Real-time analytics at Charter

by Jacob Ferlin · Agustin Schapira · November 9, 2020

Getting to a single, real-time version of the truth

"It is the mark of an educated mind to be able to entertain a thought without accepting it." — Aristotle*

Jacob Ferlin - Charter Communications, Inc. - Data Platforms
Agustin Schapira – AFB Soft, LLC. - Data Platforms

Charter Communications, Inc. is a leading broadband connectivity company and cable operator serving more than 30 million customers in 41 states through its Spectrum brand. Over an advanced communications network, the company offers a full range of state-of-the-art residential and business services including Spectrum Internet®, TV, Mobile, and Voice.

Charter knows that the customer expects those services and more. Customer satisfaction continually expands to include better reliability, competitive pricing, and exciting new features. By extension, the growing expectation is to continually understand and react quickly to the customer. Charter recognized the advantage of being able to instrument, collect, and continually analyze the performance of its platforms to drive improvements in both the product and each customer’s experience.

Under the leadership of Group Vice President Michael Baldino, the Data Platforms team has led the vision, security, and daily operations for the customer-experience-focused platform. During its formative years, the Data Platforms team explored multiple options for providing real-time data. Ultimately the decision to develop a custom solution worked very well and would likely be in use today, had we not reevaluated Druid and Imply Cloud. In 2014, Druid was one of the solutions we explored for real-time analytics; at the time, the size and volume of our records made Druid untenable once messages exceeded 25k messages per second. In the end, the decision to go in another direction came with substantial trade-offs.

Deep Storage

The previous solution did not have a concept of deep storage. Data was tied to each running instance and attempts at using non-local storage resulted in significant impacts on cost and performance. The active-active data load setup meant that schema alterations and updates required the creation, assessment, and operation of nearly redundant datasets and the environment. These environments would run in parallel with the production environment for seven days, upon which the minimum amount of data would have accrued. At this point, the team would initiate a graceful cutover, slowly bringing down the current instance while activating the latest version.

Several additional factors further complicated the management of this process:

  • The old DB was not column-oriented. Performance constraints resulted in the need for cross-table partitions. The addition of those partitions increased complexity and reduced flexibility.
  • Adding new nodes required a re-index, re-shard, and rebalance of the data. This process is both complicated and time-consuming.
  • The previous DB required significant handholding to ensure it met our high standard for SLOs.
  • To accommodate increased traffic, we'd need to launch beefier clusters, upgrade, then rollback.
  • Difficulty manipulating and updating data meant that significant data removal became difficult if not impossible to achieve.
  • Kafka ingestion was not built-in
    • A lack of built-in Kafka ingestion required us to write a custom thin layer to read, from Kafka, collect those events and then insert them into the database via 'INSERT INTO' statements. That solution worked well for a long time until it didn't. Within two short years, our data loads had increased by 50x, performance under the designed framework became a significant issue.
    • The original ingest layer was written in Python, which eventually became a hurdle to achieving higher ingestion ratios; plans to rewrite using C or JAVA were explored but abandoned in favor of moving to Imply Cloud.

Revisiting our approach

The solution that had worked previously needed reevaluation. The execution of a strong vision brought the progression from directing a stream of data to managing a torrent. We slowed our existing development and began a comprehensive evaluation of several technologies—some new, others old.

A surprising conclusion emerged. The best platform for scalability and customer usage was Imply. In particular, this platform emerged when paired with the exceptional services provided by the Imply team. Their support and the extensive expertise of a team of developers facilitated the migration of a core component of our network analytics to the Imply platform. Over time, this platform has been adopted across groups and organizations, providing a common language from which Charter team members can identify, solve, and communicate customer issues in real time.

This success has led to the continued growth and adoption of the Imply platform across Charter. To date, we have seen approximately a 10x increase in data storage (~200TB). And forget concerns of handling 25k messages per second; it now averages twice that, with bursts spiking into the 100–150k messages per second range.

Our monitoring and operations are vastly more predictable. When issues do arise, most are resolved within minutes. The majority of the time, the fix is so simple, even a ‘former’ developer can be tasked with ‘fixing’ production (guilty).

The Data Platforms mantra can be summarized in three words: quality, tenacity, enablement. Quality in all that we do, tenacity in how we approach our work and challenge each other to be better, and finally enablement: it’s our mission to democratize data, enabling teams to access and evolve their data as needed while providing the business a set of commonly understood metrics.

Under a previous architecture, changes to the data model or requests for new metrics required significant resource investments. Today, schema updates and metric changes still require some dev work. What has changed is our ability to provide users access to the latest data via the Imply Pivot analytics UI. This means our end users can develop and review personalized reports in near ‘real time.’

Our partnership with Imply and further adoption and enablement of Imply has led to an increase in our ‘average’ data temperature. More of our data has shifted into the ‘hot’ (0.1 – 3s, recent, highly concurrent, highly interactive) and ‘warm’ (5 – 30s, less recent, highly concurrent, some interactivity) zones.

Imply has empowered users to take ownership, experiment, and improve the way they use the data. The path to data democratization is long; our early foray has seen breakthroughs and setbacks. By now, some of you may be thinking that the democratization of data seems like a technical Shangri-La. While data utopia seems out of reach, enabling users and expanding data literacy requires the continual tenacity and vision to develop and support tools and systems that provide users access to quality data promptly.

Below are some tips and general principles for beginning the long, arduous, and never-ending journey towards data democratization.

  • Have a vision, revisit, and adjust it accordingly
  • Establish trust in the data broadly (required for each new data set, team, or application)
  • Establish trust in the platform broadly (required for each new technology, reporting tool, or access pattern identified)
  • Model hot, warm, and cold data similarly; reduce the number of downstream manipulations required.
  • Periodically assess, review, and reestablish trust in your platform and data
  • Understand your user base, know what access patterns and tools match with what people, err on the side of enablement (taking into account security and performance) — if they have approval, and the system can handle it, set them free!
  • Closely monitor system usage, quickly identify and kill system-intensive queries. Work with users privately on the proper use and care of the platform.
  • Constantly rethink, reengineer, and explore ways to expand access to data and analytics.
  • Have patience. Not everyone has the same level of interest or experience with data; find ways to first take care of yourself and then help the user get what they need. A leader who understands the world we work in can often smell the difference between an issue and a complaint.

Special thanks to the superb edits and proofing courtesy of the great Agustin Schapira and Cole Lechleiter.

*Aristotle actually said something more like, “Each person judges well what they know and is thus a good critic of those things. For each thing in specific, someone must be educated to be a critic; to be a critic in general one must be educated about everything.”

The Cliff notes version used above takes some liberties.

Back to blog

How can we help?