As applications become more complex, and architectures become ever more distributed, monitoring, alerting, and optimization become increasingly important. When does it make sense to build or buy an observability solution? The answer to this question depends on several factors, including the traffic your application receives, the budget available to you, the complexity of your infrastructure, and the amount of customization you need.
When should you build—and when should you buy?
Because each organization’s situation will be different, there will be no right answer for everyone. The same company may change its mind depending on their needs and their position within their lifecycle. As a startup, they might decide to use an off-the-shelf observability solution, but once their customer base and user data grow exponentially, they may eventually choose to build their own in-house platform.
Expense
The first factor is the expense of buying an observability application, which covers both pricing and other hidden costs. Although some observability services may have straightforward payment plans, others may charge for almost everything that customers monitor, such as per log line, per virtual machine, per server, or per batch of UX tests. In this situation, larger applications consisting of many virtual machines handling massive volumes of data and user traffic would be at a disadvantage.
Complexity
An especially complex environment may be another factor for building, rather than buying, observability. Any architecture that has lots of loosely-connected, poorly-mapped microservices, leftover code from previous iterations (and teams), or a mixed monolithic-microservices code base, may require expensive professional services from an observability provider to map out. However, an in-house team building their own platform will likely reduce costs significantly.
Scale
Scale is another consideration. If your environment ingests terabytes or petabytes of data (in the form of user traffic, event logs, application traces, or performance metrics) per day or week, an off-the-shelf provider may be unable to efficiently manage this volume of data. In fact, a prominent fintech firm encountered this exact problem: their previous service could only store 40 days of historical data (ranging in the petabytes), while they needed a full year’s worth for analysis. As a result, this firm decided to build their own observability solution on Druid, which enabled cost-effective storage and efficient queries.
Streaming Data
Observability solutions also must be compatible with streaming technologies such as Apache Kafka or Amazon Kinesis. Streaming remains the best way to ingest real-time data at scale—a necessity when it comes to monitoring digital environments. Yet not all off-the-shelf observability platforms offer built-in streaming support; instead, many require an external integration or additional software in the form of APIs or SDK (both of which involve significant work from SRE or DevOps teams) in order to set up a stream.
Interactivity
Lastly, any observability solution, whether it’s purchased externally or built in-house, has to support rapid, flexible data exploration. In order to find and fix the root causes of an issue, teams need to have interactive conversations with data, filtering, zooming, or slicing and dicing by high cardinality dimensions. In this case, pre-aggregating data—restructuring a database to answer commonly-asked questions—isn’t feasible, if only because users won’t even know what they’re looking for.
The database for in-house observability
If you do decide to build rather than buy, Apache Druid is an ideal database on which to build your observability platform. Today, a number of leading companies use Druid and its associated products to build their in-house solutions, including leading network visibility provider Cisco ThousandEyes, cloud pioneer VMware, streaming media giant Netflix, and software powerhouse Atlassian, to name a few.
Built for speed (and by extension, streaming data), Druid is well-suited for time-sensitive use cases like observability. It can retrieve data in milliseconds, regardless of the volume of users or queries, and immediately enables data to be queried immediately on arrival, without having to be first persisted into deep storage. Further, Druid is natively compatible with streaming technologies like Amazon Kinesis or Apache Kafka, requiring only a few clicks to set up an event stream.
When incidents escalate, many users, from site reliability engineers to executives to customers, will log on to help or oversee efforts. Druid can support large quantities of users and queries, returning results in milliseconds without freezing up or failing to complete transactions. In fact, adtech provider Amobee was able to query over one trillion rows of high-dimensionality data in milliseconds.
Druid can scale with the size of your application, up to millions of events per second. Due to its unique architecture, Druid can operate at massive scale cheaply and efficiently, even during a crisis situation, when already-large environments may generate even more data in the form of event logs, metrics, topology, or traces. For instance, Paytm, one of India’s leading financial services providers, was able to reduce their infrastructure costs by 67%, despite ingesting five billion events per day.
Lastly, visualizations play a significant role in observability—especially in troubleshooting and crisis management. Imply, the company founded by the creators of Apache Druid, also provides Pivot, an intuitive GUI to facilitate the creation of visualizations and dashboards that support interactive data exploration. For each action on a dashboard, such as a zoom or a filter, Pivot automatically executes multiple SQL commands on the backend in milliseconds. Not only does this provide a seamless data experience, but it also removes the need for users to have a deep understanding of SQL or data engineering.
Customer story: Salesforce
Salesforce provides the leading customer relationships management (CRM) software to the top companies around the world. In 2022, Salesforce generated over $31 billion in revenue, an 18% increase year-over-year.
With over 150,000 global customers and 2,700 production services, data flows in from a massive variety of sources—and into an equally diverse array of storage. In total, the Salesforce environment ingested roughly five billion+ daily events while storing over five petabytes of log data in data centers and almost 200 petabytes of data in Hadoop storage.
Given these massive volumes, attaining granular visibility into various aspects of their environment was tricky, as their existing observability providers were limited in several ways. For instance, their metrics software could not provide historical context alongside real-time data, and neither observability solution could provide rapid, interactive data exploration. Queries would also take hours to complete, removing the possibility of getting real-time insights, while building new dashboards or charts was a difficult, prolonged process. However, Salesforce needed to “support batch, stream, and interactive, ad hoc query processing on these data sets,” writes Ram Sangireddy, a Senior Director of Big Data and Intelligence Platforms.
For the various Salesforce teams and departments, monitoring was critical to a wide variety of use cases. Service owners had to understand metrics such as usage, tenant size and traffic, and other key indicators. In addition to typical observability tasks such as performance monitoring and the rapid troubleshooting, Salesforce teams also needed to monitor the customer experience of new product releases.
By switching to Druid, Salesforce teams were able to achieve key operational improvements. First, Druid enabled rapid queries (regardless of the size of the data set), reducing query times by 30%. Druid also enriched analytics, providing key features such as GROUP BY aggregations and the ability to build and share rich dashboards and visualizations—which provided flexible, versatile data exploration. By quickly querying and retrieving data, teams could utilize real-time insights to detect and resolve issues before they affected customers.
The transition to Druid also included key cost benefits—a result of Druid’s unique architecture, which is optimized for efficiently storing massive amounts of data. Lead Software Engineer Dun Lu writes that “for one day of data, we end up saving 82% in the total number of rows stored in Druid, which translated to 47% total savings in the storage footprint.” This was possible due to Druid’s compaction, which compresses and combines segments by time interval, and rollup, which aggregates multiple, identical rows within a single interval (such as all customer visits each hour).
Whether it’s monitoring performance, alerting on anomalies to preempt issues before they escalate, or ensuring the best possible user experience, observability is critical to the everyday operations and performance of applications.
To learn more about Druid, read the architecture guide.
Build your observability solution with Imply Polaris, the fully managed, Druid-as-a-service. Register for a free trial of Polaris today.