Imply Videos

Dec 9, 2022

Production ML Model Quality Monitoring Using Druid

Machine learning and AI have progressed out of the initial technology development phase and into full production. In this new domain, the concept of monitoring has been redefined. Monitoring now fully encompasses the traditional APM and infrastructure monitoring domains of the past, but also extends it to include measurement of the performance of the models themselves.

This new extension to the domain of monitoring brings with it the need to develop entirely new tools and approaches to capture, calculate, and present this data to users in a manner that makes sense. This requires extensive research and development, including the selection of the data storage technology that would power a system designed to meet these requirements.

In this talk I will dive into the technical requirements of a production model monitoring system, the architecture selected, and the system that was implemented to bring a production ML model monitoring product to market. In doing so, I will dive deeply into our choice of Apache Druid as the core data storage technology, and how we have leveraged and extended the platform Druid provides to build this product. This will include an in depth discussion of running Druid inside Kubernetes, custom data aggregation, and running Druid in production.