Production ML Monitoring: From Weeks to Minutes

Overview

Why this project matters

Designed and built a real-time model monitoring system at JUMO that reduced data anomaly detection time from weeks to minutes. The system automated batch scoring with Airflow-orchestrated Kubernetes jobs, computed PSI and CSI metrics via Great Expectations, and surfaced alerts through Datadog dashboards and PagerDuty — enabling proactive model maintenance instead of reactive firefighting.

Context

The problem

JUMO ran dozens of credit scoring models in production, but anomaly detection was manual — data scientists would discover drift weeks after it began, often only when business metrics degraded. In a FinTech environment where model accuracy directly affects lending decisions and regulatory compliance, this latency was unacceptable.

Architecture

How it was built

The monitoring pipeline is orchestrated by Airflow DAGs that trigger batch scoring pods on Kubernetes. Each run executes three stages: feature materialisation, model scoring, and monitoring. Great Expectations computes CSI (Characteristic Stability Index) for input features and PSI (Population Stability Index) for model scores on every daily run. All metrics are logged to Datadog dashboards with PagerDuty alerting for threshold breaches. When a new model version is deployed to the artifact registry, a corresponding monitoring DAG is automatically created and scheduled — no manual setup required.

Outcome

What was delivered

Drift detection latency dropped from weeks to minutes. The automated DAG creation meant new models were monitored from day one of production deployment. This contributed to an ~90% reduction in P1/P2 incidents across the ML platform, as issues were caught and addressed before they impacted business metrics.

Retrospective

What I would do differently

Great Expectations proved excellent for statistical monitoring but required careful tuning of thresholds to avoid alert fatigue. In hindsight, I would add concept drift detection alongside the distributional monitoring — PSI and CSI catch data drift well but don't directly measure prediction quality degradation.

More projects

Technology stack

PythonAirflowKubernetesGreat ExpectationsDatadogPagerDutyAWS EKSS3

Production ML Monitoring: From Weeks to Minutes

Why this project matters

The problem

How it was built

What was delivered

What I would do differently

Keep exploring

Production RAG Chatbot (Enterprise application)

The Prediction Factory: Designing an ML Platform from First Principles

Transaction Categorisation Engine — Investec

Technology stack

Next steps

Interested in similar work?