Ask me anything

Case study

Production ML Monitoring: From Weeks to Minutes

A practical build spanning Python, Airflow, Kubernetes, Great Expectations, Datadog, PagerDuty, AWS EKS, S3.

Overview

Why this project matters

Designed and built a real-time model monitoring system at JUMO that reduced data anomaly detection time from weeks to minutes. The system automated batch scoring with Airflow-orchestrated Kubernetes jobs, computed PSI and CSI metrics via Great Expectations, and surfaced alerts through Datadog dashboards and PagerDuty — enabling proactive model maintenance instead of reactive firefighting.

Context

The problem

JUMO ran dozens of credit scoring models in production, but anomaly detection was manual — data scientists would discover drift weeks after it began, often only when business metrics degraded. In a FinTech environment where model accuracy directly affects lending decisions and regulatory compliance, this latency was unacceptable.

Architecture

How it was built

The monitoring pipeline is orchestrated by Airflow DAGs that trigger batch scoring pods on Kubernetes. Each run executes three stages: feature materialisation, model scoring, and monitoring. Great Expectations computes CSI (Characteristic Stability Index) for input features and PSI (Population Stability Index) for model scores on every daily run. All metrics are logged to Datadog dashboards with PagerDuty alerting for threshold breaches. When a new model version is deployed to the artifact registry, a corresponding monitoring DAG is automatically created and scheduled — no manual setup required.

Outcome

What was delivered

Drift detection latency dropped from weeks to minutes. The automated DAG creation meant new models were monitored from day one of production deployment. This contributed to an ~90% reduction in P1/P2 incidents across the ML platform, as issues were caught and addressed before they impacted business metrics.

Retrospective

What I would do differently

Great Expectations proved excellent for statistical monitoring but required careful tuning of thresholds to avoid alert fatigue. In hindsight, I would add concept drift detection alongside the distributional monitoring — PSI and CSI catch data drift well but don't directly measure prediction quality degradation.

More projects

Keep exploring

Production RAG Chatbot (Enterprise application)

Led architecture and delivery of a production-grade RAG chatbot for John Lewis Partnership's internal workforce — from first …

The Prediction Factory: Designing an ML Platform from First Principles

Defined and delivered JUMO's internal ML platform from first principles — a config-driven orchestration layer that scaled model …

Transaction Categorisation Engine — Investec

Built an intelligent transaction categorisation engine for Investec Private Banking that reduced manual labelling effort by 60%. The …

Technology stack

PythonAirflowKubernetesGreat ExpectationsDatadogPagerDutyAWS EKSS3

Next steps

Interested in similar work?

If you need secure GenAI delivery, RAG engineering, MLOps automation, or production ML systems support, feel free to get in touch.