Case study
A practical build spanning Python, Airflow, Kubernetes, Great Expectations, Datadog, PagerDuty, AWS EKS, S3.
Overview
Designed and built a real-time model monitoring system at JUMO that reduced data anomaly detection time from weeks to minutes. The system automated batch scoring with Airflow-orchestrated Kubernetes jobs, computed PSI and CSI metrics via Great Expectations, and surfaced alerts through Datadog dashboards and PagerDuty — enabling proactive model maintenance instead of reactive firefighting.
Context
JUMO ran dozens of credit scoring models in production, but anomaly detection was manual — data scientists would discover drift weeks after it began, often only when business metrics degraded. In a FinTech environment where model accuracy directly affects lending decisions and regulatory compliance, this latency was unacceptable.
Architecture
The monitoring pipeline is orchestrated by Airflow DAGs that trigger batch scoring pods on Kubernetes. Each run executes three stages: feature materialisation, model scoring, and monitoring. Great Expectations computes CSI (Characteristic Stability Index) for input features and PSI (Population Stability Index) for model scores on every daily run. All metrics are logged to Datadog dashboards with PagerDuty alerting for threshold breaches. When a new model version is deployed to the artifact registry, a corresponding monitoring DAG is automatically created and scheduled — no manual setup required.
Outcome
Drift detection latency dropped from weeks to minutes. The automated DAG creation meant new models were monitored from day one of production deployment. This contributed to an ~90% reduction in P1/P2 incidents across the ML platform, as issues were caught and addressed before they impacted business metrics.
Retrospective
Great Expectations proved excellent for statistical monitoring but required careful tuning of thresholds to avoid alert fatigue. In hindsight, I would add concept drift detection alongside the distributional monitoring — PSI and CSI catch data drift well but don't directly measure prediction quality degradation.
More projects
Led architecture and delivery of a production-grade RAG chatbot for John Lewis Partnership's internal workforce — from first …
Defined and delivered JUMO's internal ML platform from first principles — a config-driven orchestration layer that scaled model …
Built an intelligent transaction categorisation engine for Investec Private Banking that reduced manual labelling effort by 60%. The …
If you need secure GenAI delivery, RAG engineering, MLOps automation, or production ML systems support, feel free to get in touch.