2025

MLOps Inference Pipeline with Monitoring

Production-grade MLOps pipeline with automated model training, FastAPI inference API, and comprehensive monitoring stack using Prometheus, Grafana, and Alertmanager.

FastAPIPrometheusGrafanaDockerAWS EC2TerraformPythonMLOpsAlertmanagerSlack APIIaCGoogle ColabKaggle

View repository

Problem statement

ML models in production need observability, alerting, and a path to retraining when data drifts.

Architecture overview

Terraform → EC2; Docker: FastAPI inference, Prometheus, Grafana, Alertmanager → Slack.

Challenges & learnings

Designing meaningful ML metrics (latency, throughput, error rate) for Prometheus.
Terraform state and safe EC2 lifecycle management.

Features

-Prometheus & Grafana monitoring with Slack alerts
-AWS EC2 deployment with Terraform IaC
-Drift detection and auto-retraining
-Multi-service Docker architecture