Back to projects
2025
MLOps Inference Pipeline with Monitoring
Production-grade MLOps pipeline with automated model training, FastAPI inference API, and comprehensive monitoring stack using Prometheus, Grafana, and Alertmanager.
FastAPIPrometheusGrafanaDockerAWS EC2TerraformPythonMLOpsAlertmanagerSlack APIIaCGoogle ColabKaggle
View repositoryProblem statement
ML models in production need observability, alerting, and a path to retraining when data drifts.
Architecture overview
Terraform → EC2; Docker: FastAPI inference, Prometheus, Grafana, Alertmanager → Slack.
Challenges & learnings
- Designing meaningful ML metrics (latency, throughput, error rate) for Prometheus.
- Terraform state and safe EC2 lifecycle management.
Features
- •Prometheus & Grafana monitoring with Slack alerts
- •AWS EC2 deployment with Terraform IaC
- •Drift detection and auto-retraining
- •Multi-service Docker architecture