Back to projects
2025

MLOps Inference Pipeline with Monitoring

Production-grade MLOps pipeline with automated model training, FastAPI inference API, and comprehensive monitoring stack using Prometheus, Grafana, and Alertmanager.

FastAPIPrometheusGrafanaDockerAWS EC2TerraformPythonMLOpsAlertmanagerSlack APIIaCGoogle ColabKaggle
View repository

Problem statement

ML models in production need observability, alerting, and a path to retraining when data drifts.

Architecture overview

Terraform → EC2; Docker: FastAPI inference, Prometheus, Grafana, Alertmanager → Slack.

Challenges & learnings

  • Designing meaningful ML metrics (latency, throughput, error rate) for Prometheus.
  • Terraform state and safe EC2 lifecycle management.

Features

  • Prometheus & Grafana monitoring with Slack alerts
  • AWS EC2 deployment with Terraform IaC
  • Drift detection and auto-retraining
  • Multi-service Docker architecture