Scaling a SaaS backend from 3K to 180K monthly active users without a rewrite
The platform ran on a single EC2 instance with a local PostgreSQL database. During exam season — their most critical traffic window — the site went down completely for 2–4 hours at a time. The CTO was doing manual hotfixes via SSH at 2am. There was no monitoring, no staging environment, and no deployment automation.
Started with a full infrastructure audit to understand the current state before touching anything. Identified three specific bottlenecks: single-instance database with no connection pooling, no caching layer causing repeated expensive queries, and a monolithic EC2 deployment with no horizontal scaling capability.
Migrated the database to RDS PostgreSQL with Multi-AZ standby. Added ElastiCache Redis for session storage and query result caching. Moved the application to ECS Fargate with an ALB, enabling zero-downtime rolling deployments. Built GitHub Actions CI/CD pipeline with automated testing. Added CloudWatch dashboards and PagerDuty alerting.
P95 API latency dropped from 2.3s to 420ms. Platform maintained 99.96% uptime through the next exam season — serving 60x the previous peak load without incident. The CTO reported zero 2am alerts in the 4 months after launch.