Taking on 2 new engagements — AWS, DevOps & backend architecture. Check availability →

What actually happened
on these projects.

Infrastructure, DevOps, backend systems, and AI workflow engineering. Specific numbers. Real situations. Every engagement listed here was designed, built, and delivered end-to-end.

SaaS / EdTech8 weeks

Scaling a SaaS backend from 3K to 180K monthly active users without a rewrite

The platform ran on a single EC2 instance with a local PostgreSQL database. During exam season — their most critical traffic window — the site went down completely for 2–4 hours at a time. The CTO was doing manual hotfixes via SSH at 2am. There was no monitoring, no staging environment, and no deployment automation.

Started with a full infrastructure audit to understand the current state before touching anything. Identified three specific bottlenecks: single-instance database with no connection pooling, no caching layer causing repeated expensive queries, and a monolithic EC2 deployment with no horizontal scaling capability.

Migrated the database to RDS PostgreSQL with Multi-AZ standby. Added ElastiCache Redis for session storage and query result caching. Moved the application to ECS Fargate with an ALB, enabling zero-downtime rolling deployments. Built GitHub Actions CI/CD pipeline with automated testing. Added CloudWatch dashboards and PagerDuty alerting.

P95 API latency dropped from 2.3s to 420ms. Platform maintained 99.96% uptime through the next exam season — serving 60x the previous peak load without incident. The CTO reported zero 2am alerts in the 4 months after launch.

−82%
P95 API latency
99.96%
Uptime maintained during peak
AWS ECS FargateRDS PostgreSQLElastiCache RedisCloudFrontTerraformGitHub ActionsPrometheus
Fintech / Payments6 weeks

CI/CD overhaul: eliminating 45-minute manual deployments and two failed prod deploys per month

Every production deployment involved SSH-ing into three servers in sequence, running database migration scripts by hand, manually verifying the application had started correctly, and updating a Notion doc with 23 steps. There was no staging environment. Rollbacks required manually reverting database changes. The team averaged two failed production deployments per month that required immediate hotfixes.

Audited the full deployment process by shadowing two production deployments. The root problems were: no infrastructure-as-code, no automated testing in the deployment path, manual database migration execution, and no health check verification before traffic cutover.

Set up a staging environment in AWS using Terraform — identical to production but smaller instances. Built a GitHub Actions pipeline with automated unit tests, integration tests, database migration dry-run, staging deploy with smoke tests, and production deploy with blue/green switching and automatic rollback on health check failure. Added Slack notifications for every pipeline stage.

Deploy time went from 45 minutes to 8 minutes. Zero failed production deployments in the 5 months since implementation. The team went from dreading deployments to shipping 3–4 times per week.

−82%
Deploy time (45min → 8min)
Zero
Failed deploys in 5 months post-launch
GitHub ActionsDockerAWS ECSTerraformBlue/Green deploymentSlack
D2C E-commerce3 weeks

Eliminating 4 hours/day of manual WhatsApp messaging with automated order workflows

Two operations staff spent 4+ hours every day manually copying order information from Shopify into WhatsApp messages to customers: order confirmations, dispatch notifications, delivery updates, and cart recovery follow-ups. They were using a spreadsheet to track who had been messaged. When volume increased, customer communication started to slip — some customers weren't notified for 12+ hours after ordering.

Mapped every message type and trigger point across the customer journey. Identified that 90% of the messages followed a fixed template with order-specific variables. The remaining 10% were genuine customer queries that needed human handling.

Built a FastAPI webhook handler on AWS Lambda that receives Shopify order events and triggers the appropriate WhatsApp Business API messages. Order confirmations send within 60 seconds of purchase. Dispatch notifications trigger from the fulfillment webhook. Cart recovery messages send 2 and 24 hours after abandonment if not purchased. A simple admin dashboard shows message delivery status.

Staff time on manual messaging dropped from 4 hours to under 20 minutes per day — only handling genuine customer queries. Cart recovery automation recovered an additional ₹3.8L in the first month. Customer satisfaction scores improved as communication became consistent and immediate.

₹3.8L
Monthly cart recovery
−95%
Manual messaging time
FastAPIAWS LambdaWhatsApp Business APIShopify WebhooksPostgreSQLRedis
B2B SaaS30 days

AWS cost reduction from ₹18L/month to ₹7.4L/month — without any service degradation

The AWS bill had grown to ₹18L/month for a startup with ₹40L MRR. Leadership knew it was too high but the bill was opaque — the cost explorer showed 140+ line items across 6 AWS services, none of them tagged to a specific team or feature. Two engineers had looked at it previously and given up.

Started with a 3-day audit: full resource inventory using AWS Config, cost attribution analysis using Cost Explorer with a 12-month view, and a live architecture walkthrough with the engineering team. Found seven specific categories of waste.

Stopped 23 EC2 instances from failed experiments and prototypes (none had been used in 6+ months). Shut down dev/staging environments outside working hours using Lambda schedulers — they were running 24/7. Right-sized 8 over-provisioned RDS instances (none were above 40% CPU utilisation). Moved S3 storage to Intelligent-Tiering. Purchased 1-year Reserved Instances for stable production workloads. Set up AWS Budgets with SNS alerts at 80% and 100% of monthly target.

Monthly spend dropped from ₹18L to ₹7.4L — a 59% reduction — with zero degradation in application performance. Annual savings of approximately ₹1.3Cr. The engineering team now has full cost visibility with per-service tagging and weekly cost review in their existing Slack channel.

−59%
Monthly AWS spend (₹18L → ₹7.4L)
₹1.3Cr
Annual savings
AWS Cost ExplorerAWS ConfigTerraformReserved InstancesLambda schedulersS3 Intelligent-Tiering
AI / SaaS Startup6 weeks

Making an AI customer support feature production-reliable after three failed attempts

The startup had tried three times to ship an LLM-powered customer support feature. The first attempt hit OpenAI rate limits with no fallback, returning errors to users. The second had no response validation — the model occasionally returned partial JSON that crashed the parser. The third ran up a $3,800 API bill in 4 days because someone forgot to set token limits. Engineering confidence in shipping the feature was low.

Did a code review of all three previous attempts to understand exactly what failed. The core problems were: no retry logic with exponential backoff, no response schema validation, no token budget management, no caching for repeated queries, and no observability into model behaviour.

Rebuilt the LLM integration layer as a standalone FastAPI service: structured prompting with Pydantic response models and strict schema validation, Redis caching for identical queries (roughly 40% cache hit rate in testing), token budget enforcement per request and per user per day, exponential backoff with fallback to a simpler rule-based response on repeated failure, and a Prometheus metrics endpoint tracking cost per query, cache hit rate, and model latency. Integrated with the existing Zendesk setup for escalation.

Feature shipped successfully on the fourth attempt. 71% of tier-1 support tickets now resolved automatically without human review. Monthly API cost stabilised at $340 — well within budget. CSAT scores for AI-handled tickets are 4.1/5, compared to 3.9/5 for human-handled tier-1 queries.

71%
L1 tickets resolved automatically
−91%
Support API cost vs failed attempt
FastAPIOpenAI APILangChainPineconeRedisPrometheusZendesk API
Internal Tooling / SaaS5 weeks

RAG-powered internal knowledge base: eliminating 2 hours/day of manual documentation search

A 35-person engineering and ops team at a B2B SaaS company had accumulated 4,000+ documents across Notion, Google Drive, and Confluence — onboarding guides, runbooks, API docs, client SOPs. New hires took 6–8 weeks to become productive because finding the right document required asking colleagues or searching across three separate tools. Senior engineers were being interrupted 8–12 times per day to answer questions that were documented somewhere.

Audited the full document corpus to understand structure, quality, and retrieval patterns. Chose a hybrid retrieval approach (semantic + keyword) after testing pure vector search which missed exact-match queries like error codes and command names. Designed chunking strategy based on document type rather than fixed token counts.

Built a RAG pipeline using LangChain, pgvector on PostgreSQL (avoiding a separate vector DB for simplicity and cost), and OpenAI embeddings. Document ingestion runs as a nightly batch job with incremental updates. A FastAPI service handles queries with hybrid retrieval, re-ranking, and source attribution. Deployed a Slack bot as the primary interface — engineers ask questions in natural language and get answers with document citations.

Average time-to-answer for documentation queries dropped from 18 minutes to under 2 minutes. Senior engineer interruptions reduced by approximately 70% within the first month. New hire ramp time decreased from 6–8 weeks to 3–4 weeks. The system handles 200+ queries per day with a 94% relevance rating from user feedback.

−78%
Time spent searching documentation
−70%
Senior engineer interruptions
FastAPILangChainpgvectorPostgreSQLOpenAI EmbeddingsSlack APIRedis
Operations / Logistics4 weeks

AI-powered document extraction pipeline: replacing 3 hours/day of manual data entry

A logistics company received 150–200 vendor invoices, delivery notes, and customs documents daily — in PDF, image, and mixed-format email attachments. A three-person ops team spent 3+ hours daily manually extracting line items, dates, reference numbers, and amounts into a spreadsheet, then cross-referencing against purchase orders. Error rate was around 4%, which caused payment disputes and reconciliation delays.

Tested several approaches including AWS Textract alone (poor performance on low-quality scans and non-standard layouts) and a pure GPT-4 Vision approach (too expensive at scale). Settled on a two-stage pipeline: preprocessing with OCR normalisation, then structured extraction with a fine-tuned prompt and strict Pydantic output schema.

Built a FastAPI processing service on AWS Lambda: email attachment ingestion via Gmail API webhook, PDF/image normalisation with preprocessing, GPT-4o extraction with a structured prompt template producing validated JSON, confidence scoring to flag uncertain extractions for human review, and direct database write to the ERP system. Human review queue in a simple React dashboard for the 8–12% of documents below confidence threshold.

Manual data entry time dropped from 3 hours to 25 minutes per day — the team now only reviews flagged edge cases. Error rate on auto-processed documents: 0.3% vs 4% manual. ROI on the engagement was realised within 6 weeks through labour savings.

−87%
Daily manual data entry time
0.3%
Error rate (vs 4% manual)
FastAPIGPT-4o APIAWS LambdaGmail APIPostgreSQLReactPydantic
B2B SaaS / Ops7 weeks

Agentic workflow: automating a 6-step sales enrichment process that took 45 minutes per lead

A B2B SaaS sales team qualified 30–50 inbound leads per day. Each lead required: LinkedIn research, company website review, tech stack identification via BuiltWith, CRM history check, competitive context lookup, and a personalised outreach draft. A sales development rep spent 45 minutes per lead on this — for 30 leads per day, that was 22 hours of work for a team of three.

Mapped the full 6-step enrichment process with the sales team to understand which steps were rule-based (CRM lookup, data APIs) and which required judgment (personalisation quality, relevance scoring). Designed an agent architecture with separate tool functions for each data source, LLM-driven synthesis, and a human approval step before outreach is sent.

Built a LangGraph multi-agent pipeline: a coordinator agent dispatches sub-tasks to specialised tool agents (LinkedIn scraper, BuiltWith API, CRM API, web search), collects structured outputs, and passes them to a synthesis agent that scores lead quality and drafts a personalised outreach message. Human SDR reviews the draft and approves/edits before send. Full workflow runs in under 3 minutes per lead.

Per-lead enrichment time dropped from 45 minutes to 4 minutes of human review. Team capacity increased from 30 to 120 leads per day without additional headcount. Pipeline quality improved — lead scoring consistency meant sales calls were better targeted, improving demo-to-close rate from 18% to 24% over the following quarter.

−91%
Time per lead (45min → 4min)
4× Lead capacity increase
LangGraph,LangChain,OpenAI API,FastAPI,PostgreSQL,CRM API,BuiltWith API

Working on something similar?

Book a call and I'll tell you what a similar engagement would look like for your situation.

Book a Free Strategy Call → 💬 Discuss Your Project