From EC2 to ECS with Zero Downtime: A Migration Playbook

April 16, 2025 (1y ago)

When I joined as Head of Development, the entire platform ran on EC2 instances managed with a mix of shell scripts and manual deployments. Deployments were risky, rollbacks were slow, and scaling meant SSH-ing into boxes. We needed to move to ECS — but we couldn't afford any downtime.

This is the playbook we followed.

Why ECS over EKS

EKS (Kubernetes) was on the table, but for a team of our size it was overkill. ECS with Fargate gave us container orchestration without the operational overhead of managing a control plane. The trade-off was less flexibility, but we didn't need it — we needed reliability and speed.

The migration strategy

We ran both environments in parallel for weeks. The key principle: never cut over in one go.

1. Containerise everything first

Before touching infrastructure, we containerised every service. This meant writing Dockerfiles, setting up multi-stage builds, and making sure each service could run identically in both EC2 and ECS.

FROM node:20-alpine AS builder
WORKDIR /app
COPY package.json pnpm-lock.yaml ./
RUN corepack enable && pnpm install --frozen-lockfile
COPY . .
RUN pnpm build
 
FROM node:20-alpine AS runner
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
EXPOSE 3000
CMD ["node", "dist/main.js"]

We validated every container locally and in CI before moving on.

2. AWS Copilot for infrastructure

Instead of writing raw Terraform or CloudFormation, we used AWS Copilot CLI to define and manage our ECS services. Copilot abstracts away the boilerplate — VPCs, ALBs, target groups, task definitions — and lets you describe services declaratively.

Copilot uses a manifest.yml for each service where you configure resources, health checks, and scaling:

name: api
type: Load Balanced Web Service
 
image:
  build: Dockerfile
  port: 3000
 
http:
  path: "/"
  deregistration_delay: 20s
  target_container: "nginx"
  healthcheck:
    path: /health
    healthy_threshold: 2
    unhealthy_threshold: 2
    interval: 15s
    timeout: 10s
 
cpu: 1024
memory: 2048
platform: linux/arm64
count:
  cpu_percentage: 70
  range:
    min: 1
    max: 10
    spot_from: 1

Copilot also handled environment promotion (staging → production) and secrets management via AWS Secrets Manager, which saved us a lot of glue code.

The deployment circuit breaker came for free — if a new deployment failed health checks, ECS would automatically roll back.

3. Parallel running with weighted routing

This was the key to zero downtime. We used an ALB with two target groups — one pointing at EC2, one at ECS. Route 53 weighted routing let us gradually shift traffic.

The rollout looked like this:

  • Week 1: 100% EC2, 0% ECS (ECS running but no traffic)
  • Week 2: 90% EC2, 10% ECS
  • Week 3: 50/50
  • Week 4: 10% EC2, 90% ECS
  • Week 5: 0% EC2, 100% ECS

At each stage we monitored error rates, latency, and resource usage in Datadog. Any spike and we could shift traffic back in minutes.

4. Database connections and shared state

The trickiest part was ensuring both environments could talk to the same database and cache layers without conflicts. We used RDS and ElastiCache, both already in a VPC, so it was mostly a matter of security group rules.

Session state was already in Redis, so users could hit either environment without noticing.

5. CI/CD pipeline migration

We moved from a Jenkins-based pipeline to GitHub Actions. Copilot made deployments trivial:

copilot svc deploy --name api --env production

Each push to main would:

  1. Build the Docker image
  2. Copilot pushes to ECR and updates the ECS task definition
  3. ECS handles rolling deployment automatically

Deployments went from "schedule a window and hope" to "merge and forget". Average deploy time dropped from 25 minutes to under 5.

What we got wrong

  • Health check tuning: Our initial health check intervals were too aggressive. ECS kept killing containers that were still warming up. We had to increase the grace period and adjust thresholds.
  • Log routing: We underestimated the volume of logs from Fargate tasks. CloudWatch costs spiked until we set up log filtering and moved to a more targeted approach.
  • Task sizing: We over-provisioned at first (safe, but expensive). It took a few weeks of monitoring to right-size CPU and memory allocations.
  • Shadow services: During the audit we discovered services running on EC2 that nobody on the team knew about. Undocumented cron jobs, legacy background workers, forgotten internal tools — all had to be containerised and brought under monitoring before we could safely decommission the old instances.

Results

  • Zero downtime during the entire migration
  • Deployment frequency went from weekly to multiple times per day
  • Mean time to recovery dropped from hours to minutes (automatic rollbacks)
  • Infrastructure costs decreased ~30% after right-sizing Fargate tasks
  • Engineers could deploy independently without ops involvement

Takeaways

If you're planning a similar migration:

  1. Containerise first, migrate second. Don't try to do both at once.
  2. Run parallel environments and shift traffic gradually. It's slower but dramatically safer.
  3. Invest in observability before you start. You need to see what's happening in both environments.
  4. Use deployment circuit breakers. Automatic rollbacks saved us more than once.
  5. Right-size later. Start generous with resources and optimise once you have production data.

The whole migration took about 6 weeks from first container to full cutover. The team could deploy at any time with confidence, and we never had to tell a customer "we're doing maintenance tonight."

That's the goal.