Production System Design | Notion

All production system design concepts fall into 4 buckets:

Handle more users → scaling
Respond faster → performance
Survive failures → reliability
Control behavior safely → release + observability

Module 1: Traffic & Load Handling

Core Concept	The Real-World Problem	Production Solution (Node/AWS)	Failure Mode / Trade-off
Load Balancing & Horizontal Scaling	1 server crashes under a 10k user traffic spike.	Wrap Express app in Docker; run multiple replicas behind AWS ALB.	Breaks in-memory state; requires external session storage.
Auto Scaling	Paying for 100 idle containers at 3 AM.	AWS Auto Scaling Groups triggered by active connections, not just CPU.	Scaling takes time (pulling images); sudden spikes still cause downtime.
Rate Limiting	Abusive scripts spam and crash the Express API.	Distributed rate limiting using Redis (`rate-limit-redis`).	Redis becomes a Single Point of Failure (SPOF); adds latency.
Backpressure	Reading a 5GB S3 file into memory crashes Node (OOM).	Use Node.js Streams (`.pipe()`) to pause reading when the write buffer is full.	Adds complexity to simple I/O tasks.

Module 2: Performance Optimization

Core Concept	The Real-World Problem	Production Solution (Node/AWS)	Failure Mode / Trade-off
Caching (Cache-Aside)	Complex PostgreSQL joins take 500ms and max out DB CPU.	Check Redis first. If miss, query DB, save to Redis with TTL, return data.	Cache invalidation; users seeing stale data after an update.
Content Delivery Network (CDN)	High latency for global users fetching static assets.	Push Next.js static bundles and images to Cloudflare / AWS CloudFront.	Accidentally caching authenticated/private API routes globally.
Lazy Loading & Pagination	DB returns 10,000 rows; Next.js bundle is 4MB.	Cursor-based pagination on backend; `next/dynamic` on frontend.	Cursor pagination is harder to implement than simple `OFFSET`.
Compression	Sending massive raw JSON blocks mobile networks.	Offload Brotli/Gzip compression to Nginx or AWS ALB.	Doing compression inside Node.js heavily blocks the event loop.

Module 3: Asynchronous Processing

Core Concept	The Real-World Problem	Production Solution (Node/AWS)	Failure Mode / Trade-off
Message Queues	5-second image resize blocks the Node event loop for everyone.	Return `202 Accepted`; push job to AWS SQS; process in background worker.	Worker crashes mid-job; requires Dead Letter Queues (DLQ) and ACKs.
Pub/Sub (Event-Driven)	Uploading a video triggers 4 separate microservices sequentially.	Publish `VideoUploaded` to SNS/Kafka; services process in parallel.	Eventual consistency; UI needs WebSockets to know when processing is done.

Module 4: System Reliability

Core Concept	The Real-World Problem	Production Solution (Node/AWS)	Failure Mode / Trade-off
Timeouts	3rd-party API hangs; Node connections pile up and crash.	Always use `AbortController` or Axios timeouts for external requests.	Choosing the right timeout window (too short = false failures).
Retries & Backoff	Instant retries act as a DDoS attack on struggling APIs.	Implement Exponential Backoff with Jitter (1s, 2.5s, 4.1s).	Delays the eventual failure response back to the client.
Circuit Breaker	Payment API is completely down; waiting for timeouts wastes CPU.	Use `opossum` to "open" the circuit and fail instantly for 30 seconds.	Requires careful tuning of failure thresholds and recovery windows.

Module 5: Security & Access Control

Core Concept	The Real-World Problem	Production Solution (Node/AWS)	Failure Mode / Trade-off
Authentication (JWT)	Storing JWTs in `localStorage` leads to XSS theft.	Send short-lived JWTs in `httpOnly` cookies; store Refresh Tokens in DB.	Implementing secure token rotation and revocation is complex.
Authorization (ABAC)	User A modifies `projectId` in API payload to delete User B's project (IDOR).	Validate ownership against DB or embed resource IDs in the JWT payload.	Heavy DB lookups on every single protected API route.
API Gateway & Edge Security	Botnets brute-force the Express login endpoint.	AWS WAF blocks malicious IPs at the edge before hitting Docker.	Legitimate users getting blocked by overly aggressive WAF rules.

Module 6: Data & Database Scaling

Core Concept	The Real-World Problem	Production Solution (Node/AWS)	Failure Mode / Trade-off
Replication (Read/Write Split)	Single DB cannot handle the volume of `SELECT` queries.	Route `INSERT/UPDATE` to Primary DB, route `SELECT` to Read Replicas via Prisma.	Replication lag; users refresh page and see old data.
Database Indexing	Query scans 5 million rows ($O(N)$), taking 8 seconds.	Create B-Tree indexes on heavily queried columns.	Indexes consume RAM and significantly slow down write operations.
Sharding	Data size exceeds the physical limits of a single AWS RDS instance.	Horizontally partition data across multiple DBs using a Shard Key (e.g., `tenantId`).	Cross-shard joins become practically impossible.

Module 7: Deployment & Release Strategies

Core Concept	The Real-World Problem	Production Solution (Node/AWS)	Failure Mode / Trade-off
Rolling Deployment	Restarting all containers at once causes downtime.	Orchestrator replaces old containers with new ones gradually.	Mixed versions: v1 and v2 running simultaneously breaks API contracts.
Blue-Green Deployment	Need instant zero-downtime rollbacks if a release fails.	Deploy to isolated "Green" environment; flip ALB traffic 100% instantly.	DB migrations running on Green can crash the live Blue environment.
Canary Release	Pushing a hidden bug to 100% of users.	Route 5% of traffic to the new version; monitor errors; ramp up to 100%.	Requires sticky sessions to prevent users bouncing between versions.