All production system design concepts fall into 4 buckets:
| Core Concept | The Real-World Problem | Production Solution (Node/AWS) | Failure Mode / Trade-off |
|---|---|---|---|
| Load Balancing & Horizontal Scaling | 1 server crashes under a 10k user traffic spike. | Wrap Express app in Docker; run multiple replicas behind AWS ALB. | Breaks in-memory state; requires external session storage. |
| Auto Scaling | Paying for 100 idle containers at 3 AM. | AWS Auto Scaling Groups triggered by active connections, not just CPU. | Scaling takes time (pulling images); sudden spikes still cause downtime. |
| Rate Limiting | Abusive scripts spam and crash the Express API. | Distributed rate limiting using Redis (rate-limit-redis). |
Redis becomes a Single Point of Failure (SPOF); adds latency. |
| Backpressure | Reading a 5GB S3 file into memory crashes Node (OOM). | Use Node.js Streams (.pipe()) to pause reading when the write buffer is full. |
Adds complexity to simple I/O tasks. |
| Core Concept | The Real-World Problem | Production Solution (Node/AWS) | Failure Mode / Trade-off |
|---|---|---|---|
| Caching (Cache-Aside) | Complex PostgreSQL joins take 500ms and max out DB CPU. | Check Redis first. If miss, query DB, save to Redis with TTL, return data. | Cache invalidation; users seeing stale data after an update. |
| Content Delivery Network (CDN) | High latency for global users fetching static assets. | Push Next.js static bundles and images to Cloudflare / AWS CloudFront. | Accidentally caching authenticated/private API routes globally. |
| Lazy Loading & Pagination | DB returns 10,000 rows; Next.js bundle is 4MB. | Cursor-based pagination on backend; next/dynamic on frontend. |
Cursor pagination is harder to implement than simple OFFSET. |
| Compression | Sending massive raw JSON blocks mobile networks. | Offload Brotli/Gzip compression to Nginx or AWS ALB. | Doing compression inside Node.js heavily blocks the event loop. |
| Core Concept | The Real-World Problem | Production Solution (Node/AWS) | Failure Mode / Trade-off |
|---|---|---|---|
| Message Queues | 5-second image resize blocks the Node event loop for everyone. | Return 202 Accepted; push job to AWS SQS; process in background worker. |
Worker crashes mid-job; requires Dead Letter Queues (DLQ) and ACKs. |
| Pub/Sub (Event-Driven) | Uploading a video triggers 4 separate microservices sequentially. | Publish VideoUploaded to SNS/Kafka; services process in parallel. |
Eventual consistency; UI needs WebSockets to know when processing is done. |
| Core Concept | The Real-World Problem | Production Solution (Node/AWS) | Failure Mode / Trade-off |
|---|---|---|---|
| Timeouts | 3rd-party API hangs; Node connections pile up and crash. | Always use AbortController or Axios timeouts for external requests. |
Choosing the right timeout window (too short = false failures). |
| Retries & Backoff | Instant retries act as a DDoS attack on struggling APIs. | Implement Exponential Backoff with Jitter (1s, 2.5s, 4.1s). | Delays the eventual failure response back to the client. |
| Circuit Breaker | Payment API is completely down; waiting for timeouts wastes CPU. | Use opossum to "open" the circuit and fail instantly for 30 seconds. |
Requires careful tuning of failure thresholds and recovery windows. |
| Core Concept | The Real-World Problem | Production Solution (Node/AWS) | Failure Mode / Trade-off |
|---|---|---|---|
| Authentication (JWT) | Storing JWTs in localStorage leads to XSS theft. |
Send short-lived JWTs in httpOnly cookies; store Refresh Tokens in DB. |
Implementing secure token rotation and revocation is complex. |
| Authorization (ABAC) | User A modifies projectId in API payload to delete User B's project (IDOR). |
Validate ownership against DB or embed resource IDs in the JWT payload. | Heavy DB lookups on every single protected API route. |
| API Gateway & Edge Security | Botnets brute-force the Express login endpoint. | AWS WAF blocks malicious IPs at the edge before hitting Docker. | Legitimate users getting blocked by overly aggressive WAF rules. |
| Core Concept | The Real-World Problem | Production Solution (Node/AWS) | Failure Mode / Trade-off |
|---|---|---|---|
| Replication (Read/Write Split) | Single DB cannot handle the volume of SELECT queries. |
Route INSERT/UPDATE to Primary DB, route SELECT to Read Replicas via Prisma. |
Replication lag; users refresh page and see old data. |
| Database Indexing | Query scans 5 million rows ($O(N)$), taking 8 seconds. | Create B-Tree indexes on heavily queried columns. | Indexes consume RAM and significantly slow down write operations. |
| Sharding | Data size exceeds the physical limits of a single AWS RDS instance. | Horizontally partition data across multiple DBs using a Shard Key (e.g., tenantId). |
Cross-shard joins become practically impossible. |
| Core Concept | The Real-World Problem | Production Solution (Node/AWS) | Failure Mode / Trade-off |
|---|---|---|---|
| Rolling Deployment | Restarting all containers at once causes downtime. | Orchestrator replaces old containers with new ones gradually. | Mixed versions: v1 and v2 running simultaneously breaks API contracts. |
| Blue-Green Deployment | Need instant zero-downtime rollbacks if a release fails. | Deploy to isolated "Green" environment; flip ALB traffic 100% instantly. | DB migrations running on Green can crash the live Blue environment. |
| Canary Release | Pushing a hidden bug to 100% of users. | Route 5% of traffic to the new version; monitor errors; ramp up to 100%. | Requires sticky sessions to prevent users bouncing between versions. |