https://arpitbhayani.me/system-design-for-beginners

Notes

https://drive.google.com/file/d/1Dc8XnR1d64bRaqBLEkAWs03k_3BsI1FE/view?usp=sharing

System Design — Database Architecture

Topic	Core Concepts & Mental Model	Tools & Libraries	Key Techniques	Tradeoffs & Failure Modes	Resources
Relational DB (SQL)	Tables with fixed schema, relationships via foreign keys, ACID transactions. Default choice for most apps — structured, consistent, queryable. Two workloads: OLTP (many small transactions, row-oriented) vs OLAP (few large analytical queries, column-oriented)	PostgreSQL (recommended), MySQL, SQLite (dev/test)	CRUD, JOINs, normalization (1NF/2NF/3NF), constraints (FK, UNIQUE, CHECK), EXPLAIN ANALYZE for query plans, schema migrations with zero downtime	Schema migrations on live traffic are risky — use expand-contract pattern. N+1 queries are the #1 performance killer. Missing indexes = full table scans at scale. OLTP and OLAP on same DB = analytics queries kill transaction performance	PostgreSQL docs
Isolation Levels	Controls what a transaction can see from concurrent transactions. Weakest to strongest: Read Uncommitted → Read Committed → Repeatable Read → Serializable. Stronger = safer + more locking + lower throughput	PostgreSQL, MySQL InnoDB	Read Uncommitted: dirty reads (reads uncommitted data). Read Committed: no dirty reads (PostgreSQL default). Repeatable Read: no phantom reads (MySQL default). Serializable: full isolation — transactions appear sequential	Wrong isolation level = subtle data bugs only visible under concurrent load. Serializable is safe but kills throughput under contention. Read Committed is correct default for most web apps	PostgreSQL Isolation
Indexing	Without indexes every query = full table scan. B-tree: equality + range (default). Hash: equality only, faster. GIN: arrays, JSONB, full-text. Composite: column order matters — most selective first. Covering index: query satisfied by index alone — no heap access needed	PostgreSQL EXPLAIN ANALYZE, CREATE INDEX CONCURRENTLY, pg_stat_user_indexes	Run EXPLAIN ANALYZE before and after. CREATE INDEX CONCURRENTLY in production (non-blocking). Partial indexes (WHERE deleted_at IS NULL). Find unused indexes via pg_stat_user_indexes and drop them	Too many indexes = slow writes (every write updates all indexes). Composite (a, b) helps WHERE a=? but NOT WHERE b=? alone. Low-cardinality column index (boolean) is wasteful. Wrong column order in composite = index unused	Use the Index, Luke
Replication	Copy data across multiple nodes for availability and read scaling. Primary-Replica: all writes go to primary, reads from any replica. Synchronous: primary waits for replica ACK — no data loss, higher write latency. Asynchronous: primary confirms without waiting — lower latency, small data loss window on crash	PostgreSQL streaming replication, MySQL replication, MongoDB replica sets, AWS RDS Multi-AZ	Read replicas for read-heavy workloads. Monitor replica lag via pg_stat_replication. Auto failover on primary crash (Patroni, AWS RDS). Promote replica to primary on failure	Replica lag: write to primary then immediately read replica = stale data. Always read from primary for critical paths (payments, auth checks). Async replication has a data loss window — size depends on lag	PostgreSQL HA
Connection Pooling	Each DB connection costs ~1–10MB RAM + TCP overhead. Without pooling: 100 concurrent requests = 100 DB connections → PostgreSQL collapses (default max ~100). Pool reuses connections — requests share a fixed pool	PgBouncer (PostgreSQL pooler), Prisma (connection_limit), pg Pool ({ max: 10 }), AWS RDS Proxy	Pool size formula: (num_cores × 2) + effective_spindle_count. Session vs transaction vs statement pooling modes. Monitor pool wait queue — growing queue = pool too small	Pool too small = requests wait for connection (latency spike). Pool too large = DB overwhelmed. PgBouncer transaction mode incompatible with prepared statements — use session mode or disable prepared statements	PgBouncer docs
Sharding	Split a large table across multiple DB instances — each shard holds a subset of rows. Hash sharding: shard = hash(userId) % N — even distribution. Range sharding: shard by value range — supports range queries but creates hot spots. Directory sharding: lookup table maps key → shard	MongoDB (native), Vitess (MySQL), PostgreSQL + Citus, DynamoDB (partition key)	Shard key is the most critical design decision. Embed shard key in every query to avoid scatter-gather. Avoid cross-shard JOINs — use denormalization or application-level joins. Plan resharding before you need it	Hot shard: sharding by createdAt + high insert rate = newest shard is always hot. Cross-shard queries require querying all shards in parallel then merging (expensive). Resharding live data requires double-writing + migration window	Designing Data-Intensive Applications Ch. 6
Vertical Partitioning	Split a wide table by columns. Move rarely-accessed or large columns to a separate table. Common in microservices — each service owns only the columns it needs	Standard SQL schema design	users(id, email, name) + user_profiles(id, bio, avatar, preferences). Fetch heavy columns only when needed. Each microservice owns its own tables — never share tables across services	Risk of joins becoming necessary to reconstruct original row. Over-partitioning adds query complexity. Wrong split = frequent JOINs that cost more than the savings	—
Document DB (MongoDB)	JSON-like documents, flexible schema, no fixed structure. Embed related data in one document (vs SQL joins). Best for: variable-structure data, rapid iteration, hierarchical data	MongoDB, CouchDB, MongoDB Atlas	Schema design: embed when data is "owned" (post contains comments array). Reference (foreign key) when data is shared (user referenced by many posts). Aggregation pipeline for complex queries. Change streams for real-time	No JOINs — use $lookup (slow) or denormalize. Flexible schema becomes inconsistent schema without discipline. Transactions require replica set. "Flexible schema" is a trap without Mongoose/Zod validation	MongoDB Data Modeling
Key-Value Store (Redis)	In-memory hash map. Sub-millisecond reads/writes. Not just a cache — also: session store, rate limiter, pub/sub broker, distributed lock, leaderboard, job queue	Redis, DynamoDB (for persistent KV at scale), Upstash (serverless Redis)	Data structures: String, Hash, List, Set, Sorted Set. TTL/EXPIRE for auto-expiry. SETNX for distributed locks. Sorted Sets for leaderboards. Redis Streams for event log	In-memory = data lost on restart without persistence (RDB snapshots or AOF log). Redis is single-threaded — avoid large blocking commands (KEYS *, LRANGE on huge lists). Memory is expensive — don't cache everything	Redis docs
Wide-Column Store	Rows identified by a partition key + optional clustering columns. Optimized for high write throughput and time-series data. Schema is query-driven — design tables around access patterns, not normalized relationships	Cassandra, HBase, AWS DynamoDB, Bigtable	Design tables per query. Partition key determines which node stores the data. Clustering columns control sort order within a partition. Wide rows for time-series (one row per device, columns are timestamps)	No JOINs, no ad-hoc queries. Cross-partition queries are expensive — scatter-gather across all nodes. Cassandra is AP — eventual consistency by default. Cannot change partition key after creation	Cassandra Data Modeling
Graph DB	Data modeled as nodes (entities) and edges (relationships). Native graph traversal is dramatically faster than SQL recursive JOINs for relationship-heavy queries	Neo4j, Amazon Neptune, ArangoDB	Cypher query language (Neo4j). Use for: social graph (friends of friends), recommendation engine, fraud detection, knowledge graphs, dependency resolution	Graph DBs are specialized — don't replace relational DB, augment it. Slow for non-graph queries. Hard to shard horizontally. Most teams don't need a graph DB — recursive CTEs in PostgreSQL handle most graph queries	Neo4j GraphAcademy
Choosing the Right DB	Match DB to workload — not to hype or familiarity. Wrong DB = fundamental re-architecture later	—

System Design — Caching

Mental Model: A cache trades consistency for speed. Every caching decision answers three questions: (1) Where does the cache live? (2) How does data get in? (3) How does stale data get out? Get any one wrong and you either have slow performance or wrong data shown to users.

Topic	Core Concepts & Mental Model	Tools & Libraries	Key Techniques	Tradeoffs & Failure Modes	Resources
Cache-Aside (Lazy Loading)	App manages the cache manually. Read: check cache → miss → DB → write cache → return. Write: update DB → invalidate key (don’t update cache). Populates lazily on first miss.	Redis + any DB	Key design (users:${id}, posts:${userId}:page:${n}). TTL per type. Warm hot keys. On write: `del(key)` not `set(key, newValue)` (avoids races).	First-request miss penalty. Thundering herd on expiry. Mitigate with TTL jitter (±10%) + mutex/lock on miss.	Redis patterns
Write-Through	Write to cache and DB on every write. Cache stays in sync; reads hit cache after first write.	Redis + ORM wrapper	Write both atomically. Good for read-heavy + write-heavy where consistency matters. Combine with TTL.	Slower writes (2 writes). Cache may fill with never-read data. If cache write succeeds but DB fails, inconsistency window (use transactions where possible).	—
Write-Back (Write-Behind)	Write to cache only; persist to DB async. Fastest write latency.	Redis + BullMQ background worker	Batch DB writes. Use for counters/analytics/metrics where some loss is acceptable.	Data loss if cache crashes before flush. Complex retries + DLQ needed. Never for financial/critical data.	—
Refresh-Ahead	Refresh hot keys before TTL expiry to avoid misses.	Background job, Redis keyspace notifications	Detect hot keys; refresh in background. Useful for homepage/trending/config.	Hard to predict hot keys. Added background load + complexity. Over-refresh wastes DB reads.	—
Cache Eviction Policies	When memory is full: LRU, LFU, FIFO, TTL-based eviction.	Redis maxmemory-policy: noeviction, allkeys-lru, volatile-lru, allkeys-lfu, allkeys-random	Set maxmemory; pick policy per access pattern. volatile-lru only evicts TTL keys; allkeys-lru evicts any key.	LRU can thrash under bursts. LFU resists thrash but adapts slower to changing hot keys.	Redis eviction
Cache Warming	Pre-populate cache before traffic to avoid cold-start stampede.	Redis pipeline, batch loader scripts	Warm top-N keys at startup or before traffic shift (blue/green). Schedule refresh for expiring hot keys.	Warm-up takes time. Warming too much wastes memory; warming wrong keys misses real traffic.	—
Cache Invalidation	Keep cache from serving stale DB data: TTL, event-based deletes, versioned keys.	redis.del(key), Next.js revalidateTag(), Kafka-driven invalidation	Tag-based invalidation; distributed lock (SETNX) to prevent stampede on miss.	Miss one invalidation → stale until TTL. Bulk invalidation can cause herd to DB. Event-driven is most reliable at scale.	—
Distributed Cache	Scale beyond a single Redis instance using consistent hashing across nodes.	Redis Cluster, Memcached cluster	Consistent hashing reduces remap on node changes; virtual nodes improve balance. Redis Cluster: 16384 hash slots; usually 3 primaries + replicas.	Network RTT per cache call. Node failures make some keys unavailable until failover. Can still get hot nodes without enough virtual nodes.	Redis Cluster
Cache Hierarchy (Layers)	L1 (process) → L2 (Redis) → L3 (CDN) → DB. Hit higher layers first.	L1: lru-cache/node-cache; L2: Redis; L3: Cloudflare/CloudFront; L4: DB buffer pool	L1 for very hot small data; L2 for shared sessions/results; L3 for static/public; DB buffer pool is automatic.	L1 is per-process (stale across instances). Each extra layer adds invalidation/debug complexity.	—
Browser Caching	Client caches via HTTP headers (disk/memory) for repeat requests.	Cache-Control, ETag, Last-Modified, Expires	Use immutable + hashed filenames for static assets. no-cache means revalidate, not “don’t store”.	Over-aggressive cache → stale JS/CSS after deploy. Fix with content-hash filenames.	MDN Cache-Control
CDN Caching	Edge caching close to users for static/semi-static content. Pull vs push.	Cloudflare, AWS CloudFront, Fastly	Use s-maxage for CDN TTL, versioned filenames, purge API, Cache-Control public/private, and Vary headers.	Missing Vary can leak user-specific responses. Stale CDN after deploy is common; use versioned assets + purge.	Cloudflare Cache
DB Buffer Pool	DB-level caching: hot pages stay in memory (no app changes).	Postgres shared_buffers; MySQL InnoDB buffer pool	Monitor hit ratio; size buffers appropriately. Use materialized views for heavy reads.	Big analytical queries can evict OLTP hot pages. Materialized views need refresh strategy (CONCURRENTLY).	PostgreSQL Buffer Pool

System Design — Async Processing & Messaging

Mental Model: Async processing decouples when work is requested from when it is done. The core principle: never do slow, expensive, or failure-prone work inside an HTTP request handler. Return fast, process async. This protects user experience and prevents cascading failures.

Topic	Core Concepts & Mental Model	Tools & Libraries	Key Techniques	Tradeoffs & Failure Modes	Resources
Message Queues	Producer writes a job/message; consumer processes independently. Queue buffers bursts. Point-to-point (1 message → 1 consumer) vs pub/sub (1 message → many consumers).	BullMQ + Redis (Node.js), RabbitMQ, AWS SQS, Google Cloud Tasks	FIFO vs priority queues; DLQ; ACK after success; delayed jobs; retries with exponential backoff.	No DLQ → silent loss. ACK too early → message lost on crash. Unbounded queue → resource exhaustion (set limits). BullMQ depends on Redis HA.	BullMQ docs
Event Streaming (Kafka)	Persistent, ordered, replayable event log. Kafka retains events for retention; consumers track offsets and can replay. Multiple consumer groups each get all events.	Apache Kafka, AWS MSK, Confluent Cloud, Redpanda	Topics, partitions (parallelism), consumer groups, offset mgmt, schema registry (Avro/Protobuf) for evolution.	Ordering only within a partition. Wrong partition key → single hot partition (no parallelism). Partition count changes often require migration. Ops complexity (prefer managed).	Kafka docs
Event-Driven Architecture	Publish events (“something happened”) instead of commands (“do this”). Enables loose coupling, audit logs, replay, and multiple independent consumers.	Kafka, AWS EventBridge, RabbitMQ topic exchange, Redis Pub/Sub (ephemeral)	Event schema (eventType, version, timestamp, aggregateId, payload). Choreography vs orchestration. Version events for backward compat.	Eventual consistency. Schema changes can break consumers. Choreography is hard to trace; orchestration introduces coordinator coupling/SPOF.	Martin Fowler — EDA
Saga Pattern	Distributed transaction via sequence of local transactions + compensating actions (no global ACID across services).	Temporal, AWS Step Functions (orchestration), Kafka events (choreography)	Design idempotent steps. Example: Reserve Inventory → Charge Payment → Notify Shipping; on failure run compensations (Release Inventory).	Intermediate “pending” states visible. Some actions aren’t reversible (e.g., emails). Choreography sagas hard to debug; use tracing.	Saga Pattern
Real-time Pub/Sub	Broadcast to all subscribers in real time. Redis Pub/Sub is ephemeral (no persistence); Kafka is durable/replayable.	Redis Pub/Sub, Kafka, Google Cloud Pub/Sub, AWS SNS+SQS	Redis for live notifications/presence/cache invalidation. Kafka for durable fan-out/audit/event sourcing. Fan-out: SNS topic → multiple SQS queues.	Redis Pub/Sub: offline subscribers miss messages; no consumer groups. Use Kafka/Redis Streams if durability is needed.	Redis Pub/Sub
WebSocket (Real-time Bidirectional)	Persistent connection; both client and server can push anytime. For chat/collab/live dashboards/multiplayer.	Socket.io, native WebSocket, Pusher/Ably/Liveblocks	Auth via handshake. Rooms/namespaces. Heartbeats. Reconnect with backoff. Scale across instances with Redis adapter (Socket.io).	Serverless often can’t hold long-lived connections (use hosted or dedicated servers). Each connection costs memory; reconnection storms after restarts.	Socket.io docs
SSE — Server-Sent Events	One-way server → client streaming over HTTP. Auto-reconnect with Last-Event-ID. Simpler than WebSocket for push-only.	EventSource API, Next.js streaming route handlers, @microsoft/fetch-event-source	Use event/data/id/retry fields. Set Content-Type: text/event-stream. Stream updates incrementally.	No client → server messages on same connection. HTTP/1.1 has low per-domain connection limits (HTTP/2 helps). Serverless timeouts can break long-lived SSE.	MDN SSE
Background Jobs & Scheduling	Run slow/failure-prone work outside HTTP. Use queues for triggered work, cron for recurring. Don’t block handlers (email, PDFs, image resize, slow APIs).	BullMQ, node-cron, Inngest, Temporal, Vercel Cron	Queue: attempts + backoff. Cron: schedule. Ensure single-run in distributed setups with locks or dedicated scheduler. Add DLQ + alerting.	Cron on multiple instances → duplicate runs (use locks). Silent failures without DLQ/alerts. Long jobs need progress visibility.	BullMQ docs

System Design — Resiliency & Fault Tolerance

Mental Model: Failures are not exceptional — they are normal. Design assuming everything will fail: servers crash, networks partition, DBs go slow, third-party APIs timeout. The goal is graceful degradation — the system degrades predictably rather than collapsing entirely.

Topic	Core Concepts & Mental Model	Tools & Libraries	Key Techniques	Tradeoffs & Failure Modes	Resources
Circuit Breaker	Prevents cascading failures by failing fast when a dependency is unhealthy. States: Closed → Open → Half-Open (probe).	opossum (Node.js), Resilience4j (Java), AWS App Mesh, Istio	Set error/timeout thresholds, open duration, half-open probes; use fallbacks (cache/default/degraded response).	Too sensitive opens on spikes; too lax allows cascades. Half-open probe failures keep it open. Use per-dependency breakers.	opossum Node.js
Retry with Exponential Backoff + Jitter	Retries can recover transient failures; jitter avoids synchronized retry storms.	axios-retry, p-retry, custom implementation	Retry only idempotent ops; respect Retry-After; cap retries (3–5); use backoff + jitter; retry 5xx/timeouts not 4xx.	Retrying non-idempotent POST can duplicate side effects. No jitter causes thundering herd. Infinite retries exhaust resources; use retry budgets.	—
Bulkhead Pattern	Isolates failures so one subsystem can’t starve others (separate “compartments”).	Separate HTTP clients/pools, separate DB pools, Kubernetes resource limits	Independent pools per dependency (auth/payment/email). Use K8s requests/limits to cap blast radius.	Too many tiny pools waste capacity; sizing is hard; extra config overhead—apply at critical boundaries.	—
Rate Limiting Algorithms	Protect systems from abuse/overload. Common algorithms: fixed window, sliding window (log/counter), token bucket.	Upstash Rate Limit, express-rate-limit, rate-limit-redis	Use distributed limiter (Redis) for multi-instance. Limit per IP/user/API key. Return 429 + Retry-After and rate headers.	Fixed window can be bursty at boundaries. In-memory limiters don’t work across instances. Per-IP can be bypassed (e.g., IPv6 rotation).	Upstash Rate Limit
Health Checks	Liveness: process alive. Readiness: safe to receive traffic. Keep them separate.	Kubernetes probes, AWS ALB health checks, custom endpoints	Simple liveness endpoint; readiness checks dependencies (DB/Redis). Use startup probes for slow boot. Cache readiness result briefly.	DB in liveness can restart all pods on DB blip. Over-frequent readiness checks can add DB load.	—
Data Redundancy & Backup	Plan around RTO (time to recover) and RPO (acceptable data loss).	AWS Backup, RDS backups, PITR, pg_dump, cross-region replication	Automated snapshots, PITR, cross-region DR copies, and regular restore tests.	Untested backups are worthless. Cross-region adds cost/latency. Define RTO/RPO first.	AWS Backup docs
Leader Election	Ensures only one instance performs singleton work (cron, coordinator, partition leader).	Zookeeper, etcd, Redis SETNX, Raft-based systems	Use leases/TTL; quorum-based election; ephemeral nodes for auto re-election.	Split-brain risk under partitions; TTL too short thrashes, too long blocks after crashes; quorum mitigates.	etcd docs
Graceful Degradation	Serve a useful degraded response when dependencies fail (live → stale cache → default → error).	Circuit breaker fallbacks, stale caches, feature flags	Fallback content, queue-and-process-later for slow deps, precomputed popular results.	Must communicate degradation; silently serving stale data can mislead. Test degraded paths intentionally.	—
Timeouts	Every outbound call needs a timeout to avoid hanging request threads.	AbortController, axios timeout, pg statement/query timeout	Set sensible timeouts per dependency; on timeout return fallback, log, and emit metrics.	No timeout causes pile-ups. Too short creates false timeouts; too long delays failure detection.	—