Mental Model: As a developer you don't need to be a Linux sysadmin. You need to be comfortable enough to: SSH into a server, read logs, debug a running process, manage files, and not accidentally break things. Every cloud server, Docker container, and CI runner is Linux.
| Skill | Core Concepts & Mental Model | Key Commands & Techniques | Tradeoffs & Failure Modes | Resources |
|---|---|---|---|---|
| File System Navigation | Linux file tree: everything is a file. / is root. Key dirs: /etc (system configs), /var/log (logs), /home (user dirs), /tmp (cleared on reboot), /usr/bin (installed binaries), /proc (live kernel/process info) | ls -la, cd, pwd, find / -name "file" -type f, tree, du -sh * (disk usage per dir), df -h (disk free), stat file (metadata) | rm -rf with wrong path = unrecoverable data loss. Always double-check path before destructive commands. /proc/meminfo and /proc/cpuinfo are useful for inspecting running system without installing tools | Linux Command Line (free book) |
| File Operations & Permissions | Linux permissions: owner / group / others × read(4) / write(2) / execute(1). Octal notation: 755 = rwxr-xr-x. Symbolic: chmod +x file | cp, mv, rm -rf, mkdir -p (create nested dirs), touch, cat, less, head -n 20, tail -f (live log follow), grep -r "text" ., chmod 755 file, chown user:group file, ln -s target link (symlink) | Never run chmod 777 — world-writable files are a security risk. chown -R as root on wrong directory can break system binaries. tail -f is the most-used command for watching live logs in production | chmod Calculator |
| Process Management | Every running program is a process with a PID. Parent-child relationships. Signals: SIGTERM (15) = graceful shutdown request, SIGKILL (9) = force kill (no cleanup), SIGHUP (1) = reload config | ps aux (all processes), top / htop (live — htop is better), kill -15 PID (graceful), kill -9 PID (force), pkill node (by name), lsof -i :3000 (what's using port 3000), nohup command & (survive logout), jobs / fg / bg | kill -9 skips cleanup handlers — use kill -15 first, wait 5s, then kill -9 if needed. Zombie processes = child exited but parent never collected exit status (ps shows Z state). lsof -i :3000 is the most useful command when "port already in use" | DigitalOcean Process Guide |
| Text Processing | Parse logs, extract data, transform output — the most frequent real-world Linux task. Pipe chains process data as streams — no memory issue for huge files | grep -E "ERROR | WARN" app.log (regex), grep -v "health" (exclude), awk '{print $1}' (print column), sed 's/old/new/g' (replace), sort | uniq -c |
| Environment & Shell | Shell (bash/zsh) interprets commands. Environment variables are key=value pairs inherited by child processes. PATH controls where shell searches for binaries | export VAR=value (current session), echo $VAR, env (all vars), printenv VAR, which node (binary location), type command (alias vs binary), source ~/.zshrc (reload config). Add to ~/.zshrc or ~/.bashrc for persistence | export in terminal = session only. Secrets in .bashrc/.zshrc = visible in plain text and potentially git-tracked. Use direnv (per-directory .envrc files) for project-specific env vars without polluting global shell | direnv |
| SSH | Encrypted remote terminal access. Key-based auth: your private key + server's authorized_keys. More secure and more convenient than passwords. Never enable password auth on production servers | ssh-keygen -t ed25519 -C "email" (generate keypair), ssh-copy-id user@ip (add pubkey to server), ssh user@ip, ssh -i key.pem user@ip (AWS style), chmod 400 key.pem (required by SSH), ~/.ssh/config for saved host aliases, ssh -L 5432:localhost:5432 user@server (port forwarding — access remote DB locally) | Private key must never leave your machine. PEM files require chmod 400 — SSH refuses overly-permissive keys. ed25519 is preferred over RSA (shorter key, same security). SSH port forwarding is the safest way to access remote DBs locally without exposing DB port publicly | SSH Essentials |
| Cron Jobs | Schedule recurring tasks. Runs as the crontab owner's user. Format: minute hour day month weekday command (5 fields) | crontab -e (edit), crontab -l (list), 0 2 * * * /scripts/backup.sh (daily at 2am), */5 * * * * (every 5 min). Always redirect output: >> /var/log/cron.log 2>&1. Use absolute paths for binaries: /usr/bin/node not node | Cron uses minimal PATH — relative binary names silently fail. Silent failures are the #1 cron bug — always log output. Test scripts manually before adding to crontab. For distributed systems use BullMQ repeatable jobs or Inngest instead of cron — avoids multi-instance double-execution | Crontab Guru |
| Networking Commands | Debug connectivity issues, inspect open ports, trace network paths | curl -I https://api.example.com (HTTP headers), curl -v (verbose — full request/response), wget, ping, traceroute / tracepath, netstat -tulpn (open ports + listening processes), ss -tulpn (modern replacement for netstat), nc -zv host port (test port connectivity), dig domain (DNS lookup), nslookup domain | curl -v is the most useful tool for debugging API calls from a server. ss -tulpn shows exactly what's listening on which port. If curl works but browser doesn't = DNS or CORS issue, not a server issue | — |
| System Monitoring | Understand resource usage before scaling or debugging performance issues | htop (interactive process + CPU/memory), free -h (memory), df -h (disk), iostat (disk I/O), vmstat (CPU/memory/IO overview), uptime (load averages), dmesg (kernel messages — hardware/boot issues), journalctl -u myservice -f (systemd service logs, live) | Load average > number of CPU cores = CPU bottleneck. Memory usage near 100% + swap increasing = memory leak. Disk full = silent failures (logs stop writing, DB crashes). Check df -h before diagnosing mysterious errors | — |
| Package Management | Install, update, remove system packages. apt (Debian/Ubuntu), yum/dnf (RHEL/Amazon Linux) | apt update (refresh package index), apt install -y package, apt remove package, apt list --installed, which binary (confirm install). On Alpine (Docker): apk add --no-cache package | Always apt update before apt install in Docker or CI — stale package index = version not found errors. Use apt-get (non-interactive) in Dockerfiles, not apt (interactive). Pin package versions in Dockerfiles for reproducible builds | — |
Mental Model: Git tracks snapshots of your project, not diffs. Three areas: Working Directory (files you edit) → Staging Area (what you mark for commit) → Repository (committed history). Commits are immutable content-addressed objects — you never modify history, you add new commits on top.
| Skill | Core Concepts & Mental Model | Key Commands & Techniques | Tradeoffs & Failure Modes | Resources |
|---|---|---|---|---|
| Core Concepts | Four Git objects: Blob (file content), Tree (directory), Commit (snapshot + metadata + parent), Tag (named commit). HEAD = pointer to current commit. Branch = lightweight pointer that moves forward on each commit | git add ., git commit -m "message", git status, git log --oneline --graph --all, git diff (unstaged), git diff --staged (staged), git show <hash> (inspect commit) | Committing node_modules or .env = bloated repo + leaked secrets. Always have .gitignore before first commit. Large binary files in git history are permanent — use Git LFS for assets | Pro Git Book (free) |
| Branching & Merging | Branches are cheap pointers to commits. Three strategies: GitHub Flow (main + feature branches — best for most teams + CI/CD), GitFlow (main/develop/feature/hotfix — complex, slower), Trunk-Based Development (everyone commits to main daily with feature flags — fastest, requires discipline) | git checkout -b feature/name, git merge feature/name (merge commit), git rebase main (linear history), git cherry-pick <hash> (single commit), git branch -d feature/name, git branch -a (all branches including remote) | Long-lived feature branches = merge conflict pain. Rebase rewrites history — never rebase shared/pushed branches. Cherry-pick overused = diverged history. Trunk-based development requires feature flags to hide incomplete work | — |
| Remote Workflows | origin = your primary remote (GitHub/GitLab). upstream = original repo (open source forks). fetch downloads without merging. pull = fetch + merge | git remote -v, git fetch origin (download, no merge), git pull origin main, git push origin feature/name, git push --force-with-lease (safer than -f — fails if remote has new commits), git clone --depth 1 url (shallow clone — faster for CI) | git pull on wrong branch = merges into wrong branch. git push -f on shared branches overwrites teammates' commits. Always use --force-with-lease instead of --force. Shallow clones miss full history — some git operations fail | — |
| Undoing Changes | Git almost never permanently deletes committed data (reflog retains for 90 days). Strategy depends on whether commits are local or already pushed to shared branch | git restore file (discard working dir changes), git reset HEAD~1 (undo last commit, keep changes staged), git reset --hard HEAD~1 (undo + discard — dangerous), git revert <hash> (safe undo for pushed commits — creates new reverting commit), git stash / git stash pop, git reflog (find lost commits) | git reset --hard = local changes gone. Check reflog before panicking. git revert is the ONLY safe undo for already-pushed commits — never reset shared history. git stash drop = stash gone permanently | — |
| Conventional Commits | Structured commit message format that enables: automated changelogs, semantic versioning, readable history, and CI triggers based on commit type | Types: feat (new feature → minor version bump), fix (bug fix → patch bump), chore (maintenance), refactor, docs, test, ci, perf, BREAKING CHANGE (→ major bump). Format: type(scope): description. Example: feat(auth): add OAuth Google login | Inconsistent commit messages = useless git log. WIP commits should be squashed before merge (git rebase -i). Use commitlint + husky to enforce format in CI | Conventional Commits |
| Branch Protection & GitHub Workflows | Pull Requests are the unit of code review and the CI trigger. Branch protection enforces quality gates before merge. CODEOWNERS auto-assigns reviewers by file path | Branch protection rules: require PR review before merge, require CI status checks to pass, disable force push to main, require linear history. CODEOWNERS: src/auth/ @security-team. Merge strategies: Squash merge (clean history), Rebase merge (linear), Merge commit (full history) | Direct push to main = no review, no audit trail. Missing status check requirement = broken CI doesn't block merge. Squash merge loses individual commit history — use for feature branches. Merge commit for releases (preserves full context) | GitHub Branch Protection |
| Secrets & .gitignore | .gitignore prevents files from being tracked. But once a file is committed, removing from .gitignore doesn't remove from history | .gitignore essentials: node_modules/, .env, .env.local, dist/, .DS_Store, *.log, coverage/. Remove accidentally committed file: git rm --cached .env → commit. If secret already pushed: rotate immediately, use git-filter-repo to purge history, assume it's compromised | .gitignore must exist before first commit — files already tracked are not ignored by adding them later. git rm --cached removes from tracking but keeps local file. Private repo does NOT mean secrets are safe — treat all committed secrets as compromised | gitignore templates |
| Git Hooks & Automation | Scripts that run automatically at Git lifecycle events. Pre-commit: run linting/formatting before commit. Commit-msg: validate commit message format. Pre-push: run tests before push | husky (manage hooks in package.json), lint-staged (run linters only on staged files — fast). Setup: npx husky init → .husky/pre-commit → npx lint-staged. package.json: "lint-staged": { "*.ts": ["eslint --fix", "prettier --write"] } | Hooks run locally — can be bypassed with git commit --no-verify. Never put critical security checks only in hooks — enforce in CI too. Slow hooks = developers bypass them | husky docs |
| Monorepo Git Patterns | Single repo for multiple apps/packages. Git operations need to be scoped — you don't want to run all tests when only one package changed | Turborepo: turbo run test --filter=@myapp/api (run only affected). GitHub Actions path filters: on.push.paths: ['apps/api/**'] to trigger only relevant workflows. git diff --name-only HEAD~1 to detect changed packages | Without path-based CI filtering: every commit triggers full test suite for all packages = slow. Turborepo remote cache: reuse unchanged package build outputs across CI runs | Turborepo |
| Git in 2026 — AI-assisted Workflows | Git workflows now integrate with AI tools for: auto-generated commit messages, PR descriptions, code review suggestions, and conflict resolution assistance | GitHub Copilot generates commit messages and PR descriptions. git commit with Copilot CLI. Cursor/Windsurf AI-assisted merge conflict resolution. Conventional commit message generation from staged diff | AI-generated commit messages can be generic — review before accepting. AI PR descriptions miss business context — always add "why" manually. Auto-merge bots (Renovate, Dependabot) handle dependency updates via PRs | GitHub Copilot CLI |
Mental Model: CI/CD makes deployment a non-event — small, frequent, automated, and reversible. CI (Continuous Integration) catches bugs early by building and testing every commit. CD (Continuous Delivery/Deployment) removes manual steps from shipping code. The goal: confidence to deploy 10 times a day without fear.
| Skill | Core Concepts & Mental Model | Tools & Techniques | Tradeoffs & Failure Modes | Resources |
|---|---|---|---|---|
| CI/CD Mental Model | CI: every commit triggers automated build + test. CD (Delivery): every merged commit is automatically deployable (may require manual approval). CD (Deployment): every merged commit automatically deploys to production. Pipeline stages run in order: Trigger → Install → Lint → Type Check → Test → Build → Deploy. Fail fast: run cheap checks (lint) before expensive ones (E2E) | GitHub Actions (most common), GitLab CI, CircleCI, Buildkite (self-hosted runners) | Slow CI = developers stop waiting = commits piling up = defeats CI purpose. CI that never fails = tests not catching real bugs. No rollback plan = deployment is gambling. Target: CI feedback in under 5 minutes | GitHub Actions docs |
| GitHub Actions Core Concepts | YAML-based workflow automation. Triggered by events. Jobs run in parallel by default. Steps within a job run sequentially. Runners are ephemeral VMs that are destroyed after each run | on (triggers: push, pull_request, schedule, workflow_dispatch). jobs (parallel units of work). steps (sequential commands in a job). uses (reuse community actions). env (env vars). secrets (encrypted, injected at runtime). needs (job dependency — run after another job) | Runners are stateless — no file persistence between runs without artifacts or cache. Action versions must be pinned (actions/checkout@v4 not @main) — unpinned = supply chain attack. GITHUB_TOKEN has limited permissions by default — configure permissions explicitly | GitHub Actions docs |
| Standard Node.js CI Pipeline | Practical pipeline covering the full quality gate for a Node.js/TypeScript project | name: CI. on: [push, pull_request]. jobs: ci: runs-on: ubuntu-latest. steps: checkout → setup-node (node-version: 20, cache: npm) → npm ci → npm run lint → npx tsc --noEmit → npm test -- --coverage → npm run build | npm ci (not npm install) — uses lockfile exactly, fails if lockfile is out of sync. Missing tsc --noEmit = TypeScript errors only caught at runtime. Missing npm run build in CI = build errors discovered at deploy time, not commit time | — |
| Caching Dependencies | Reinstalling node_modules on every CI run wastes 1–3 minutes. Cache based on lockfile hash — reinstall only when dependencies change | actions/cache with key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}. Or use actions/setup-node with cache: 'npm' (handles caching automatically). Turborepo Remote Cache: cache entire build outputs — unchanged packages don't rebuild | Wrong cache key = stale node_modules served from cache. Cache key must include OS + lockfile hash. Turborepo remote cache requires TURBO_TOKEN secret in CI | actions/cache |
| Parallel Jobs | Independent jobs (lint, test, build) can run simultaneously — reduces total pipeline time | jobs: lint: ..., typecheck: ..., test: ... (all run in parallel). deploy: needs: [lint, typecheck, test] (runs only after all pass). Matrix builds: strategy.matrix.node-version: [18, 20, 22] — test across multiple Node versions | Too many parallel jobs = exceed free tier minute limits. Matrix without filtering = exponential job count. Use concurrency groups to cancel outdated runs on new push: concurrency: ${{ github.workflow }}-${{ github.ref }} | — |
| Deployment Strategies | How you ship code determines your risk, downtime, and rollback capability | Blue-Green: two identical envs (blue=live, green=new), switch traffic instantly → zero downtime + instant rollback, double cost. Rolling: replace instances one at a time → zero downtime, slower rollback. Canary: route 5% of traffic to new version → validate before full rollout. Feature Flags: deploy code dark, toggle via config → decouple deploy from release entirely | Blue-green: DB schema migrations are tricky — both versions run simultaneously, so old version must handle new schema. Canary: user hits old + new version in same session = state inconsistency risk. Feature flags: flag debt accumulates — schedule cleanup of old flags | LaunchDarkly |
| Environment Strategy | Separate environments for each stage of confidence: dev → staging → production. Each has its own config, DB, and secrets | dev: local machine or ephemeral preview deploy (Vercel preview, Railway PR deploys). staging: mirrors production, used for QA and smoke tests. production: real users. GitHub Environments: configure protection rules (required reviewers, wait timer) per environment. environment: production in job config triggers approval gate | Staging data diverging from production = staging tests don't reflect prod behavior. No staging env = testing in production. Auto-deploy to production without approval = one bad merge ships immediately | GitHub Environments |
| Secrets Management | Never store secrets in YAML files or source code. Inject at runtime from a secure store | GitHub Secrets: Settings → Secrets and Variables → accessible as ${{ secrets.DATABASE_URL }} — masked in logs. AWS Secrets Manager: rotation, audit log, cross-service access. HashiCorp Vault: self-hosted, fine-grained. Doppler: developer-friendly sync to GitHub/Vercel/AWS. In Actions: env: DATABASE_URL: ${{ secrets.DATABASE_URL }} | Secret committed to git = rotate immediately even if repo is private. NEXT_PUBLIC_ prefix = secret visible in browser bundle. Secrets should support rotation — design app to handle new credentials without restart | GitHub Encrypted Secrets |
| Docker in CI | Build and push Docker images as part of CI, then deploy the image — not raw source code | actions/checkout → docker/setup-buildx-action → docker/login-action (GHCR or DockerHub) → docker/build-push-action with cache-from: type=gha and cache-to: type=gha,mode=max → tag with git SHA (not latest). Pull exact SHA tag in deploy step | Using latest tag in CD = non-deterministic deploys. GitHub Actions cache for Docker layers dramatically speeds up image builds. Multi-platform builds (linux/amd64,linux/arm64) needed if deploying to ARM (AWS Graviton) | docker/build-push-action |
| Release & Versioning | Automate version bumps and changelogs from conventional commits | semantic-release: reads conventional commits → bumps package.json version → generates CHANGELOG.md → creates GitHub Release → publishes to npm. release-please (Google): PR-based approach — creates a release PR that batches version bump + changelog. Manual: npm version patch/minor/major + git tag | Without automated versioning: manual bumps get forgotten or inconsistent. semantic-release requires strict conventional commit discipline. release-please is safer for teams — change is a PR, not an automatic push | semantic-release |
| 2026 — AI in CI/CD | AI is now integrated into the CI/CD feedback loop — speeding up review, detection, and fix cycles | GitHub Copilot code review in PRs: suggests fixes inline. Automated dependency updates: Renovate Bot + Dependabot auto-creates PRs for dep updates with changelogs. AI-generated PR descriptions from diff. Flaky test detection: GitHub Actions now surfaces flaky tests automatically. OpenTelemetry traces from CI builds (build observability) | AI review suggestions can miss business context. Auto-merged Renovate PRs without test coverage = shipping broken updates. Treat Dependabot PRs like any other code — CI must pass before merge | Renovate Bot |
Mental Model: A container is an isolated process on the host OS kernel — not a VM. It packages the app + its exact runtime dependencies into a portable unit that runs identically everywhere. Image = immutable blueprint. Container = running instance. Docker solved "works on my machine." Containers are now the universal unit of deployment.
| Skill | Core Concepts & Mental Model | Tools & Techniques | Tradeoffs & Failure Modes | Resources |
|---|---|---|---|---|
| Core Concepts | Image: read-only blueprint (like a class). Container: running instance (like an object). Registry: image storage (Docker Hub, GHCR, ECR). Images are built in layers — each Dockerfile instruction adds a layer. Layers are cached and reused across builds | docker build -t myapp:1.0 . Build. docker run -p 3000:3000 myapp:1.0 Run. docker ps List running. docker images List local images. docker logs container-id Follow logs. docker exec -it container-id sh Enter container shell. docker stop / docker rm | Images are immutable — never modify a running container, rebuild the image. Large images = slow CI push/pull + slow cold starts. Tag images with git SHA not latest — latest is non-deterministic | Docker docs |
| Dockerfile | Text file with ordered instructions to build an image. Each instruction creates a new layer. Layer order = cache efficiency: put rarely-changing steps first (install deps), frequently-changing last (copy source) | FROM node:20-alpine (base — alpine = minimal, ~5MB vs ~900MB for node:20). WORKDIR /app. COPY package*.json ./ (lockfile first). RUN npm ci. COPY . . (source last). RUN npm run build. EXPOSE 3000. USER node (non-root). CMD ["node", "dist/server.js"] | node:latest = unpredictable version changes. Running as root = security risk. COPY . . before npm ci = cache bust on every source change = slow builds. Leaving devDependencies in production image = bloated + larger attack surface | Dockerfile Best Practices |
| Multi-stage Builds | Multiple FROM stages in one Dockerfile. Build stage: install all deps + compile TypeScript. Production stage: copy only compiled output — no devDeps, no source, no build tools. Result: dramatically smaller, more secure image | Stage 1 (builder): FROM node:20-alpine AS builder → COPY package*.json ./ → RUN npm ci → COPY . . → RUN npm run build. Stage 2 (runner): FROM node:20-alpine AS runner → COPY --from=builder /app/dist ./dist → COPY --from=builder /app/node_modules ./node_modules → USER node → CMD ["node", "dist/server.js"] | Without multi-stage: image can be 1.5GB+. With multi-stage: ~150–200MB. Next.js standalone output + multi-stage = ~120MB. Test stage: add a third stage that runs tests — fail build if tests fail | Multi-stage builds |
| .dockerignore | Prevents files from being sent to Docker build context and baked into the image. Equivalent to .gitignore for Docker. Missing it = slow builds + secrets in image | Essentials: node_modules (rebuilt inside), .git, .env (never bake secrets into image), dist (rebuilt), *.log, .DS_Store, README.md, coverage/, .github/ | .env in image = secrets visible to anyone who pulls the image. node_modules in build context = slow build (Docker uploads entire folder to daemon). Always create .dockerignore before writing Dockerfile | — |
| Docker Compose | Define and run multi-container apps with a single YAML file. Manages networking, volumes, env vars, and service dependencies. The standard tool for local development environments | docker-compose.yml: services (app, postgres, redis), ports, environment, volumes, depends_on, healthcheck. docker compose up -d (background), docker compose logs -f app (follow logs), docker compose down -v (stop + remove volumes) | depends_on only waits for container start, not readiness — use healthcheck + condition: service_healthy. Hardcoded passwords in docker-compose.yml = committed to git. Use .env file + ${VAR} substitution | Docker Compose docs |
| Volumes & Persistence | Containers are ephemeral — filesystem resets on restart. Volumes persist data outside the container lifecycle | Named volume (Docker-managed): docker run -v mydata:/var/lib/postgresql/data. Bind mount (maps host dir): docker run -v $(pwd):/app (dev hot-reload). docker volume inspect mydata. docker volume ls. Backup: docker run --rm -v mydata:/data -v $(pwd):/backup alpine tar czf /backup/data.tar.gz /data | Never store DB data inside container — always use named volume. Bind mounts in production = host path dependency. Volume permissions: container UID must match host file ownership or use --user flag. Named volumes need explicit backup strategy | — |
| Networking | Containers on the same network reach each other by service name (DNS). Bridge network isolates from host. Each docker-compose project gets its own default network | docker network create mynet. Service DNS in compose: http://postgres:5432, http://redis:6379 (use service name as hostname). Expose vs Publish: EXPOSE documents port (internal), -p 3000:3000 maps to host. docker network inspect bridge | Never expose DB port to host in production (-p 5432:5432 on DB). Only the app container should reach the DB. Containers on different networks can't communicate — explicit shared network needed. host.docker.internal = reach host machine from container (Mac/Windows) | Docker networking |
| Security Best Practices | Containers reduce attack surface but introduce new risks if misconfigured | Run as non-root: USER node (add before CMD). Read-only filesystem: docker run --read-only. Drop capabilities: --cap-drop=ALL --cap-add=NET_BIND_SERVICE. No secrets in ENV in Dockerfile (visible in docker inspect and image layers). Use Docker secrets or mount secrets as files at runtime. Scan images: docker scout cves myapp:latest or trivy image myapp:latest | Root in container = root on host if container escapes (rare but possible). ENV secrets baked into image layers = visible in docker history. Trivy/Scout catches known CVEs in base image and dependencies — run in CI before push | Trivy |
| Useful Patterns | Common Docker patterns used in real projects | Health check: HEALTHCHECK --interval=30s --timeout=3s CMD curl -f http://localhost:3000/health | exit 1. Init process: use dumb-init as PID 1 to forward signals correctly (FROM node:20-alpine → RUN apk add dumb-init → ENTRYPOINT ["dumb-init", "--"]). Debug running container: docker exec -it container-id sh. Copy file from container: docker cp container-id:/app/log.txt . | |
| 2026 — Docker in Modern Dev | Docker usage patterns have evolved — new tooling makes containers faster and more ergonomic | Docker Init: docker init auto-generates optimized Dockerfile + compose file for your project. Dev Containers (VS Code / Cursor): develop inside a container — entire dev environment is reproducible and shareable (.devcontainer/devcontainer.json). Testcontainers: spin up real Docker containers in integration tests (real PostgreSQL, real Redis — no mocks). Docker Scout: built-in vulnerability scanning replacing third-party tools | Dev Containers solve "new developer setup takes 2 days." Testcontainers are now the standard for integration testing — much more reliable than mocked DB clients. Docker Desktop alternatives: OrbStack (Mac — faster, lighter than Docker Desktop) | Dev Containers spec, Testcontainers |
Mental Model: As a developer you don't need to know every AWS service (there are 200+). You need to know: how to deploy an app, store files, manage a database, handle secrets, send traffic to the right place, and know when things break. AWS is the default cloud — these services appear in almost every production stack.
| Skill | Core Concepts & Mental Model | Key Services & Techniques | Tradeoffs & Failure Modes | Resources |
|---|---|---|---|---|
| Core Concepts | Region: geographic cluster of data centers. AZ (Availability Zone): isolated data center within a region. VPC (Virtual Private Cloud): your private network in AWS. IAM: identity and access management — controls who/what can do what. Everything in AWS is accessed via IAM | Always deploy across multiple AZs for high availability. Use us-east-1 as default (most services available, cheapest). IAM roles for EC2/ECS (not access keys hardcoded in code). Least privilege: grant minimum permissions needed | Single-AZ deploy = one data center outage takes down your app. IAM access keys in code = credential leak. IAM role vs IAM user: roles are preferred for services (no long-lived credentials) | AWS docs |
| EC2 | Virtual machine in the cloud. You control OS, runtime, and config. Most flexible but most maintenance. t3.micro (free tier) → t3.medium → c5 (compute) → r5 (memory) instance types | Launch: choose AMI (Amazon Linux 2023 or Ubuntu 22.04), instance type, security group (firewall rules), key pair (SSH access). Connect: ssh -i key.pem ec2-user@ip. User data script: runs on first boot (install Node, pull image, start app). Elastic IP: static IP for your instance | EC2 = you manage patching, scaling, and OS. Security group: inbound rules control what ports are open — only open ports 22 (SSH, restricted to your IP), 80, 443. Never open 0.0.0.0/0 on port 22 (brute-force target). Use Systems Manager Session Manager instead of SSH for production access | EC2 docs |
| ECS (Elastic Container Service) | Run Docker containers without managing servers (Fargate mode) or on EC2. The standard way to deploy containerized apps on AWS. Task Definition = Dockerfile equivalent (image, CPU, memory, env vars, port). Service = desired count of running tasks + auto-restart | ECS Fargate: serverless containers — no EC2 to manage. Task Definition: image URI (ECR), CPU (256/512/1024), memory, environment variables (from Secrets Manager), port mappings. Service: desired count, auto-scaling, ALB integration. Rolling deploy: replace tasks one at a time | Fargate cold starts: ~10–30s to start a new task. CPU/memory limits are hard — app OOM-killed if it exceeds memory limit. ECR image pull timeout if image is too large. ECS Service auto-restart on task failure — check CloudWatch logs when tasks keep restarting | ECS docs |
| ECR (Elastic Container Registry) | Private Docker image registry in AWS. Stores images for ECS/EKS deployment. Integrated with IAM for access control | aws ecr get-login-password | docker login --username AWS --password-stdin <account>.dkr.ecr.<region>.amazonaws.com. Tag: docker tag myapp:latest <ecr-uri>:latest. Push: docker push <ecr-uri>:latest. Lifecycle policy: auto-delete old images (keep last 10 tagged images) | Without lifecycle policy: ECR storage costs grow unbounded (every CI push stores a new image). Always push git SHA tag + latest — never only latest (can't roll back to previous). ECR image scanning: enable on push to catch CVEs automatically |
| RDS (Relational Database Service) | Managed PostgreSQL/MySQL. AWS handles: backups, patching, failover, replication. You focus on schema and queries | Multi-AZ: primary + standby replica in different AZ — automatic failover. Read replicas: offload read traffic. Parameter groups: DB config (max_connections, shared_buffers). Subnet group: deploy RDS in private subnet (not public internet). Security group: only allow traffic from app server/ECS task security group | Never put RDS in public subnet — only accessible from within VPC. Enable automated backups (default 7 days — extend to 30 for production). RDS Proxy: connection pooler between app and RDS — prevents connection exhaustion from Lambda/ECS scaling | RDS docs |
| S3 (Simple Storage Service) | Object storage for files, images, backups, static sites. Infinitely scalable, 11 nines durability. Flat key-value store — no real folders, just prefixes | Bucket: container for objects. Object key: the "path" (images/user-123/avatar.jpg). Pre-signed URL: time-limited URL for direct client upload/download — no server proxy needed. Bucket policy vs ACL: use bucket policy (ACL is legacy). Storage classes: Standard → Infrequent Access → Glacier (archival). Lifecycle rules: auto-transition old objects to cheaper tier | Public bucket = anyone can read/write all files. Default: block all public access. Use pre-signed URLs for user uploads — never proxy through your server. S3 egress (data out) is expensive — serve via CloudFront CDN instead | S3 docs |
| CloudFront | CDN for serving S3 files and dynamic content at edge. Reduces latency from ~200ms to ~10ms. Integrates with S3, ALB, API Gateway, and custom origins | Distribution: configure origin (S3 bucket or ALB). Cache behavior: which paths to cache (/* for static, /api/* bypass cache). Cache invalidation: aws cloudfront create-invalidation --paths "/*". SSL: attach ACM certificate for HTTPS on custom domain. Lambda@Edge / CloudFront Functions: run code at edge (auth, redirects, A/B testing) | CDN serves stale content until TTL expires — use versioned filenames (app.[hash].js) for instant "invalidation." CloudFront invalidation takes ~30s and costs $0.005 per 1000 paths. Lambda@Edge has cold starts — CloudFront Functions are faster for simple logic | CloudFront docs |
| ALB (Application Load Balancer) | L7 HTTP load balancer. Routes traffic to ECS tasks, EC2 instances, or Lambda. Required in front of any multi-instance deployment | Target group: pool of instances/tasks. Health check: GET /health → 200 = healthy, 5xx or timeout = remove from rotation. Listener rules: route by path (/api/* → API target group, /* → frontend target group) or host (api.example.com vs app.example.com). HTTPS: attach ACM certificate to listener | No health check = ALB routes to crashed instances. Health check interval too short = flapping. Connection draining: ALB waits for in-flight requests to complete before deregistering instance (set deregistration delay: 30s) | ALB docs |
| IAM (Identity & Access Management) | Controls who (user, service, role) can do what (action) on which resource. Principle of least privilege: grant minimum permissions needed | IAM Role: attached to EC2/ECS/Lambda — temporary credentials, no long-lived keys. IAM Policy: JSON document defining permissions. Managed policies: AWS pre-built (AmazonS3ReadOnlyAccess). Inline policy: attached to one specific resource. aws sts get-caller-identity (who am I?). IAM Access Analyzer: finds overly permissive policies | IAM access keys in code/git = credential leak — use IAM roles for services. Wildcard actions (s3:*) = too permissive. Never use root account for daily operations — create an IAM user/role. Rotate access keys every 90 days if you must use them | IAM Best Practices |
| Secrets Manager & SSM | Store and inject secrets securely. Secrets Manager: supports automatic rotation, versioning, cross-account access. SSM Parameter Store: simpler, cheaper, no auto-rotation | Secrets Manager: aws secretsmanager get-secret-value --secret-id prod/myapp/db. SSM: aws ssm get-parameter --name /myapp/prod/DATABASE_URL --with-decryption. In ECS Task Definition: reference secret ARN in environment → injected at container start. In Lambda: fetch at cold start, cache in module scope | Fetching secrets on every request = latency + Secrets Manager rate limit. Cache secrets in memory (module scope) and refresh on rotation event. Secrets Manager costs $0.40/secret/month — use SSM for simple non-rotating secrets | Secrets Manager docs |
| CloudWatch | AWS-native monitoring: logs, metrics, alarms, dashboards. Default destination for logs from ECS, Lambda, EC2 | Log groups: ECS tasks auto-log to /ecs/myapp. Log Insights: query logs with SQL-like syntax. Metrics: CPU, memory, request count, error rate. Alarms: trigger SNS notification or auto-scaling action. aws logs tail /ecs/myapp --follow (live logs in terminal) | CloudWatch Logs retention default = never expires = unbounded cost. Set retention policy (30–90 days). Log Insights queries on large log groups are slow and expensive — use structured JSON logs with indexed fields. CloudWatch metrics have 1-minute resolution minimum | CloudWatch docs |
| Lambda | Run functions without managing servers. Event-triggered: HTTP (API Gateway), S3 event, SQS message, scheduled (EventBridge). Billed per invocation + duration | Handler: exports.handler = async (event) => { return { statusCode: 200, body: JSON.stringify(result) } }. Cold start: ~100ms–1s for Node.js (provisioned concurrency eliminates cold starts). Timeout: max 15 minutes. Memory: 128MB–10GB (more memory = more CPU). Layers: shared dependencies across functions | Cold starts are a real UX problem for user-facing APIs — use ECS for low-latency APIs, Lambda for async processing. No persistent connections — use RDS Proxy for DB access. Lambda function code limit: 50MB zipped (use Lambda layers for large deps) | Lambda docs |
| 2026 — Modern Deployment Patterns | Cloud deployment patterns have simplified significantly for developer teams | App Runner: deploy a Docker image or GitHub repo with zero infrastructure config — AWS manages scaling, load balancing, TLS. Better than ECS for small teams. Copilot CLI: deploy ECS apps with one command (copilot init). CDK (Cloud Development Kit): define AWS infrastructure in TypeScript — more ergonomic than raw CloudFormation/Terraform for developers. Alternatives: Railway, Render, Fly.io (simpler than AWS for early-stage projects — no IAM, no VPC config) | App Runner is more expensive than ECS at scale but dramatically simpler. CDK generates CloudFormation — still has CFN limitations. Railway/Render: excellent DX but less control and higher cost at scale. Fly.io: best for globally distributed apps (edge regions) | AWS App Runner, AWS CDK |
Mental Model: Kubernetes (K8s) is a container orchestrator — it runs your Docker containers across a cluster of machines, handles restarts, scaling, and rolling deploys automatically. As a developer you don't need to operate a cluster, but you must understand enough to: read manifests, debug failing pods, deploy your app, and know what Kubernetes is actually doing when your deployment is "stuck."
| Skill | Core Concepts & Mental Model | Key Techniques & Commands | Tradeoffs & Failure Modes | Resources |
|---|---|---|---|---|
| Core Architecture | Control Plane: manages the cluster (API Server, etcd, Scheduler, Controller Manager). Node: worker machine (EC2/VM) that runs containers. Pod: smallest deployable unit — one or more containers that share network and storage. Kubernetes reconciles desired state (YAML) with actual state continuously | kubectl get pods, kubectl get nodes, kubectl describe pod <name> (events + status), kubectl logs <pod> -f (live logs), kubectl exec -it <pod> -- sh (enter container), kubectl apply -f manifest.yaml, kubectl delete -f manifest.yaml | K8s adds significant operational complexity — don't use it unless you have 3+ services that need independent scaling/deployment. Use ECS/App Runner/Fly.io for simpler setups. kubectl is your primary debugging tool | Kubernetes docs |
| Pod | Smallest deployable unit. Usually one container per pod. Pods are ephemeral — they die and are replaced. Never deploy a naked Pod (no restart or scaling) — always use a Deployment | Pod spec: containers (image, ports, resources, env, volumeMounts), volumes, restartPolicy. Pods get a cluster-internal IP (not stable — changes on restart). Pod-to-pod communication: use Service DNS, not pod IP directly | Pod IP changes on every restart — never hardcode pod IPs. Pod dying continuously (CrashLoopBackOff) = check kubectl logs and kubectl describe pod for events. OOMKilled = memory limit too low | — |
| Deployment | Manages a ReplicaSet — ensures N replicas of your pod are always running. Handles rolling updates and rollbacks. The standard way to run stateless apps | deployment.yaml: apiVersion: apps/v1, kind: Deployment, spec.replicas: 3, spec.selector.matchLabels, spec.template (pod template). kubectl rollout status deployment/myapp. kubectl rollout undo deployment/myapp (rollback). kubectl scale deployment myapp --replicas=5 | Rolling update default: replaces pods one at a time (no downtime). maxSurge: how many extra pods during update. maxUnavailable: how many pods can be down during update. rollback only reverts pod template — does NOT revert DB migrations | K8s Deployments |
| Service | Stable DNS name + IP that routes traffic to matching pods. Pods come and go — Service provides a stable endpoint. Three types: ClusterIP (internal only), NodePort (exposes on each node's IP), LoadBalancer (creates cloud LB — use for external traffic) | service.yaml: kind: Service, spec.selector (matches pod labels), spec.ports (port: 80, targetPort: 3000), spec.type: ClusterIP. DNS: http://myapp.default.svc.cluster.local or just http://myapp within same namespace. kubectl get services, kubectl describe service myapp | ClusterIP only reachable within cluster. LoadBalancer creates an AWS/GCP LB per service (expensive) — use Ingress instead for multiple services. Service selector must exactly match pod labels or traffic won't route | — |
| Ingress | Routes external HTTP/HTTPS traffic to Services based on host/path rules. One Ingress Controller (NGINX, Traefik, AWS ALB Ingress) handles routing for all services — cheaper than one LoadBalancer per service | ingress.yaml: kind: Ingress, rules: host: api.example.com, http.paths: path: /api, backend.service.name: api-service. TLS: spec.tls with secret containing cert. Annotations configure NGINX behavior (rate limit, CORS, timeout) | Ingress requires an Ingress Controller deployed in the cluster — not included by default. Path matching order matters: specific paths before catch-all. cert-manager automates TLS certificate provisioning from Let's Encrypt | NGINX Ingress |
| ConfigMap & Secret | ConfigMap: non-sensitive config injected as env vars or files. Secret: sensitive data (base64 encoded — NOT encrypted by default in etcd). Both decouple config from container image | configmap.yaml: kind: ConfigMap, data: NODE_ENV: production. secret.yaml: kind: Secret, data: DB_PASSWORD: <base64>. Reference in pod: env.valueFrom.configMapKeyRef or secretKeyRef. Mount as file: volumeMounts + volumes.configMap | K8s Secrets are base64 encoded not encrypted — anyone with cluster access can read them. Use External Secrets Operator + AWS Secrets Manager for real encryption. Changing ConfigMap does not auto-restart pods — add a checksum annotation to trigger rollout | K8s Secrets |
| Resources & Limits | Every container must declare CPU and memory requests (what K8s reserves) and limits (hard cap). Scheduler uses requests to decide which node to place the pod on | resources.requests: cpu: "250m" (250 millicores = 0.25 CPU), memory: "256Mi". resources.limits: cpu: "500m", memory: "512Mi". kubectl top pods (live CPU/memory usage). kubectl describe node (allocatable vs allocated) | OOMKilled: container exceeded memory limit — increase limit or fix memory leak. CPU throttling: container exceeds CPU limit — not killed but slowed down (check kubectl top). No resource requests = scheduler places pods randomly = noisy neighbor problem. Always set both requests AND limits | — |
| Liveness & Readiness Probes | Liveness: is the container healthy? Fail → restart container. Readiness: is the container ready for traffic? Fail → remove from Service endpoints (no traffic). These are distinct purposes | livenessProbe: httpGet: path: /health/live, port: 3000, initialDelaySeconds: 10, periodSeconds: 30. readinessProbe: httpGet: path: /health/ready, port: 3000, initialDelaySeconds: 5, periodSeconds: 10 | Liveness probe that tests DB = one DB hiccup restarts all pods simultaneously (cascading failure). Readiness probe timeout too short = pod removed from rotation on slow response. No readiness probe = pod receives traffic before app is ready to serve | — |
| HPA (Horizontal Pod Autoscaler) | Automatically scale pod count based on CPU, memory, or custom metrics. Scale up when load increases, scale down when idle | hpa.yaml: kind: HorizontalPodAutoscaler, spec.scaleTargetRef.name: myapp, spec.minReplicas: 2, spec.maxReplicas: 10, spec.metrics: cpu averageUtilization: 70 | kubectl get hpa (shows current replicas + targets) | Scale-down too aggressive = pods removed before load decreases → spikes cause thrashing. minReplicas: 1 = single point of failure. HPA requires metrics-server installed. Custom metrics HPA (queue depth, request rate) more accurate than CPU-based |
| Namespaces | Virtual clusters within a physical cluster. Isolate environments (dev/staging/prod) or teams. Resource quotas per namespace | kubectl get namespaces, kubectl create namespace staging, kubectl apply -f manifest.yaml -n staging, kubectl config set-context --current --namespace=myapp (set default namespace). ResourceQuota: limit total CPU/memory per namespace | Default namespace = no isolation. Don't run prod and dev in same namespace. Namespace doesn't provide network isolation by default — use NetworkPolicy for that. Cross-namespace service DNS: http://service.namespace.svc.cluster.local | — |
| Debugging Pods | Systematic approach to diagnosing K8s issues | CrashLoopBackOff: kubectl logs <pod> --previous (logs before crash). Pending: kubectl describe pod — check Events for "Insufficient CPU/memory" or "Unschedulable." ImagePullBackOff: wrong image name, wrong tag, or missing registry credentials (imagePullSecrets). OOMKilled: increase memory limit. kubectl debug node/<node> (ephemeral debug container — 2026 standard) | Most K8s issues are diagnosed via kubectl describe (events section) and kubectl logs. "ContainerCreating" stuck = volume mount issue or image pull issue. Events section in describe is the most useful debugging output | K8s Troubleshooting |
| 2026 — Managed K8s & Tooling | K8s operational complexity has been abstracted — developers interact via higher-level tools | EKS (AWS), GKE (Google — best managed K8s), AKS (Azure). Helm: package manager for K8s — install apps as charts (helm install postgres bitnami/postgresql). Kustomize: environment-specific config overlays (built into kubectl). Lens / k9s: GUI/TUI for K8s — vastly better DX than raw kubectl. ArgoCD: GitOps — K8s manifests in git, ArgoCD syncs cluster to git state automatically. Skaffold / Tilt: local K8s dev loop (hot reload in cluster) | Most developers should use managed K8s (EKS/GKE) not self-hosted. ArgoCD is the 2026 standard for CD into K8s — declarative, auditable, self-healing. k9s is faster than kubectl for day-to-day debugging | k9s, ArgoCD |