DevOps for Developers — 01. Linux & Terminal

Mental Model: As a developer you don't need to be a Linux sysadmin. You need to be comfortable enough to: SSH into a server, read logs, debug a running process, manage files, and not accidentally break things. Every cloud server, Docker container, and CI runner is Linux.

Skill Core Concepts & Mental Model Key Commands & Techniques Tradeoffs & Failure Modes Resources
File System Navigation Linux file tree: everything is a file. / is root. Key dirs: /etc (system configs), /var/log (logs), /home (user dirs), /tmp (cleared on reboot), /usr/bin (installed binaries), /proc (live kernel/process info) ls -la, cd, pwd, find / -name "file" -type f, tree, du -sh * (disk usage per dir), df -h (disk free), stat file (metadata) rm -rf with wrong path = unrecoverable data loss. Always double-check path before destructive commands. /proc/meminfo and /proc/cpuinfo are useful for inspecting running system without installing tools Linux Command Line (free book)
File Operations & Permissions Linux permissions: owner / group / others × read(4) / write(2) / execute(1). Octal notation: 755 = rwxr-xr-x. Symbolic: chmod +x file cp, mv, rm -rf, mkdir -p (create nested dirs), touch, cat, less, head -n 20, tail -f (live log follow), grep -r "text" ., chmod 755 file, chown user:group file, ln -s target link (symlink) Never run chmod 777 — world-writable files are a security risk. chown -R as root on wrong directory can break system binaries. tail -f is the most-used command for watching live logs in production chmod Calculator
Process Management Every running program is a process with a PID. Parent-child relationships. Signals: SIGTERM (15) = graceful shutdown request, SIGKILL (9) = force kill (no cleanup), SIGHUP (1) = reload config ps aux (all processes), top / htop (live — htop is better), kill -15 PID (graceful), kill -9 PID (force), pkill node (by name), lsof -i :3000 (what's using port 3000), nohup command & (survive logout), jobs / fg / bg kill -9 skips cleanup handlers — use kill -15 first, wait 5s, then kill -9 if needed. Zombie processes = child exited but parent never collected exit status (ps shows Z state). lsof -i :3000 is the most useful command when "port already in use" DigitalOcean Process Guide
Text Processing Parse logs, extract data, transform output — the most frequent real-world Linux task. Pipe chains process data as streams — no memory issue for huge files grep -E "ERROR WARN" app.log (regex), grep -v "health" (exclude), awk '{print $1}' (print column), sed 's/old/new/g' (replace), sort uniq -c
Environment & Shell Shell (bash/zsh) interprets commands. Environment variables are key=value pairs inherited by child processes. PATH controls where shell searches for binaries export VAR=value (current session), echo $VAR, env (all vars), printenv VAR, which node (binary location), type command (alias vs binary), source ~/.zshrc (reload config). Add to ~/.zshrc or ~/.bashrc for persistence export in terminal = session only. Secrets in .bashrc/.zshrc = visible in plain text and potentially git-tracked. Use direnv (per-directory .envrc files) for project-specific env vars without polluting global shell direnv
SSH Encrypted remote terminal access. Key-based auth: your private key + server's authorized_keys. More secure and more convenient than passwords. Never enable password auth on production servers ssh-keygen -t ed25519 -C "email" (generate keypair), ssh-copy-id user@ip (add pubkey to server), ssh user@ip, ssh -i key.pem user@ip (AWS style), chmod 400 key.pem (required by SSH), ~/.ssh/config for saved host aliases, ssh -L 5432:localhost:5432 user@server (port forwarding — access remote DB locally) Private key must never leave your machine. PEM files require chmod 400 — SSH refuses overly-permissive keys. ed25519 is preferred over RSA (shorter key, same security). SSH port forwarding is the safest way to access remote DBs locally without exposing DB port publicly SSH Essentials
Cron Jobs Schedule recurring tasks. Runs as the crontab owner's user. Format: minute hour day month weekday command (5 fields) crontab -e (edit), crontab -l (list), 0 2 * * * /scripts/backup.sh (daily at 2am), */5 * * * * (every 5 min). Always redirect output: >> /var/log/cron.log 2>&1. Use absolute paths for binaries: /usr/bin/node not node Cron uses minimal PATH — relative binary names silently fail. Silent failures are the #1 cron bug — always log output. Test scripts manually before adding to crontab. For distributed systems use BullMQ repeatable jobs or Inngest instead of cron — avoids multi-instance double-execution Crontab Guru
Networking Commands Debug connectivity issues, inspect open ports, trace network paths curl -I https://api.example.com (HTTP headers), curl -v (verbose — full request/response), wget, ping, traceroute / tracepath, netstat -tulpn (open ports + listening processes), ss -tulpn (modern replacement for netstat), nc -zv host port (test port connectivity), dig domain (DNS lookup), nslookup domain curl -v is the most useful tool for debugging API calls from a server. ss -tulpn shows exactly what's listening on which port. If curl works but browser doesn't = DNS or CORS issue, not a server issue
System Monitoring Understand resource usage before scaling or debugging performance issues htop (interactive process + CPU/memory), free -h (memory), df -h (disk), iostat (disk I/O), vmstat (CPU/memory/IO overview), uptime (load averages), dmesg (kernel messages — hardware/boot issues), journalctl -u myservice -f (systemd service logs, live) Load average > number of CPU cores = CPU bottleneck. Memory usage near 100% + swap increasing = memory leak. Disk full = silent failures (logs stop writing, DB crashes). Check df -h before diagnosing mysterious errors
Package Management Install, update, remove system packages. apt (Debian/Ubuntu), yum/dnf (RHEL/Amazon Linux) apt update (refresh package index), apt install -y package, apt remove package, apt list --installed, which binary (confirm install). On Alpine (Docker): apk add --no-cache package Always apt update before apt install in Docker or CI — stale package index = version not found errors. Use apt-get (non-interactive) in Dockerfiles, not apt (interactive). Pin package versions in Dockerfiles for reproducible builds

DevOps for Developers — 02. Git & Version Control

Mental Model: Git tracks snapshots of your project, not diffs. Three areas: Working Directory (files you edit) → Staging Area (what you mark for commit) → Repository (committed history). Commits are immutable content-addressed objects — you never modify history, you add new commits on top.

Skill Core Concepts & Mental Model Key Commands & Techniques Tradeoffs & Failure Modes Resources
Core Concepts Four Git objects: Blob (file content), Tree (directory), Commit (snapshot + metadata + parent), Tag (named commit). HEAD = pointer to current commit. Branch = lightweight pointer that moves forward on each commit git add ., git commit -m "message", git status, git log --oneline --graph --all, git diff (unstaged), git diff --staged (staged), git show <hash> (inspect commit) Committing node_modules or .env = bloated repo + leaked secrets. Always have .gitignore before first commit. Large binary files in git history are permanent — use Git LFS for assets Pro Git Book (free)
Branching & Merging Branches are cheap pointers to commits. Three strategies: GitHub Flow (main + feature branches — best for most teams + CI/CD), GitFlow (main/develop/feature/hotfix — complex, slower), Trunk-Based Development (everyone commits to main daily with feature flags — fastest, requires discipline) git checkout -b feature/name, git merge feature/name (merge commit), git rebase main (linear history), git cherry-pick <hash> (single commit), git branch -d feature/name, git branch -a (all branches including remote) Long-lived feature branches = merge conflict pain. Rebase rewrites history — never rebase shared/pushed branches. Cherry-pick overused = diverged history. Trunk-based development requires feature flags to hide incomplete work
Remote Workflows origin = your primary remote (GitHub/GitLab). upstream = original repo (open source forks). fetch downloads without merging. pull = fetch + merge git remote -v, git fetch origin (download, no merge), git pull origin main, git push origin feature/name, git push --force-with-lease (safer than -f — fails if remote has new commits), git clone --depth 1 url (shallow clone — faster for CI) git pull on wrong branch = merges into wrong branch. git push -f on shared branches overwrites teammates' commits. Always use --force-with-lease instead of --force. Shallow clones miss full history — some git operations fail
Undoing Changes Git almost never permanently deletes committed data (reflog retains for 90 days). Strategy depends on whether commits are local or already pushed to shared branch git restore file (discard working dir changes), git reset HEAD~1 (undo last commit, keep changes staged), git reset --hard HEAD~1 (undo + discard — dangerous), git revert <hash> (safe undo for pushed commits — creates new reverting commit), git stash / git stash pop, git reflog (find lost commits) git reset --hard = local changes gone. Check reflog before panicking. git revert is the ONLY safe undo for already-pushed commits — never reset shared history. git stash drop = stash gone permanently
Conventional Commits Structured commit message format that enables: automated changelogs, semantic versioning, readable history, and CI triggers based on commit type Types: feat (new feature → minor version bump), fix (bug fix → patch bump), chore (maintenance), refactor, docs, test, ci, perf, BREAKING CHANGE (→ major bump). Format: type(scope): description. Example: feat(auth): add OAuth Google login Inconsistent commit messages = useless git log. WIP commits should be squashed before merge (git rebase -i). Use commitlint + husky to enforce format in CI Conventional Commits
Branch Protection & GitHub Workflows Pull Requests are the unit of code review and the CI trigger. Branch protection enforces quality gates before merge. CODEOWNERS auto-assigns reviewers by file path Branch protection rules: require PR review before merge, require CI status checks to pass, disable force push to main, require linear history. CODEOWNERS: src/auth/ @security-team. Merge strategies: Squash merge (clean history), Rebase merge (linear), Merge commit (full history) Direct push to main = no review, no audit trail. Missing status check requirement = broken CI doesn't block merge. Squash merge loses individual commit history — use for feature branches. Merge commit for releases (preserves full context) GitHub Branch Protection
Secrets & .gitignore .gitignore prevents files from being tracked. But once a file is committed, removing from .gitignore doesn't remove from history .gitignore essentials: node_modules/, .env, .env.local, dist/, .DS_Store, *.log, coverage/. Remove accidentally committed file: git rm --cached .env → commit. If secret already pushed: rotate immediately, use git-filter-repo to purge history, assume it's compromised .gitignore must exist before first commit — files already tracked are not ignored by adding them later. git rm --cached removes from tracking but keeps local file. Private repo does NOT mean secrets are safe — treat all committed secrets as compromised gitignore templates
Git Hooks & Automation Scripts that run automatically at Git lifecycle events. Pre-commit: run linting/formatting before commit. Commit-msg: validate commit message format. Pre-push: run tests before push husky (manage hooks in package.json), lint-staged (run linters only on staged files — fast). Setup: npx husky init → .husky/pre-commit → npx lint-staged. package.json: "lint-staged": { "*.ts": ["eslint --fix", "prettier --write"] } Hooks run locally — can be bypassed with git commit --no-verify. Never put critical security checks only in hooks — enforce in CI too. Slow hooks = developers bypass them husky docs
Monorepo Git Patterns Single repo for multiple apps/packages. Git operations need to be scoped — you don't want to run all tests when only one package changed Turborepo: turbo run test --filter=@myapp/api (run only affected). GitHub Actions path filters: on.push.paths: ['apps/api/**'] to trigger only relevant workflows. git diff --name-only HEAD~1 to detect changed packages Without path-based CI filtering: every commit triggers full test suite for all packages = slow. Turborepo remote cache: reuse unchanged package build outputs across CI runs Turborepo
Git in 2026 — AI-assisted Workflows Git workflows now integrate with AI tools for: auto-generated commit messages, PR descriptions, code review suggestions, and conflict resolution assistance GitHub Copilot generates commit messages and PR descriptions. git commit with Copilot CLI. Cursor/Windsurf AI-assisted merge conflict resolution. Conventional commit message generation from staged diff AI-generated commit messages can be generic — review before accepting. AI PR descriptions miss business context — always add "why" manually. Auto-merge bots (Renovate, Dependabot) handle dependency updates via PRs GitHub Copilot CLI

DevOps for Developers — 03. CI/CD Pipelines

Mental Model: CI/CD makes deployment a non-event — small, frequent, automated, and reversible. CI (Continuous Integration) catches bugs early by building and testing every commit. CD (Continuous Delivery/Deployment) removes manual steps from shipping code. The goal: confidence to deploy 10 times a day without fear.

Skill Core Concepts & Mental Model Tools & Techniques Tradeoffs & Failure Modes Resources
CI/CD Mental Model CI: every commit triggers automated build + test. CD (Delivery): every merged commit is automatically deployable (may require manual approval). CD (Deployment): every merged commit automatically deploys to production. Pipeline stages run in order: Trigger → Install → Lint → Type Check → Test → Build → Deploy. Fail fast: run cheap checks (lint) before expensive ones (E2E) GitHub Actions (most common), GitLab CI, CircleCI, Buildkite (self-hosted runners) Slow CI = developers stop waiting = commits piling up = defeats CI purpose. CI that never fails = tests not catching real bugs. No rollback plan = deployment is gambling. Target: CI feedback in under 5 minutes GitHub Actions docs
GitHub Actions Core Concepts YAML-based workflow automation. Triggered by events. Jobs run in parallel by default. Steps within a job run sequentially. Runners are ephemeral VMs that are destroyed after each run on (triggers: push, pull_request, schedule, workflow_dispatch). jobs (parallel units of work). steps (sequential commands in a job). uses (reuse community actions). env (env vars). secrets (encrypted, injected at runtime). needs (job dependency — run after another job) Runners are stateless — no file persistence between runs without artifacts or cache. Action versions must be pinned (actions/checkout@v4 not @main) — unpinned = supply chain attack. GITHUB_TOKEN has limited permissions by default — configure permissions explicitly GitHub Actions docs
Standard Node.js CI Pipeline Practical pipeline covering the full quality gate for a Node.js/TypeScript project name: CI. on: [push, pull_request]. jobs: ci: runs-on: ubuntu-latest. steps: checkout → setup-node (node-version: 20, cache: npm) → npm ci → npm run lint → npx tsc --noEmit → npm test -- --coverage → npm run build npm ci (not npm install) — uses lockfile exactly, fails if lockfile is out of sync. Missing tsc --noEmit = TypeScript errors only caught at runtime. Missing npm run build in CI = build errors discovered at deploy time, not commit time
Caching Dependencies Reinstalling node_modules on every CI run wastes 1–3 minutes. Cache based on lockfile hash — reinstall only when dependencies change actions/cache with key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}. Or use actions/setup-node with cache: 'npm' (handles caching automatically). Turborepo Remote Cache: cache entire build outputs — unchanged packages don't rebuild Wrong cache key = stale node_modules served from cache. Cache key must include OS + lockfile hash. Turborepo remote cache requires TURBO_TOKEN secret in CI actions/cache
Parallel Jobs Independent jobs (lint, test, build) can run simultaneously — reduces total pipeline time jobs: lint: ..., typecheck: ..., test: ... (all run in parallel). deploy: needs: [lint, typecheck, test] (runs only after all pass). Matrix builds: strategy.matrix.node-version: [18, 20, 22] — test across multiple Node versions Too many parallel jobs = exceed free tier minute limits. Matrix without filtering = exponential job count. Use concurrency groups to cancel outdated runs on new push: concurrency: ${{ github.workflow }}-${{ github.ref }}
Deployment Strategies How you ship code determines your risk, downtime, and rollback capability Blue-Green: two identical envs (blue=live, green=new), switch traffic instantly → zero downtime + instant rollback, double cost. Rolling: replace instances one at a time → zero downtime, slower rollback. Canary: route 5% of traffic to new version → validate before full rollout. Feature Flags: deploy code dark, toggle via config → decouple deploy from release entirely Blue-green: DB schema migrations are tricky — both versions run simultaneously, so old version must handle new schema. Canary: user hits old + new version in same session = state inconsistency risk. Feature flags: flag debt accumulates — schedule cleanup of old flags LaunchDarkly
Environment Strategy Separate environments for each stage of confidence: dev → staging → production. Each has its own config, DB, and secrets dev: local machine or ephemeral preview deploy (Vercel preview, Railway PR deploys). staging: mirrors production, used for QA and smoke tests. production: real users. GitHub Environments: configure protection rules (required reviewers, wait timer) per environment. environment: production in job config triggers approval gate Staging data diverging from production = staging tests don't reflect prod behavior. No staging env = testing in production. Auto-deploy to production without approval = one bad merge ships immediately GitHub Environments
Secrets Management Never store secrets in YAML files or source code. Inject at runtime from a secure store GitHub Secrets: Settings → Secrets and Variables → accessible as ${{ secrets.DATABASE_URL }} — masked in logs. AWS Secrets Manager: rotation, audit log, cross-service access. HashiCorp Vault: self-hosted, fine-grained. Doppler: developer-friendly sync to GitHub/Vercel/AWS. In Actions: env: DATABASE_URL: ${{ secrets.DATABASE_URL }} Secret committed to git = rotate immediately even if repo is private. NEXT_PUBLIC_ prefix = secret visible in browser bundle. Secrets should support rotation — design app to handle new credentials without restart GitHub Encrypted Secrets
Docker in CI Build and push Docker images as part of CI, then deploy the image — not raw source code actions/checkout → docker/setup-buildx-action → docker/login-action (GHCR or DockerHub) → docker/build-push-action with cache-from: type=gha and cache-to: type=gha,mode=max → tag with git SHA (not latest). Pull exact SHA tag in deploy step Using latest tag in CD = non-deterministic deploys. GitHub Actions cache for Docker layers dramatically speeds up image builds. Multi-platform builds (linux/amd64,linux/arm64) needed if deploying to ARM (AWS Graviton) docker/build-push-action
Release & Versioning Automate version bumps and changelogs from conventional commits semantic-release: reads conventional commits → bumps package.json version → generates CHANGELOG.md → creates GitHub Release → publishes to npm. release-please (Google): PR-based approach — creates a release PR that batches version bump + changelog. Manual: npm version patch/minor/major + git tag Without automated versioning: manual bumps get forgotten or inconsistent. semantic-release requires strict conventional commit discipline. release-please is safer for teams — change is a PR, not an automatic push semantic-release
2026 — AI in CI/CD AI is now integrated into the CI/CD feedback loop — speeding up review, detection, and fix cycles GitHub Copilot code review in PRs: suggests fixes inline. Automated dependency updates: Renovate Bot + Dependabot auto-creates PRs for dep updates with changelogs. AI-generated PR descriptions from diff. Flaky test detection: GitHub Actions now surfaces flaky tests automatically. OpenTelemetry traces from CI builds (build observability) AI review suggestions can miss business context. Auto-merged Renovate PRs without test coverage = shipping broken updates. Treat Dependabot PRs like any other code — CI must pass before merge Renovate Bot

DevOps for Developers — 04. Docker & Containerization

Mental Model: A container is an isolated process on the host OS kernel — not a VM. It packages the app + its exact runtime dependencies into a portable unit that runs identically everywhere. Image = immutable blueprint. Container = running instance. Docker solved "works on my machine." Containers are now the universal unit of deployment.

Skill Core Concepts & Mental Model Tools & Techniques Tradeoffs & Failure Modes Resources
Core Concepts Image: read-only blueprint (like a class). Container: running instance (like an object). Registry: image storage (Docker Hub, GHCR, ECR). Images are built in layers — each Dockerfile instruction adds a layer. Layers are cached and reused across builds docker build -t myapp:1.0 . Build. docker run -p 3000:3000 myapp:1.0 Run. docker ps List running. docker images List local images. docker logs container-id Follow logs. docker exec -it container-id sh Enter container shell. docker stop / docker rm Images are immutable — never modify a running container, rebuild the image. Large images = slow CI push/pull + slow cold starts. Tag images with git SHA not latest — latest is non-deterministic Docker docs
Dockerfile Text file with ordered instructions to build an image. Each instruction creates a new layer. Layer order = cache efficiency: put rarely-changing steps first (install deps), frequently-changing last (copy source) FROM node:20-alpine (base — alpine = minimal, ~5MB vs ~900MB for node:20). WORKDIR /app. COPY package*.json ./ (lockfile first). RUN npm ci. COPY . . (source last). RUN npm run build. EXPOSE 3000. USER node (non-root). CMD ["node", "dist/server.js"] node:latest = unpredictable version changes. Running as root = security risk. COPY . . before npm ci = cache bust on every source change = slow builds. Leaving devDependencies in production image = bloated + larger attack surface Dockerfile Best Practices
Multi-stage Builds Multiple FROM stages in one Dockerfile. Build stage: install all deps + compile TypeScript. Production stage: copy only compiled output — no devDeps, no source, no build tools. Result: dramatically smaller, more secure image Stage 1 (builder): FROM node:20-alpine AS builder → COPY package*.json ./ → RUN npm ci → COPY . . → RUN npm run build. Stage 2 (runner): FROM node:20-alpine AS runner → COPY --from=builder /app/dist ./dist → COPY --from=builder /app/node_modules ./node_modules → USER node → CMD ["node", "dist/server.js"] Without multi-stage: image can be 1.5GB+. With multi-stage: ~150–200MB. Next.js standalone output + multi-stage = ~120MB. Test stage: add a third stage that runs tests — fail build if tests fail Multi-stage builds
.dockerignore Prevents files from being sent to Docker build context and baked into the image. Equivalent to .gitignore for Docker. Missing it = slow builds + secrets in image Essentials: node_modules (rebuilt inside), .git, .env (never bake secrets into image), dist (rebuilt), *.log, .DS_Store, README.md, coverage/, .github/ .env in image = secrets visible to anyone who pulls the image. node_modules in build context = slow build (Docker uploads entire folder to daemon). Always create .dockerignore before writing Dockerfile
Docker Compose Define and run multi-container apps with a single YAML file. Manages networking, volumes, env vars, and service dependencies. The standard tool for local development environments docker-compose.yml: services (app, postgres, redis), ports, environment, volumes, depends_on, healthcheck. docker compose up -d (background), docker compose logs -f app (follow logs), docker compose down -v (stop + remove volumes) depends_on only waits for container start, not readiness — use healthcheck + condition: service_healthy. Hardcoded passwords in docker-compose.yml = committed to git. Use .env file + ${VAR} substitution Docker Compose docs
Volumes & Persistence Containers are ephemeral — filesystem resets on restart. Volumes persist data outside the container lifecycle Named volume (Docker-managed): docker run -v mydata:/var/lib/postgresql/data. Bind mount (maps host dir): docker run -v $(pwd):/app (dev hot-reload). docker volume inspect mydata. docker volume ls. Backup: docker run --rm -v mydata:/data -v $(pwd):/backup alpine tar czf /backup/data.tar.gz /data Never store DB data inside container — always use named volume. Bind mounts in production = host path dependency. Volume permissions: container UID must match host file ownership or use --user flag. Named volumes need explicit backup strategy
Networking Containers on the same network reach each other by service name (DNS). Bridge network isolates from host. Each docker-compose project gets its own default network docker network create mynet. Service DNS in compose: http://postgres:5432, http://redis:6379 (use service name as hostname). Expose vs Publish: EXPOSE documents port (internal), -p 3000:3000 maps to host. docker network inspect bridge Never expose DB port to host in production (-p 5432:5432 on DB). Only the app container should reach the DB. Containers on different networks can't communicate — explicit shared network needed. host.docker.internal = reach host machine from container (Mac/Windows) Docker networking
Security Best Practices Containers reduce attack surface but introduce new risks if misconfigured Run as non-root: USER node (add before CMD). Read-only filesystem: docker run --read-only. Drop capabilities: --cap-drop=ALL --cap-add=NET_BIND_SERVICE. No secrets in ENV in Dockerfile (visible in docker inspect and image layers). Use Docker secrets or mount secrets as files at runtime. Scan images: docker scout cves myapp:latest or trivy image myapp:latest Root in container = root on host if container escapes (rare but possible). ENV secrets baked into image layers = visible in docker history. Trivy/Scout catches known CVEs in base image and dependencies — run in CI before push Trivy
Useful Patterns Common Docker patterns used in real projects Health check: HEALTHCHECK --interval=30s --timeout=3s CMD curl -f http://localhost:3000/health exit 1. Init process: use dumb-init as PID 1 to forward signals correctly (FROM node:20-alpine → RUN apk add dumb-init → ENTRYPOINT ["dumb-init", "--"]). Debug running container: docker exec -it container-id sh. Copy file from container: docker cp container-id:/app/log.txt .
2026 — Docker in Modern Dev Docker usage patterns have evolved — new tooling makes containers faster and more ergonomic Docker Init: docker init auto-generates optimized Dockerfile + compose file for your project. Dev Containers (VS Code / Cursor): develop inside a container — entire dev environment is reproducible and shareable (.devcontainer/devcontainer.json). Testcontainers: spin up real Docker containers in integration tests (real PostgreSQL, real Redis — no mocks). Docker Scout: built-in vulnerability scanning replacing third-party tools Dev Containers solve "new developer setup takes 2 days." Testcontainers are now the standard for integration testing — much more reliable than mocked DB clients. Docker Desktop alternatives: OrbStack (Mac — faster, lighter than Docker Desktop) Dev Containers spec, Testcontainers

DevOps for Developers — 05. Cloud — AWS for Developers

Mental Model: As a developer you don't need to know every AWS service (there are 200+). You need to know: how to deploy an app, store files, manage a database, handle secrets, send traffic to the right place, and know when things break. AWS is the default cloud — these services appear in almost every production stack.

Skill Core Concepts & Mental Model Key Services & Techniques Tradeoffs & Failure Modes Resources
Core Concepts Region: geographic cluster of data centers. AZ (Availability Zone): isolated data center within a region. VPC (Virtual Private Cloud): your private network in AWS. IAM: identity and access management — controls who/what can do what. Everything in AWS is accessed via IAM Always deploy across multiple AZs for high availability. Use us-east-1 as default (most services available, cheapest). IAM roles for EC2/ECS (not access keys hardcoded in code). Least privilege: grant minimum permissions needed Single-AZ deploy = one data center outage takes down your app. IAM access keys in code = credential leak. IAM role vs IAM user: roles are preferred for services (no long-lived credentials) AWS docs
EC2 Virtual machine in the cloud. You control OS, runtime, and config. Most flexible but most maintenance. t3.micro (free tier) → t3.medium → c5 (compute) → r5 (memory) instance types Launch: choose AMI (Amazon Linux 2023 or Ubuntu 22.04), instance type, security group (firewall rules), key pair (SSH access). Connect: ssh -i key.pem ec2-user@ip. User data script: runs on first boot (install Node, pull image, start app). Elastic IP: static IP for your instance EC2 = you manage patching, scaling, and OS. Security group: inbound rules control what ports are open — only open ports 22 (SSH, restricted to your IP), 80, 443. Never open 0.0.0.0/0 on port 22 (brute-force target). Use Systems Manager Session Manager instead of SSH for production access EC2 docs
ECS (Elastic Container Service) Run Docker containers without managing servers (Fargate mode) or on EC2. The standard way to deploy containerized apps on AWS. Task Definition = Dockerfile equivalent (image, CPU, memory, env vars, port). Service = desired count of running tasks + auto-restart ECS Fargate: serverless containers — no EC2 to manage. Task Definition: image URI (ECR), CPU (256/512/1024), memory, environment variables (from Secrets Manager), port mappings. Service: desired count, auto-scaling, ALB integration. Rolling deploy: replace tasks one at a time Fargate cold starts: ~10–30s to start a new task. CPU/memory limits are hard — app OOM-killed if it exceeds memory limit. ECR image pull timeout if image is too large. ECS Service auto-restart on task failure — check CloudWatch logs when tasks keep restarting ECS docs
ECR (Elastic Container Registry) Private Docker image registry in AWS. Stores images for ECS/EKS deployment. Integrated with IAM for access control aws ecr get-login-password docker login --username AWS --password-stdin <account>.dkr.ecr.<region>.amazonaws.com. Tag: docker tag myapp:latest <ecr-uri>:latest. Push: docker push <ecr-uri>:latest. Lifecycle policy: auto-delete old images (keep last 10 tagged images) Without lifecycle policy: ECR storage costs grow unbounded (every CI push stores a new image). Always push git SHA tag + latest — never only latest (can't roll back to previous). ECR image scanning: enable on push to catch CVEs automatically
RDS (Relational Database Service) Managed PostgreSQL/MySQL. AWS handles: backups, patching, failover, replication. You focus on schema and queries Multi-AZ: primary + standby replica in different AZ — automatic failover. Read replicas: offload read traffic. Parameter groups: DB config (max_connections, shared_buffers). Subnet group: deploy RDS in private subnet (not public internet). Security group: only allow traffic from app server/ECS task security group Never put RDS in public subnet — only accessible from within VPC. Enable automated backups (default 7 days — extend to 30 for production). RDS Proxy: connection pooler between app and RDS — prevents connection exhaustion from Lambda/ECS scaling RDS docs
S3 (Simple Storage Service) Object storage for files, images, backups, static sites. Infinitely scalable, 11 nines durability. Flat key-value store — no real folders, just prefixes Bucket: container for objects. Object key: the "path" (images/user-123/avatar.jpg). Pre-signed URL: time-limited URL for direct client upload/download — no server proxy needed. Bucket policy vs ACL: use bucket policy (ACL is legacy). Storage classes: Standard → Infrequent Access → Glacier (archival). Lifecycle rules: auto-transition old objects to cheaper tier Public bucket = anyone can read/write all files. Default: block all public access. Use pre-signed URLs for user uploads — never proxy through your server. S3 egress (data out) is expensive — serve via CloudFront CDN instead S3 docs
CloudFront CDN for serving S3 files and dynamic content at edge. Reduces latency from ~200ms to ~10ms. Integrates with S3, ALB, API Gateway, and custom origins Distribution: configure origin (S3 bucket or ALB). Cache behavior: which paths to cache (/* for static, /api/* bypass cache). Cache invalidation: aws cloudfront create-invalidation --paths "/*". SSL: attach ACM certificate for HTTPS on custom domain. Lambda@Edge / CloudFront Functions: run code at edge (auth, redirects, A/B testing) CDN serves stale content until TTL expires — use versioned filenames (app.[hash].js) for instant "invalidation." CloudFront invalidation takes ~30s and costs $0.005 per 1000 paths. Lambda@Edge has cold starts — CloudFront Functions are faster for simple logic CloudFront docs
ALB (Application Load Balancer) L7 HTTP load balancer. Routes traffic to ECS tasks, EC2 instances, or Lambda. Required in front of any multi-instance deployment Target group: pool of instances/tasks. Health check: GET /health → 200 = healthy, 5xx or timeout = remove from rotation. Listener rules: route by path (/api/* → API target group, /* → frontend target group) or host (api.example.com vs app.example.com). HTTPS: attach ACM certificate to listener No health check = ALB routes to crashed instances. Health check interval too short = flapping. Connection draining: ALB waits for in-flight requests to complete before deregistering instance (set deregistration delay: 30s) ALB docs
IAM (Identity & Access Management) Controls who (user, service, role) can do what (action) on which resource. Principle of least privilege: grant minimum permissions needed IAM Role: attached to EC2/ECS/Lambda — temporary credentials, no long-lived keys. IAM Policy: JSON document defining permissions. Managed policies: AWS pre-built (AmazonS3ReadOnlyAccess). Inline policy: attached to one specific resource. aws sts get-caller-identity (who am I?). IAM Access Analyzer: finds overly permissive policies IAM access keys in code/git = credential leak — use IAM roles for services. Wildcard actions (s3:*) = too permissive. Never use root account for daily operations — create an IAM user/role. Rotate access keys every 90 days if you must use them IAM Best Practices
Secrets Manager & SSM Store and inject secrets securely. Secrets Manager: supports automatic rotation, versioning, cross-account access. SSM Parameter Store: simpler, cheaper, no auto-rotation Secrets Manager: aws secretsmanager get-secret-value --secret-id prod/myapp/db. SSM: aws ssm get-parameter --name /myapp/prod/DATABASE_URL --with-decryption. In ECS Task Definition: reference secret ARN in environment → injected at container start. In Lambda: fetch at cold start, cache in module scope Fetching secrets on every request = latency + Secrets Manager rate limit. Cache secrets in memory (module scope) and refresh on rotation event. Secrets Manager costs $0.40/secret/month — use SSM for simple non-rotating secrets Secrets Manager docs
CloudWatch AWS-native monitoring: logs, metrics, alarms, dashboards. Default destination for logs from ECS, Lambda, EC2 Log groups: ECS tasks auto-log to /ecs/myapp. Log Insights: query logs with SQL-like syntax. Metrics: CPU, memory, request count, error rate. Alarms: trigger SNS notification or auto-scaling action. aws logs tail /ecs/myapp --follow (live logs in terminal) CloudWatch Logs retention default = never expires = unbounded cost. Set retention policy (30–90 days). Log Insights queries on large log groups are slow and expensive — use structured JSON logs with indexed fields. CloudWatch metrics have 1-minute resolution minimum CloudWatch docs
Lambda Run functions without managing servers. Event-triggered: HTTP (API Gateway), S3 event, SQS message, scheduled (EventBridge). Billed per invocation + duration Handler: exports.handler = async (event) => { return { statusCode: 200, body: JSON.stringify(result) } }. Cold start: ~100ms–1s for Node.js (provisioned concurrency eliminates cold starts). Timeout: max 15 minutes. Memory: 128MB–10GB (more memory = more CPU). Layers: shared dependencies across functions Cold starts are a real UX problem for user-facing APIs — use ECS for low-latency APIs, Lambda for async processing. No persistent connections — use RDS Proxy for DB access. Lambda function code limit: 50MB zipped (use Lambda layers for large deps) Lambda docs
2026 — Modern Deployment Patterns Cloud deployment patterns have simplified significantly for developer teams App Runner: deploy a Docker image or GitHub repo with zero infrastructure config — AWS manages scaling, load balancing, TLS. Better than ECS for small teams. Copilot CLI: deploy ECS apps with one command (copilot init). CDK (Cloud Development Kit): define AWS infrastructure in TypeScript — more ergonomic than raw CloudFormation/Terraform for developers. Alternatives: Railway, Render, Fly.io (simpler than AWS for early-stage projects — no IAM, no VPC config) App Runner is more expensive than ECS at scale but dramatically simpler. CDK generates CloudFormation — still has CFN limitations. Railway/Render: excellent DX but less control and higher cost at scale. Fly.io: best for globally distributed apps (edge regions) AWS App Runner, AWS CDK

DevOps for Developers — 06. Kubernetes (Developer Basics)

Mental Model: Kubernetes (K8s) is a container orchestrator — it runs your Docker containers across a cluster of machines, handles restarts, scaling, and rolling deploys automatically. As a developer you don't need to operate a cluster, but you must understand enough to: read manifests, debug failing pods, deploy your app, and know what Kubernetes is actually doing when your deployment is "stuck."

Skill Core Concepts & Mental Model Key Techniques & Commands Tradeoffs & Failure Modes Resources
Core Architecture Control Plane: manages the cluster (API Server, etcd, Scheduler, Controller Manager). Node: worker machine (EC2/VM) that runs containers. Pod: smallest deployable unit — one or more containers that share network and storage. Kubernetes reconciles desired state (YAML) with actual state continuously kubectl get pods, kubectl get nodes, kubectl describe pod <name> (events + status), kubectl logs <pod> -f (live logs), kubectl exec -it <pod> -- sh (enter container), kubectl apply -f manifest.yaml, kubectl delete -f manifest.yaml K8s adds significant operational complexity — don't use it unless you have 3+ services that need independent scaling/deployment. Use ECS/App Runner/Fly.io for simpler setups. kubectl is your primary debugging tool Kubernetes docs
Pod Smallest deployable unit. Usually one container per pod. Pods are ephemeral — they die and are replaced. Never deploy a naked Pod (no restart or scaling) — always use a Deployment Pod spec: containers (image, ports, resources, env, volumeMounts), volumes, restartPolicy. Pods get a cluster-internal IP (not stable — changes on restart). Pod-to-pod communication: use Service DNS, not pod IP directly Pod IP changes on every restart — never hardcode pod IPs. Pod dying continuously (CrashLoopBackOff) = check kubectl logs and kubectl describe pod for events. OOMKilled = memory limit too low
Deployment Manages a ReplicaSet — ensures N replicas of your pod are always running. Handles rolling updates and rollbacks. The standard way to run stateless apps deployment.yaml: apiVersion: apps/v1, kind: Deployment, spec.replicas: 3, spec.selector.matchLabels, spec.template (pod template). kubectl rollout status deployment/myapp. kubectl rollout undo deployment/myapp (rollback). kubectl scale deployment myapp --replicas=5 Rolling update default: replaces pods one at a time (no downtime). maxSurge: how many extra pods during update. maxUnavailable: how many pods can be down during update. rollback only reverts pod template — does NOT revert DB migrations K8s Deployments
Service Stable DNS name + IP that routes traffic to matching pods. Pods come and go — Service provides a stable endpoint. Three types: ClusterIP (internal only), NodePort (exposes on each node's IP), LoadBalancer (creates cloud LB — use for external traffic) service.yaml: kind: Service, spec.selector (matches pod labels), spec.ports (port: 80, targetPort: 3000), spec.type: ClusterIP. DNS: http://myapp.default.svc.cluster.local or just http://myapp within same namespace. kubectl get services, kubectl describe service myapp ClusterIP only reachable within cluster. LoadBalancer creates an AWS/GCP LB per service (expensive) — use Ingress instead for multiple services. Service selector must exactly match pod labels or traffic won't route
Ingress Routes external HTTP/HTTPS traffic to Services based on host/path rules. One Ingress Controller (NGINX, Traefik, AWS ALB Ingress) handles routing for all services — cheaper than one LoadBalancer per service ingress.yaml: kind: Ingress, rules: host: api.example.com, http.paths: path: /api, backend.service.name: api-service. TLS: spec.tls with secret containing cert. Annotations configure NGINX behavior (rate limit, CORS, timeout) Ingress requires an Ingress Controller deployed in the cluster — not included by default. Path matching order matters: specific paths before catch-all. cert-manager automates TLS certificate provisioning from Let's Encrypt NGINX Ingress
ConfigMap & Secret ConfigMap: non-sensitive config injected as env vars or files. Secret: sensitive data (base64 encoded — NOT encrypted by default in etcd). Both decouple config from container image configmap.yaml: kind: ConfigMap, data: NODE_ENV: production. secret.yaml: kind: Secret, data: DB_PASSWORD: <base64>. Reference in pod: env.valueFrom.configMapKeyRef or secretKeyRef. Mount as file: volumeMounts + volumes.configMap K8s Secrets are base64 encoded not encrypted — anyone with cluster access can read them. Use External Secrets Operator + AWS Secrets Manager for real encryption. Changing ConfigMap does not auto-restart pods — add a checksum annotation to trigger rollout K8s Secrets
Resources & Limits Every container must declare CPU and memory requests (what K8s reserves) and limits (hard cap). Scheduler uses requests to decide which node to place the pod on resources.requests: cpu: "250m" (250 millicores = 0.25 CPU), memory: "256Mi". resources.limits: cpu: "500m", memory: "512Mi". kubectl top pods (live CPU/memory usage). kubectl describe node (allocatable vs allocated) OOMKilled: container exceeded memory limit — increase limit or fix memory leak. CPU throttling: container exceeds CPU limit — not killed but slowed down (check kubectl top). No resource requests = scheduler places pods randomly = noisy neighbor problem. Always set both requests AND limits
Liveness & Readiness Probes Liveness: is the container healthy? Fail → restart container. Readiness: is the container ready for traffic? Fail → remove from Service endpoints (no traffic). These are distinct purposes livenessProbe: httpGet: path: /health/live, port: 3000, initialDelaySeconds: 10, periodSeconds: 30. readinessProbe: httpGet: path: /health/ready, port: 3000, initialDelaySeconds: 5, periodSeconds: 10 Liveness probe that tests DB = one DB hiccup restarts all pods simultaneously (cascading failure). Readiness probe timeout too short = pod removed from rotation on slow response. No readiness probe = pod receives traffic before app is ready to serve
HPA (Horizontal Pod Autoscaler) Automatically scale pod count based on CPU, memory, or custom metrics. Scale up when load increases, scale down when idle hpa.yaml: kind: HorizontalPodAutoscaler, spec.scaleTargetRef.name: myapp, spec.minReplicas: 2, spec.maxReplicas: 10, spec.metrics: cpu averageUtilization: 70 kubectl get hpa (shows current replicas + targets) Scale-down too aggressive = pods removed before load decreases → spikes cause thrashing. minReplicas: 1 = single point of failure. HPA requires metrics-server installed. Custom metrics HPA (queue depth, request rate) more accurate than CPU-based
Namespaces Virtual clusters within a physical cluster. Isolate environments (dev/staging/prod) or teams. Resource quotas per namespace kubectl get namespaces, kubectl create namespace staging, kubectl apply -f manifest.yaml -n staging, kubectl config set-context --current --namespace=myapp (set default namespace). ResourceQuota: limit total CPU/memory per namespace Default namespace = no isolation. Don't run prod and dev in same namespace. Namespace doesn't provide network isolation by default — use NetworkPolicy for that. Cross-namespace service DNS: http://service.namespace.svc.cluster.local
Debugging Pods Systematic approach to diagnosing K8s issues CrashLoopBackOff: kubectl logs <pod> --previous (logs before crash). Pending: kubectl describe pod — check Events for "Insufficient CPU/memory" or "Unschedulable." ImagePullBackOff: wrong image name, wrong tag, or missing registry credentials (imagePullSecrets). OOMKilled: increase memory limit. kubectl debug node/<node> (ephemeral debug container — 2026 standard) Most K8s issues are diagnosed via kubectl describe (events section) and kubectl logs. "ContainerCreating" stuck = volume mount issue or image pull issue. Events section in describe is the most useful debugging output K8s Troubleshooting
2026 — Managed K8s & Tooling K8s operational complexity has been abstracted — developers interact via higher-level tools EKS (AWS), GKE (Google — best managed K8s), AKS (Azure). Helm: package manager for K8s — install apps as charts (helm install postgres bitnami/postgresql). Kustomize: environment-specific config overlays (built into kubectl). Lens / k9s: GUI/TUI for K8s — vastly better DX than raw kubectl. ArgoCD: GitOps — K8s manifests in git, ArgoCD syncs cluster to git state automatically. Skaffold / Tilt: local K8s dev loop (hot reload in cluster) Most developers should use managed K8s (EKS/GKE) not self-hosted. ArgoCD is the 2026 standard for CD into K8s — declarative, auditable, self-healing. k9s is faster than kubectl for day-to-day debugging k9s, ArgoCD