🐧 Terminal Simulator — Module 3

Process Management & Troubleshooting

When the alert fires at 2 AM, you need to check disk space, hunt rogue processes, and fix permissions — fast.

Module Progress0/10 steps
STEP 1 / 10
df

Alert! Disk Space Critical — Identify the Full Partition

Real-World Scenario

Zabbix just fired a P1 alert: "CRITICAL: /var/log partition 100% full on prod-server-03." The application is failing to write logs and users are reporting 500 errors. You SSH into the server immediately. The first command every senior engineer runs in a disk space emergency is `df -h` — you need to see ALL partitions at a glance and identify which one is full.

Technical Breakdown

`df` (disk free) reports filesystem disk space usage. Without flags, it shows raw 1K blocks — unreadable for humans. `-h` (human-readable) converts to KB, MB, GB automatically. `-T` adds the filesystem type column (ext4, xfs, tmpfs). In production, you almost ALWAYS want `df -h`. Look at the "Use%" column — anything above 90% is a warning, above 95% is critical.

-hHuman-readable sizes (KB, MB, GB) instead of raw blocks.
-TShow filesystem type (ext4, xfs, tmpfs, etc.).
-iShow inode usage instead of block usage.
--totalAdd a grand total row at the bottom.

Your Task

Check disk usage in human-readable format. Type: df -h

devops@prod-server-03 — bash
devops@prod-server:~$

Quick Guide: Incident Response

Understanding the basics in 30 seconds

How It Works

  • df -h shows filesystem disk usage — identify which partition is full
  • du -sh | sort -rh finds the largest files eating disk space
  • ps aux | grep hunts specific processes by name
  • kill sends signals: SIGTERM (graceful) or SIGKILL -9 (force)
  • free -h checks RAM and swap usage — look at "available" not "free"
  • top gives real-time CPU, memory, and process monitoring
  • systemctl restart/status manages systemd services
  • journalctl -u reads service logs from the systemd journal
  • chmod controls file permissions — +x adds execute permission
  • chown changes file ownership — critical when root creates files

Key Benefits

  • Complete incident response flow from alert to resolution
  • Efficient disk space diagnostics with df + du pipe chains
  • Safe process termination with proper signal escalation
  • Memory monitoring to prevent OOM killer situations
  • Service management with systemctl restart/reload
  • Log analysis with journalctl for post-incident verification
  • Permission and ownership fixes for CI/CD pipelines

Real-World Uses

  • Responding to Zabbix/Prometheus disk space alerts at 2 AM
  • Finding 90GB debug logs eating production disk space
  • Killing runaway cron jobs consuming 85% CPU
  • Restarting Nginx after disk-full crash
  • Verifying clean service startup with journalctl
  • Fixing deploy script permissions after git pull
  • Fixing root-owned log files blocking application writes

The Incident Response Playbook

The Complete Troubleshooting Framework

When an alert fires, senior engineers follow a systematic 10-step approach rather than panicking. This module walks through a complete production incident from the initial Zabbix alert to the final post-mortem. The framework is:

1. df -h → Identify the full partition
2. du -sh | sort -rh → Find the space hog
3. ps aux | grep → Hunt the rogue process
4. kill -9 → Terminate the process
5. free -h → Check memory/swap health
6. top → Verify system is stabilizing
7. systemctl restart → Bring crashed services back
8. journalctl -u → Verify clean startup
9. chmod +x → Fix deploy script permissions
10. chown -R → Fix file ownership issues

🔍 Diagnostics

Identify what's wrong.

  • df -h — Disk usage per partition
  • du -sh /* — Find largest directories
  • ps aux | grep — Find processes
  • free -h — Memory and swap
  • top — Real-time system overview

⚡ Actions

Fix the problem.

  • kill -9 PID — Force kill process
  • systemctl restart — Restart service
  • chmod +x — Fix permissions
  • chown user:group — Fix ownership

✅ Verify

Confirm it's fixed.

  • journalctl -u — Service logs
  • top — CPU/MEM stabilizing
  • df -h — Space recovered
  • ls -la — Verify permissions

Kill Signal Cheat Sheet

Signal Escalation:
  1. kill PID — Send SIGTERM (15). Process can clean up gracefully.
  2. Wait 5 seconds. If still running...
  3. kill -9 PID — Send SIGKILL (9). Kernel terminates immediately.
  4. Verify: ps aux | grep PID — Confirm it's gone.

Permission vs Ownership — Know the Difference

chmod changes what actions are allowed (read, write, execute).

chown changes who the file belongs to (user and group).

Common mistake: a file has 755 permissions but is owned by root. A non-root user can read/execute but NOT write — even though write is enabled for the owner. Fix: chown devops:devops file then chmod 755 file.