Linux Process Management & Troubleshooting Simulator

The Incident Response Playbook

The Complete Troubleshooting Framework

When an alert fires, senior engineers follow a systematic 10-step approach rather than panicking. This module walks through a complete production incident from the initial Zabbix alert to the final post-mortem. The framework is:

1. df -h → Identify the full partition
2. du -sh | sort -rh → Find the space hog
3. ps aux | grep → Hunt the rogue process
4. kill -9 → Terminate the process
5. free -h → Check memory/swap health
6. top → Verify system is stabilizing
7. systemctl restart → Bring crashed services back
8. journalctl -u → Verify clean startup
9. chmod +x → Fix deploy script permissions
10. chown -R → Fix file ownership issues

🔍 Diagnostics

Identify what's wrong.

df -h — Disk usage per partition
du -sh /* — Find largest directories
ps aux | grep — Find processes
free -h — Memory and swap
top — Real-time system overview

⚡ Actions

Fix the problem.

kill -9 PID — Force kill process
systemctl restart — Restart service
chmod +x — Fix permissions
chown user:group — Fix ownership

✅ Verify

Confirm it's fixed.

journalctl -u — Service logs
top — CPU/MEM stabilizing
df -h — Space recovered
ls -la — Verify permissions

Kill Signal Cheat Sheet

Signal Escalation:

kill PID — Send SIGTERM (15). Process can clean up gracefully.
Wait 5 seconds. If still running...
kill -9 PID — Send SIGKILL (9). Kernel terminates immediately.
Verify: ps aux | grep PID — Confirm it's gone.

Permission vs Ownership — Know the Difference

chmod changes what actions are allowed (read, write, execute).

chown changes who the file belongs to (user and group).

Common mistake: a file has 755 permissions but is owned by root. A non-root user can read/execute but NOT write — even though write is enabled for the owner. Fix: chown devops:devops file then chmod 755 file.