The Incident Response Playbook
The Complete Troubleshooting Framework
When an alert fires, senior engineers follow a systematic 10-step approach rather than panicking. This module walks through a complete production incident from the initial Zabbix alert to the final post-mortem. The framework is:
1. df -h → Identify the full partition
2. du -sh | sort -rh → Find the space hog
3. ps aux | grep → Hunt the rogue process
4. kill -9 → Terminate the process
5. free -h → Check memory/swap health
6. top → Verify system is stabilizing
7. systemctl restart → Bring crashed services back
8. journalctl -u → Verify clean startup
9. chmod +x → Fix deploy script permissions
10. chown -R → Fix file ownership issues
🔍 Diagnostics
Identify what's wrong.
df -h— Disk usage per partitiondu -sh /*— Find largest directoriesps aux | grep— Find processesfree -h— Memory and swaptop— Real-time system overview
⚡ Actions
Fix the problem.
kill -9 PID— Force kill processsystemctl restart— Restart servicechmod +x— Fix permissionschown user:group— Fix ownership
✅ Verify
Confirm it's fixed.
journalctl -u— Service logstop— CPU/MEM stabilizingdf -h— Space recoveredls -la— Verify permissions
Kill Signal Cheat Sheet
kill PID— Send SIGTERM (15). Process can clean up gracefully.- Wait 5 seconds. If still running...
kill -9 PID— Send SIGKILL (9). Kernel terminates immediately.- Verify:
ps aux | grep PID— Confirm it's gone.
Permission vs Ownership — Know the Difference
chmod changes what actions are allowed (read, write, execute).
chown changes who the file belongs to (user and group).
Common mistake: a file has 755 permissions but is owned by root. A non-root user can read/execute but NOT write — even though write is enabled for the owner. Fix: chown devops:devops file then chmod 755 file.




