System Administration Rules
User Management
- Create service accounts with
--systemflag — no home directory, no login shell sudowith specific commands, not blanket ALL — principle of least privilege- Lock accounts instead of deleting:
usermod -L— preserves audit trail and file ownership - SSH keys in
~/.ssh/authorized_keyswith restrictive permissions — 600 for file, 700 for directory visudoto edit sudoers — catches syntax errors before saving, prevents lockout
Process Management
systemctlfor services, notservice— systemd is standard on modern distrosjournalctl -u service -ffor live logs — more powerful than tail on log filesniceandionicefor background tasks — don't compete with production workloads- Kill signals: SIGTERM (15) first, SIGKILL (9) last resort — SIGKILL doesn't allow cleanup
nohuporscreen/tmuxfor long-running commands — SSH disconnect kills regular processes
File Systems and Storage
df -hfor disk usage,du -sh *to find culprits — check before disk fills completelylsof +D /pathfinds processes using a directory — needed before unmountingncdufor interactive disk usage — faster than repeated du commands- Mount options matter:
noexec,nosuidfor security on data partitions - Resize filesystems with care: grow is safe, shrink risks data loss — always backup first
Logs and Monitoring
logrotateprevents disk fill — configure size limits and retention- Centralize logs to external system — local logs lost if server dies
/var/log/auth.logor/var/log/securefor login attempts — watch for brute forcedmesgfor kernel messages — hardware errors, OOM kills appear here- Monitor inode usage, not just disk space — many small files exhaust inodes
Permissions and Security
chmod 600for secrets,640for configs,644for public — world-writable is almost never correct- Sticky bit on shared directories (
chmod +t) — users can only delete their own files setfaclfor complex permissions — when traditional owner/group/other isn't enoughchattr +imakes files immutable — even root can't modify without removing flag- SELinux/AppArmor in enforcing mode — permissive logs but doesn't protect
Package Management
apt updatebeforeapt upgrade— upgrade without update uses stale package lists- Unattended security updates:
unattended-upgrades— critical patches shouldn't wait - Pin package versions in production — unexpected upgrades cause unexpected outages
- Remove unused packages:
apt autoremove— reduces attack surface and disk usage - Know your package manager: apt/yum/dnf/pacman — commands differ, concepts similar
Backups
- Test restores regularly — backups that can't restore are worthless
- Include package lists and configs, not just data — recreating environment is painful
- Offsite backups mandatory — local backups don't survive disk failure or ransomware
- Backup before any risky change — "I'll just quickly edit" famous last words
- Document restore procedure — 3am disaster is wrong time to figure it out
Performance
top/htopfor live view,vmstatfor trends — understand baseline before diagnosingiotopfor disk I/O bottlenecks — slow disk often blamed on CPU- Load average: 1.0 per core is healthy — consistently higher means queuing
- Swap usage isn't inherently bad — but consistent swapping indicates memory shortage
sarfor historical data — retroactively diagnose what happened during incident
Networking Basics
ss -tulpnshows listening ports —netstatis deprecatedip addrandip routereplaceifconfigandroute— learn the new tools- Check both host firewall and cloud security groups — traffic blocked at either level fails
/etc/hostsfor local overrides — quick testing without DNS changescurl -vshows full connection details — headers, timing, TLS handshake
Common Mistakes
- Running services as root — one exploit owns the system
- No monitoring until something breaks — reactive is expensive
- Editing config without backup —
cp file file.baktakes two seconds - Rebooting to "fix" issues — masks the problem, it'll return
- Ignoring disk space warnings — 100% full causes cascading failures
- Forgetting timezone configuration — logs from different servers don't correlate