ZFS Homelab Management

⚠️ MANDATORY SKILL INVOCATION ⚠️

YOU MUST invoke this skill (NOT optional) when the user mentions ANY of these triggers:

"check ZFS pool health", "ZFS status", "pool health", "zpool status"
"setup ZFS replication", "configure replication", "sync ZFS datasets"
"configure ZFS snapshots", "setup Sanoid", "snapshot retention"
"optimize ZFS performance", "tune ZFS properties", "ZFS compression"
"troubleshoot ZFS", "ZFS errors", "failed replication", "degraded pool"
"schedule ZFS scrubs", "setup scrubbing", "monthly scrub"
"check pool capacity", "pool space", "ZFS disk usage"
Any mention of ZFS, Sanoid, Syncoid, zrepl, raidz1, or ZFS datasets

Failure to invoke this skill when triggers occur violates your operational requirements.

Purpose

Comprehensive ZFS management for homelab environments with multi-device replication, automated snapshot management, performance optimization, and health monitoring.

Read-Write Operations: This skill performs both monitoring (read-only) and management (read-write) operations including:

Pool health checks and monitoring
Snapshot creation and pruning
Dataset replication between devices
Property optimization
Scrub scheduling

Recommended Architecture: Pull-based replication from centralized backup server to 5 source devices.

Based on research: Synthesized from 130+ URLs, 56,000+ vector database entries, and official OpenZFS/Oracle/FreeBSD documentation.

🚨 DESTRUCTIVE OPERATIONS - CRITICAL SAFETY PROTOCOL

ABSOLUTE REQUIREMENT: NO DESTRUCTIVE COMMANDS WITHOUT EXPLICIT USER AUTHORIZATION AND DOUBLE CONFIRMATION

This section defines the MANDATORY safety protocol that MUST be followed for ALL destructive ZFS operations.

Destructive Command Categories

EXTREMELY DESTRUCTIVE (Permanent Data Loss):

zfs destroy - Permanently destroys datasets/snapshots (UNRECOVERABLE)
zfs destroy -R - Recursively destroys all snapshots and child datasets (CATASTROPHIC)
zpool destroy - Destroys entire pool and all data (CATASTROPHIC)
zfs rollback - Rolls back to snapshot, LOSING all intermediate changes (DATA LOSS)
zpool labelclear - Removes ZFS labels, makes data inaccessible (DATA LOSS)

HIGHLY DESTRUCTIVE (Configuration/Availability Loss):

zpool remove - Removes vdev from pool (CANNOT BE UNDONE)
zpool detach - Detaches mirror device (removes redundancy)
zpool offline - Takes disk offline (reduces pool availability)
zpool replace - Replaces disk in pool (requires resilver)
zpool clear - Clears pool errors (can mask real problems)

MODERATELY DESTRUCTIVE (Cleanup Operations):

sanoid --prune-snapshots - Deletes old snapshots per retention policy
syncoid --no-sync-snap - Deletes snapshots on destination
zpool scrub -s - Stops in-progress scrub (interrupts data verification)

POTENTIALLY IRREVERSIBLE (Property Changes):

zfs set recordsize= - CANNOT be changed efficiently on existing data
zfs set compression= - Only affects new data, not existing
zfs inherit - Resets properties to inherited values (loses custom settings)

MANDATORY CONFIRMATION PROTOCOL

YOU MUST FOLLOW THIS PROTOCOL FOR ALL DESTRUCTIVE OPERATIONS:

User Request Required - NEVER suggest or execute destructive commands unprompted
Explicit Command Presentation - Show the EXACT command that will be executed
Impact Explanation - Clearly explain what will be destroyed/lost/changed
Backup Verification - Confirm backups exist before proceeding
Double Confirmation - Require user to confirm TWICE:
- First: "Do you want to execute this destructive operation?"
- Second: "Are you absolutely sure? This cannot be undone."
Dry-Run When Possible - Use --dry-run or -n flags if available
Manual Execution Only - Provide command for user to run manually, DO NOT execute automatically

Safety Examples

WRONG - NEVER DO THIS:

User: "I have too many snapshots"
Assistant: "Let me clean those up for you."
[Executes: sanoid --prune-snapshots]  ❌ NO CONFIRMATION

CORRECT - ALWAYS DO THIS:

User: "I have too many snapshots"
Assistant: "I can help you prune old snapshots. First, let me check what would be deleted:

Command: sanoid --prune-snapshots --verbose --dry-run

This will show which snapshots would be removed based on your retention policy in /etc/sanoid/sanoid.conf.

⚠️ WARNING: This is a DESTRUCTIVE operation. Deleted snapshots cannot be recovered.

Before proceeding:
1. Have you verified your retention policy is correct?
2. Do you have backups of critical data?
3. Are you sure you want to permanently delete old snapshots?

If yes, I'll provide the command for you to run manually."

[User confirms twice]

Assistant: "Here is the command to run manually:

```bash
sudo sanoid --prune-snapshots --verbose

⚠️ FINAL WARNING: This will PERMANENTLY DELETE old snapshots. Are you absolutely sure?"

[User confirms a second time: "Yes, delete them"]

Assistant: "Understood. Please run the command above. I cannot execute destructive operations automatically."


### Safe Operations (No Confirmation Needed)

These operations are read-only and safe to execute:
- `./scripts/pool-health.sh` - Pool health checks
- `zpool status` - View pool status
- `zfs list` - List datasets/snapshots
- `zpool list` - List pools
- `zfs get all` - View properties
- `sanoid --verbose --dry-run` - Preview snapshot operations
- `syncoid --dry-run` - Preview replication

## Setup

This skill uses ZFS commands directly and does not require additional credentials beyond SSH access for remote replication.

**Required for remote replication:**

Remote replication requires passwordless SSH authentication between the backup server and source devices. This allows Syncoid to pull snapshots automatically without manual intervention.

1. **SSH keys configured** (passwordless authentication):
   ```bash
   # Generate SSH key on backup server
   ssh-keygen -t ed25519 -C "zfs-replication"

   # Copy to each source device
   ssh-copy-id user@device1
   ssh-copy-id user@device2
   # ... repeat for all devices

   # Test connectivity
   ssh user@device1 echo "SSH working"

ZFS delegation (for non-root replication):

# On backup server
zfs allow -u replication-user create,mount,receive backup/device1

Sanoid/Syncoid installed (for automation):

sudo apt install sanoid  # Debian/Ubuntu
sudo pkg install sanoid  # FreeBSD

See README.md for detailed setup instructions.

Commands

Pool Health Check

# Check all pools
./scripts/pool-health.sh

# Check specific pool
./scripts/pool-health.sh tank

# JSON output for monitoring
./scripts/pool-health.sh --json

Snapshot Management (Sanoid)

# Configure Sanoid
cp assets/sanoid.conf.template /etc/sanoid/sanoid.conf
sudo nano /etc/sanoid/sanoid.conf

# Manual snapshot
sudo sanoid --take-snapshots --verbose

# Manual prune
sudo sanoid --prune-snapshots --verbose

# List snapshots
zfs list -t snapshot

Replication Setup (Syncoid)

# Manual replication (pull-based)
syncoid --recursive user@device1:tank backup/device1

# With options
syncoid \
  --recursive \
  --no-privilege-elevation \
  --identifier=device1 \
  --compress=zstd-fast \
  user@device1:tank backup/device1

Property Optimization

# Enable compression (always)
zfs set compression=lz4 pool/dataset

# Disable atime
zfs set atime=off pool/dataset

# Tune recordsize (workload-specific)
zfs set recordsize=8K pool/databases     # Databases
zfs set recordsize=1M pool/media         # Large files
zfs set recordsize=128K pool/data        # Default

Scrub Scheduling

# Manual scrub
zpool scrub poolname

# Check scrub status
zpool status poolname

# Pause scrub
zpool scrub -p poolname

# Add to cron (monthly)
0 2 * * 0 [ $(date +\%d) -le 7 ] && /usr/sbin/zpool scrub tank

Workflow

When user asks about ZFS pool health:

"Check my ZFS pools" → Run ./scripts/pool-health.sh
"Is my pool degraded?" → Run zpool status -v poolname
"When was the last scrub?" → Check pool health script output
"Pool capacity warnings" → Check capacity (warn at 70%, critical at 80%)

When user asks about ZFS replication:

"Setup replication from 5 devices" → Configure pull-based syncoid jobs with staggered scheduling
"Replicate device1 now" → Run syncoid command
"Failed replication" → Check for resume tokens, consult troubleshooting guide
"Check replication status" → Review logs: grep syncoid /var/log/syslog | tail -20

When user asks about snapshots:

"Setup automated snapshots" → Copy sanoid.conf template, configure retention policy, add to cron
"List snapshots" → Run zfs list -t snapshot (SAFE - read-only)
"Restore from snapshot" → ⚠️ DESTRUCTIVE - Use zfs rollback or zfs clone (REQUIRES DOUBLE CONFIRMATION)
"Too many snapshots" → ⚠️ DESTRUCTIVE - Run sanoid --prune-snapshots (REQUIRES DOUBLE CONFIRMATION, use --dry-run first)

When user asks about performance:

"Optimize ZFS" → Check compression, atime, recordsize settings
"Slow writes" → Check capacity (>80% impacts performance), consider adding SLOG
"Tune for databases" → Set recordsize=8K, enable lz4 compression

Detailed Flow for Multi-Device Replication Setup

Install Sanoid on all 5 devices + backup server
Configure short retention on sources (24 hourly only)
Configure pull-based syncoid on backup server:
- Stagger devices by 15-20 minutes
- Use --identifier flag per device
- Enable compression with --compress=zstd-fast
Setup ZFS delegation for non-root operation
Add to cron with staggered schedule
Test manual run before enabling automation

References

scripts/

scripts/pool-health.sh - Comprehensive pool health checker with JSON output support. Checks state, capacity, scrub status, and generates alerts.

references/

references/command-reference.md - Complete ZFS command syntax reference for zpool, zfs, sanoid, and syncoid commands with parameters and examples.

references/quick-reference.md - Quick command cheatsheet for common ZFS operations.

references/troubleshooting.md - Comprehensive troubleshooting guide covering:

Common replication failures (network, space, permissions, conflicts)
Pool health issues (degraded pools, high capacity, scrub errors)
Performance problems (slow writes/reads)
SSH/network issues
Recovery procedures
Decision trees for diagnostics

Load this reference when user encounters errors or requests troubleshooting assistance.

assets/

assets/sanoid.conf.template - Sanoid configuration template with homelab-optimized retention policies ready to copy to /etc/sanoid/sanoid.conf.

Notes

🚨 CRITICAL SAFETY REMINDER

ALL DESTRUCTIVE COMMANDS REQUIRE:

Explicit user request
Exact command shown to user
Impact explanation
First confirmation
Second confirmation
Manual execution (commands provided to user, NEVER auto-executed)

NEVER execute these commands automatically:

zfs destroy - Permanent data loss
zpool destroy - Catastrophic data loss
zfs rollback - Loses intermediate changes
sanoid --prune-snapshots - Deletes snapshots permanently
Any command that modifies, removes, or destroys data

See "🚨 DESTRUCTIVE OPERATIONS - CRITICAL SAFETY PROTOCOL" section above for complete details.

RAIDZ1 Critical Warnings

Single parity - Can tolerate only 1 disk failure
Two disk failures = complete data loss
Monitor SMART data aggressively
Monthly scrubs MANDATORY for data integrity
Consider migrating to RAIDZ2 for critical data

Pool Capacity Thresholds

<70%: Optimal performance
70%: Warning (performance may degrade)
80%: Critical (fragmentation increases)
90%: Emergency (severe write degradation)
>95%: Risk of pool exhaustion

Important Concepts

Backup does NOT replace scrubbing - Both are required
Set recordsize BEFORE writing data - Cannot change efficiently after
NEVER enable dedup in homelab - Requires 5GB RAM per TB
LZ4 compression is always beneficial - Minimal overhead, often improves performance

Security Considerations

For production deployments:

Use dedicated replication user (not root)
Implement SSH key restrictions
Use ZFS delegation for non-root replication
Firewall rules limiting SSH to backup server IP

Reference

Official Documentation:

Automation Tools:

Additional Resources:

references/command-reference.md - Complete ZFS command syntax
references/quick-reference.md - Quick command cheatsheet
references/troubleshooting.md - Detailed troubleshooting guide

zfs

Safety Notice

Copy this and send it to your AI assistant to learn