Architecture Evaluation Framework
Current Technology Stack
Layer Technology Purpose
OS Talos Linux Immutable, API-driven Kubernetes OS
GitOps Flux + ResourceSets Declarative cluster state reconciliation
CNI/Network Cilium eBPF networking, network policies, Hubble observability
Storage Longhorn Distributed block storage with S3 backup
Object Storage Garage S3-compatible distributed object storage
Database CNPG (CloudNativePG) PostgreSQL operator with HA and backups
Cache/KV Dragonfly Redis-compatible in-memory store
Monitoring kube-prometheus-stack Prometheus + Grafana + Alertmanager
Logging Alloy → Loki Log collection pipeline
Certificates cert-manager Automated TLS certificate management
Secrets ESO + AWS SSM External Secrets Operator with Parameter Store
Upgrades Tuppr Declarative Talos/Kubernetes/Cilium upgrades
Infrastructure Terragrunt + OpenTofu Infrastructure as Code for bare-metal provisioning
CI/CD GitHub Actions + OCI Artifact-based promotion pipeline
Evaluation Criteria
When evaluating any proposed technology addition or architecture change, assess against these criteria:
- Principle Alignment
Score the proposal against each core principle (Strong/Weak/Neutral):
-
Enterprise at Home: Does it reflect production-grade patterns?
-
Everything as Code: Can it be fully represented in git?
-
Automation is Key: Does it reduce or increase manual toil?
-
Learning First: Does it teach valuable enterprise skills?
-
DRY and Code Reuse: Does it leverage existing patterns or create duplication?
-
Continuous Improvement: Does it make the system more maintainable?
- Stack Fit
-
Does this overlap with existing tools? (e.g., adding Redis when Dragonfly exists)
-
Does it integrate with the GitOps workflow? (Must be Flux-deployable)
-
Does it work on bare-metal? (No cloud-only services)
-
Does it support the multi-cluster model? (dev → integration → live)
- Operational Cost
-
How is it monitored? (Must integrate with kube-prometheus-stack)
-
How is it backed up? (Must have a recovery story)
-
How does it handle upgrades? (Must be declarative, ideally via Renovate)
-
What's the failure blast radius? (Isolated > cluster-wide)
- Complexity Budget
-
Is the complexity justified by the learning value?
-
Could a simpler existing tool solve the same problem?
-
What's the maintenance burden over 12 months?
- Alternative Analysis
-
What existing stack components could solve this? (Always check first)
-
What are the top 2-3 alternatives in the ecosystem?
-
What do other production homelabs use? (kubesearch research)
- Failure Modes
-
What happens when this component is unavailable?
-
How does it interact with network policies? (Default deny)
-
What's the recovery procedure? (Must be documented in a runbook)
-
Can it self-heal? (Strong preference for self-healing)
Common Design Patterns
New Application
-
HelmRelease via ResourceSet (flux-gitops pattern)
-
Namespace with network-policy profile label
-
ExternalSecret for credentials
-
ServiceMonitor + PrometheusRule for observability
-
GarageBucketClaim if S3 storage needed
-
CNPG Cluster if database needed
New Infrastructure Component
-
OpenTofu module in infrastructure/modules/
-
Unit in appropriate stack under infrastructure/units/
-
Test coverage in .tftest.hcl files
-
Version pinned in versions.env if applicable
New Secret
-
Store in AWS SSM Parameter Store
-
Reference via ExternalSecret CR
-
Never commit to git, not even encrypted
New Storage
-
Longhorn PVC for block storage (default)
-
GarageBucketClaim for object storage (S3-compatible)
-
Never use hostPath or emptyDir for persistent data
New Database
-
CNPG Cluster CR for PostgreSQL
-
Automated backups to Garage S3
-
Connection pooling via PgBouncer (CNPG-managed)
New Network Exposure
-
HTTPRoute for HTTP/HTTPS traffic (Gateway API)
-
Appropriate network-policy profile label
-
cert-manager Certificate for TLS
-
Internal gateway for internal-only services
Anti-Patterns to Challenge
Anti-Pattern Why It's Wrong Correct Approach
"Just run a container" without monitoring Invisible failures, no alerting ServiceMonitor + PrometheusRule required
Adding a new tool when existing ones suffice Stack bloat, maintenance burden Evaluate existing stack first
Skipping observability "for now" Technical debt that never gets paid Monitoring is day-1, not day-2
Manual operational steps Drift, inconsistency, bus factor Everything declarative via GitOps
Cloud-only services Vendor lock-in, can't run on bare-metal Self-hosted alternatives preferred
Single-instance without HA story Single point of failure At minimum, document recovery procedure
Storing state outside git Shadow configuration, drift Git is the source of truth