mTLS and Service Mesh Security
Comprehensive guide to securing service-to-service communication with mutual TLS and service mesh patterns.
When to Use This Skill
-
Implementing mTLS between services
-
Deploying service mesh (Istio, Linkerd)
-
Certificate management for services
-
Zero trust networking within clusters
-
Service identity and authentication
-
Encrypting east-west traffic
Mutual TLS (mTLS) Fundamentals
TLS vs mTLS
Standard TLS (one-way): Client ──────────────────► Server Client verifies server identity
Mutual TLS (two-way): Client ◄────────────────► Server Both verify each other
Standard TLS:
- Server presents certificate
- Client validates server
- Client remains anonymous to server
Mutual TLS:
- Server presents certificate
- Client validates server
- Client presents certificate
- Server validates client
- Both identities verified
mTLS Handshake
mTLS Handshake Flow:
-
Client Hello └── Client → Server: "Hello, I support these ciphers"
-
Server Hello + Certificate └── Server → Client: "Let's use this cipher" └── Server → Client: "Here's my certificate" └── Server → Client: "Please provide your certificate"
-
Client Certificate └── Client → Server: "Here's my certificate"
-
Certificate Verification └── Both sides verify: - Certificate chain valid - Not expired - Not revoked - Identity matches expected
-
Key Exchange └── Derive shared session key
-
Encrypted Communication └── All traffic encrypted with session key
Certificate Components
Service Certificate Fields:
Subject: CN = my-service O = my-organization
Subject Alternative Names (SANs):
- DNS: my-service.default.svc.cluster.local
- DNS: my-service.default
- DNS: my-service
- URI: spiffe://cluster.local/ns/default/sa/my-service
Issuer: (CA that signed the certificate) CN = cluster-ca
Validity: Not Before: 2025-01-01 Not After: 2025-01-08 (short-lived, auto-rotated)
Key Usage:
- Digital Signature
- Key Encipherment
Extended Key Usage:
- TLS Web Server Authentication
- TLS Web Client Authentication
Service Mesh Architecture
Components
Service Mesh Architecture:
┌─────────────────────────────────────────────────────┐ │ Control Plane │ │ ┌────────────┐ ┌────────────┐ ┌────────────┐ │ │ │ Pilot │ │ Citadel │ │ Galley │ │ │ │ (Config) │ │ (CA) │ │ (Validation)│ │ │ └─────┬──────┘ └─────┬──────┘ └─────┬──────┘ │ └────────┼───────────────┼───────────────┼───────────┘ │ │ │ └───────────────┼───────────────┘ │ ┌───────────────▼───────────────┐ │ Data Plane │ ┌────────┼────────────────────────────────┼────────┐ │ │ │ │ │ ┌─────▼─────┐ ┌──────▼─────┐ │ │ │ Sidecar │◄─────mTLS───────►│ Sidecar │ │ │ │ (Envoy) │ │ (Envoy) │ │ │ └─────┬─────┘ └─────┬──────┘ │ │ │ │ │ │ ┌─────▼─────┐ ┌─────▼─────┐ │ │ │ Service A │ │ Service B │ │ │ └───────────┘ └───────────┘ │ └──────────────────────────────────────────────────┘
Control Plane Functions:
- Certificate authority (issue/rotate certs)
- Configuration distribution
- Policy management
- Service discovery
Data Plane Functions:
- mTLS termination
- Traffic encryption
- Policy enforcement
- Telemetry collection
Sidecar Proxy Pattern
Sidecar Injection:
Without Sidecar: ┌───────────────────┐ │ Pod │ │ ┌─────────────┐ │ │ │ App │──────► Network │ └─────────────┘ │ └───────────────────┘
With Sidecar: ┌────────────────────────────────────┐ │ Pod │ │ ┌─────────────┐ ┌─────────────┐ │ │ │ App │ │ Sidecar │ │ │ │ │──►│ (Envoy) │──────► mTLS ──► │ │ localhost │ │ handles │ │ │ │ :8080 │ │ security │ │ │ └─────────────┘ └─────────────┘ │ └────────────────────────────────────┘
Traffic Flow:
- App sends request to localhost
- Sidecar intercepts (iptables rules)
- Sidecar establishes mTLS connection
- Traffic encrypted to destination sidecar
- Destination sidecar decrypts
- Destination sidecar forwards to app
Istio Security
Istio mTLS Modes
PeerAuthentication Modes:
-
PERMISSIVE (default initially)
- Accepts both plaintext and mTLS
- Good for migration
-
STRICT
- mTLS required for all traffic
- Rejects plaintext connections
-
DISABLE
- Disable mTLS (not recommended)
Example Policy: apiVersion: security.istio.io/v1beta1 kind: PeerAuthentication metadata: name: default namespace: production spec: mtls: mode: STRICT
Istio Authorization Policies
Authorization Policy Structure:
apiVersion: security.istio.io/v1beta1 kind: AuthorizationPolicy metadata: name: orders-policy namespace: production spec: selector: matchLabels: app: orders-service action: ALLOW rules:
- from:
- source: principals: ["cluster.local/ns/production/sa/frontend"] to:
- operation: methods: ["GET", "POST"] paths: ["/api/orders/*"] when:
- key: request.headers[x-custom-header] values: ["valid-value"]
Policy Logic:
- selector: Which workloads this applies to
- from: Who can make requests (source identity)
- to: What operations are allowed
- when: Additional conditions
Istio Certificate Management
Istio Certificate Flow:
- Workload starts with sidecar
- Sidecar requests certificate from Istiod (CA)
- Istiod verifies service account identity
- Istiod issues short-lived certificate (24h default)
- Sidecar stores certificate in memory
- Certificate auto-rotated before expiry
SPIFFE Identity Format: spiffe://cluster.local/ns/namespace/sa/service-account
Certificate Properties:
- Short-lived (hours, not years)
- Auto-rotated (no manual intervention)
- Bound to Kubernetes service account
- No private key leaves workload
Linkerd Security
Linkerd mTLS
Linkerd Automatic mTLS:
Features:
- mTLS enabled by default
- Zero-configuration setup
- Automatic certificate rotation
- No YAML required for basic mTLS
Identity System:
- Uses Kubernetes service accounts
- Certificates issued by Linkerd's identity service
- 24-hour certificate lifetime (default)
- Automatic rotation
Verification: $ linkerd viz tap deploy/my-service Shows mTLS status of connections
$ linkerd check --proxy Validates mTLS configuration
Linkerd Server Authorization
Linkerd Authorization:
apiVersion: policy.linkerd.io/v1beta1 kind: Server metadata: name: orders-api namespace: production spec: podSelector: matchLabels: app: orders-service port: 8080 proxyProtocol: HTTP/2
apiVersion: policy.linkerd.io/v1beta1 kind: ServerAuthorization metadata: name: orders-authz namespace: production spec: server: name: orders-api client: meshTLS: serviceAccounts: - name: frontend namespace: production
Certificate Management
Certificate Rotation Strategies
Rotation Approaches:
-
Short-Lived Certificates (Recommended)
- 1-24 hour validity
- Auto-rotated by mesh
- No revocation needed (just let expire)
- Service mesh handles automatically
-
Long-Lived with Revocation
- Days to months validity
- Requires revocation infrastructure
- CRL or OCSP for checking
- More complex to manage
-
Hybrid
- Short-lived for service mesh
- Longer-lived for external connections
- Different approaches for different contexts
Rotation Timeline: ┌─────────────────────────────────────────────────┐ │ Certificate Lifetime (e.g., 24 hours) │ │ │ │ ├──────────────────┼────────────────┼────────┤ │ Issue Rotate Expire │ │ t=0 t=12h t=24h │ │ (50% of life) │ └─────────────────────────────────────────────────┘
Root CA Management
Root CA Hierarchy:
Option 1: Single Root (Simple) Root CA └── Workload Certificates
Option 2: Intermediate CAs (Recommended) Root CA (offline, very long-lived) ├── Cluster CA 1 (intermediate, medium-lived) │ └── Workload Certs (short-lived) ├── Cluster CA 2 │ └── Workload Certs └── Cluster CA 3 └── Workload Certs
Root CA Rotation:
- Generate new root CA
- Update trust bundle (include both old and new)
- Issue new intermediates from new root
- Workloads accept certs from both roots
- Remove old root after all certs rotated
Migration to mTLS
Migration Strategy
Phase 1: Observe (Week 1-2)
- Enable mesh in permissive mode
- Monitor which connections are plaintext
- Identify all service-to-service traffic
- Document dependencies
Phase 2: Test (Week 3-4)
- Enable strict mode in test environment
- Verify all services can communicate
- Test failure scenarios
- Fix any issues
Phase 3: Rollout (Week 5-8)
- Enable strict mode namespace by namespace
- Start with least critical namespaces
- Monitor for connection failures
- Rollback plan ready
Phase 4: Enforce (Week 9+)
- Enable strict mode cluster-wide
- Remove permissive policies
- Document exceptions
- Ongoing monitoring
Common Migration Issues
Issue: External services can't connect Fix: Use Gateway for external → internal traffic
Issue: Legacy services don't support mTLS Fix: Use permissive mode for specific services
Issue: Performance degradation Fix: Tune sidecar resources, connection pools
Issue: Certificate errors Fix: Check trust bundle, certificate chain
Issue: Non-meshed services can't communicate Fix: Either add to mesh or use permissive mode
Best Practices
Security Best Practices:
-
Certificate Management □ Use short-lived certificates (hours, not years) □ Automate rotation completely □ Protect root CA (offline if possible) □ Monitor certificate expiry
-
Policy Management □ Default deny, explicit allow □ Use namespace isolation □ Regular policy audits □ Test policies in staging first
-
Observability □ Monitor mTLS success/failure rates □ Alert on plaintext connections (in strict mode) □ Log authorization decisions □ Trace requests across services
-
Operations □ Document all exceptions □ Regular security reviews □ Incident response procedures □ Rotation runbooks
-
Performance □ Right-size sidecar resources □ Connection pooling □ Monitor latency overhead □ Benchmark with and without mesh
Troubleshooting
Common Issues:
-
"Connection reset" errors
- Check if both sides have valid certs
- Verify trust bundle is synchronized
- Check for certificate expiry
-
"503 Service Unavailable"
- Destination may not have sidecar
- Authorization policy blocking request
- Service not in mesh
-
High latency
- Sidecar resource constraints
- Certificate verification overhead
- Network policy conflicts
-
Intermittent failures
- Certificate rotation race condition
- Trust bundle propagation delay
- Sidecar restart during rotation
Debug Commands (Istio): $ istioctl analyze $ istioctl proxy-status $ istioctl proxy-config secret <pod>
Debug Commands (Linkerd): $ linkerd check $ linkerd viz tap <resource> $ linkerd viz stat <resource>
Related Skills
-
zero-trust-architecture
-
Overall security architecture
-
api-security
-
Application-level security
-
container-orchestration
-
Kubernetes and service mesh
-
distributed-tracing
-
Observability in service mesh