Internal Developer Platform

Comprehensive guide to designing and building Internal Developer Platforms (IDPs) that improve developer productivity and experience.

When to Use This Skill

Designing an Internal Developer Platform
Building or restructuring platform teams
Improving developer experience (DevEx)
Evaluating platform technologies (Backstage, Port, etc.)
Creating self-service capabilities for developers
Measuring platform adoption and success

Platform Engineering Fundamentals

What is an Internal Developer Platform?

Internal Developer Platform (IDP): A layer on top of infrastructure that provides self-service capabilities to development teams while maintaining governance.

┌─────────────────────────────────────────────────────────────┐ │ DEVELOPERS │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Team A │ │ Team B │ │ Team C │ │ Team D │ │ │ └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │ │ │ │ │ │ │ │ └────────────┴─────┬──────┴────────────┘ │ │ │ │ │ ┌───────────────────────┴───────────────────────────────┐ │ │ │ INTERNAL DEVELOPER PLATFORM │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ │ │ Service │ │ Template │ │ Self- │ │ Docs & │ │ │ │ │ │ Catalog │ │ Library │ │ Service │ │ Discovery│ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ │ │ ┌───────────────────────┴───────────────────────────────┐ │ │ │ INFRASTRUCTURE │ │ │ │ Kubernetes │ Cloud │ CI/CD │ Observability │ Security │ │ │ └───────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Key Value Propositions: ├── Self-service: Developers can provision without tickets ├── Standardization: Consistent patterns across teams ├── Guardrails: Security and compliance built-in ├── Visibility: Centralized service catalog and docs └── Efficiency: Reduce cognitive load on developers

Platform vs Infrastructure

Infrastructure Team (Traditional):

Ticket-based requests
Manual provisioning
Bespoke solutions per team
Ops handles deployments
Documentation scattered

Platform Team (Modern):

Self-service capabilities
Automated provisioning
Standardized templates
Developers own deployments
Centralized documentation

Key Shift: "You Build It, You Run It" + "Platform Handles the How"

Platform Core Components

Service Catalog

Service Catalog: Centralized registry of all services with ownership, docs, and metadata.

┌─────────────────────────────────────────────────────────────┐ │ SERVICE CATALOG │ ├─────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ Payment Service [API] │ │ │ │ Owner: Payments Team │ Tier: Critical │ │ │ │ Tech: Node.js, PostgreSQL │ Dependencies: 4 │ │ │ │ [Docs] [API Spec] [Runbook] [Alerts] [Deploy] │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │ User Service [Backend] │ │ │ │ Owner: Identity Team │ Tier: High │ │ │ │ Tech: Go, MongoDB │ Dependencies: 2 │ │ │ │ [Docs] [API Spec] [Runbook] [Alerts] [Deploy] │ │ │ └─────────────────────────────────────────────────────────┘ │ │ │ │ Service Metadata: │ │ ├── Owner team and contacts │ │ ├── Technical stack │ │ ├── Service tier/criticality │ │ ├── Dependencies (upstream/downstream) │ │ ├── API specifications │ │ ├── Documentation links │ │ ├── Deployment information │ │ └── Observability dashboards │ └─────────────────────────────────────────────────────────────┘

Template Library

Template Library: Pre-built templates for common patterns that encode best practices.

Template Categories: ├── Application Templates │ ├── REST API (Go, Node.js, .NET, Python) │ ├── GraphQL Service │ ├── gRPC Service │ ├── Event Consumer │ ├── Scheduled Job │ └── Frontend (React, Vue, Angular) │ ├── Infrastructure Templates │ ├── Database (PostgreSQL, MySQL, MongoDB) │ ├── Cache (Redis, Memcached) │ ├── Message Queue (Kafka, RabbitMQ) │ └── Storage (S3, GCS) │ └── Integration Templates ├── Third-party API client ├── Authentication flow └── Webhook handler

Template Contents: ┌─────────────────────────────────────────────────────────────┐ │ Template: node-rest-api │ ├─────────────────────────────────────────────────────────────┤ │ ├── src/ │ Application code │ │ ├── tests/ │ Test setup │ │ ├── Dockerfile │ Container image │ │ ├── helm/ │ Kubernetes deployment │ │ ├── .github/workflows/ │ CI/CD pipelines │ │ ├── docs/ │ Documentation templates │ │ ├── catalog-info.yaml │ Backstage registration │ │ └── terraform/ │ Infrastructure as Code │ │ │ │ Built-in: │ │ ✓ Health checks ✓ Structured logging │ │ ✓ OpenTelemetry tracing ✓ Prometheus metrics │ │ ✓ Security headers ✓ Input validation │ │ ✓ Error handling ✓ API documentation │ └─────────────────────────────────────────────────────────────┘

Self-Service Portal

Self-Service Capabilities: Actions developers can perform without tickets or approvals.

┌─────────────────────────────────────────────────────────────┐ │ SELF-SERVICE PORTAL │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Create New Service [5 min setup, no tickets] │ │ ├── Choose template │ │ ├── Configure options │ │ ├── Generate repository │ │ ├── Create CI/CD pipeline │ │ ├── Provision infrastructure │ │ └── Register in catalog │ │ │ │ Common Self-Service Actions: │ │ ┌────────────────┬────────────────┬────────────────┐ │ │ │ Environments │ Databases │ Secrets │ │ │ │ ├── Create env │ ├── Provision │ ├── Create │ │ │ │ ├── Clone env │ ├── Scale │ ├── Rotate │ │ │ │ └── Destroy │ └── Backup │ └── Access │ │ │ └────────────────┴────────────────┴────────────────┘ │ │ ┌────────────────┬────────────────┬────────────────┐ │ │ │ Deployments │ Domains │ Access │ │ │ │ ├── Deploy │ ├── Request │ ├── Request │ │ │ │ ├── Rollback │ ├── Configure │ ├── Review │ │ │ │ └── Promote │ └── Cert │ └── Audit │ │ │ └────────────────┴────────────────┴────────────────┘ │ │ │ │ Guardrails (automatic): │ │ ✓ Security scanning ✓ Compliance checks │ │ ✓ Cost limits ✓ Naming conventions │ │ ✓ Resource quotas ✓ Approval workflows │ └─────────────────────────────────────────────────────────────┘

Platform Team Structure

Team Topologies

Platform Team Types:

Enabling Team (Recommended Start) Purpose: Help stream-aligned teams adopt platform Size: 3-5 people Activities: ├── Pair programming with product teams ├── Create documentation and guides ├── Gather feedback and requirements └── Provide training and support
Platform Team (Mature) Purpose: Build and maintain the platform Size: 5-15 people (scale with org) Activities: ├── Build self-service capabilities ├── Maintain templates and tooling ├── Define and enforce standards └── Operate platform infrastructure
Complicated Subsystem Team (Specialized) Purpose: Handle complex technical domains Size: 3-7 people per domain Examples: ├── Data platform team ├── ML platform team └── Security platform team

Team Interaction: ┌─────────────────────────────────────────────────────────────┐ │ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Stream- │◄───────►│ Platform │ │ │ │ Aligned Team │ X-as-a- │ Team │ │ │ └──────────────┘ Service └──────────────┘ │ │ │ │ │ │ │ Collaboration │ Facilitation │ │ │ │ │ │ ▼ ▼ │ │ ┌──────────────┐ ┌──────────────┐ │ │ │ Complicated │◄───────►│ Enabling │ │ │ │ Subsystem │ Service │ Team │ │ │ └──────────────┘ └──────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘

Platform Team Skills

Platform Team Competencies:

Technical: ├── Kubernetes and container orchestration ├── Infrastructure as Code (Terraform, Pulumi) ├── CI/CD pipeline design ├── API design and development ├── Observability tooling ├── Security engineering └── Cloud platforms (AWS, GCP, Azure)

Product: ├── Developer experience research ├── User journey mapping ├── Metrics and analytics ├── Documentation writing └── Training and enablement

Organizational: ├── Stakeholder management ├── Communication skills ├── Change management └── Technical leadership

Platform Technology Choices

Backstage (Spotify)

Backstage: Open-source developer portal framework by Spotify.

Core Features: ├── Service Catalog (software component registry) ├── Software Templates (scaffolding) ├── TechDocs (docs-as-code) ├── Search (unified search across everything) └── Plugins (extensible ecosystem)

Architecture: ┌─────────────────────────────────────────────────────────────┐ │ BACKSTAGE │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Frontend (React) │ │ │ │ ├── Catalog UI ├── Templates UI ├── Plugins │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Backend (Node.js) │ │ │ │ ├── Catalog API ├── Auth ├── Plugin APIs │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ Integrations │ │ │ │ ├── GitHub ├── Kubernetes ├── CI/CD │ │ │ │ ├── PagerDuty ├── Prometheus ├── Custom │ │ │ └──────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Catalog Entity:

catalog-info.yaml

apiVersion: backstage.io/v1alpha1 kind: Component metadata: name: payment-service description: Handles payment processing annotations: github.com/project-slug: org/payment-service backstage.io/techdocs-ref: dir:. spec: type: service lifecycle: production owner: payments-team system: payments dependsOn: - component:user-service providesApis: - payment-api

Alternative Platforms

Platform Options Comparison:

Platform	Type	Strengths	Considerations
Backstage	OSS	Extensible, active community	Requires customization
Port	Commercial	Quick setup, polished UI	Vendor lock-in
Cortex	Commercial	SRE focused, scorecards	Enterprise pricing
OpsLevel	Commercial	Service maturity	Smaller ecosystem
Roadie	Managed	Hosted Backstage	Less control

Decision Factors: ├── Build vs Buy tolerance ├── Customization requirements ├── Team capacity for maintenance ├── Integration needs ├── Budget constraints └── Timeline expectations

Developer Experience Metrics

DORA Metrics

DORA (DevOps Research and Assessment) Metrics:

Deployment Frequency How often you deploy to production ├── Elite: Multiple times per day ├── High: Weekly to monthly ├── Medium: Monthly to every 6 months └── Low: Every 6+ months
Lead Time for Changes Time from code commit to production ├── Elite: < 1 hour ├── High: 1 day to 1 week ├── Medium: 1 week to 1 month └── Low: 1 month to 6 months
Mean Time to Recovery (MTTR) Time to recover from production failure ├── Elite: < 1 hour ├── High: < 1 day ├── Medium: < 1 week └── Low: 1 week to 1 month
Change Failure Rate Percentage of deployments causing failure ├── Elite: 0-15% ├── High: 16-30% ├── Medium: 31-45% └── Low: 46-60%

Platform-Specific Metrics

Platform Success Metrics:

Adoption: ├── % of services in catalog ├── % of teams using templates ├── Self-service usage rate ├── Portal active users └── Template utilization

Efficiency: ├── Time to first deployment (new service) ├── Time to provision infrastructure ├── Ticket reduction rate ├── Toil automation percentage └── Developer time saved

Satisfaction: ├── Developer NPS ├── Platform satisfaction surveys ├── Support ticket volume ├── Documentation usefulness └── Onboarding feedback

Quality: ├── Template adoption vs custom builds ├── Security compliance rate ├── Standards adherence └── Incident rate for platform-built services

Implementation Roadmap

Phased Approach

Phase 1: Foundation (3-6 months) ├── Service catalog (inventory what exists) ├── Basic documentation site ├── Initial template (1-2 golden paths) ├── Platform team formation └── Metrics baseline

Phase 2: Self-Service (6-12 months) ├── Template library expansion ├── Self-service provisioning ├── CI/CD standardization ├── Developer portal launch └── Adoption campaigns

Phase 3: Optimization (12-18 months) ├── Advanced templates ├── Platform APIs ├── Automation expansion ├── Cost optimization └── Advanced analytics

Phase 4: Ecosystem (18+ months) ├── Plugin ecosystem ├── ML/data platform integration ├── Cross-team collaboration features ├── External developer experience └── Continuous evolution

Success Criteria Per Phase: Phase 1: 50% service discovery complete Phase 2: 70% of new services use templates Phase 3: 80% self-service capability Phase 4: Platform is indispensable

Common Anti-Patterns

Platform Anti-Patterns:

"Build It and They Will Come" ❌ Building features without user research ✓ Start with developer interviews and pain points
"One Size Fits All" ❌ Forcing every team into same workflow ✓ Provide flexibility with sensible defaults
"Platform as Gatekeeper" ❌ Adding friction and approval gates ✓ Enable self-service with guardrails
"Technical Purity" ❌ Choosing tech for platform team excitement ✓ Choose what solves developer problems
"Big Bang Launch" ❌ Building for 2 years before releasing ✓ Iterate quickly with early adopters
"Mandates Without Value" ❌ Forcing adoption via policy ✓ Make platform so good teams want to use it
"Documentation Afterthought" ❌ Minimal or outdated docs ✓ Treat docs as product feature
"Ivory Tower Platform" ❌ Platform team isolated from users ✓ Embed with product teams regularly

Best Practices

Platform Engineering Best Practices:

Treat Platform as Product ├── Have product owner/manager ├── Conduct user research ├── Prioritize based on impact └── Measure outcomes, not outputs
Start with Golden Paths ├── Identify most common use cases ├── Create templates for those first ├── Make golden path easiest choice └── Don't block non-golden paths
Optimize for Self-Service ├── Target <5 minutes for common tasks ├── Eliminate manual approvals where safe ├── Provide escape hatches when needed └── Clear error messages and guidance
Build Community ├── Developer advocates/champions ├── Office hours and support channels ├── Contribution guidelines └── Celebrate platform wins
Measure Everything ├── Adoption metrics ├── Developer satisfaction ├── Time savings └── Platform reliability
Iterate Rapidly ├── Ship early, improve often ├── Gather feedback continuously ├── Deprecate gracefully └── Communicate changes clearly

Related Skills

golden-paths
Designing standardized development workflows
self-service-infrastructure
Infrastructure self-service patterns
slo-sli-error-budget
Platform reliability targets
observability-patterns
Platform observability

internal-developer-platform

Safety Notice

Copy this and send it to your AI assistant to learn

catalog-info.yaml

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering