System Design Generator
Create comprehensive system architecture plans from requirements.
System Design Document Template
System Design: [Feature/Product Name]
Overview
Brief description of what we're building and why.
Requirements
Functional
- User can upload videos (max 1GB)
- System processes video within 5 minutes
- User receives notification when complete
Non-Functional
- Handle 1000 uploads/day
- 99.9% uptime
- Process videos in <5 minutes (p95)
- Cost: <$0.50 per video
High-Level Architecture
┌─────────┐ ┌──────────┐ ┌─────────────┐ │ Client │─────▶│ API │─────▶│ Upload │ │ │ │ Gateway │ │ Service │ └─────────┘ └──────────┘ └─────────────┘ │ ▼ ┌─────────────┐ │ Storage │ │ (S3) │ └─────────────┘ │ ▼ ┌─────────────┐ │ Processing │◀─┐ │ Queue │ │ └─────────────┘ │ │ │ ▼ │ ┌─────────────┐ │ │ Processor │─┘ │ Workers │ └─────────────┘ │ ▼ ┌─────────────┐ │Notification │ │ Service │ └─────────────┘
Components
1. API Gateway
Responsibilities:
- Authentication
- Rate limiting
- Request routing
Technology: Kong/AWS API Gateway Scaling: Auto-scale based on requests/sec
2. Upload Service
Responsibilities:
- Generate pre-signed S3 URLs
- Validate file metadata
- Enqueue processing jobs
API:
POST /uploads Request: { filename, size, content_type } Response: { upload_url, upload_id }
Technology: Node.js + Express Scaling: Horizontal (stateless)
3. Storage (S3)
Responsibilities:
- Store raw videos
- Store processed outputs
- Serve content via CDN
Structure:
/uploads/{user_id}/{upload_id}/original.mp4 /processed/{user_id}/{upload_id}/output.mp4
4. Processing Queue
Responsibilities:
- Buffer processing jobs
- Ensure at-least-once delivery
- DLQ for failed jobs
Technology: AWS SQS Configuration:
- Visibility timeout: 15 minutes
- DLQ after 3 retries
5. Processor Workers
Responsibilities:
- Transcode videos
- Generate thumbnails
- Update database
Technology: Python + FFmpeg Scaling: Auto-scale on queue depth
Data Flow
Upload Flow
- Client requests upload URL from Upload Service
- Upload Service generates pre-signed S3 URL
- Client uploads directly to S3
- Client notifies Upload Service of completion
- Upload Service enqueues processing job
- Returns upload_id to client
Processing Flow
- Worker polls queue for jobs
- Downloads video from S3
- Processes video (transcode, thumbnail)
- Uploads results to S3
- Updates database status
- Sends notification
- Deletes message from queue
Data Model
interface Upload {
id: string;
user_id: string;
filename: string;
size: number;
status: 'pending' | 'processing' | 'complete' | 'failed';
original_url: string;
processed_url?: string;
created_at: Date;
processed_at?: Date;
}
interface ProcessingJob {
upload_id: string;
attempts: number;
error?: string;
}
API Contract
Upload Endpoints
POST /uploads - Request upload URL
GET /uploads/:id - Get upload status
DELETE /uploads/:id - Cancel upload
GET /uploads - List user uploads
Webhooks
POST {webhook_url}
{
"event": "upload.completed",
"upload_id": "...",
"status": "complete",
"processed_url": "..."
}
Scaling Considerations
Current Capacity
- 1000 uploads/day = ~1 per minute
- Single worker can process 1 video every 5 minutes
- Need 5 workers for current load
10x Scale (10,000/day)
- ~10 uploads per minute
- Need 50 workers
- Use spot instances for cost savings
- Add Redis cache for status checks
100x Scale (100,000/day)
- ~100 uploads per minute
- Partition by region
- Use Kafka instead of SQS
- Database sharding by user_id
Failure Modes
S3 Unavailable
- Impact: Uploads fail
- Mitigation: Multi-region S3 replication
Queue Backed Up
- Impact: Processing delays
- Mitigation: Auto-scale workers faster
Worker Crash During Processing
- Impact: Job retried
- Mitigation: Idempotent processing
Cost Estimate
Monthly (1000 uploads/day):
- S3 Storage: $50
- S3 Transfer: $100
- SQS: $10
- Workers (EC2): $300
- Database: $100
Total: ~$560/month
Security
- Pre-signed URLs expire in 1 hour
- Videos in private S3 buckets
- CloudFront signed URLs for delivery
- Rate limiting per user
Monitoring
Metrics:
- Upload success rate
- Processing time (p50, p95, p99)
- Queue depth
- Worker CPU/memory
- Error rate by type
Alerts:
- Queue depth >1000
- Processing time p95 >10 minutes
- Error rate >5%
Open Questions
- Video retention policy? (30 days? 1 year?)
- Maximum video duration? (affects processing time)
- Regional data residency requirements?
## Component Template
```markdown
### Component Name
**Responsibilities:**
- Primary responsibility
- Secondary responsibility
**Technology Stack:**
- Language: [Python/Node/Go]
- Framework: [Express/FastAPI/Gin]
- Database: [PostgreSQL/MongoDB]
**API/Interface:**
```typescript
interface ComponentAPI {
method(params): ReturnType;
}
Scaling Strategy:
- Horizontal: Stateless, load balanced
- Vertical: Cache layer, connection pooling
Dependencies:
- Service A (for X)
- Database B (for persistence)
Failure Handling:
- Retry with exponential backoff
- Circuit breaker for downstream services
- Fallback to cached data
## Best Practices
1. **Start with requirements**: Functional + non-functional
2. **Draw diagrams first**: Visual clarity
3. **Define boundaries**: What's in scope vs out
4. **Document tradeoffs**: Every choice has costs
5. **Plan for failure**: What breaks and how to handle
6. **Consider scale**: Current, 10x, 100x
7. **Estimate costs**: Build vs buy decisions
8. **Leave open questions**: Don't pretend to know everything
## Output Checklist
- [ ] Requirements documented (functional + non-functional)
- [ ] High-level architecture diagram
- [ ] Component breakdown (3-7 components)
- [ ] Data flow documented
- [ ] Data model defined
- [ ] API contracts specified
- [ ] Scaling considerations (1x, 10x, 100x)
- [ ] Failure modes identified
- [ ] Cost estimate provided
- [ ] Security considerations
- [ ] Monitoring plan