go-performance-best-practices

Go performance optimization guidelines for profiling, allocation, GC tuning, concurrency, PGO, and I/O. This skill should be used when writing, reviewing, or optimizing Go code for performance. Triggers on tasks involving slow services, high latency, high memory usage, memory leaks, goroutine leaks, GC pressure, CPU profiling, pprof analysis, allocation reduction, sync.Pool, mutex contention, HTTP client tuning, Profile-Guided Optimization, GOMEMLIMIT tuning, Go 1.24 features, Swiss Tables, or any Go performance investigation.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "go-performance-best-practices" with this command: npx skills add mcart13/dev-skills/mcart13-dev-skills-go-performance-best-practices

Go Performance Best Practices

Comprehensive performance optimization guide for Go codebases. Contains 41 rules across 8 categories with real-world benchmarks, BOMvault-specific examples, and proven optimization patterns from 10+ years of production experience.

When to Apply

Reference these guidelines when:

  • Writing or refactoring Go code
  • Tuning latency, throughput, allocation rate, or GC behavior
  • Investigating performance regressions
  • Reviewing code for performance issues
  • Debugging memory leaks or goroutine leaks
  • Optimizing containerized services (ECS, Kubernetes)

The Performance Optimization Workflow

Phase 1: Measure First (Don't Guess)

Never optimize without data. The #1 mistake is optimizing based on intuition.

# Step 1: Establish baseline with benchmarks
go test -bench=. -benchmem -count=5 ./... | tee baseline.txt

# Step 2: Generate CPU profile for hot paths
go test -bench=BenchmarkCriticalPath -cpuprofile=cpu.prof
go tool pprof -http=:8080 cpu.prof

# Step 3: Generate heap profile for allocations
go test -bench=BenchmarkCriticalPath -memprofile=heap.prof
go tool pprof -http=:8080 heap.prof

# Step 4: Check allocation counts (correlates with latency)
go tool pprof -alloc_objects heap.prof

Key pprof views:

ViewUse For
topQuick ranking of hot functions
list funcnameLine-by-line attribution
webVisual call graph
flameFlame graph for deep call stacks
peek funcnameCallers and callees

Phase 2: Identify the Bottleneck

Use the right profile for the right problem:

SymptomProfile Typepprof Flag
High CPU usageCPU-cpuprofile
High memory usageHeap (inuse)-memprofile + -inuse_space
High allocation rate / GC pressureHeap (alloc)-memprofile + -alloc_objects
Goroutine leaksGoroutineruntime/pprof.Lookup("goroutine")
Lock contentionMutex-mutexprofile
Blocking operationsBlock-blockprofile

Quick diagnosis commands:

# CPU: What's using the most cycles?
go tool pprof -top cpu.prof

# Memory: What's consuming the most heap?
go tool pprof -top -inuse_space heap.prof

# Allocations: What's creating the most objects?
go tool pprof -top -alloc_objects heap.prof

# Compare before/after
go tool pprof -base baseline.prof optimized.prof

Phase 3: Apply Targeted Optimization

Match the symptom to the optimization category:

SymptomCategoryKey Rules
CPU-boundWork Avoidancework-cache-*, work-short-circuit-*
Memory-boundAllocationalloc-preallocate-*, alloc-copy-to-avoid-retention
GC pausesGC Tuninggc-set-gomemlimit, gc-use-sync-pool
I/O latencyI/Oio-buffered-io, io-reuse-http-client
Lock contentionConcurrencyconc-reduce-lock-contention, conc-use-atomics
Goroutine explosionConcurrencyconc-limit-goroutines, conc-bounded-channels

Phase 4: Verify Improvement

# Run benchmark again
go test -bench=. -benchmem -count=5 ./... | tee optimized.txt

# Compare results
benchstat baseline.txt optimized.txt

# Verify no regressions in other benchmarks

Success criteria:

  • Measurable improvement (not just "feels faster")
  • No regressions in other areas
  • Code remains readable and maintainable
  • Changes are justified by data

Common Optimization Scenarios

Scenario 1: High Latency / Slow Response Times

Symptoms: P99 latency spikes, slow API responses, timeouts

Diagnosis:

# CPU profile during slow requests
curl http://localhost:8080/debug/pprof/profile?seconds=30 > cpu.prof
go tool pprof -http=:8080 cpu.prof

Common causes and fixes:

CauseIndicatorFix
JSON encodingencoding/json in topUse json.NewEncoder streaming, consider jsoniter
Regex compilationregexp.Compile in hot pathCache compiled regex at init
Slice/map scanningLoops in profileConvert to map lookup
String concatenation+ operator in loopsUse strings.Builder
Excessive loggingLogger in topReduce log level in hot path

Scenario 2: High Memory Usage / OOM Kills

Symptoms: Container OOM killed, memory growing over time, swap thrashing

Diagnosis:

# Heap profile
curl http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof -inuse_space -top heap.prof

# Check for memory leaks (growing allocations)
go tool pprof -alloc_space -top heap.prof

Common causes and fixes:

CauseIndicatorFix
Large slice retentionappend with small subslicescopy() to new slice
Unbounded cachesMap growing without evictionAdd LRU/TTL eviction
io.ReadAll on large filesLarge []byte allocationsStream with io.Copy
String/[]byte conversionsruntime.stringtoslicebyteStay in one domain
Goroutine leaksGoroutine count growingCheck context cancellation

Scenario 3: High GC Pressure / CPU Spent in GC

Symptoms: gc_pause_seconds high, runtime.mallocgc in CPU profile

Diagnosis:

# Check GC stats
GODEBUG=gctrace=1 ./myservice 2>&1 | head -20

# Allocation profile
go tool pprof -alloc_objects -top heap.prof

Common causes and fixes:

CauseIndicatorFix
Many small allocationsHigh alloc_objectsUse sync.Pool
Creating slices in loopsmake([]T, ...) in hot pathPreallocate or pool
fmt.Sprintf in hot pathfmt.* allocationsUse strconv
Interface boxinginterface{} conversionsUse generics or concrete types
Not setting GOMEMLIMITFrequent GC cyclesSet GOMEMLIMIT to 80-90% of container

Scenario 4: Goroutine Leaks / Count Growing

Symptoms: Goroutine count increases over time, eventual resource exhaustion

Diagnosis:

# Goroutine profile
curl http://localhost:8080/debug/pprof/goroutine?debug=2 > goroutine.txt
cat goroutine.txt | head -100

# Count by state
curl http://localhost:8080/debug/pprof/goroutine?debug=1 | head -50

Common causes and fixes:

CauseIndicatorFix
Blocked channel receivechan receive in stackAdd timeout or close channel
HTTP client no timeoutnet/http.(*persistConn).readLoopSet client timeout
Ticker not stoppedtime.Tick in stackUse time.NewTicker + defer Stop()
Context not cancelledcontext.Background() everywherePass and check context
Worker pool leakWorkers waiting on closed channelProper shutdown signaling

Scenario 5: Lock Contention / Serialized Execution

Symptoms: CPU not fully utilized, goroutines blocked on mutex

Diagnosis:

# Mutex profile (must be enabled)
curl http://localhost:8080/debug/pprof/mutex > mutex.prof
go tool pprof -top mutex.prof

# Block profile
curl http://localhost:8080/debug/pprof/block > block.prof
go tool pprof -top block.prof

Common causes and fixes:

CauseIndicatorFix
Global mutexSingle lock in mutex profileShard by key
Write lock for readssync.Mutex on read-heavy mapUse sync.RWMutex
Lock held during I/OI/O calls while holding lockRelease lock before I/O
Atomic operations on structatomic.Value for configUse atomic.Pointer[T]

BOMvault Service Optimization Guide

License Enricher

Profile: CPU-bound, high allocation rate from parsing

Key optimizations:

  • Cache compiled SPDX license regex patterns at init
  • Pool bytes.Buffer for license text processing
  • Preallocate slice for AffectedPackages based on typical size
  • Stream large license files instead of io.ReadAll
// BOMvault license-enricher pattern
var (
    spdxRegex = regexp.MustCompile(`^[A-Za-z0-9][A-Za-z0-9.-]*$`)
    bufPool   = sync.Pool{New: func() any { return new(bytes.Buffer) }}
)

func (e *Enricher) ProcessLicense(data []byte) (*License, error) {
    buf := bufPool.Get().(*bytes.Buffer)
    buf.Reset()
    defer bufPool.Put(buf)
    // ... use buf for processing
}

Vulnerability Enricher

Profile: I/O-bound (NVD API), memory spikes from CVE data

Key optimizations:

  • Reuse http.Client with connection pooling
  • Stream JSON responses for large CVE feeds
  • Set GOMEMLIMIT to 80% of container memory
  • Use map for CVE ID lookups instead of slice scanning
  • Batch database inserts (100-500 per batch)
// BOMvault vulnerability-enricher pattern
var nvdClient = &http.Client{
    Timeout: 30 * time.Second,
    Transport: &http.Transport{
        MaxIdleConns:        100,
        MaxIdleConnsPerHost: 10,
        IdleConnTimeout:     90 * time.Second,
    },
}

type CVEIndex struct {
    byID map[string]*CVE  // O(1) lookup
}

Graph Ingest

Profile: Memory-bound, large SBOM processing

Key optimizations:

  • Stream SBOM JSON parsing with json.Decoder
  • Copy component slices to avoid retaining entire SBOM
  • Use GOMEMLIMIT with soft memory limit
  • Bounded worker pool for parallel component processing
  • Context timeouts for database operations
// BOMvault graph-ingest pattern
func (g *GraphIngest) ProcessSBOM(ctx context.Context, r io.Reader) error {
    dec := json.NewDecoder(r)  // Stream, don't ReadAll

    // Bounded parallelism
    sem := make(chan struct{}, 10)

    for dec.More() {
        var component Component
        if err := dec.Decode(&component); err != nil {
            return err
        }

        sem <- struct{}{}
        go func(c Component) {
            defer func() { <-sem }()
            g.processComponent(ctx, c)
        }(component)
    }
    return nil
}

Alert Writer

Profile: I/O-bound (SARIF generation), batch processing

Key optimizations:

  • Precompute report templates at startup
  • Batch writes to reduce syscalls
  • Pool buffers for SARIF report generation
  • Use strings.Builder for alert message construction
// BOMvault alert-writer pattern
var (
    reportTemplates = template.Must(template.ParseGlob("templates/*.html"))
    bufPool         = sync.Pool{New: func() any { return new(bytes.Buffer) }}
)

func (w *AlertWriter) GenerateSARIF(findings []*Finding) ([]byte, error) {
    buf := bufPool.Get().(*bytes.Buffer)
    buf.Reset()
    buf.Grow(len(findings) * 500)  // Estimate size
    defer bufPool.Put(buf)

    // Batch write to buffer, then single Write to output
}

Rule Categories by Priority

PriorityCategoryImpactPrefix
1Measurement & ProfilingCRITICALprof-
2Allocation & Data StructuresHIGHalloc-
3Strings, Bytes & EncodingHIGHbytes-
4Concurrency & SynchronizationHIGHconc-
5GC & Memory LimitsHIGHgc-
6I/O & NetworkingHIGHio-
7Runtime & SchedulingMEDIUMrt-
8Work Avoidance & CachingMEDIUMwork-

Quick Reference

1. Measurement & Profiling (CRITICAL)

RuleImpactWhen to Apply
prof-use-testing-benchmarksFoundationAlways benchmark before optimizing
prof-report-allocsFoundationWhen allocation rate matters
prof-benchmark-timersFoundationWhen setup skews results
prof-cpu-profileFoundationCPU-bound workloads
prof-heap-profileFoundationMemory issues, GC pressure

2. Allocation & Data Structures (HIGH)

RuleImpactWhen to Apply
alloc-preallocate-slices2-10xKnown size, append loops
alloc-preallocate-maps2-5xKnown cardinality
alloc-copy-to-avoid-retentionMemory leakSubslices of large arrays
alloc-use-copy-builtin2-3xSlice-to-slice moves
alloc-avoid-string-byte-conv2xFrequent conversions
alloc-use-zero-value-buffersMinorBuffer initialization

3. Strings, Bytes & Encoding (HIGH)

RuleImpactWhen to Apply
bytes-use-strings-builder100-1000xString concatenation loops (vs + operator)
bytes-use-bytes-buffer10-100xByte accumulation
bytes-grow-when-known2-5xKnown final size
bytes-avoid-fmt-in-hot-path5-10xNumber formatting
bytes-precompile-regexp10-100xRegex in hot path

4. Concurrency & Synchronization (HIGH)

RuleImpactWhen to Apply
conc-limit-goroutinesStabilityUnbounded parallelism
conc-bounded-channels2-5xBurst absorption
conc-use-context-cancelResource safetyLong-running operations
conc-reduce-lock-contention2-10xMutex in profile
conc-use-atomics5-10xSimple counters
conc-pass-contextResource safetyAll API boundaries

5. GC & Memory Limits (HIGH)

RuleImpactWhen to Apply
gc-set-gomemlimitOOM preventionContainerized apps
gc-tune-gogcCPU/memory tradeoffGC overhead visible
gc-use-sync-pool10-50xShort-lived buffers
gc-reset-before-putMemory leakPooled objects with refs
gc-avoid-pooling-largeMemoryLarge objects (>32KB)

6. I/O & Networking (HIGH)

RuleImpactWhen to Apply
io-buffered-io10xUnbuffered file I/O
io-stream-large-bodiesO(1) memoryLarge HTTP bodies
io-reuse-http-client7-10xMultiple HTTP requests
io-tune-transport2-5xHigh concurrency HTTP
io-set-timeoutsStabilityAll HTTP servers/clients

7. Runtime & Scheduling (MEDIUM)

RuleImpactWhen to Apply
rt-avoid-busy-loop100x CPUPolling loops
rt-stop-tickersResource leaktime.NewTicker usage
rt-set-gomaxprocsContainer CPUDocker/ECS/K8s
rt-use-timeout-contextsStabilityExternal calls

8. Work Avoidance & Caching (MEDIUM)

RuleImpactWhen to Apply
work-cache-compiled-regex10-100xRegex in request path
work-cache-lookupsO(1) vs O(n)Repeated containment checks
work-batch-small-writes3-10xMany small writes
work-precompute-templates10-100xTemplate in request path
work-short-circuit-common2-10xCommon trivial inputs

Decision Trees

"My service is slow"

Is it CPU-bound? (CPU near 100%)
├── Yes → Profile CPU
│   ├── Hot function is I/O → Check io-* rules
│   ├── Hot function is encoding → Check bytes-* rules
│   ├── Hot function is your code → Check work-* rules
│   └── Hot function is GC → Check gc-* rules
└── No → Profile for blocking
    ├── Mutex contention → Check conc-reduce-lock-contention
    ├── Channel blocking → Check conc-bounded-channels
    ├── Network I/O → Check io-* rules
    └── Disk I/O → Check io-buffered-io

"My service uses too much memory"

Is memory growing over time?
├── Yes (leak) →
│   ├── Goroutine count growing → Check context cancellation
│   ├── Map growing → Add eviction/TTL
│   ├── Slice retention → Use copy() for subslices
│   └── Pooled object refs → Reset before Put
└── No (steady but high) →
    ├── Large allocations → Stream instead of ReadAll
    ├── Many small allocations → Use sync.Pool
    ├── High peak usage → Set GOMEMLIMIT
    └── Buffer reallocation → Preallocate with known size

"My service has GC problems"

Is GC taking too much CPU?
├── Yes →
│   ├── Many objects → Pool short-lived objects
│   ├── Large heap → Set GOMEMLIMIT higher
│   └── Frequent cycles → Increase GOGC (200-400)
└── No, but pauses are long →
    ├── Large heap → Reduce allocation rate
    └── Pointer-heavy structures → Consider flat arrays

Profiling Cheat Sheet

Enable pprof in Production

import _ "net/http/pprof"

func main() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }()
    // ... rest of app
}

Common pprof Commands

# Interactive mode
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
go tool pprof http://localhost:6060/debug/pprof/heap

# Web UI (recommended)
go tool pprof -http=:8080 cpu.prof

# Command-line analysis
go tool pprof -top cpu.prof
go tool pprof -list=FunctionName cpu.prof
go tool pprof -png -output=profile.png cpu.prof

# Compare profiles
go tool pprof -base before.prof after.prof

# Allocation analysis
go tool pprof -alloc_objects heap.prof  # Count of allocations
go tool pprof -alloc_space heap.prof    # Bytes allocated
go tool pprof -inuse_objects heap.prof  # Current live objects
go tool pprof -inuse_space heap.prof    # Current memory usage

Benchmark Commands

# Run all benchmarks
go test -bench=. -benchmem ./...

# Run specific benchmark
go test -bench=BenchmarkProcess -benchmem

# Multiple runs for statistical significance
go test -bench=. -benchmem -count=10 | tee results.txt

# Compare results
go install golang.org/x/perf/cmd/benchstat@latest
benchstat before.txt after.txt

# Generate profiles from benchmarks
go test -bench=BenchmarkProcess -cpuprofile=cpu.prof -memprofile=mem.prof

Profile-Guided Optimization (PGO)

Go 1.21+ supports PGO for 2-7% performance improvement in production workloads.

PGO Workflow

# Step 1: Collect production CPU profile (30+ seconds recommended)
curl http://localhost:6060/debug/pprof/profile?seconds=60 > default.pgo

# Step 2: Place profile in package directory
cp default.pgo ./cmd/myservice/default.pgo

# Step 3: Build with PGO (auto-detects default.pgo)
go build ./cmd/myservice

# Step 4: Verify PGO was applied
go build -gcflags="-d=pgo" ./cmd/myservice 2>&1 | grep "PGO"

Best practices:

  • Collect profiles under realistic production load
  • Re-collect profiles periodically (weekly/monthly)
  • PGO improves inlining and devirtualization decisions
  • Works best for CPU-bound workloads

PGO Impact by Workload Type

Workload TypeExpected ImprovementNotes
HTTP services2-4%Helps with routing, JSON, template code
GRPC services3-5%Protocol buffer encoding benefits
CLI tools2-3%Shorter startup time
Computation-heavy5-7%Best for math, parsing, encoding

Go 1.24 Features (January 2025+)

Go 1.24 introduces significant runtime improvements:

Swiss Tables for Maps

Maps now use Swiss Tables internally for ~10% faster operations on average:

// No code changes required - automatic in Go 1.24+
m := make(map[string]int)  // Uses Swiss Tables internally

Impact: Lookup and iteration 10-30% faster depending on workload.

testing.B.Loop for Benchmarks

New idiomatic benchmark pattern (Go 1.24+):

// Go 1.23 and earlier
func BenchmarkProcess(b *testing.B) {
    for i := 0; i < b.N; i++ {
        process()
    }
}

// Go 1.24+ (preferred)
func BenchmarkProcess(b *testing.B) {
    for b.Loop() {
        process()
    }
}

Benefits: Avoids common mistakes with benchmark timers, cleaner syntax.

Version Compatibility Table

FeatureMinimum Go VersionImpact
Generics1.18Type-safe pools
GOMEMLIMIT1.19OOM prevention
PGO1.212-7%
maps stdlib package1.21Clone, Keys
slices stdlib package1.21Sort, Clone
sync.OnceFunc1.21Lazy init
cmp package1.21Generic compare
log/slog1.21Structured logs
Swiss Tables (maps)1.2410% faster maps
testing.B.Loop1.24Cleaner benchmarks

References

Full Compiled Document

For the complete guide with all rules expanded: AGENTS.md

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

rust-performance-best-practices

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

typescript-performance-best-practices

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

fhir-hl7-validator

No summary provided by upstream source.

Repository SourceNeeds Review