Golang Performance

This skill provides guidance on optimizing Go application performance including profiling, memory management, concurrency optimization, and avoiding common performance pitfalls.

When to Use This Skill

When profiling Go applications for CPU or memory issues
When optimizing memory allocations and reducing GC pressure
When implementing efficient concurrency patterns
When analyzing escape analysis results
When optimizing hot paths in production code

Profiling with pprof

Enable Profiling in HTTP Server

import ( "net/http" _ "net/http/pprof" )

func main() { // pprof endpoints available at /debug/pprof/ go func() { http.ListenAndServe("localhost:6060", nil) }()

// Main application

}

CPU Profiling

Collect 30-second CPU profile

go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

Interactive commands

(pprof) top10 # Top 10 functions by CPU (pprof) list FuncName # Show source with timing (pprof) web # Open flame graph in browser

Memory Profiling

Heap profile

go tool pprof http://localhost:6060/debug/pprof/heap

Allocs profile (all allocations)

go tool pprof http://localhost:6060/debug/pprof/allocs

Interactive commands

(pprof) top10 -cum # Top by cumulative allocations (pprof) list FuncName # Show allocation sites

Programmatic Profiling

import ( "os" "runtime/pprof" )

func profileCPU() { f, _ := os.Create("cpu.prof") defer f.Close()

pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()

// Code to profile

}

func profileMemory() { f, _ := os.Create("mem.prof") defer f.Close()

runtime.GC() // Get accurate stats
pprof.WriteHeapProfile(f)

}

Memory Optimization

Reduce Allocations

// BAD: Allocates on every call func Process(items []string) []string { result := []string{} for _, item := range items { result = append(result, transform(item)) } return result }

// GOOD: Pre-allocate with known capacity func Process(items []string) []string { result := make([]string, 0, len(items)) for _, item := range items { result = append(result, transform(item)) } return result }

Use sync.Pool for Frequent Allocations

var bufferPool = sync.Pool{ New: func() interface{} { return new(bytes.Buffer) }, }

func ProcessRequest(data []byte) []byte { buf := bufferPool.Get().(*bytes.Buffer) defer func() { buf.Reset() bufferPool.Put(buf) }()

// Use buffer
buf.Write(data)
return buf.Bytes()

}

Avoid String Concatenation in Loops

// BAD: O(n^2) allocations func BuildString(parts []string) string { result := "" for _, part := range parts { result += part } return result }

// GOOD: Single allocation func BuildString(parts []string) string { var builder strings.Builder for _, part := range parts { builder.WriteString(part) } return builder.String() }

Slice Memory Leaks

// BAD: Keeps entire backing array alive func GetFirst(data []byte) []byte { return data[:10] }

// GOOD: Copy to release backing array func GetFirst(data []byte) []byte { result := make([]byte, 10) copy(result, data[:10]) return result }

Escape Analysis

Show escape analysis decisions

go build -gcflags="-m" ./...

More verbose

go build -gcflags="-m -m" ./...

Avoiding Heap Escapes

// ESCAPES: Returned pointer func NewUser() *User { return &User{} // Allocated on heap }

// STAYS ON STACK: Value return func NewUser() User { return User{} // May stay on stack }

// ESCAPES: Interface conversion func Process(v interface{}) { ... }

func main() { x := 42 Process(x) // x escapes to heap }

Concurrency Optimization

Worker Pool Pattern

func ProcessItems(items []Item, workers int) []Result { jobs := make(chan Item, len(items)) results := make(chan Result, len(items))

// Start workers
var wg sync.WaitGroup
for i := 0; i &#x3C; workers; i++ {
    wg.Add(1)
    go func() {
        defer wg.Done()
        for item := range jobs {
            results &#x3C;- process(item)
        }
    }()
}

// Send jobs
for _, item := range items {
    jobs &#x3C;- item
}
close(jobs)

// Wait and collect
go func() {
    wg.Wait()
    close(results)
}()

var output []Result
for r := range results {
    output = append(output, r)
}
return output

}

Buffered Channels for Throughput

// SLOW: Unbuffered causes blocking ch := make(chan int)

// FAST: Buffer reduces contention ch := make(chan int, 100)

Avoid Lock Contention

// BAD: Global lock var mu sync.Mutex var cache = make(map[string]string)

func Get(key string) string { mu.Lock() defer mu.Unlock() return cache[key] }

// GOOD: Sharded locks type ShardedCache struct { shards [256]struct { mu sync.RWMutex items map[string]string } }

func (c *ShardedCache) getShard(key string) *struct { mu sync.RWMutex items map[string]string } { h := fnv.New32a() h.Write([]byte(key)) return &c.shards[h.Sum32()%256] }

func (c *ShardedCache) Get(key string) string { shard := c.getShard(key) shard.mu.RLock() defer shard.mu.RUnlock() return shard.items[key] }

Use sync.Map for Specific Cases

// Good for: keys written once, read many; disjoint key sets var cache sync.Map

func Get(key string) (string, bool) { v, ok := cache.Load(key) if !ok { return "", false } return v.(string), true }

func Set(key, value string) { cache.Store(key, value) }

Data Structure Optimization

Struct Field Ordering (Memory Alignment)

// BAD: 24 bytes (padding) type Bad struct { a bool // 1 byte + 7 padding b int64 // 8 bytes c bool // 1 byte + 7 padding }

// GOOD: 16 bytes (no padding) type Good struct { b int64 // 8 bytes a bool // 1 byte c bool // 1 byte + 6 padding }

Avoid Interface{} When Possible

// SLOW: Type assertions, boxing func Sum(values []interface{}) float64 { var sum float64 for _, v := range values { sum += v.(float64) } return sum }

// FAST: Concrete types func Sum(values []float64) float64 { var sum float64 for _, v := range values { sum += v } return sum }

Benchmarking Patterns

func BenchmarkProcess(b *testing.B) { data := generateTestData() b.ResetTimer() // Exclude setup time

for i := 0; i &#x3C; b.N; i++ {
    Process(data)
}

}

// Memory benchmarks func BenchmarkAllocs(b *testing.B) { b.ReportAllocs() for i := 0; i < b.N; i++ { _ = make([]byte, 1024) } }

// Compare implementations func BenchmarkComparison(b *testing.B) { b.Run("old", func(b *testing.B) { for i := 0; i < b.N; i++ { OldImplementation() } }) b.Run("new", func(b *testing.B) { for i := 0; i < b.N; i++ { NewImplementation() } }) }

Run with:

go test -bench=. -benchmem ./... go test -bench=. -benchtime=5s ./... # Longer runs

Common Pitfalls

Defer in Hot Loops

// BAD: Defer overhead per iteration for _, item := range items { mu.Lock() defer mu.Unlock() // Defers stack up! process(item) }

// GOOD: Explicit unlock for _, item := range items { mu.Lock() process(item) mu.Unlock() }

// BETTER: Extract to function for _, item := range items { processWithLock(item) }

func processWithLock(item Item) { mu.Lock() defer mu.Unlock() process(item) }

JSON Encoding Performance

// SLOW: Reflection on every call json.Marshal(v)

// FAST: Reuse encoder var buf bytes.Buffer encoder := json.NewEncoder(&buf) encoder.Encode(v)

// FASTER: Code generation (easyjson, ffjson)

Best Practices

Measure before optimizing - Profile to find actual bottlenecks
Pre-allocate slices - Use make([]T, 0, capacity) when size is known
Pool frequently allocated objects - Use sync.Pool for buffers
Minimize allocations in hot paths - Reuse objects, avoid interfaces
Right-size channels - Buffer to reduce blocking without wasting memory
Avoid premature optimization - Clarity first, optimize measured problems
Use value receivers for small structs - Avoid pointer indirection
Order struct fields by size - Largest to smallest reduces padding

golang-performance

Safety Notice

Copy this and send it to your AI assistant to learn

Collect 30-second CPU profile

Interactive commands

Heap profile

Allocs profile (all allocations)

Interactive commands

Show escape analysis decisions

More verbose

Source Transparency

Related Skills

meta-cognitive-reasoning

golang-enterprise-patterns

nextjs-architecture

file-converter