Node.js Performance

Use this workflow to turn Node.js performance/resource investigations into safe, reviewable PRs.

Goals

Improve execution time first: reduce p50/p95/p99 latency and increase throughput without changing intended behavior.
Reduce CPU, memory, event-loop lag, I/O pressure, or lock contention when it supports execution-time gains.
Ship small, isolated changes with measurable impact.

Operating Rules

Work on one optimization per PR.
Always choose the highest expected-impact task first.
Confirm and respect intentional behaviors before changing them.
Prefer low-risk changes in high-frequency paths.
Prioritize request/job execution-path work over bootstrap/startup micro-optimizations unless startup is on the critical path at scale.
Include evidence: targeted tests + before/after benchmark.

Impact-First Selection

Before coding, rank candidates using this score:

priority = (frequency x blast_radius x expected_gain) / (risk x effort)

Use 1-5 for each factor:

frequency: how often the path runs in production.
blast_radius: how many requests/jobs/users are affected.
expected_gain: estimated latency/resource improvement.
risk: probability of behavior regression.
effort: engineering time and change surface area.

Pick the top-ranked candidate, then validate with a baseline measurement.

If two candidates have similar score, pick the one with clearer end-to-end execution-time impact.

Prioritization Targets

Start with code that runs on every request/job/task:

Request/job wrappers and middleware.
Retry/timeout/circuit-breaker code.
Connection pools (DB/Redis/HTTP) and socket reuse.
Stream/pipeline transformations and buffering.
Serialization/deserialization hot paths (JSON, parsers, schema validation).
Queue consumers, schedulers, and worker dispatch.
Event listener attach/detach lifecycle and cleanup logic.

Deprioritize unless justified by production profile:

One-time startup/bootstrap code.
Rare admin/debug-only flows.
Teardown paths that are not on the steady-state critical path.

Common Hot-Path Smells

Recomputing invariant values per invocation.
Re-parsing code/AST repeatedly.
Duplicate async lookups returning the same value.
Per-call heavy object allocation in common-case parsing.
Unnecessary awaits in teardown/close/dispose paths.
Missing fast paths for dominant input shapes.
Unbounded retries or retry storms under degraded dependencies.
Excessive concurrency causing memory spikes or downstream saturation.
Work done for logging/telemetry/metrics formatting even when disabled.

Execution Workflow

Pick one candidate

Rank candidates and pick the highest priority score.
Explain the issue in one sentence.
State expected impact (CPU, latency, memory, event-loop lag, I/O, contention).

Prove it is hot

Add a focused micro-benchmark or scenario benchmark.
Capture baseline numbers before editing.
Prefer scenario benchmarks that include real request/job flow when the goal is execution-time improvement.
For resource issues, capture process metrics (rss, heap, FD count, event-loop delay).

Design minimal fix

Keep behavior-compatible defaults.
Add fallback path for edge cases.
Avoid broad refactors in the same PR.

Implement

Make the smallest patch that removes repeated work.
Keep interfaces stable unless change is necessary.

Test

Add/adjust targeted tests for new behavior and regressions.
Run relevant package tests (not only whole-monorepo by default).
Add concurrency/degradation tests when the bug appears only under load.

Benchmark again

Re-run the same benchmark with same parameters.
Report absolute and relative deltas.
Include latency deltas first (p50/p95/p99, throughput), then resource deltas when applicable.

Package PR

Branch naming: codex/perf-<area>-<change>.
Commit message: perf(<package>): <what changed>.
Include risk notes and rollback simplicity.

Iterate

Wait for review, then move to next isolated improvement.

Benchmarking Guidance

Keep benchmark scope narrow to isolate one change.
Use warmup iterations.
Measure both:
micro: operation-level overhead.
scenario: request/job flow, concurrency, and degraded dependency condition.
For execution-time work, scenario numbers are the decision-maker; micro numbers are supporting evidence.
Always print:
total time
per-op time
p50/p95/p99 latency when applicable
speedup ratio
iterations and workload shape
resource counters (rss, heap, handles, event-loop delay) when relevant

Resource Exhaustion Checklist

Cap concurrency at each boundary (ingress, queue, downstream clients).
Ensure timeout + cancellation are wired end-to-end.
Ensure retries are bounded and jittered.
Confirm listeners/timers/intervals are always cleaned up.
Confirm streams are closed/destroyed on success and error paths.
Confirm object caches have size/TTL controls.

CI / Flake Handling

If CI-only failures appear, add temporary diagnostic payloads in tests.
Serialize only affected flaky tests when resource contention is the cause.
Keep determinism improvements in test code, not production code, unless required.

Output Template

For each PR, report:

Issue being fixed.
Why it matters under load.
Code locations changed.
Tests run and results.
Benchmark before/after numbers (execution first: p50/p95/p99 and throughput).
Risk assessment.
Next candidate optimization.