cuda-skill

CUDA & PTX Reference

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "cuda-skill" with this command: npx skills add slowlyc/cursor-gpu-skills/slowlyc-cursor-gpu-skills-cuda-skill

CUDA & PTX Reference

Documentation Locations

All documentation is under the references/ directory within this skill's install location. The base path depends on which agent tool is used:

  • Cursor: ~/.cursor/skills/cuda-skill/references/

  • Claude Code: ~/.claude/skills/cuda-skill/references/

  • Codex: ~/.agents/skills/cuda-skill/references/

Determine actual path at runtime:

CUDA_REFS="$(dirname "$(find ~/.cursor/skills ~/.claude/skills ~/.agents/skills -name 'cuda-skill' -type d 2>/dev/null | head -1)")/cuda-skill/references"

All rg examples below use ~/.cursor/skills/cuda-skill/references/ as placeholder. Replace with the actual path.

references/ ├── ptx-docs/ # PTX ISA 9.1 full spec (405 files, 2.3MB) ├── ptx-simple/ # PTX condensed quick-ref (13 files, 149KB) ├── cuda-runtime-docs/ # CUDA Runtime API 13.1 (107 files, 0.9MB) ├── cuda-driver-docs/ # CUDA Driver API 13.1 (128 files, 0.8MB) ├── cuda-guide/ # CUDA Programming Guide v13.1 (39 pages, 1.6MB) │ ├── 01-introduction/ # Programming model, CUDA platform │ ├── 02-basics/ # CUDA C++, kernels, async, memory, nvcc │ ├── 03-advanced/ # Advanced APIs, kernel programming, driver API, multi-GPU │ ├── 04-special-topics/ # Graphs, Unified Memory, Coop Groups, TMA, etc. │ ├── 05-appendices/ # Compute Capabilities, C++ extensions, math funcs │ └── INDEX.md ├── best-practices-guide/ # CUDA C++ Best Practices Guide ├── ncu-docs/ # Nsight Compute full docs (ProfilingGuide, CLI, etc.) ├── nsys-docs/ # Nsight Systems full docs (UserGuide, etc.) ├── ptx-isa.md # PTX search guide ├── cuda-runtime.md # Runtime API search guide ├── cuda-driver.md # Driver API search guide ├── nsys-guide.md # Nsight Systems quick reference ├── ncu-guide.md # Nsight Compute quick reference ├── debugging-tools.md # compute-sanitizer, cuda-gdb ├── nvtx-patterns.md # NVTX instrumentation └── performance-traps.md # Bank conflicts, coalescing

ptx-simple/ Contents (Condensed Quick-Ref)

ptx-simple/ ├── ptx-isa-arithmetic.md # add, sub, mul, mad, fma, div, min, max ├── ptx-isa-data-types.md # Types, cvt, rounding, pack ├── ptx-isa-memory-spaces.md # .reg, .global, .shared, fences ├── ptx-isa-load-store.md # ld, st, prefetch ├── ptx-isa-control-flow.md # @p, setp, bra, call, ret, exit ├── ptx-isa-tensor-cores.md # mma.sync, ldmatrix, wgmma ├── ptx-isa-async-copy.md # cp.async, cp.async.bulk, TMA ├── ptx-isa-barriers.md # bar.sync, mbarrier ├── ptx-isa-warp-ops.md # shfl, vote, match, redux ├── ptx-isa-cache-hints.md # Cache control ├── ptx-isa-sm90-hopper.md # Hopper-specific (sm_90) ├── ptx-isa-sm100-blackwell.md # Blackwell-specific (sm_100, tcgen05) └── ptx-isa-misc.md # Other instructions

Search Strategy

Use Grep tool to search documentation. Never load entire files into context.

PTX Instruction Lookup

Find specific instruction

rg "mbarrier.init" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/

Find WGMMA register fragments

rg "register fragment" ~/.cursor/skills/cuda-skill/references/ptx-docs/9-instruction-set/ | rg -i wgmma

Find TMA swizzling modes

rg "swizzle_mode" ~/.cursor/skills/cuda-skill/references/ptx-docs/

Quick PTX syntax lookup (condensed)

rg "wgmma" ~/.cursor/skills/cuda-skill/references/ptx-simple/ptx-isa-tensor-cores.md

CUDA Runtime API Lookup

Error code meaning

rg "cudaErrorInvalidValue" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/

Function documentation

rg -A 20 "cudaStreamSynchronize" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/modules/group__cudart__stream.md

Struct fields

rg "" ~/.cursor/skills/cuda-skill/references/cuda-runtime-docs/data-structures/structcudadeviceprop.md

CUDA Driver API Lookup

Context management

rg -A 20 "cuCtxCreate" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__ctx.md

Module loading

rg "cuModuleLoad" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__module.md

Virtual memory

rg "cuMemMap" ~/.cursor/skills/cuda-skill/references/cuda-driver-docs/modules/group__cuda__va.md

CUDA Programming Guide Lookup

Compute Capabilities table

rg -A 5 "sm_90" ~/.cursor/skills/cuda-skill/references/cuda-guide/05-appendices/compute-capabilities.md

CUDA Graphs usage

rg "cudaGraph" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cuda-graphs.md

Cooperative Groups

rg "cooperative" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/cooperative-groups.md

Unified Memory behavior

rg "managed" ~/.cursor/skills/cuda-skill/references/cuda-guide/04-special-topics/unified-memory.md

Thread Block Clusters (Hopper+)

rg "cluster" ~/.cursor/skills/cuda-skill/references/cuda-guide/01-introduction/programming-model.md

Programming Guide index (discover all topics)

cat ~/.cursor/skills/cuda-skill/references/cuda-guide/INDEX.md

Best Practices Guide Lookup

Memory coalescing best practices

rg -i "coalescing" ~/.cursor/skills/cuda-skill/references/best-practices-guide/

Occupancy optimization

rg -i "occupancy" ~/.cursor/skills/cuda-skill/references/best-practices-guide/

Shared memory usage patterns

rg -i "shared memory" ~/.cursor/skills/cuda-skill/references/best-practices-guide/

Nsight Compute Lookup

Metric meanings and collection

rg -i "metric" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md

CLI usage and options

rg -i "section" ~/.cursor/skills/cuda-skill/references/ncu-docs/NsightComputeCli.md

Roofline analysis

rg -i "roofline" ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md

Nsight Systems Lookup

CLI profiling options

rg -i "nsys profile" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md

CUDA trace analysis

rg -i "cuda.*trace" ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md

When to Use Each Source

Need Source Path shorthand

PTX instruction syntax/semantics Full PTX docs ptx-docs/9-instruction-set/

Quick PTX syntax check Condensed PTX ptx-simple/

State spaces, data types Full PTX docs ptx-docs/5-state-spaces-types-and-variables/

Memory consistency model Full PTX docs ptx-docs/8-memory-consistency-model/

Special registers (%tid, etc.) Full PTX docs ptx-docs/10-special-registers/

Directives (.version, .target) Full PTX docs ptx-docs/11-directives/

CUDA Runtime functions Runtime docs cuda-runtime-docs/modules/

CUDA structs (cudaDeviceProp) Runtime docs cuda-runtime-docs/data-structures/

Driver API (cuCtx, cuModule) Driver docs cuda-driver-docs/modules/

sm_90 / Hopper specifics Condensed PTX ptx-simple/ptx-isa-sm90-hopper.md

sm_100 / Blackwell / tcgen05 Condensed PTX ptx-simple/ptx-isa-sm100-blackwell.md

CUDA C++ programming concepts Programming Guide cuda-guide/02-basics/

Thread/block/grid model Programming Guide cuda-guide/01-introduction/programming-model.md

Compute Capabilities table Programming Guide cuda-guide/05-appendices/compute-capabilities.md

CUDA Graphs usage Programming Guide cuda-guide/04-special-topics/cuda-graphs.md

Unified Memory Programming Guide cuda-guide/04-special-topics/unified-memory.md

Cooperative Groups Programming Guide cuda-guide/04-special-topics/cooperative-groups.md

Async barriers/pipelines (C++) Programming Guide cuda-guide/04-special-topics/async-barriers.md

L2 cache control Programming Guide cuda-guide/04-special-topics/l2-cache-control.md

Dynamic parallelism Programming Guide cuda-guide/04-special-topics/dynamic-parallelism.md

C++ language extensions Programming Guide cuda-guide/05-appendices/cpp-language-extensions.md

Math functions (device) Programming Guide cuda-guide/05-appendices/mathematical-functions.md

Multi-GPU programming Programming Guide cuda-guide/03-advanced/multi-gpu-systems.md

Environment variables Programming Guide cuda-guide/05-appendices/environment-variables.md

Memory optimization practices Best Practices best-practices-guide/

Performance profiling strategy Best Practices best-practices-guide/

ncu metrics, sections, roofline Nsight Compute ncu-docs/ProfilingGuide.md

ncu CLI options and workflows Nsight Compute ncu-docs/NsightComputeCli.md

nsys profiling and tracing Nsight Systems nsys-docs/UserGuide.md

Debugging Workflow

  • Reproduce minimally — Isolate failing kernel with smallest input

  • Add printf — if (idx == 0) printf(...) in device code

  • Run compute-sanitizer: compute-sanitizer --tool memcheck ./program compute-sanitizer --tool racecheck ./program

  • cuda-gdb backtrace (non-interactive): cuda-gdb -batch -ex "run" -ex "bt" ./program

  • When tools fail — Minimize diff between working/broken code, read it carefully

For detailed tool options, read ~/.cursor/skills/cuda-skill/references/debugging-tools.md .

Performance Optimization Workflow

Never optimize without profiling. GPU bottleneck intuition is almost always wrong.

  • Establish baseline timing

  • nsys — Where is time spent? nsys profile -o report ./program nsys stats report.nsys-rep --report cuda_gpu_kern_sum

  • ncu — Why is this kernel slow? ncu --kernel-name "myKernel" --set full -o report ./program

  • Hypothesize based on metrics, change ONE thing, verify

Symptom Likely Cause Tool

Low GPU utilization Launch overhead, CPU bottleneck nsys timeline

Memory bound Poor coalescing, low cache hit ncu memory section

Compute bound but slow Low occupancy, register pressure ncu occupancy

High sectors/request (>4) Poor coalescing ncu memory metrics

For detailed guides, read:

  • ~/.cursor/skills/cuda-skill/references/nsys-guide.md (quick reference)

  • ~/.cursor/skills/cuda-skill/references/ncu-guide.md (quick reference)

  • ~/.cursor/skills/cuda-skill/references/performance-traps.md

  • ~/.cursor/skills/cuda-skill/references/ncu-docs/ProfilingGuide.md (full Nsight Compute profiling guide)

  • ~/.cursor/skills/cuda-skill/references/nsys-docs/UserGuide.md (full Nsight Systems user guide)

  • ~/.cursor/skills/cuda-skill/references/best-practices-guide/ (CUDA C++ Best Practices)

Compilation Reference

Debug

nvcc -g -G -lineinfo -O0 program.cu -o program_debug

Release with line info (always use -lineinfo for profiling)

nvcc -O3 -lineinfo program.cu -o program

Target architecture

nvcc -arch=sm_80 program.cu # Ampere nvcc -arch=sm_90 program.cu # Hopper nvcc -arch=sm_100 program.cu # Blackwell

Generate PTX / inspect binary

nvcc -ptx program.cu cuobjdump -ptx ./program cuobjdump -sass ./program nvcc --ptxas-options=-v program.cu # Register usage

Inline PTX in CUDA

device int myAdd(int a, int b) { int result; asm("add.s32 %0, %1, %2;" : "=r"(result) : "r"(a), "r"(b)); return result; } // Constraint codes: r=32b reg, l=64b reg, f=f32, d=f64, n=immediate

PTX Documentation Structure

ptx-docs/ ├── 1-introduction/ ├── 2-programming-model/ # Thread hierarchy, memory ├── 3-ptx-machine-model/ # SIMT architecture ├── 4-syntax/ # PTX syntax rules ├── 5-state-spaces-types-and-variables/ # Memory spaces, data types ├── 6-instruction-operands/ # Operand types ├── 7-abstracting-the-abi/ # Functions, calling conventions ├── 8-memory-consistency-model/ # Memory ordering, atomics ├── 9-instruction-set/ # 186 instruction files │ ├── 9.7.1-* Integer arithmetic │ ├── 9.7.3-* Floating point │ ├── 9.7.9-* Data movement (includes TMA) │ ├── 9.7.14-* WMMA (sm_70+) │ ├── 9.7.15-* WGMMA (sm_90+) │ └── 9.7.16-* TensorCore Gen5 (sm_100+) ├── 10-special-registers/ # %tid, %ctaid, %clock64 ├── 11-directives/ # .version, .target, .entry ├── 12-descriptions-ofpragmastrings/ └── 13-release-notes/

Updating Documentation

cd /path/to/cursor-gpu-skills

Update everything

uv run scrape_docs.py all --force

Or update individually:

uv run scrape_docs.py ptx-simple --force # Condensed PTX from triton repo uv run scrape_docs.py ptx # Full PTX ISA from NVIDIA uv run scrape_docs.py runtime # CUDA Runtime API uv run scrape_docs.py driver # CUDA Driver API uv run scrape_docs.py guide --force # CUDA Programming Guide v13.1 uv run scrape_docs.py best-practices --force # CUDA C++ Best Practices Guide uv run scrape_docs.py ncu-docs --force # Nsight Compute docs uv run scrape_docs.py nsys-docs --force # Nsight Systems docs

Additional References

For deeper investigation, read the search guide files:

  • PTX search workflow: ~/.cursor/skills/cuda-skill/references/ptx-isa.md

  • Runtime API guide: ~/.cursor/skills/cuda-skill/references/cuda-runtime.md

  • Driver API guide: ~/.cursor/skills/cuda-skill/references/cuda-driver.md

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

triton-skill

No summary provided by upstream source.

Repository SourceNeeds Review
General

cutlass-skill

No summary provided by upstream source.

Repository SourceNeeds Review
General

Fast Douyin Publish

抖音视频自动发布助手。一键上传视频到抖音,支持自动文案生成和标签优化。

Registry SourceRecently Updated