cutlass-skill

CUTLASS & CuTeDSL Development

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "cutlass-skill" with this command: npx skills add slowlyc/cursor-gpu-skills/slowlyc-cursor-gpu-skills-cutlass-skill

CUTLASS & CuTeDSL Development

Source Code Locations

CUTLASS 源码位于此 skill 安装目录下的 repos/cutlass/ 。 实际路径取决于所用工具:

  • Cursor: ~/.cursor/skills/cutlass-skill/repos/cutlass/

  • Claude Code: ~/.claude/skills/cutlass-skill/repos/cutlass/

  • Codex: ~/.agents/skills/cutlass-skill/repos/cutlass/

CUTLASS_REPO: 下文示例用 ~/.cursor/skills/cutlass-skill/repos/cutlass/ 作占位符,替换为实际路径。

如果该路径不存在,在项目目录下运行 bash update-repos.sh cutlass 。

CuTeDSL (Python DSL for GPU Kernels)

CUTLASS_REPO/python/CuTeDSL/ ├── cutlass/ │ ├── base_dsl/ # DSL 基础: 类型, 变量, 函数, PTX emit │ ├── cute/ # CuTe Python 绑定: Layout, Tensor, TiledMMA, TiledCopy │ ├── cutlass_dsl/ # CUTLASS DSL: GEMM builder, epilogue, pipeline │ ├── pipeline/ # 流水线抽象: MainloopPipeline, PipelineAsync │ ├── jax/ # JAX 集成 │ ├── utils/ # 编译工具, profiler, tensor 工具 │ └── torch.py # PyTorch 集成

CuTeDSL 示例:

CUTLASS_REPO/examples/python/CuTeDSL/ ├── ampere/ # Ampere: sgemm, tensorop_gemm, flash_attention_v2 ├── hopper/ # Hopper: TMA gemm, FP8, grouped GEMM ├── blackwell/ # Blackwell: blockwise_gemm ├── blackwell_geforce/ # GeForce Blackwell ├── cute/ # CuTe tutorials (Python) ├── distributed/ # 分布式 GEMM ├── experimental/ # 实验性功能 ├── jax/ # JAX 示例 ├── notebooks/ # Jupyter notebooks (英文) ├── notebooks-zh/ # Jupyter notebooks (中文) └── advanced_compiler_control/ # 高级编译控制

CUTLASS C++ Examples (按架构分类)

CUTLASS_REPO/examples/ ├── 00-47: Ampere 及更早架构 ├── 48-69: Hopper (sm_90) │ ├── 48_hopper_warp_specialized_gemm │ ├── 49_hopper_gemm_with_collective_builder │ ├── 54_hopper_fp8_warp_specialized_gemm │ ├── 55_hopper_mixed_dtype_gemm │ ├── 57_hopper_grouped_gemm │ ├── 62_hopper_sparse_gemm │ ├── 67_hopper_fp8..._blockwise_scaling │ ├── 88_hopper_fmha │ └── ... ├── 70-93: Blackwell (sm_100) │ ├── 70_blackwell_gemm │ ├── 71_blackwell_gemm_with_collective_builder │ ├── 72_blackwell_narrow_precision_gemm │ ├── 77_blackwell_fmha │ ├── 81_blackwell_gemm_blockwise │ ├── 83_blackwell_sparse_gemm │ ├── 92_blackwell_moe_gemm │ ├── 93_blackwell_low_latency_gqa │ └── ... └── cute/tutorial/ # CuTe C++ tutorials (sgemm, tiled_copy, hopper, blackwell)

CuTe C++ Headers

CUTLASS_REPO/include/cute/ ├── layout.hpp # Layout 核心: Shape, Stride, 组合 ├── tensor.hpp # Tensor: make_tensor, local_tile, partition ├── swizzle.hpp # Swizzle 模式 ├── algorithm/ # copy, gemm, fill, clear ├── arch/ # 架构特定: copy_sm90, mma_sm90, copy_sm100 ├── atom/ # MMA atom, Copy atom 定义 │ ├── mma_atom.hpp │ ├── copy_atom.hpp │ └── mma_traits_sm90_gmma.hpp # WGMMA traits ├── numeric/ # 数值类型 └── container/ # tuple, array

CUTLASS C++ Headers

CUTLASS_REPO/include/cutlass/ ├── gemm/ # GEMM 设备层, collective, kernel │ ├── collective/ # CollectiveMainloop, CollectiveEpilogue │ ├── kernel/ # GemmUniversal │ └── device/ # 设备启动接口 ├── epilogue/ # Epilogue: bias, activation, scaling ├── conv/ # 卷积 ├── arch/ # MMA 指令包装 (mma_sm90.h, mma_sm100.h) ├── pipeline/ # Pipeline: PipelineTmaAsync, PipelineAsync ├── experimental/ # 实验性 API └── detail/ # 内部实现细节

pycute (Python CuTe 绑定)

CUTLASS_REPO/python/pycute/ ├── layout.py # Layout, make_layout, complement, coalesce ├── int_tuple.py # IntTuple 操作 ├── swizzle.py # Swizzle └── typing.py # 类型定义

Search Strategy

用 Grep 工具搜索,不要整文件加载。

CuTeDSL 用法

CUTLASS_REPO="$HOME/.cursor/skills/cutlass-skill/repos/cutlass"

查找 CuTeDSL GEMM 示例

rg "@jit" $CUTLASS_REPO/examples/python/CuTeDSL/

查找 TiledMMA 使用

rg "TiledMMA|tiled_mma" $CUTLASS_REPO/python/CuTeDSL/cutlass/cute/

查找 pipeline 用法

rg "MainloopPipeline|PipelineAsync" $CUTLASS_REPO/python/CuTeDSL/cutlass/pipeline/

查找 Blackwell CuTeDSL 示例

rg "sm_100|blackwell" $CUTLASS_REPO/examples/python/CuTeDSL/blackwell/

CuTe C++ 用法

查找 Layout 操作

rg "make_layout|composition|complement" $CUTLASS_REPO/include/cute/layout.hpp

查找 TiledCopy 使用

rg "TiledCopy|make_tiled_copy" $CUTLASS_REPO/include/cute/

查找 MMA atom traits

rg "MMA_Traits" $CUTLASS_REPO/include/cute/atom/

查找 Hopper WGMMA

rg "SM90_64x" $CUTLASS_REPO/include/cute/atom/mma_traits_sm90_gmma.hpp

查找 TMA copy

rg "SM90_TMA" $CUTLASS_REPO/include/cute/arch/

CUTLASS Collective Builder

查找 CollectiveBuilder 使用

rg "CollectiveBuilder" $CUTLASS_REPO/examples/49_hopper_gemm_with_collective_builder/

查找 Collective Mainloop

rg "CollectiveMainloop" $CUTLASS_REPO/include/cutlass/gemm/collective/

查找 Epilogue 融合

rg "fusion|EVT" $CUTLASS_REPO/include/cutlass/epilogue/

查找 kernel 启动模板

rg "GemmUniversal" $CUTLASS_REPO/include/cutlass/gemm/device/

GEMM 示例搜索

查找 FP8 GEMM 配置

rg "float_e4m3|float_e5m2|fp8" $CUTLASS_REPO/examples/54_hopper_fp8_warp_specialized_gemm/

查找 blockwise scaling

rg "blockwise|block_scale" $CUTLASS_REPO/examples/67_hopper_fp8_warp_specialized_gemm_with_blockwise_scaling/

查找 grouped GEMM

rg "grouped|ProblemShape::Group" $CUTLASS_REPO/examples/57_hopper_grouped_gemm/

查找 sparse GEMM

rg "sparse|Sparse" $CUTLASS_REPO/examples/62_hopper_sparse_gemm/

查找 StreamK

rg "StreamK|stream_k" $CUTLASS_REPO/examples/47_ampere_gemm_universal_streamk/

When to Use Each Source

Need Source Path

CuTeDSL 入门 CuTeDSL examples examples/python/CuTeDSL/ampere/

CuTeDSL Hopper GEMM CuTeDSL examples examples/python/CuTeDSL/hopper/

CuTeDSL Blackwell GEMM CuTeDSL examples examples/python/CuTeDSL/blackwell/

CuTeDSL API 定义 CuTeDSL source python/CuTeDSL/cutlass/

CuTe Layout 语义 CuTe headers include/cute/layout.hpp

CuTe Tensor 操作 CuTe headers include/cute/tensor.hpp

MMA atom traits CuTe atom include/cute/atom/

TMA copy 架构 CuTe arch include/cute/arch/copy_sm90*

CUTLASS GEMM 模板 CUTLASS examples examples/48-93_*

Collective Builder CUTLASS examples examples/49_hopper_gemm_with_collective_builder/

Collective Mainloop CUTLASS headers include/cutlass/gemm/collective/

Epilogue 融合 CUTLASS headers include/cutlass/epilogue/

Pipeline 抽象 CUTLASS headers include/cutlass/pipeline/

pycute Layout 操作 pycute python/pycute/

Hopper FMHA CUTLASS examples examples/88_hopper_fmha/

Blackwell FMHA CUTLASS examples examples/77_blackwell_fmha/

MoE GEMM CUTLASS examples examples/92_blackwell_moe_gemm/

CuTeDSL 编写模式

基本 Elementwise Kernel

from cutlass import jit, Int32, Float32

@jit def add_kernel(x: Float32, y: Float32) -> Float32: return x + y

参考 examples/python/CuTeDSL/ampere/elementwise_add.py 获取完整示例。

CuTeDSL GEMM

参考 examples/python/CuTeDSL/ampere/sgemm.py 获取基础 SGEMM。 参考 examples/python/CuTeDSL/hopper/ 获取 Hopper TMA GEMM。

CuTe C++ GEMM 模式

参考 examples/cute/tutorial/sgemm_1.cu ~ sgemm_sm80.cu 获取 CuTe SGEMM 渐进教程。

Compilation Reference

编译 CUTLASS example

cd CUTLASS_REPO && mkdir -p build && cd build cmake .. -DCUTLASS_NVCC_ARCHS=90a # Hopper cmake --build . --target 49_hopper_gemm_with_collective_builder

编译特定架构

cmake .. -DCUTLASS_NVCC_ARCHS="80;90a;100a"

CuTeDSL 运行

pip install -e python/CuTeDSL/ python examples/python/CuTeDSL/ampere/sgemm.py

常见问题排查

问题 可能原因 查找参考

GEMM 精度不对 Epilogue 未配置正确的 accumulator 类型 rg "ElementAccumulator" examples/49_*

TMA 报错 Tensor alignment 不满足 128B rg "Alignment|alignment" examples/48_*

Collective Builder 编译失败 架构不匹配或 pipeline 配置错误 rg "KernelSchedule|EpilogueSchedule" examples/49_*

CuTeDSL 类型错误 DSL 类型与 CUDA 类型不匹配 rg "dtype|element_type" examples/python/CuTeDSL/

Layout swizzle 错误 Swizzle 模式与数据排布不兼容 include/cute/swizzle.hpp

更新 CUTLASS 源码

在 cursor-gpu-skills 项目目录下

bash update-repos.sh cutlass

Additional References

  • CUTLASS 官方文档: https://github.com/NVIDIA/cutlass

  • CuTe 文档: CUTLASS_REPO/media/docs/cute/ (如果使用 --full 模式安装)

  • CuTeDSL notebooks: examples/python/CuTeDSL/notebooks/

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

triton-skill

No summary provided by upstream source.

Repository SourceNeeds Review
General

cuda-skill

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

openclaw-version-monitor

监控 OpenClaw GitHub 版本更新,获取最新版本发布说明,翻译成中文, 并推送到 Telegram 和 Feishu。用于:(1) 定时检查版本更新 (2) 推送版本更新通知 (3) 生成中文版发布说明

Archived SourceRecently Updated
Coding

ask-claude

Delegate a task to Claude Code CLI and immediately report the result back in chat. Supports persistent sessions with full context memory. Safe execution: no data exfiltration, no external calls, file operations confined to workspace. Use when the user asks to run Claude, delegate a coding task, continue a previous Claude session, or any task benefiting from Claude Code's tools (file editing, code analysis, bash, etc.).

Archived SourceRecently Updated