tilelang-developer

Write, optimize, and debug high-performance AI compute kernels using TileLang (a Python DSL for GPU programming). Use when the user requests: (1) Writing custom GPU kernels for AI workloads (GEMM, Attention, MLA, etc.), (2) Optimizing existing TileLang code for NVIDIA, AMD, or Ascend hardware, (3) Implementing non-standard operators (like DeepSeek MLA, FlashAttention variants), (4) Debugging TileLang compilation or runtime errors, or (5) Cross-platform kernel development targeting multiple GPU vendors.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "tilelang-developer" with this command: npx skills add yzlnew/infra-skills/yzlnew-infra-skills-tilelang-developer

TileLang Developer

Write high-performance AI compute kernels using TileLang - a tile-based programming model that bridges the gap between CUDA's low-level control and high-level abstractions.

When to Use This Skill

Use this skill when the user needs to:

  • Implement custom GPU kernels for AI operations (matrix multiplication, attention mechanisms, etc.)
  • Optimize performance-critical operators for modern GPUs (NVIDIA Ampere/Hopper, AMD MI300X, Ascend NPU)
  • Debug TileLang code or resolve performance issues
  • Port kernels across different hardware platforms
  • Understand or modify existing TileLang implementations

Kernel Development Workflow

Follow these steps when writing a TileLang kernel:

Step 1: Analyze Requirements

Gather essential information:

Input/Output Specifications:

  • Tensor shapes (M, N, K dimensions)
  • Data types (float16, float32, bfloat16, int8)
  • Memory layout (row-major, column-major)

Hardware Target:

  • NVIDIA GPU (Ampere A100, Hopper H100, etc.)
  • AMD GPU (MI300X, etc.)
  • Huawei Ascend NPU

Performance Goals:

  • Target throughput or latency
  • Memory bandwidth constraints
  • Comparison baseline (cuBLAS, vendor libraries)

Ask clarifying questions if details are missing.

Step 2: Set Up Kernel Structure

Create the basic kernel scaffold:

import tilelang
import tilelang.language as T

@tilelang.jit(target="cuda", out_idx=[2])  # Specify output indices
def kernel_name(M, N, K, block_M, block_N, block_K):
    @T.prim_func
    def main(
        A: T.Buffer((M, K), "float16"),
        B: T.Buffer((K, N), "float16"),
        C: T.Buffer((M, N), "float16")
    ):
        # Kernel logic will go here
        pass

    return main

Key decisions:

  • target: "cuda" (NVIDIA), "hip" (AMD), or "cpu"
  • out_idx: List indices of output parameters
  • Block dimensions: Typical values are 64, 128, or 256

Step 3: Define Grid and Memory Hierarchy

Set up computation grid and allocate memory:

# Define grid dimensions
with T.Kernel(
    T.ceildiv(N, block_N),  # Grid X
    T.ceildiv(M, block_M),  # Grid Y
    threads=128
) as (bx, by):

    # Allocate shared memory (L1 cache)
    A_shared = T.alloc_shared((block_M, block_K), "float16")
    B_shared = T.alloc_shared((block_K, block_N), "float16")

    # Allocate register fragments (accumulators)
    C_local = T.alloc_fragment((block_M, block_N), "float32")

    # CRITICAL: Apply swizzle layout to avoid bank conflicts
    T.annotate_layout({
        A_shared: T.make_swizzled_layout(A_shared),
        B_shared: T.make_swizzled_layout(B_shared)
    })

Memory hierarchy:

  • Global Memory (HBM): Input/output tensors, slowest
  • Shared Memory (L1): Explicitly managed cache, ~164KB on A100
  • Registers: Fastest, used for accumulators and temporaries

Critical optimization: Always apply T.make_swizzled_layout to shared memory to eliminate bank conflicts.

Step 4: Implement Computation Logic

Use TileLang primitives for data movement and computation:

# Initialize accumulator
T.clear(C_local)

# Main computation loop with software pipelining
for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
    # Load tiles from global to shared memory
    T.copy(A[by * block_M, k * block_K], A_shared)
    T.copy(B[k * block_K, bx * block_N], B_shared)

    # Compute using Tensor Cores
    T.gemm(A_shared, B_shared, C_local, transpose_B=False)

# Write results back
T.copy(C_local, C[by * block_M, bx * block_N])

Key primitives:

  • T.copy: Intelligent data transfer (auto-selects cp.async, TMA, etc.)
  • T.gemm: Matrix multiplication using Tensor Cores
  • T.Pipelined: Software pipelining to overlap compute and memory transfer
  • T.Parallel: Element-wise parallel operations

Pipeline stages:

  • num_stages=2: Double buffering
  • num_stages=3: Triple buffering (recommended for most workloads)
  • num_stages=4+: Diminishing returns, increases shared memory usage

Step 5: Validate and Test

Generate test code to verify correctness:

# Example instantiation
func = kernel_name(
    M=1024, N=1024, K=1024,
    block_M=128, block_N=128, block_K=32
)

# Test against reference implementation
import torch
A = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
B = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
C_tilelang = torch.empty(1024, 1024, dtype=torch.float16, device='cuda')
C_reference = A @ B

func(A, B, C_tilelang)

# Verify with appropriate tolerance for FP16
torch.testing.assert_close(C_tilelang, C_reference, rtol=1e-3, atol=1e-3)

Step 6: Optimize Performance

Apply advanced optimizations if performance is suboptimal:

Block Size Tuning:

  • A100: Try 128×128×32 or 64×64×32
  • H100: Can use larger blocks (256×128×32)
  • MI300X: May need smaller blocks due to 64KB shared memory limit

Pipeline Depth:

  • Increase num_stages if memory-bound
  • Decrease if shared memory is exhausted

Warp Policy (for advanced cases):

T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullRow)  # For attention
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullCol)  # For MLA decode

Block-level swizzle:

T.use_swizzle(panel_size=10)  # Improves L2 cache hit rate

Common Kernel Patterns

Matrix Multiplication (GEMM)

Most fundamental kernel. See EXAMPLES.md for complete implementation.

Key features:

  • 3-stage pipelining
  • Swizzle layout for shared memory
  • Float32 accumulator for precision

FlashAttention

Memory-efficient attention with online softmax. See EXAMPLES.md for complete implementation.

Key features:

  • O(N) memory complexity
  • Online softmax statistics
  • Fused kernel (no intermediate materialization)

DeepSeek MLA

Multi-Head Latent Attention with KV compression. See EXAMPLES.md for complete implementation.

Key features:

  • Split-KV parallelization
  • Non-standard dimensions
  • FullCol warp policy for narrow matrices

Reference Documentation

When you need specific information:

  • API details (parameters, signatures, options): Read API_REFERENCE.md
  • Complete code examples (GEMM, Attention, MLA): Read EXAMPLES.md
  • Troubleshooting (errors, performance issues): Read DEBUGGING.md

Critical Performance Guidelines

Always include these optimizations:

  1. Swizzle layout for shared memory:

    T.annotate_layout({
        A_shared: T.make_swizzled_layout(A_shared)
    })
    
  2. Software pipelining:

    for k in T.Pipelined(num_blocks, num_stages=3):
    
  3. Float32 accumulators:

    C_local = T.alloc_fragment((M, N), "float32")  # Not float16
    
  4. Aligned block_K:

    block_K = 32  # Or 16, must align for Tensor Core
    
  5. Initialize accumulators:

    T.clear(C_local)
    

Hardware-Specific Considerations

NVIDIA GPUs

  • Ampere (A100): Use cp.async, num_stages=3, block_K=32
  • Hopper (H100): Can use TMA, larger blocks (256×128), num_stages=4
  • Shared memory: 164KB (A100), 228KB (H100)

AMD GPUs

  • MI300X: Use target="hip", smaller blocks, 64KB shared memory limit
  • Test with both HIP and CUDA backends for compatibility

Huawei Ascend

  • More experimental backend
  • May require specific block sizes
  • Consult Ascend-specific documentation

Code Quality Standards

When generating TileLang code:

  1. Add explanatory comments for non-obvious optimizations
  2. Specify hardware assumptions (e.g., "optimized for A100")
  3. Include usage examples showing instantiation
  4. Document block size choices and tuning rationale
  5. Provide performance expectations (e.g., "~90% of cuBLAS")

Example Kernel Request Flow

User: "Write a FP16 matrix multiplication kernel for A100"

Response:

  1. Clarify dimensions (if not specified)
  2. Generate complete kernel code with:
    • Proper structure (@tilelang.jit, @T.prim_func)
    • Swizzle layouts
    • 3-stage pipelining
    • Appropriate block sizes (128×128×32)
  3. Add usage example
  4. Explain key optimizations:
    • "Swizzle layout eliminates bank conflicts"
    • "3-stage pipeline overlaps memory and compute"
    • "Float32 accumulator prevents overflow"
  5. Suggest testing approach

Troubleshooting Quick Reference

Compilation errors:

  • Shared memory exceeded → Reduce block size or num_stages
  • Shape mismatch → Verify dimension alignment in T.gemm

Runtime errors:

  • Results all zeros → Check T.clear() and out_idx in decorator
  • NaN/Inf → Use float32 accumulator, add epsilon in division

Performance issues:

  • Low throughput → Verify swizzle layout and pipelining enabled
  • Low occupancy → Reduce shared memory usage or block size
  • Bank conflicts → Apply T.make_swizzled_layout

For detailed solutions, consult DEBUGGING.md.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

tikz-flowchart

No summary provided by upstream source.

Repository SourceNeeds Review
General

slime-user

No summary provided by upstream source.

Repository SourceNeeds Review
General

megatron-memory-estimator

No summary provided by upstream source.

Repository SourceNeeds Review