TileLang Developer

Write high-performance AI compute kernels using TileLang - a tile-based programming model that bridges the gap between CUDA's low-level control and high-level abstractions.

When to Use This Skill

Use this skill when the user needs to:

Implement custom GPU kernels for AI operations (matrix multiplication, attention mechanisms, etc.)
Optimize performance-critical operators for modern GPUs (NVIDIA Ampere/Hopper, AMD MI300X, Ascend NPU)
Debug TileLang code or resolve performance issues
Port kernels across different hardware platforms
Understand or modify existing TileLang implementations

Kernel Development Workflow

Follow these steps when writing a TileLang kernel:

Step 1: Analyze Requirements

Gather essential information:

Input/Output Specifications:

Tensor shapes (M, N, K dimensions)
Data types (float16, float32, bfloat16, int8)
Memory layout (row-major, column-major)

Hardware Target:

NVIDIA GPU (Ampere A100, Hopper H100, etc.)
AMD GPU (MI300X, etc.)
Huawei Ascend NPU

Performance Goals:

Target throughput or latency
Memory bandwidth constraints
Comparison baseline (cuBLAS, vendor libraries)

Ask clarifying questions if details are missing.

Step 2: Set Up Kernel Structure

Create the basic kernel scaffold:

import tilelang
import tilelang.language as T

@tilelang.jit(target="cuda", out_idx=[2])  # Specify output indices
def kernel_name(M, N, K, block_M, block_N, block_K):
    @T.prim_func
    def main(
        A: T.Buffer((M, K), "float16"),
        B: T.Buffer((K, N), "float16"),
        C: T.Buffer((M, N), "float16")
    ):
        # Kernel logic will go here
        pass

    return main

Key decisions:

target: "cuda" (NVIDIA), "hip" (AMD), or "cpu"
out_idx: List indices of output parameters
Block dimensions: Typical values are 64, 128, or 256

Step 3: Define Grid and Memory Hierarchy

Set up computation grid and allocate memory:

# Define grid dimensions
with T.Kernel(
    T.ceildiv(N, block_N),  # Grid X
    T.ceildiv(M, block_M),  # Grid Y
    threads=128
) as (bx, by):

    # Allocate shared memory (L1 cache)
    A_shared = T.alloc_shared((block_M, block_K), "float16")
    B_shared = T.alloc_shared((block_K, block_N), "float16")

    # Allocate register fragments (accumulators)
    C_local = T.alloc_fragment((block_M, block_N), "float32")

    # CRITICAL: Apply swizzle layout to avoid bank conflicts
    T.annotate_layout({
        A_shared: T.make_swizzled_layout(A_shared),
        B_shared: T.make_swizzled_layout(B_shared)
    })

Memory hierarchy:

Global Memory (HBM): Input/output tensors, slowest
Shared Memory (L1): Explicitly managed cache, ~164KB on A100
Registers: Fastest, used for accumulators and temporaries

Critical optimization: Always apply T.make_swizzled_layout to shared memory to eliminate bank conflicts.

Step 4: Implement Computation Logic

Use TileLang primitives for data movement and computation:

# Initialize accumulator
T.clear(C_local)

# Main computation loop with software pipelining
for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=3):
    # Load tiles from global to shared memory
    T.copy(A[by * block_M, k * block_K], A_shared)
    T.copy(B[k * block_K, bx * block_N], B_shared)

    # Compute using Tensor Cores
    T.gemm(A_shared, B_shared, C_local, transpose_B=False)

# Write results back
T.copy(C_local, C[by * block_M, bx * block_N])

Key primitives:

T.copy: Intelligent data transfer (auto-selects cp.async, TMA, etc.)
T.gemm: Matrix multiplication using Tensor Cores
T.Pipelined: Software pipelining to overlap compute and memory transfer
T.Parallel: Element-wise parallel operations

Pipeline stages:

num_stages=2: Double buffering
num_stages=3: Triple buffering (recommended for most workloads)
num_stages=4+: Diminishing returns, increases shared memory usage

Step 5: Validate and Test

Generate test code to verify correctness:

# Example instantiation
func = kernel_name(
    M=1024, N=1024, K=1024,
    block_M=128, block_N=128, block_K=32
)

# Test against reference implementation
import torch
A = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
B = torch.randn(1024, 1024, dtype=torch.float16, device='cuda')
C_tilelang = torch.empty(1024, 1024, dtype=torch.float16, device='cuda')
C_reference = A @ B

func(A, B, C_tilelang)

# Verify with appropriate tolerance for FP16
torch.testing.assert_close(C_tilelang, C_reference, rtol=1e-3, atol=1e-3)

Step 6: Optimize Performance

Apply advanced optimizations if performance is suboptimal:

Block Size Tuning:

A100: Try 128×128×32 or 64×64×32
H100: Can use larger blocks (256×128×32)
MI300X: May need smaller blocks due to 64KB shared memory limit

Pipeline Depth:

Increase num_stages if memory-bound
Decrease if shared memory is exhausted

Warp Policy (for advanced cases):

T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullRow)  # For attention
T.gemm(A, B, C, policy=T.GemmWarpPolicy.FullCol)  # For MLA decode

Block-level swizzle:

T.use_swizzle(panel_size=10)  # Improves L2 cache hit rate

Common Kernel Patterns

Matrix Multiplication (GEMM)

Most fundamental kernel. See EXAMPLES.md for complete implementation.

Key features:

3-stage pipelining
Swizzle layout for shared memory
Float32 accumulator for precision

FlashAttention

Memory-efficient attention with online softmax. See EXAMPLES.md for complete implementation.

Key features:

O(N) memory complexity
Online softmax statistics
Fused kernel (no intermediate materialization)

DeepSeek MLA

Multi-Head Latent Attention with KV compression. See EXAMPLES.md for complete implementation.

Key features:

Split-KV parallelization
Non-standard dimensions
FullCol warp policy for narrow matrices

Reference Documentation

When you need specific information:

API details (parameters, signatures, options): Read API_REFERENCE.md
Complete code examples (GEMM, Attention, MLA): Read EXAMPLES.md
Troubleshooting (errors, performance issues): Read DEBUGGING.md

Critical Performance Guidelines

Always include these optimizations:

Swizzle layout for shared memory:

T.annotate_layout({
    A_shared: T.make_swizzled_layout(A_shared)
})

Software pipelining:

for k in T.Pipelined(num_blocks, num_stages=3):

Float32 accumulators:

C_local = T.alloc_fragment((M, N), "float32")  # Not float16

Aligned block_K:

block_K = 32  # Or 16, must align for Tensor Core

Initialize accumulators:
```
T.clear(C_local)
```

Hardware-Specific Considerations

NVIDIA GPUs

Ampere (A100): Use cp.async, num_stages=3, block_K=32
Hopper (H100): Can use TMA, larger blocks (256×128), num_stages=4
Shared memory: 164KB (A100), 228KB (H100)

AMD GPUs

MI300X: Use target="hip", smaller blocks, 64KB shared memory limit
Test with both HIP and CUDA backends for compatibility

Huawei Ascend

More experimental backend
May require specific block sizes
Consult Ascend-specific documentation

Code Quality Standards

When generating TileLang code:

Add explanatory comments for non-obvious optimizations
Specify hardware assumptions (e.g., "optimized for A100")
Include usage examples showing instantiation
Document block size choices and tuning rationale
Provide performance expectations (e.g., "~90% of cuBLAS")

Example Kernel Request Flow

User: "Write a FP16 matrix multiplication kernel for A100"

Response:

Clarify dimensions (if not specified)
Generate complete kernel code with:
- Proper structure (@tilelang.jit, @T.prim_func)
- Swizzle layouts
- 3-stage pipelining
- Appropriate block sizes (128×128×32)
Add usage example
Explain key optimizations:
- "Swizzle layout eliminates bank conflicts"
- "3-stage pipeline overlaps memory and compute"
- "Float32 accumulator prevents overflow"
Suggest testing approach

Troubleshooting Quick Reference

Compilation errors:

Shared memory exceeded → Reduce block size or num_stages
Shape mismatch → Verify dimension alignment in T.gemm

Runtime errors:

Results all zeros → Check T.clear() and out_idx in decorator
NaN/Inf → Use float32 accumulator, add epsilon in division

Performance issues:

Low throughput → Verify swizzle layout and pipelining enabled
Low occupancy → Reduce shared memory usage or block size
Bank conflicts → Apply T.make_swizzled_layout

For detailed solutions, consult DEBUGGING.md.