Ascend NPU Kernel Butler
Expert guide for Ascend NPU hardware architecture and triton-ascend kernel development. Avoid confusion with GPU concepts by understanding the fundamental differences between Ascend NPU and GPU architectures.
Overview
Ascend NPU (Neural Processing Unit) is Huawei's AI accelerator with a fundamentally different architecture from GPUs. This skill provides accurate, NPU-specific guidance for kernel development using triton-ascend, ensuring code correctness and optimal performance.
Critical: When answering Ascend-related questions, always use NPU-specific terminology and concepts. Do not map GPU concepts (warp, SM, shared memory) directly to NPU architecture.
Ascend Hardware Architecture
AI Core Structure
The AI Core is the fundamental computing unit in Ascend NPU, organized differently from GPU Streaming Multiprocessors (SM):
| Component | Function | GPU Equivalent |
|---|---|---|
| Cube Unit | Matrix computation (16x16 FP16 matmul per cycle) | Tensor Core |
| Vector Unit | SIMD vector operations | CUDA Core |
| Scalar Unit | Control flow, instruction scheduling (mini-CPU) | Warp Scheduler |
| MTE1/MTE2/MTE3 | Data transfer between storage levels | Memory Controller |
| FixPipe | On-the-fly format/type conversion | N/A |
Operating Modes
Coupled Mode (A1 series):
- Single Scalar unit schedules both Cube and Vector
Decoupled Mode (A2/A3 series):
- Independent Scalar units for Cube and Vector
- Higher parallelism potential
Memory Hierarchy
Global Memory (GM)
↓ MTE3
Unified Buffer (UB)
↓ MTE2
L1 Buffer
↓ MTE1
┌─────────┬─────────┐
↓ ↓ ↓
L0A L0B L0C
(Cube (Cube (Cube
input) input) output)
└─────────┴─────────┘
↓ FixPipe
Unified Buffer (UB)
↓ MTE3
Global Memory (GM)
Key differences from GPU:
- No unified memory space
- Explicit data movement between levels (MTE units)
- Strict data flow paths
Common GPU vs NPU Confusions
1. Memory Model
GPU: Unified memory space, shared memory accessible by all threads in a block
shared memory→ fast, software-managed cache
NPU: Multi-level storage hierarchy, explicit data movement required
- Unified Buffer (UB) → general-purpose data staging
- L0A/L0B/L0C → Cube unit specific buffers
- L1 Buffer → intermediate storage
2. Threading Model
GPU: Thread blocks, warps (32 threads), SIMT execution
tl.program_id(axis)→ block IDtl.arange()→ thread ID within block
NPU: Block-based execution, no warp concept
- Blocks are the fundamental execution unit
- No SIMT warp-level synchronization
- Use block-level barriers instead
3. Synchronization
GPU: cudaSyncThreads(), warp-level primitives
tl.atomic_*for shared memory atomics
NPU: PipeBarrier and SetFlag/WaitFlag for pipeline synchronization
- Different synchronization semantics
- Avoid GPU synchronization patterns
4. Data Access Patterns
GPU: Flexible memory access, coalescing important
- Arbitrary access patterns possible (with performance cost)
NPU: Strict alignment requirements
- Vector instructions require 32B alignment
- Cache Line alignment improves load efficiency
- Plan data movement carefully
triton-ascend Development
Basic Kernel Structure
import triton
import triton.language as tl
@triton.jit
def npu_kernel(
x_ptr, y_ptr, z_ptr,
n_elements,
BLOCK_SIZE: tl.constexpr,
):
# Block ID (different from GPU thread block concept)
pid = tl.program_id(axis=0)
# Offset calculation
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
# Load from Global Memory to Unified Buffer
x = tl.load(x_ptr + offsets)
y = tl.load(y_ptr + offsets)
# Compute in Vector Unit
z = x + y
# Store back to Global Memory
tl.store(z_ptr + offsets, z)
GPU to NPU Migration Checklist
When migrating GPU Triton kernels to NPU:
- Replace
tl.dot()with explicittl.matmul()for NPU - Check data alignment (32B for Vector, 64B for Cube)
- Verify memory access patterns match NPU hierarchy
- Remove GPU-specific synchronization primitives
- Use NPU-specific intrinsic functions when needed
- Consider multi-buffering for pipeline efficiency
Performance Optimization
Reduce Scalar Computation
Scalar units have limited throughput. Minimize:
- Complex branching logic
- Runtime-dependent calculations
- Dynamic loop conditions
Good:
# Precompute at compile time
TILE_SIZE: tl.constexpr = 64
Avoid:
# Runtime calculation
tile_size = tl.sqrt(n_elements).to(tl.int32)
Data Alignment
- Vector instructions: 32B alignment minimum
- Cache Line alignment: 64B for better performance
- Use
tl.contiguous()to ensure memory layout
Cache Utilization
Maximize ICache (instruction cache) and DCache (data cache):
- Keep kernels compact
- Reuse loaded data
- Minimize Global Memory access
Key Intrinsic Functions
| Function | Purpose | Notes |
|---|---|---|
tl.program_id(axis) | Get block index | Not thread ID |
tl.arange(start, stop) | Generate offset sequence | Block-local |
tl.load(ptr) | Load from GM to UB | Respects alignment |
tl.store(ptr, val) | Store from UB to GM | Respects alignment |
tl.matmul(a, b) | Matrix multiplication | Uses Cube unit |
tl.exp(x), tl.sqrt(x) | Math functions | Vector unit |
Migration from GPU Triton
For detailed migration guidance, refer to:
- [
references/migrate-from-gpu.md](references/migrate-from-gpu.md) - Step-by-step migration guide - [
references/architecture-difference.md](references/architecture-difference.md) - Detailed architecture comparison
When migrating kernels:
- Analyze memory access patterns
- Verify data flow through storage hierarchy
- Replace GPU-specific operations with NPU equivalents
- Test with small inputs first
- Profile and optimize based on NPU-specific counters
Additional Resources
Official Documentation
- Ascend Basic Architecture - Hardware fundamentals
- Abstract Hardware Architecture - Programming model
- Architecture Difference - GPU vs NPU comparison
- Migration Guide - Kernel migration
- Performance Guidelines - Optimization tips
- Core Features - triton-ascend design
Reference Files in This Skill
- [
references/hardware-architecture.md](references/hardware-architecture.md) - Detailed hardware architecture - [
references/triton-ascend-guide.md](references/triton-ascend-guide.md) - Development workflow - [
references/gpu-npu-differences.md](references/gpu-npu-differences.md) - Comprehensive comparison
Example Code
Working examples in examples/:
- [
kernel-example.py](examples/kernel-example.py) - Basic NPU kernel template
Common Pitfalls
- Using GPU terminology → Always use NPU-specific terms (AI Core, not SM; UB, not shared memory)
- Ignoring alignment → Vector ops require 32B alignment, Cache Line is 64B
- Wrong synchronization → No warps on NPU, use block-level barriers
- Excessive Scalar computation → Scalar units are slow, precompute at compile time
- Poor data reuse → Minimize GM access, maximize UB/L1 utilization