amd-rocm-porting

Port NVIDIA CUDA codebases to AMD ROCm GPUs. Use when making PyTorch models run on AMD GPUs, replacing NVIDIA-specific libraries with AMD equivalents, fixing ROCm build/runtime failures, or porting C/C++ CUDA kernels to HIP. Also covers dependency debugging and environment setup on ROCm Docker images.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "amd-rocm-porting" with this command: npx skills add arist12/amd-skills/arist12-amd-skills-amd-rocm-porting

AMD ROCm Porting

Port NVIDIA CUDA codebases to AMD ROCm GPUs for functional equivalence.

5 Critical Rules (read first)

  1. NVIDIA isolation: Every ROCm change MUST be gated behind is_rocm. The NVIDIA code path must be byte-for-byte identical to the pre-porting state.

    is_rocm = hasattr(torch.version, "hip") and torch.version.hip is not None
    
  2. Compile mode: NEVER use mode="reduce-overhead" on ROCm — causes 65x slowdown. Use mode="default" on ROCm, keep original mode for NVIDIA.

  3. Inductor: Disable triton.cudagraphs, triton.cudagraph_trees, and memory_planning on ROCm. Also override max_autotune = False (AMD Docker images set it True by default, causing mode="default" to silently behave like max-autotune and hang). Details: references/torch-compile-and-cudagraph.md

  4. Warp width: AMD wavefronts are 64-wide (not 32). All ballot/mask operations need uint64_t. (C/C++ repos only; pure Python repos skip this.)

  5. Three-tier fallback: AMD-optimized lib → PyTorch SDPA → pure PyTorch eager. Details: references/library-and-model-adaptation.md

Decision Tree: Which Phases to Run

Does the repo have C/C++ CUDA kernels (.cu / .cuh files)?
├── NO  → Skip Phases 2, 3, 4. Run Phases 1, 5, 6, 7, 8 only.
│         (Pure Python/PyTorch repos — most HuggingFace models, etc.)
└── YES → Run all 8 phases.
          Does it use flash-attn, CUTLASS, or custom extensions?
          ├── flash-attn only → Phase 5 (replace with aiter)
          ├── CUTLASS         → Phase 3 + manual CK rewrite
          └── custom kernels  → Full Phase 2 + 3 HIPIFY workflow

Context Management

  • Load reference files lazily — only read a reference when actively working on that phase.
  • Summarize findings — after each phase, record a brief summary rather than retaining raw grep output in context.

Phase Checklist

Phase 0: Environment Setup

Step 1 — Audit existing environment before installing anything. AMD Docker images often have PyTorch ROCm, aiter, flash-attn pre-installed. Check what exists:

env | grep -iE 'TORCH|INDUCTOR|AUTOTUNE|TRITON|HIP|ROCM|HSA|GPU|AMD|CUDA' | sort
pip show torch torchvision transformers 2>/dev/null | grep -E "^(Name|Version|Location)"

Add the repo src/ to sys.path in scripts to make the package importable without pip install:

import sys, pathlib
sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[1] / "src"))

Run the target script and note only the ModuleNotFoundErrors that actually occur. Install those packages individually (pip install --no-deps <pkg>).

Step 2 — Never run pip install -e . on AMD without exclusions. The pyproject.toml was written for NVIDIA and often contains jax[cuda12] and torch==X.Y.Z. Running pip install -e . will overwrite your ROCm PyTorch with CUDA versions. Use: pip install --no-deps --ignore-requires-python -e ., then install only missing packages.

CRITICAL: Protect PyTorch after any pip install. Always verify:

python3 -c "import torch; print(torch.__version__, torch.version.hip)"

If torch.version.hip is None, your ROCm PyTorch was overwritten. Reinstall it.

Step 3 — Python version constraint. requires-python = ">=3.11" is often a conservative constraint. Use --ignore-requires-python.

Step 4 — Repos with JAX + PyTorch: use PyTorch-only path. Skip all JAX-dependent code; do not attempt to install or fix JAX for ROCm.

Step 5 — Dependency debugging. If you hit ImportError, version mismatch, or dtype errors, read references/dependency-debugging.md for the diagnostic protocol.

Phase 1: ROCm Detection & Flags

  • Detect ROCm: is_rocm = hasattr(torch.version, "hip") and torch.version.hip is not None
  • Detect GPU arch (never hardcode): rocminfo | grep -o 'gfx[0-9a-f]*' | head -1 → e.g. gfx942 (MI300X) or gfx950 (MI350X)
  • Set ROCm-safe env vars: PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (omit max_split_size_mb)
  • Set performance env vars:
    export GPU_MAX_HW_QUEUES=2  HIP_FORCE_DEV_KERNARG=1  HSA_NO_SCRATCH_RECLAIM=1  AMD_LOG_LEVEL=0
    
  • Disable NUMA balancing (10-30% perf loss if left on): sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
  • Verify GPU: rocm-smi, rocminfo | grep gfx, hipcc --version
  • FP8 dtype depends on arch:
    arch = torch.cuda.get_device_properties(0).gcnArchName
    fp8_dtype = torch.float8_e4m3fnuz if "gfx942" in arch else torch.float8_e4m3fn
    

Phase 2: Source Translation (C/C++ only)

  • Run hipify-perl --inplace for initial pass, then hipify-clang for complex templates
  • Key header mappings: cuda_runtime.hhip/hip_runtime.h, cublas_v2.hhipblas/hipblas.h
  • Flag inline PTX (grep -rn "asm\s*(") — cannot be auto-ported; flag CUTLASS — needs manual CK rewrite

Phase 3: Architecture Adaptation (C/C++ only)

  • Replace 32-bit ballot masks with uint64_t for AMD 64-wide wavefronts
  • Replace __popc with __popcll for 64-bit masks; prefer 64-element shared memory tiles

Phase 4: Build System (C/C++ only)

  • Detect GPU arch at runtime — never hardcode: GPU_ARCH=$(rocminfo | grep -o 'gfx[0-9a-f]*' | head -1)
  • CMake: find_package(HIP), set CMAKE_HIP_ARCHITECTURES to the detected arch
  • setup.py: detect is_rocm, use CUDAExtension (hipcc handles .cu on ROCm)

Phase 5: Library Replacement

  • flash-attn → aiter (different API; wrap with three-tier fallback)
  • NCCL → RCCL, cuBLAS → hipBLAS (drop-in via HIPIFY)
  • pynvml: guard with try/except, use torch.cuda.is_available() as primary GPU check
  • PYTORCH_CUDA_ALLOC_CONF: remove max_split_size_mb on ROCm (rejected by HIP allocator)
  • Details + fallback patterns: references/library-and-model-adaptation.md

Phase 6: torch.compile Adaptation

  • Gate compile mode: "default" on ROCm, original mode on NVIDIA
  • Audit env vars first: env | grep -iE 'TORCH|INDUCTOR|AUTOTUNE' — unset TORCHINDUCTOR_MAX_AUTOTUNE if present (causes hangs even in default mode)
  • Apply Inductor config: disable cudagraphs, memory_planning; set max_autotune=False; use ATEN GEMM backend
  • Details + monkey-patch: references/torch-compile-and-cudagraph.md

Phase 7: CUDAGraph / HIP Graph (optional)

  • Only needed if kernel launch overhead is a bottleneck (profile first).
  • Since Inductor CUDAGraphs are disabled on ROCm, use manual CUDAGraph capture.
  • HIP does NOT raise errors for illegal ops during capture — it silently produces wrong results on replay. Always validate outputs.
  • For capture patterns, RNG patch, and graph break debugging, see the amd-kernel-optimization skill's torch-compile-and-graphs.md.

Phase 8: Verification

  • Static: grep for remaining cuda_runtime.h, inline PTX, NVIDIA-specific types
  • Build (C/C++): GPU_ARCH=$(rocminfo | grep -o 'gfx[0-9a-f]*' | head -1); hipcc -c kernels.hip --offload-arch=$GPU_ARCH
  • Functional: forward + backward pass, compare loss to CPU reference
  • Numerical: torch.testing.assert_close(rocm_out, cuda_ref, rtol=5e-2, atol=5e-2)
  • Details + golden vector methodology: references/verification-methodology.md

First-Run Compilation Penalty (NORMAL on AMD)

Every JIT component has a slow first run. Do NOT conclude something is broken because first run is slow.

ComponentFirst RunSubsequentCache
torch.compile (default)2-5 min<1sTORCHINDUCTOR_CACHE_DIR
torch.compile (max-autotune)5-15 min<1sTORCHINDUCTOR_CACHE_DIR
AITER JIT kernels1-3 min<1saiter jit/build/
Triton kernels1-2 min<1s~/.triton/cache
TunableOp GEMM tuning1-5 min<1sPYTORCH_TUNABLEOP_FILENAME

Set timeout ≥ 600s for first compilation. Do NOT kill processes under 15 minutes.

Common Pitfalls

PitfallSymptomFix
pip install -e . on AMDOverwrites ROCm torch with CUDA versionUse --no-deps --ignore-requires-python; install missing pkgs individually
TORCHINDUCTOR_MAX_AUTOTUNE=1 in Docker envmode="default" hangs (silently becomes max-autotune)unset TORCHINDUCTOR_MAX_AUTOTUNE before any compile
reduce-overhead compile mode65x slowdown, hangsmode="default" on ROCm
max_split_size_mb in PYTORCH_CUDA_ALLOC_CONFRuntimeError at startupRemove on ROCm
Top-level import pynvmlImportErrorGuard with try/except; use torch.cuda.is_available() first
Inductor cudagraphs enabledSlowdown, capture errorsinductor_config.triton.cudagraphs = False
Inductor memory_planningDeep recursion crashinductor_config.memory_planning = False
torch.cuda.get_rng_state() during captureRuntimeErrorApply Dynamo RNG patch
torch.backends.cuda.matmul.allow_tf32AttributeError on ROCmGate behind if not is_rocm
NUMA balancing on10-30% perf loss, intermittent errorsecho 0 > /proc/sys/kernel/numa_balancing
FP8 dtype mismatchCrash or accuracy lossgfx942=e4m3fnuz, gfx950=e4m3fn
32-bit warp masks (C/C++)Silent wrong resultsUse uint64_t for ballot/active masks
Patching files into wrong site-packages pathCustom model code never loadsVerify with inspect.getfile(TheClass) after patching

References

Load only when actively working on that phase:

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

env-probe

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

Cron Expression

Cron表达式生成、解释、常用示例、验证、下次执行时间、平台转换(Linux/AWS/GitHub Actions). Use when you need cron expression capabilities. Triggers on: cron expression.

Registry SourceRecently Updated
Coding

Coze Studio

An AI agent development platform with all-in-one visual tools, simplifying agent creation, debugging coze studio, typescript, agent, agent-platform, ai-plugi...

Registry SourceRecently Updated
Coding

Auto Document Generator

自动从代码生成技术文档,支持 Python/JavaScript/Bash,AI 增强文档质量

Registry SourceRecently Updated