add-sgl-kernel

Tutorial: Adding a New Kernel to sgl-kernel (AOT / Heavyweight)

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "add-sgl-kernel" with this command: npx skills add sgl-project/sglang/sgl-project-sglang-add-sgl-kernel

Tutorial: Adding a New Kernel to sgl-kernel (AOT / Heavyweight)

This tutorial walks through adding a simple element-wise scale operation as an AOT kernel. We'll implement scale(x, factor) = x * factor to demonstrate the complete workflow.

Goal

Add a new operation that scales each element of a tensor by a scalar factor:

  • Input: tensor x (CUDA) and scalar factor (float)

  • Output: x * factor (element-wise, in-place or into pre-allocated out )

  • Supported dtypes: FP16 (torch.float16 ), BF16 (torch.bfloat16 ), FP32 (torch.float32 )

  • Dispatched via DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 macro (defined in sgl-kernel/include/utils.h )

Two rules of thumb (must follow)

  • Prefer python/sglang/jit_kernel first when the kernel does not depend on CUTLASS or another large C++ project. This is the default path for lightweight kernels that benefit from rapid iteration.

  • Prefer sgl-kernel when the kernel does depend on CUTLASS or another large C++ project, or when it should be part of the AOT wheel / torch op registration flow.

  • Exception: if the dependency is flashinfer , or CUTLASS that is already provided through flashinfer , the kernel can still be implemented as jit_kernel .

In addition, every new kernel must ship with:

  • Tests (pytest)

  • A benchmark script (triton.testing)

Repository integration map

You will typically touch these files/areas:

  • Implementation: sgl-kernel/csrc/elementwise/scale.cu (pick the right subdirectory)

  • Public declarations: sgl-kernel/include/sgl_kernel_ops.h

  • Torch extension registration: sgl-kernel/csrc/common_extension.cc

  • Build: sgl-kernel/CMakeLists.txt (set(SOURCES ...) )

  • Python API: sgl-kernel/python/sgl_kernel/ and sgl-kernel/python/sgl_kernel/init.py

  • Tests: sgl-kernel/tests/test_scale.py

  • Benchmarks: sgl-kernel/benchmark/bench_scale.py

Step 1: Implement the kernel in csrc/

Pick the right subdirectory:

  • csrc/elementwise/ — for element-wise ops (our example)

  • csrc/gemm/ , csrc/attention/ , csrc/moe/ — for other categories

Create sgl-kernel/csrc/elementwise/scale.cu :

#include <ATen/cuda/CUDAContext.h> #include <c10/cuda/CUDAGuard.h> #include <torch/all.h>

#include "utils.h" // DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16

// scale_kernel: out[i] = input[i] * factor // Supports float, half (__half), __nv_bfloat16 via template T template <typename T> global void scale_kernel(T* restrict out, const T* restrict input, float factor, int64_t n) { int64_t idx = static_cast<int64_t>(blockIdx.x) * blockDim.x + threadIdx.x; if (idx < n) { out[idx] = static_cast<T>(static_cast<float>(input[idx]) * factor); } }

void scale(at::Tensor& out, const at::Tensor& input, double factor) { TORCH_CHECK(input.is_cuda(), "input must be a CUDA tensor"); TORCH_CHECK(input.is_contiguous(), "input must be contiguous"); TORCH_CHECK(out.is_cuda(), "out must be a CUDA tensor"); TORCH_CHECK(out.is_contiguous(), "out must be contiguous"); TORCH_CHECK(out.sizes() == input.sizes(), "out and input must have the same shape"); TORCH_CHECK(out.scalar_type() == input.scalar_type(), "out and input must have the same dtype");

const int64_t n = input.numel(); const int threads = 256; const int blocks = (n + threads - 1) / threads;

const cudaStream_t stream = at::cuda::getCurrentCUDAStream(); const at::cuda::OptionalCUDAGuard device_guard(device_of(input));

// Dispatches over float, float16, bfloat16 DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16(input.scalar_type(), c_type, [&] { scale_kernel<c_type><<<blocks, threads, 0, stream>>>( static_cast<c_type*>(out.data_ptr()), static_cast<const c_type*>(input.data_ptr()), static_cast<float>(factor), n); cudaError_t status = cudaGetLastError(); TORCH_CHECK(status == cudaSuccess, "scale_kernel launch failed: ", cudaGetErrorString(status)); return true; }); }

Key points:

  • Use at::Tensor (PyTorch tensors), TORCH_CHECK for validation, at::cuda::getCurrentCUDAStream() for stream

  • DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 covers float , half (FP16), __nv_bfloat16 (BF16)

  • Add device error checking after every kernel launch

  • If a kernel only works on certain architectures, enforce that with TORCH_CHECK and skip logic in tests

Step 2: Add a C++ declaration in include/sgl_kernel_ops.h

Edit sgl-kernel/include/sgl_kernel_ops.h , add to the elementwise section:

void scale(at::Tensor& out, const at::Tensor& input, double factor);

Step 3: Register the op in csrc/common_extension.cc

Edit sgl-kernel/csrc/common_extension.cc , inside TORCH_LIBRARY_FRAGMENT(sgl_kernel, m) :

// From csrc/elementwise m.def("scale(Tensor! out, Tensor input, float factor) -> ()"); m.impl("scale", torch::kCUDA, &scale);

Key points:

  • Tensor! means in-place / mutable output argument

  • The schema is important for torch.compile and for consistent call signatures

  • If your underlying C++ API uses float but PyTorch bindings expect double , the implicit cast is fine for scalars; use shims if needed for other types

Step 4: Add the new source file to CMakeLists.txt

Edit sgl-kernel/CMakeLists.txt , add to set(SOURCES ...) :

csrc/elementwise/scale.cu

Key points:

  • Keep the list alphabetically sorted (the file explicitly requires this)

  • If the kernel has arch constraints, reflect that in tests/benchmarks via skip logic

Step 5: Expose a Python API under sgl-kernel/python/sgl_kernel/

Prefer following the existing module organization first. For elementwise kernels, the usual pattern is:

  • implement the Python wrapper in sgl-kernel/python/sgl_kernel/elementwise.py

  • then re-export it from sgl-kernel/python/sgl_kernel/init.py

For example, in sgl-kernel/python/sgl_kernel/elementwise.py , add:

import torch

def scale( input: torch.Tensor, factor: float, out: torch.Tensor | None = None, ) -> torch.Tensor: """ Element-wise scale: out = input * factor.

Supported dtypes: torch.float16, torch.bfloat16, torch.float32.

Parameters
----------
input  : CUDA input tensor
factor : scale factor (float)
out    : optional pre-allocated CUDA output tensor (same shape/dtype as input)
"""
if out is None:
    out = torch.empty_like(input)
torch.ops.sgl_kernel.scale.default(out, input, factor)
return out

Then re-export it from sgl-kernel/python/sgl_kernel/init.py following the existing import style used by other kernels.

Step 6: Write tests (required)

Create sgl-kernel/tests/test_scale.py :

import pytest

import torch import sgl_kernel

@pytest.mark.parametrize("dtype", [torch.float16, torch.bfloat16, torch.float32]) @pytest.mark.parametrize("size", [128, 1024, 4096, 65536]) @pytest.mark.parametrize("factor", [0.5, 1.0, 2.0]) def test_scale_correctness(dtype, size, factor): input = torch.randn(size, dtype=dtype, device="cuda") out = torch.empty_like(input)

result = sgl_kernel.scale(input, factor, out=out)
assert result is out

expected = input * factor
rtol, atol = (1e-5, 1e-6) if dtype == torch.float32 else (1e-2, 1e-2)
torch.testing.assert_close(out, expected, rtol=rtol, atol=atol)

def test_scale_shape_mismatch(): input = torch.randn(128, dtype=torch.float16, device="cuda") out = torch.empty(256, dtype=torch.float16, device="cuda") with pytest.raises(RuntimeError, match="same shape"): sgl_kernel.scale(input, 2.0, out=out)

def test_scale_cpu_input(): input = torch.randn(128, dtype=torch.float16) # CPU out = torch.empty_like(input) with pytest.raises(RuntimeError, match="CUDA"): sgl_kernel.scale(input, 2.0, out=out)

if name == "main": pytest.main([file, "-q"])

Step 7: Add a benchmark (required)

Create sgl-kernel/benchmark/bench_scale.py :

import itertools import os

import torch import triton import triton.testing

import sgl_kernel

IS_CI = ( os.getenv("CI", "false").lower() == "true" or os.getenv("GITHUB_ACTIONS", "false").lower() == "true" )

dtypes = [torch.float16] if IS_CI else [torch.float16, torch.bfloat16, torch.float32] sizes = [4096] if IS_CI else [2**n for n in range(10, 20)] # 1K … 512K factors = [2.0]

configs = list(itertools.product(dtypes, sizes))

def torch_scale(input: torch.Tensor, factor: float) -> torch.Tensor: return input * factor

@triton.testing.perf_report( triton.testing.Benchmark( x_names=["dtype", "size"], x_vals=configs, line_arg="provider", line_vals=["sglang", "torch"], line_names=["SGL Kernel", "PyTorch"], styles=[("green", "-"), ("red", "--")], ylabel="µs (median)", plot_name="scale-performance", args={}, ) ) def benchmark(dtype, size, provider): input = torch.randn(size, dtype=dtype, device="cuda") out = torch.empty_like(input) factor = 2.0

if provider == "sglang":
    fn = lambda: sgl_kernel.scale(input, factor, out=out)
else:
    fn = lambda: torch_scale(input, factor)

ms, min_ms, max_ms = triton.testing.do_bench_cudagraph(
    fn, quantiles=[0.5, 0.2, 0.8]
)
return 1000 * ms, 1000 * max_ms, 1000 * min_ms

if name == "main": benchmark.run(print_data=True)

Step 8: Build

Build:

cd sgl-kernel make build -j16

If you need to limit host resource usage:

cd sgl-kernel make build -j1 MAX_JOBS=2 CMAKE_ARGS="-DSGL_KERNEL_COMPILE_THREADS=1"

Step 9: Validate

After building successfully, run the test and benchmark:

pytest sgl-kernel/tests/test_scale.py -q python sgl-kernel/benchmark/bench_scale.py

Troubleshooting

  • Async CUDA errors: CUDA_LAUNCH_BLOCKING=1

  • Memory errors: compute-sanitizer --tool memcheck python ...

  • Build is too slow / OOM: reduce MAX_JOBS and SGL_KERNEL_COMPILE_THREADS

  • Binary bloat: use sgl-kernel/analyze_whl_kernel_sizes.py

  • CMake sources list: if your .cu file is missing from SOURCES , the symbol will be undefined at link time

References

  • sgl-kernel/README.md

  • sgl-kernel/include/sgl_kernel_ops.h

  • sgl-kernel/csrc/common_extension.cc

  • sgl-kernel/CMakeLists.txt

  • sgl-kernel/include/utils.h — DISPATCH_PYTORCH_DTYPE_TO_CTYPE_FLOAT_FP16 macro and friends

  • sgl-kernel/csrc/elementwise/activation.cu — reference for the FP16/BF16/FP32 dispatch pattern

Summary of Files Created/Modified

sgl-kernel/csrc/elementwise/scale.cu # NEW: CUDA kernel + launcher sgl-kernel/include/sgl_kernel_ops.h # MODIFIED: C++ declaration sgl-kernel/csrc/common_extension.cc # MODIFIED: schema + dispatch registration sgl-kernel/CMakeLists.txt # MODIFIED: add source file (alphabetical) sgl-kernel/python/sgl_kernel/elementwise.py # MODIFIED: Python wrapper sgl-kernel/python/sgl_kernel/init.py # MODIFIED: re-export Python API sgl-kernel/tests/test_scale.py # NEW: tests sgl-kernel/benchmark/bench_scale.py # NEW: benchmark

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

sglang

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

sglang

No summary provided by upstream source.

Repository SourceNeeds Review
Research

sglang

No summary provided by upstream source.

Repository SourceNeeds Review
Research

sglang

No summary provided by upstream source.

Repository SourceNeeds Review