Browser-Based ONNX Inference

This skill provides a comprehensive workflow for executing ONNX models locally in the browser using ONNX Runtime Web (ORT-Web). Local inference offers significant advantages in data privacy, reduced server costs, and unlimited scalability as each user brings their own compute power.

1. Setup and Installation

Install the required library via npm:

npm install onnxruntime-web

Note: For experimental features like WebGPU or WebNN, use the nightly version onnxruntime-web@dev.

2. Global Environment Configuration

Set global ort.env flags before creating a session to optimize the runtime environment.

WebAssembly (CPU): Enable multi-threading by setting ort.env.wasm.numThreads (default is half of hardware concurrency) and use a Proxy Worker (ort.env.wasm.proxy = true) to keep the UI responsive.
WASM Paths: If binaries are not in the same directory as the JS bundle, manually override paths using ort.env.wasm.wasmPaths to point to local assets or a CDN.
WebGPU (GPU): Use ort.env.webgpu.profiling = { mode: 'default' } for performance diagnosis during development.

3. Creating an Inference Session

Initialize the session by choosing the appropriate Execution Provider (EP):

import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('./model.onnx', {
  executionProviders: ['webgpu', 'wasm'], // Prioritize GPU, fallback to CPU
  graphOptimizationLevel: 'all' // Enable all graph-level optimizations
});

4. Data Preprocessing

Input data must match the model's training format (e.g., NCHW for vision models).

Image-to-Tensor: Use libraries like JIMP or OpenCV.js to resize, normalize (divide by 255.0), and convert RGBA to RGB.
Tensor Creation: Use new ort.Tensor('float32', float32Data,) to prepare the input feeds.

5. Optimized Inference Patterns

Graph Capture: For models with static shapes on WebGPU, enable enableGraphCapture: true to reduce CPU overhead by replaying kernel executions.
IO Binding: For transformer models, keep data on the GPU by using ort.Tensor.fromGpuBuffer() and setting preferredOutputLocation: 'gpu-buffer' to avoid expensive memory copies.
Quantization: Prefer uint8 quantized models for CPU (WASM) inference to improve performance; avoid float16 on CPU as it lacks native support and is slow.

6. Large Model Handling (>2GB)

Platform Limits: Browsers like Chrome limit ArrayBuffer to ~2GB. Models exceeding this must be exported with external data.

Loading External Data: Explicitly link external weight files in the session options:

const session = await ort.InferenceSession.create(modelUrl, {
  externalData: [{ path: './model.data', data: dataUrl }]
});

7. Common Edge Cases

Memory Management: Explicitly call tensor.dispose() for GPU tensors to prevent memory leaks.
Zero-Sized Tensors: ORT-Web treats tensors with a dimension of 0 as CPU tensors regardless of the selected EP.
Thermal Throttling: Sustained inference on mobile devices may trigger frequency scaling, doubling latency. Use lightweight "tiny" models to maintain thermal equilibrium.

8. Examples

Multilingual Translation

Offload heavy translation tasks to a separate Web Worker using a singleton pattern to ensure the model (e.g., NLLB-200) loads only once.

Object Detection (YOLO)

Implement Non-Max Suppression (NMS). If the browser lacks support for specific NMS ops, run a separate NMS ONNX model to filter overlapping boxes locally.

browser-onnx

Safety Notice

Copy this and send it to your AI assistant to learn