nemo-evaluator-sdk

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "nemo-evaluator-sdk" with this command: npx skills add davila7/claude-code-templates/davila7-claude-code-templates-nemo-evaluator-sdk

NeMo Evaluator SDK - Enterprise LLM Benchmarking

Quick Start

NeMo Evaluator SDK evaluates LLMs across 100+ benchmarks from 18+ harnesses using containerized, reproducible evaluation with multi-backend execution (local Docker, Slurm HPC, Lepton cloud).

Installation:

pip install nemo-evaluator-launcher

Set API key and run evaluation:

export NGC_API_KEY=nvapi-your-key-here

Create minimal config

cat > config.yaml << 'EOF' defaults:

  • execution: local
  • deployment: none
  • self

execution: output_dir: ./results

target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

evaluation: tasks: - name: ifeval EOF

Run evaluation

nemo-evaluator-launcher run --config-dir . --config-name config

View available tasks:

nemo-evaluator-launcher ls tasks

Common Workflows

Workflow 1: Evaluate Model on Standard Benchmarks

Run core academic benchmarks (MMLU, GSM8K, IFEval) on any OpenAI-compatible endpoint.

Checklist:

Standard Evaluation:

  • Step 1: Configure API endpoint
  • Step 2: Select benchmarks
  • Step 3: Run evaluation
  • Step 4: Check results

Step 1: Configure API endpoint

config.yaml

defaults:

  • execution: local
  • deployment: none
  • self

execution: output_dir: ./results

target: api_endpoint: model_id: meta/llama-3.1-8b-instruct url: https://integrate.api.nvidia.com/v1/chat/completions api_key_name: NGC_API_KEY

For self-hosted endpoints (vLLM, TRT-LLM):

target: api_endpoint: model_id: my-model url: http://localhost:8000/v1/chat/completions api_key_name: "" # No key needed for local

Step 2: Select benchmarks

Add tasks to your config:

evaluation: tasks: - name: ifeval # Instruction following - name: gpqa_diamond # Graduate-level QA env_vars: HF_TOKEN: HF_TOKEN # Some tasks need HF token - name: gsm8k_cot_instruct # Math reasoning - name: humaneval # Code generation

Step 3: Run evaluation

Run with config file

nemo-evaluator-launcher run
--config-dir .
--config-name config

Override output directory

nemo-evaluator-launcher run
--config-dir .
--config-name config
-o execution.output_dir=./my_results

Limit samples for quick testing

nemo-evaluator-launcher run
--config-dir .
--config-name config
-o +evaluation.nemo_evaluator_config.config.params.limit_samples=10

Step 4: Check results

Check job status

nemo-evaluator-launcher status <invocation_id>

List all runs

nemo-evaluator-launcher ls runs

View results

cat results/<invocation_id>/<task>/artifacts/results.yml

Workflow 2: Run Evaluation on Slurm HPC Cluster

Execute large-scale evaluation on HPC infrastructure.

Checklist:

Slurm Evaluation:

  • Step 1: Configure Slurm settings
  • Step 2: Set up model deployment
  • Step 3: Launch evaluation
  • Step 4: Monitor job status

Step 1: Configure Slurm settings

slurm_config.yaml

defaults:

  • execution: slurm
  • deployment: vllm
  • self

execution: hostname: cluster.example.com account: my_slurm_account partition: gpu output_dir: /shared/results walltime: "04:00:00" nodes: 1 gpus_per_node: 8

Step 2: Set up model deployment

deployment: checkpoint_path: /shared/models/llama-3.1-8b tensor_parallel_size: 2 data_parallel_size: 4 max_model_len: 4096

target: api_endpoint: model_id: llama-3.1-8b # URL auto-generated by deployment

Step 3: Launch evaluation

nemo-evaluator-launcher run
--config-dir .
--config-name slurm_config

Step 4: Monitor job status

Check status (queries sacct)

nemo-evaluator-launcher status <invocation_id>

View detailed info

nemo-evaluator-launcher info <invocation_id>

Kill if needed

nemo-evaluator-launcher kill <invocation_id>

Workflow 3: Compare Multiple Models

Benchmark multiple models on the same tasks for comparison.

Checklist:

Model Comparison:

  • Step 1: Create base config
  • Step 2: Run evaluations with overrides
  • Step 3: Export and compare results

Step 1: Create base config

base_eval.yaml

defaults:

  • execution: local
  • deployment: none
  • self

execution: output_dir: ./comparison_results

evaluation: nemo_evaluator_config: config: params: temperature: 0.01 parallelism: 4 tasks: - name: mmlu_pro - name: gsm8k_cot_instruct - name: ifeval

Step 2: Run evaluations with model overrides

Evaluate Llama 3.1 8B

nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=meta/llama-3.1-8b-instruct
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

Evaluate Mistral 7B

nemo-evaluator-launcher run
--config-dir .
--config-name base_eval
-o target.api_endpoint.model_id=mistralai/mistral-7b-instruct-v0.3
-o target.api_endpoint.url=https://integrate.api.nvidia.com/v1/chat/completions

Step 3: Export and compare

Export to MLflow

nemo-evaluator-launcher export <invocation_id_1> --dest mlflow nemo-evaluator-launcher export <invocation_id_2> --dest mlflow

Export to local JSON

nemo-evaluator-launcher export <invocation_id> --dest local --format json

Export to Weights & Biases

nemo-evaluator-launcher export <invocation_id> --dest wandb

Workflow 4: Safety and Vision-Language Evaluation

Evaluate models on safety benchmarks and VLM tasks.

Checklist:

Safety/VLM Evaluation:

  • Step 1: Configure safety tasks
  • Step 2: Set up VLM tasks (if applicable)
  • Step 3: Run evaluation

Step 1: Configure safety tasks

evaluation: tasks: - name: aegis # Safety harness - name: wildguard # Safety classification - name: garak # Security probing

Step 2: Configure VLM tasks

For vision-language models

target: api_endpoint: type: vlm # Vision-language endpoint model_id: nvidia/llama-3.2-90b-vision-instruct url: https://integrate.api.nvidia.com/v1/chat/completions

evaluation: tasks: - name: ocrbench # OCR evaluation - name: chartqa # Chart understanding - name: mmmu # Multimodal understanding

When to Use vs Alternatives

Use NeMo Evaluator when:

  • Need 100+ benchmarks from 18+ harnesses in one platform

  • Running evaluations on Slurm HPC clusters or cloud

  • Requiring reproducible containerized evaluation

  • Evaluating against OpenAI-compatible APIs (vLLM, TRT-LLM, NIMs)

  • Need enterprise-grade evaluation with result export (MLflow, W&B)

Use alternatives instead:

  • lm-evaluation-harness: Simpler setup for quick local evaluation

  • bigcode-evaluation-harness: Focused only on code benchmarks

  • HELM: Stanford's broader evaluation (fairness, efficiency)

  • Custom scripts: Highly specialized domain evaluation

Supported Harnesses and Tasks

Harness Task Count Categories

lm-evaluation-harness

60+ MMLU, GSM8K, HellaSwag, ARC

simple-evals

20+ GPQA, MATH, AIME

bigcode-evaluation-harness

25+ HumanEval, MBPP, MultiPL-E

safety-harness

3 Aegis, WildGuard

garak

1 Security probing

vlmevalkit

6+ OCRBench, ChartQA, MMMU

bfcl

6 Function calling v2/v3

mtbench

2 Multi-turn conversation

livecodebench

10+ Live coding evaluation

helm

15 Medical domain

nemo-skills

8 Math, science, agentic

Common Issues

Issue: Container pull fails

Ensure NGC credentials are configured:

docker login nvcr.io -u '$oauthtoken' -p $NGC_API_KEY

Issue: Task requires environment variable

Some tasks need HF_TOKEN or JUDGE_API_KEY:

evaluation: tasks: - name: gpqa_diamond env_vars: HF_TOKEN: HF_TOKEN # Maps env var name to env var

Issue: Evaluation timeout

Increase parallelism or reduce samples:

-o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=100

Issue: Slurm job not starting

Check Slurm account and partition:

execution: account: correct_account partition: gpu qos: normal # May need specific QOS

Issue: Different results than expected

Verify configuration matches reported settings:

evaluation: nemo_evaluator_config: config: params: temperature: 0.0 # Deterministic num_fewshot: 5 # Check paper's fewshot count

CLI Reference

Command Description

run

Execute evaluation with config

status <id>

Check job status

info <id>

View detailed job info

ls tasks

List available benchmarks

ls runs

List all invocations

export <id>

Export results (mlflow/wandb/local)

kill <id>

Terminate running job

Configuration Override Examples

Override model endpoint

-o target.api_endpoint.model_id=my-model -o target.api_endpoint.url=http://localhost:8000/v1/chat/completions

Add evaluation parameters

-o +evaluation.nemo_evaluator_config.config.params.temperature=0.5 -o +evaluation.nemo_evaluator_config.config.params.parallelism=8 -o +evaluation.nemo_evaluator_config.config.params.limit_samples=50

Change execution settings

-o execution.output_dir=/custom/path -o execution.mode=parallel

Dynamically set tasks

-o 'evaluation.tasks=[{name: ifeval}, {name: gsm8k}]'

Python API Usage

For programmatic evaluation without the CLI:

from nemo_evaluator.core.evaluate import evaluate from nemo_evaluator.api.api_dataclasses import ( EvaluationConfig, EvaluationTarget, ApiEndpoint, EndpointType, ConfigParams )

Configure evaluation

eval_config = EvaluationConfig( type="mmlu_pro", output_dir="./results", params=ConfigParams( limit_samples=10, temperature=0.0, max_new_tokens=1024, parallelism=4 ) )

Configure target endpoint

target_config = EvaluationTarget( api_endpoint=ApiEndpoint( model_id="meta/llama-3.1-8b-instruct", url="https://integrate.api.nvidia.com/v1/chat/completions", type=EndpointType.CHAT, api_key="nvapi-your-key-here" ) )

Run evaluation

result = evaluate(eval_cfg=eval_config, target_cfg=target_config)

Advanced Topics

Multi-backend execution: See references/execution-backends.md Configuration deep-dive: See references/configuration.md Adapter and interceptor system: See references/adapter-system.md Custom benchmark integration: See references/custom-benchmarks.md

Requirements

  • Python: 3.10-3.13

  • Docker: Required for local execution

  • NGC API Key: For pulling containers and using NVIDIA Build

  • HF_TOKEN: Required for some benchmarks (GPQA, MMLU)

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

senior-data-scientist

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

senior-backend

No summary provided by upstream source.

Repository SourceNeeds Review
-1.2K
davila7
Coding

senior-frontend

No summary provided by upstream source.

Repository SourceNeeds Review