llama-cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llama-cpp" with this command: npx skills add zechenzhangagi/ai-research-skills/zechenzhangagi-ai-research-skills-llama-cpp

llama.cpp

Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.

When to use llama.cpp

Use llama.cpp when:

  • Running on CPU-only machines

  • Deploying on Apple Silicon (M1/M2/M3/M4)

  • Using AMD or Intel GPUs (no CUDA)

  • Edge deployment (Raspberry Pi, embedded systems)

  • Need simple deployment without Docker/Python

Use TensorRT-LLM instead when:

  • Have NVIDIA GPUs (A100/H100)

  • Need maximum throughput (100K+ tok/s)

  • Running in datacenter with CUDA

Use vLLM instead when:

  • Have NVIDIA GPUs

  • Need Python-first API

  • Want PagedAttention

Quick start

Installation

macOS/Linux

brew install llama.cpp

Or build from source

git clone https://github.com/ggerganov/llama.cpp cd llama.cpp make

With Metal (Apple Silicon)

make LLAMA_METAL=1

With CUDA (NVIDIA)

make LLAMA_CUDA=1

With ROCm (AMD)

make LLAMA_HIP=1

Download model

Download from HuggingFace (GGUF format)

huggingface-cli download
TheBloke/Llama-2-7B-Chat-GGUF
llama-2-7b-chat.Q4_K_M.gguf
--local-dir models/

Or convert from HuggingFace

python convert_hf_to_gguf.py models/llama-2-7b-chat/

Run inference

Simple chat

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
-p "Explain quantum computing"
-n 256 # Max tokens

Interactive chat

./llama-cli
-m models/llama-2-7b-chat.Q4_K_M.gguf
--interactive

Server mode

Start OpenAI-compatible server

./llama-server
-m models/llama-2-7b-chat.Q4_K_M.gguf
--host 0.0.0.0
--port 8080
-ngl 32 # Offload 32 layers to GPU

Client request

curl http://localhost:8080/v1/chat/completions
-H "Content-Type: application/json"
-d '{ "model": "llama-2-7b-chat", "messages": [{"role": "user", "content": "Hello!"}], "temperature": 0.7, "max_tokens": 100 }'

Quantization formats

GGUF format overview

Format Bits Size (7B) Speed Quality Use Case

Q4_K_M 4.5 4.1 GB Fast Good Recommended default

Q4_K_S 4.3 3.9 GB Faster Lower Speed critical

Q5_K_M 5.5 4.8 GB Medium Better Quality critical

Q6_K 6.5 5.5 GB Slower Best Maximum quality

Q8_0 8.0 7.0 GB Slow Excellent Minimal degradation

Q2_K 2.5 2.7 GB Fastest Poor Testing only

Choosing quantization

General use (balanced)

Q4_K_M # 4-bit, medium quality

Maximum speed (more degradation)

Q2_K or Q3_K_M

Maximum quality (slower)

Q6_K or Q8_0

Very large models (70B, 405B)

Q3_K_M or Q4_K_S # Lower bits to fit in memory

Hardware acceleration

Apple Silicon (Metal)

Build with Metal

make LLAMA_METAL=1

Run with GPU acceleration (automatic)

./llama-cli -m model.gguf -ngl 999 # Offload all layers

Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)

NVIDIA GPUs (CUDA)

Build with CUDA

make LLAMA_CUDA=1

Offload layers to GPU

./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers

Hybrid CPU+GPU for large models

./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest

AMD GPUs (ROCm)

Build with ROCm

make LLAMA_HIP=1

Run with AMD GPU

./llama-cli -m model.gguf -ngl 999

Common patterns

Batch processing

Process multiple prompts from file

cat prompts.txt | ./llama-cli
-m model.gguf
--batch-size 512
-n 100

Constrained generation

JSON output with grammar

./llama-cli
-m model.gguf
-p "Generate a person: "
--grammar-file grammars/json.gbnf

Outputs valid JSON only

Context size

Increase context (default 512)

./llama-cli
-m model.gguf
-c 4096 # 4K context window

Very long context (if model supports)

./llama-cli -m model.gguf -c 32768 # 32K context

Performance benchmarks

CPU performance (Llama 2-7B Q4_K_M)

CPU Threads Speed Cost

Apple M3 Max 16 50 tok/s $0 (local)

AMD Ryzen 9 7950X 32 35 tok/s $0.50/hour

Intel i9-13900K 32 30 tok/s $0.40/hour

AWS c7i.16xlarge 64 40 tok/s $2.88/hour

GPU acceleration (Llama 2-7B Q4_K_M)

GPU Speed vs CPU Cost

NVIDIA RTX 4090 120 tok/s 3-4× $0 (local)

NVIDIA A10 80 tok/s 2-3× $1.00/hour

AMD MI250 70 tok/s 2× $2.00/hour

Apple M3 Max (Metal) 50 tok/s ~Same $0 (local)

Supported models

LLaMA family:

  • Llama 2 (7B, 13B, 70B)

  • Llama 3 (8B, 70B, 405B)

  • Code Llama

Mistral family:

  • Mistral 7B

  • Mixtral 8x7B, 8x22B

Other:

  • Falcon, BLOOM, GPT-J

  • Phi-3, Gemma, Qwen

  • LLaVA (vision), Whisper (audio)

Find models: https://huggingface.co/models?library=gguf

References

  • Quantization Guide - GGUF formats, conversion, quality comparison

  • Server Deployment - API endpoints, Docker, monitoring

  • Optimization - Performance tuning, hybrid CPU+GPU

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

ml-paper-writing

No summary provided by upstream source.

Repository SourceNeeds Review
Research

llama-cpp

No summary provided by upstream source.

Repository SourceNeeds Review
Research

qdrant-vector-search

No summary provided by upstream source.

Repository SourceNeeds Review
Research

crewai-multi-agent

No summary provided by upstream source.

Repository SourceNeeds Review