ollama-rag

Build RAG systems with Ollama - run locally or use cloud for massive models.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ollama-rag" with this command: npx skills add cuba6112/skillfactory/cuba6112-skillfactory-ollama-rag

Ollama RAG Guide

Build RAG systems with Ollama - run locally or use cloud for massive models.

Ollama Cloud Models (Dec 2025)

Access via ollama signin (v0.12+). No local storage needed, privacy preserved.

Model Params Context Best For

deepseek-v3.2:cloud

671B 160K GPT-5 level, reasoning

deepseek-v3.1:671b-cloud

671B 160K Thinking + non-thinking hybrid

qwen3-coder:480b-cloud

480B 256K-1M Agentic coding, repo-scale

minimax-m2:cloud

230B (10B active) 128K #1 open-source, tools

gpt-oss:120b-cloud

120B 128K OpenAI open weights

glm-4.6:cloud

Code generation

Sign in to access cloud

ollama signin

Run cloud models

ollama run deepseek-v3.2:cloud ollama run qwen3-coder:480b-cloud ollama run minimax-m2:cloud

Local Models (Dec 2025)

Reasoning Models

Model Params Context Best For

nemotron-3-nano

30B (3.6B active) 1M tokens Agents, long docs, code

deepseek-r1

7B-671B 128K Reasoning, math, code

qwq

32B 32K Logic, analysis

llama4

109B/400B 128K General, multimodal

Fast/Efficient Models

Model Size RAM Speed

llama3.2:3b

2GB 8GB Very fast

mistral-small-3.1

24B 16GB Fast

gemma3

4B-27B 8-32GB Balanced

Embedding Models

Model Dims Context MTEB Score

snowflake-arctic-embed2

1024 8K 67.5

mxbai-embed-large

1024 512 64.68

nomic-embed-text

768 8K 53.01

Recommendation: snowflake-arctic-embed2 for accuracy, nomic-embed-text for speed.

Quick Start

Cloud (No Local Resources)

ollama signin ollama run deepseek-v3.2:cloud # GPT-5 level ollama run qwen3-coder:480b-cloud # 1M context for huge repos

Local

ollama pull nemotron-3-nano # 1M context, 24GB VRAM ollama pull snowflake-arctic-embed2

Or for lower RAM (8GB)

ollama pull llama3.2:3b ollama pull nomic-embed-text

Stack Options

Option A: LangChain + ChromaDB (Most Common)

from langchain_ollama import OllamaLLM, OllamaEmbeddings from langchain_chroma import Chroma from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_community.document_loaders import PyPDFLoader

Load and split

loader = PyPDFLoader("document.pdf") splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200) docs = splitter.split_documents(loader.load())

Embed and store

embeddings = OllamaEmbeddings(model="snowflake-arctic-embed2") vectorstore = Chroma.from_documents(docs, embeddings, persist_directory="./db")

Query - LOCAL

llm = OllamaLLM(model="nemotron-3-nano")

Or CLOUD (GPT-5 level, no local resources)

llm = OllamaLLM(model="deepseek-v3.2:cloud")

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

from langchain.chains import RetrievalQA qa = RetrievalQA.from_chain_type(llm=llm, retriever=retriever) answer = qa.invoke("What is the main topic?")

Option B: LlamaIndex (Better Accuracy)

from llama_index.llms.ollama import Ollama from llama_index.embeddings.ollama import OllamaEmbedding from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings

Configure

Settings.llm = Ollama(model="nemotron-3-nano", request_timeout=300.0) Settings.embed_model = OllamaEmbedding(model_name="snowflake-arctic-embed2")

Load and index

documents = SimpleDirectoryReader("./docs").load_data() index = VectorStoreIndex.from_documents(documents)

Query

query_engine = index.as_query_engine() response = query_engine.query("Summarize the key findings")

Option C: Direct Ollama API (Minimal Dependencies)

import ollama import chromadb

Embed

def embed(text): return ollama.embed(model="nomic-embed-text", input=text)["embeddings"][0]

Store in ChromaDB

client = chromadb.PersistentClient(path="./db") collection = client.get_or_create_collection("docs") collection.add(ids=["1"], documents=["text"], embeddings=[embed("text")])

Retrieve and generate

results = collection.query(query_embeddings=[embed("query")], n_results=3) context = "\n".join(results["documents"][0])

response = ollama.chat( model="nemotron-3-nano", messages=[{"role": "user", "content": f"Context:\n{context}\n\nQuestion: ..."}] )

Vector Database Options

Database Install Best For

ChromaDB pip install chromadb

Simple, embedded

FAISS pip install faiss-cpu

Fast similarity

Qdrant pip install qdrant-client

Production scale

Weaviate Docker Full-featured

Nemotron 3 Nano Deep Dive

Why Nemotron for RAG:

  • 1M token context = entire codebases, long documents

  • Hybrid Mamba-Transformer = 4x faster inference

  • MoE (3.6B active params) = runs on 24GB VRAM

  • Apache 2.0 license = commercial use OK

For very long documents

llm = OllamaLLM( model="nemotron-3-nano", num_ctx=131072, # 128K context, increase as needed temperature=0.1, # Lower for factual RAG )

Hardware Requirements

Model RAM GPU VRAM

3B models 8GB 4GB

7-8B models 16GB 8GB

30B models 32GB 24GB

70B+ models 64GB+ 48GB+

References

  • Model selection guide

  • Ollama Library

  • Nemotron on Ollama

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

torchaudio

No summary provided by upstream source.

Repository SourceNeeds Review
General

unsloth-sft

No summary provided by upstream source.

Repository SourceNeeds Review
General

pytorch-onnx

No summary provided by upstream source.

Repository SourceNeeds Review