RAG Pipeline Starter

# RAG Pipeline Starter

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "RAG Pipeline Starter" with this command: npx skills add abhinas90/rag-pipeline-starter

RAG Pipeline Starter

Production-grade RAG pipeline setup with chunking strategies, embedding benchmarks, and retrieval tuning for 50K-500K row datasets.

Overview

This skill provides a complete toolkit for building and optimizing RAG (Retrieval-Augmented Generation) pipelines. It analyzes your data, recommends optimal chunking strategies, benchmarks embedding models, and helps tune retrieval parameters for maximum accuracy.

When to Use

  • Building a new RAG system from scratch
  • Optimizing an existing RAG pipeline's retrieval quality
  • Choosing the right embedding model for your domain
  • Processing large document collections (50K-500K rows)
  • Need to balance speed vs. accuracy for your use case

Scripts

chunking_analyzer.py

Analyzes documents and recommends optimal chunking strategies based on content structure.

Usage:

# Assess data and get strategy recommendation
python chunking_analyzer.py --assess ./data

# Apply chunking strategy to documents
python chunking_analyzer.py --strategy recursive --input ./data/doc.txt --output ./chunks/ --chunk-size 500 --overlap 50

Options:

  • --assess <dir> - Analyze documents and recommend strategy
  • --strategy <name> - Chunking strategy: fixed, semantic, recursive, hierarchical
  • --input <path> - Input file or directory
  • --output <dir> - Output directory for chunks
  • --chunk-size <int> - Chunk size (default: 500)
  • --overlap <int> - Overlap between chunks (default: 50)

embedding_benchmark.py

Tests multiple embedding models on your data to find the best fit for your domain.

Usage:

python embedding_benchmark.py --data ./chunks/ --domain finance --output results.json

Options:

  • --embeddings <models> - Embedding models to test (space-separated)
  • --data <dir> - Directory with chunked text files (required)
  • --domain <name> - Domain name for context-specific recommendations
  • --output <file> - Output file for results (JSON)

Supported Embeddings:

  • sentence-transformers/all-MiniLM-L6-v2 (384 dims, fast, free)
  • sentence-transformers/all-mpnet-base-v2 (768 dims, medium, free)
  • openai/text-embedding-ada-002 (1536 dims, fast, paid)
  • cohere/embed-english-v3.0 (1024 dims, fast, paid)
  • bm25 (sparse, fast, free)

retrieval_tuner.py

Optimizes retrieval parameters (top-k, similarity threshold) for your specific use case.

Usage:

python retrieval_tuner.py --index ./vector_store/ --queries ./test_queries.json --output tuning_results.json

Options:

  • --index <dir> - Vector store index directory
  • --queries <file> - JSON file with test queries and expected results
  • --output <file> - Output file for tuning results
  • --top-k-range <min> <max> - Range of top-k values to test (default: 1 20)
  • --threshold-range <min> <max> <step> - Similarity threshold range

vector_store_manager.py

Manages vector store operations: create, update, search, and maintain indexes.

Usage:

# Create index from chunks
python vector_store_manager.py --create --chunks ./chunks/ --index ./vector_store/ --embedding sentence-transformers/all-MiniLM-L6-v2

# Search index
python vector_store_manager.py --search --index ./vector_store/ --query "your search query" --top-k 5

Options:

  • --create - Create new index from chunks
  • --chunks <dir> - Directory with chunked text files
  • --index <dir> - Vector store directory
  • --embedding <model> - Embedding model to use
  • --search - Search existing index
  • --query <text> - Search query
  • --top-k <int> - Number of results to return (default: 5)
  • --update - Update index with new documents
  • --stats - Show index statistics

Pricing Strategy

Free tier (this skill): Core chunking + embedding benchmark tools
Paid guide ($49): Complete production RAG setup with:

  • Multi-modal document processing
  • Hybrid search (dense + sparse)
  • Re-ranking pipeline
  • Evaluation framework
  • Deployment scripts

Workflow

  1. Assess your data

    python chunking_analyzer.py --assess ./your_data/
    
  2. Apply chunking strategy

    python chunking_analyzer.py --strategy recursive --input ./data/ --output ./chunks/
    
  3. Benchmark embeddings

    python embedding_benchmark.py --data ./chunks/ --domain your_domain
    
  4. Create vector store

    python vector_store_manager.py --create --chunks ./chunks/ --index ./vector_store/ --embedding <recommended_model>
    
  5. Tune retrieval (optional)

    python retrieval_tuner.py --index ./vector_store/ --queries ./test_queries.json
    

Requirements

  • Python 3.8+
  • Dependencies: numpy, sentence-transformers (optional for real embeddings)

Files

  • chunking_analyzer.py - Document analysis and chunking
  • embedding_benchmark.py - Embedding model benchmarking
  • retrieval_tuner.py - Retrieval parameter optimization
  • vector_store_manager.py - Vector store operations
  • skill.json - Skill metadata

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Ephemeral Media Hosting

自動削除機能付き一時メディアホスティングシステム

Registry SourceRecently Updated
General

Ethereum Read Only

Foundry castを使用したウォレット不要のオンチェーン状態読み取り

Registry SourceRecently Updated
General

OpenClaw Memory

Manage, optimize, and troubleshoot the OpenClaw memory system — MEMORY.md curation, daily logs (memory/YYYY-MM-DD.md), memory_search tuning, compaction survi...

Registry SourceRecently Updated
General

ImageRouter

Generate AI images with any model using ImageRouter API (requires API key).

Registry SourceRecently Updated
2.6K2dawe35