chunking-strategy

Chunking Strategy for RAG Systems

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "chunking-strategy" with this command: npx skills add giuseppe-trisciuoglio/developer-kit/giuseppe-trisciuoglio-developer-kit-chunking-strategy

Chunking Strategy for RAG Systems

Overview

Provides chunking strategies for RAG systems, vector databases, and document processing. Recommends chunk sizes, overlap percentages, and boundary detection methods; validates semantic coherence; evaluates retrieval metrics.

When to Use

Use when building or optimizing RAG systems, vector search pipelines, document chunking workflows, or performance-tuning existing systems with poor retrieval quality.

Instructions

Choose Chunking Strategy

Select based on document type and use case:

Fixed-Size Chunking (Level 1)

  • Use for simple documents without clear structure

  • Start with 512 tokens and 10-20% overlap

  • Adjust: 256 for factoid queries, 1024 for analytical

Recursive Character Chunking (Level 2)

  • Use for documents with structural boundaries

  • Hierarchical separators: paragraphs → sentences → words

  • Customize for document types (HTML, Markdown, JSON)

Structure-Aware Chunking (Level 3)

  • Use for structured content (Markdown, code, tables, PDFs)

  • Preserve semantic units: functions, sections, table blocks

  • Validate structure preservation post-split

Semantic Chunking (Level 4)

  • Use for complex documents with thematic shifts

  • Embedding-based boundary detection with 0.8 similarity threshold

  • Buffer size: 3-5 sentences

Advanced Methods (Level 5)

  • Late Chunking for long-context models

  • Contextual Retrieval for high-precision requirements

  • Monitor computational cost vs. retrieval gain

Reference: references/strategies.md.

Implement Chunking Pipeline

Pre-process documents

  • Analyze structure, content types, information density

  • Identify multi-modal content (tables, images, code)

Select parameters

  • Chunk size: embedding model context window / 4

  • Overlap: 10-20% for most cases

  • Strategy-specific settings

Process and validate

  • Apply chunking strategy

  • Validate coherence: run evaluate_chunks.py --coherence (see below)

  • Test with representative documents

Evaluate and iterate

  • Measure precision and recall

  • If precision < 0.7: reduce chunk_size by 25% and re-evaluate

  • If recall < 0.6: increase overlap by 10% and re-evaluate

  • Monitor latency and memory usage

Reference: references/implementation.md.

Validate Chunk Quality

Run validation commands to assess chunk quality:

Check semantic coherence (requires sentence-transformers)

python -c " from sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') chunks = [...] # your chunks embeddings = model.encode(chunks) similarity = (embeddings @ embeddings.T).mean() print(f'Cohesion: {similarity:.3f}') # target: 0.3-0.7 "

Measure retrieval precision

python -c " relevant = sum(1 for c in retrieved if c in relevant_chunks) precision = relevant / len(retrieved) print(f'Precision: {precision:.2f}') # target: >= 0.7 "

Check chunk size distribution

python -c " import numpy as np sizes = [len(c.split()) for c in chunks] print(f'Mean: {np.mean(sizes):.0f}, Std: {np.std(sizes):.0f}') print(f'Min: {min(sizes)}, Max: {max(sizes)}') "

Reference: references/evaluation.md.

Examples

Fixed-Size Chunking

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter( chunk_size=256, chunk_overlap=25, length_function=len ) chunks = splitter.split_documents(documents)

Structure-Aware Code Chunking

import ast

def chunk_python_code(code): tree = ast.parse(code) chunks = [] for node in ast.walk(tree): if isinstance(node, (ast.FunctionDef, ast.ClassDef)): chunks.append(ast.get_source_segment(code, node)) return chunks

Semantic Chunking

def semantic_chunk(text, similarity_threshold=0.8): sentences = split_into_sentences(text) embeddings = generate_embeddings(sentences) chunks, current = [], [sentences[0]] for i in range(1, len(sentences)): sim = cosine_similarity(embeddings[i-1], embeddings[i]) if sim < similarity_threshold: chunks.append(" ".join(current)) current = [sentences[i]] else: current.append(sentences[i]) chunks.append(" ".join(current)) return chunks

Best Practices

Core Principles

  • Balance context preservation with retrieval precision

  • Maintain semantic coherence within chunks

  • Optimize for embedding model context window constraints

Implementation

  • Start with fixed-size (512 tokens, 15% overlap)

  • Iterate based on document characteristics

  • Test with domain-specific documents before deployment

Pitfalls to Avoid

  • Over-chunking: context-poor small chunks

  • Under-chunking: missing information in oversized chunks

  • Ignoring semantic boundaries and document structure

  • One-size-fits-all for diverse content types

Constraints and Warnings

Resource Considerations

  • Semantic methods require significant compute resources

  • Late chunking needs long-context embedding models

  • Complex strategies increase processing latency

  • Monitor memory for large document batches

Quality Requirements

  • Validate semantic coherence post-processing

  • Test with representative documents before deployment

  • Ensure chunks maintain standalone meaning

  • Implement error handling for malformed content

References

  • strategies.md - Detailed strategies

  • implementation.md - Implementation guidelines

  • evaluation.md - Performance metrics

  • tools.md - Libraries and frameworks

  • research.md - Research papers

  • advanced-strategies.md - 11 advanced methods

  • semantic-methods.md - Semantic approaches

  • visualization-tools.md - Visualization tools

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

shadcn-ui

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

tailwind-css-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

unit-test-bean-validation

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

react-patterns

No summary provided by upstream source.

Repository SourceNeeds Review