Building GitHub Index
Create markdown indexes of GitHub repositories optimized for Claude project knowledge. Indexes enable retrieval via GitHub API with semantic descriptions for effective matching.
Quick Start
# Documentation repos (markdown/notebooks)
python scripts/github_index.py owner/repo -o index.md
# Code repos (extract symbols via tree-sitter)
python scripts/github_index.py owner/repo --code-symbols -o index.md
# Multiple repos combined
python scripts/github_index.py owner/repo1 owner/repo2 -o combined.md
Script Options
| Flag | Description |
|---|---|
-o, --output | Output file (default: github_index.md) |
--token | GitHub PAT; also reads GITHUB_TOKEN env |
--include-patterns | Only index matching globs: "docs/**" "src/**" |
--exclude-patterns | Skip matching globs: "test/**" |
--max-files | Cap files per repo (default: 200) |
--skip-fetch | Tree only, no content fetch (fast, filename-only descriptions) |
--code-symbols | Include code files, extract function/class names via tree-sitter |
Description Extraction Priority
- YAML frontmatter -
title:anddescription:fields - Markdown headings - First h1/h2 as title, subsequent as topics
- Notebook cells - First markdown cell heading
- Code symbols - Public function/class names (with
--code-symbols) - Path-derived - Convert filename to words (fallback)
When Descriptions Fail
Some repos have stub files (links to external docs, empty readmes). In these cases:
Manual curation recommended. Use the tree output and domain knowledge:
# Get tree structure only (fast)
python scripts/github_index.py owner/repo --skip-fetch -o skeleton.md
# Then manually enhance descriptions based on domain knowledge
For code-heavy repos with embedded apps:
- Directory names encode purpose:
acc_wav_gen→ "ACC waveform generation" - Peripheral acronyms map to functions: AFEC=ADC, MCAN=CAN, TWIHS=I2C
- Operation modes: blocking, interrupt, dma, polled
Output Format
# {Repo} - Content Index
**Repository:** {url}
**Branch:** `{branch}`
## Retrieval Method
{API curl commands}
---
## {Category}
| Description | Path |
|-------------|------|
| {What this covers} | `{path/file.md}` |
Description column leads (relevance matching), path follows (retrieval key).
API Access
Enumerate files:
curl -sL "https://api.github.com/repos/OWNER/REPO/git/trees/BRANCH?recursive=1"
Fetch content:
curl -s "https://api.github.com/repos/OWNER/REPO/contents/PATH?ref=BRANCH" \
-H "Accept: application/vnd.github+json" | \
python3 -c "import sys,json,base64; print(base64.b64decode(json.load(sys.stdin)['content']).decode())"
Network
Both scripts download a repo tarball (single HTTP request, no per-file rate limits) then process files locally. Allowlist: api.github.com (tarball redirects via this endpoint)
Related Skills
accessing-github-repos- Private repos, PAT setup, tarball downloadmapping-codebases- Detailed code structure (methods, imports, line numbers)
Condensed Format (pk_index.py)
For token-constrained project knowledge, use the condensed script:
python scripts/pk_index.py owner/repo -o repo_pk.md
Produces ~80% smaller output:
- Single line per file:
path— description - Symbols only (no signatures)
- 15 files max per category
- No retrieval instructions section
Ideal when adding multiple repo indexes to project knowledge.