github-research

Explore and analyze GitHub repositories related to a research topic. Reads deep-research output, discovers repos from multiple sources, deeply analyzes code, and produces integration blueprints.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "github-research" with this command: npx skills add lingzhi227/agent-research-skills/lingzhi227-agent-research-skills-github-research

GitHub Research Skill

Trigger

Activate this skill when the user wants to:

  • "Find repos for [topic]", "GitHub research on [topic]"
  • "Analyze open-source code for [topic]"
  • "Find implementations of [paper/technique]"
  • "Which repos implement [algorithm]?"
  • Uses /github-research <deep-research-output-dir> slash command

Overview

This skill systematically discovers, evaluates, and deeply analyzes GitHub repositories related to a research topic. It reads deep-research output (paper database, phase reports, code references) and produces an actionable integration blueprint for reusing open-source code.

Installation: ~/.claude/skills/github-research/ — scripts, references, and this skill definition. Output: ./github-research-output/{slug}/ relative to the current working directory. Input: A deep-research output directory (containing paper_db.jsonl, phase reports, code_repos.md, etc.)

6-Phase Pipeline

Phase 1: Intake     → Extract refs, URLs, keywords from deep-research output
Phase 2: Discovery  → Multi-source broad GitHub search (50-200 repos)
Phase 3: Filtering  → Score & rank → select top 15-30 repos
Phase 4: Deep Dive  → Clone & deeply analyze top 8-15 repos (code reading)
Phase 5: Analysis   → Per-repo reports + cross-repo comparison
Phase 6: Blueprint  → Integration/reuse plan for research topic

Output Directory Structure

github-research-output/{slug}/
├── repo_db.jsonl                     # Master repo database
├── phase1_intake/
│   ├── extracted_refs.jsonl          # URLs, keywords, paper-repo links
│   └── intake_summary.md
├── phase2_discovery/
│   ├── search_results/               # Raw JSONL from each search
│   └── discovery_log.md
├── phase3_filtering/
│   ├── ranked_repos.jsonl            # Scored & ranked subset
│   └── filtering_report.md
├── phase4_deep_dive/
│   ├── repos/                        # Cloned repos (shallow)
│   ├── analyses/                     # Per-repo analysis .md files
│   └── deep_dive_summary.md
├── phase5_analysis/
│   ├── comparison_matrix.md          # Cross-repo comparison
│   ├── technique_map.md              # Paper concept → code mapping
│   └── analysis_report.md
└── phase6_blueprint/
    ├── integration_plan.md           # How to combine repos
    ├── reuse_catalog.md              # Reusable components catalog
    ├── final_report.md               # Complete compiled report
    └── blueprint_summary.md

Scripts Reference

All scripts are Python 3, stdlib-only, located in ~/.claude/skills/github-research/scripts/.

ScriptPurposeKey Flags
extract_research_refs.pyParse deep-research output for GitHub URLs, paper refs, keywords--research-dir, --output
search_github.pySearch GitHub repos via gh api--query, --language, --min-stars, --sort, --max-results, --topic, --output
search_github_code.pySearch GitHub code for implementations--query, --language, --filename, --max-results, --output
search_paperswithcode.pySearch Papers With Code for paper→repo mappings--paper-title, --arxiv-id, --query, --output
repo_db.pyJSONL repo database managementsubcommands: merge, filter, score, search, tag, stats, export, rank
repo_metadata.pyFetch detailed metadata via gh api--repos, --input, --output, --delay
clone_repo.pyShallow-clone repos for analysis--repo, --output-dir, --depth, --branch
analyze_repo_structure.pyMap file tree, key files, LOC stats--repo-dir, --output
extract_dependencies.pyExtract and parse dependency files--repo-dir, --output
find_implementations.pySearch cloned repo for specific code patterns--repo-dir, --patterns, --output
repo_readme_fetch.pyFetch README without cloning--repos, --input, --output, --max-chars
compare_repos.pyGenerate comparison matrix across repos--input, --output
compile_github_report.pyAssemble final report from all phases--topic-dir

Phase 1: Intake

Goal: Extract all relevant references, URLs, and keywords from the deep-research output.

Steps

  1. Create output directory structure:

    SLUG=$(echo "$TOPIC" | tr '[:upper:]' '[:lower:]' | tr ' ' '-' | tr -cd 'a-z0-9-')
    mkdir -p github-research-output/$SLUG/{phase1_intake,phase2_discovery/search_results,phase3_filtering,phase4_deep_dive/{repos,analyses},phase5_analysis,phase6_blueprint}
    
  2. Extract references from deep-research output:

    python ~/.claude/skills/github-research/scripts/extract_research_refs.py \
      --research-dir <deep-research-output-dir> \
      --output github-research-output/$SLUG/phase1_intake/extracted_refs.jsonl
    
  3. Review extracted refs: Read the generated JSONL. Note:

    • GitHub URLs found directly in reports
    • Paper titles and arxiv IDs (for Papers With Code lookup)
    • Research keywords and themes (for GitHub search queries)
  4. Write intake summary: Create phase1_intake/intake_summary.md with:

    • Number of direct GitHub URLs found
    • Number of papers with potential code links
    • Key research themes extracted
    • Planned search queries for Phase 2

Checkpoint

  • extracted_refs.jsonl exists with entries
  • intake_summary.md written
  • Search strategy documented

Phase 2: Discovery

Goal: Cast a wide net to find 50-200 candidate repos from multiple sources.

Steps

  1. Search by direct URLs: Any GitHub URLs from Phase 1 → fetch metadata:

    python ~/.claude/skills/github-research/scripts/repo_metadata.py \
      --repos owner1/name1 owner2/name2 ... \
      --output github-research-output/$SLUG/phase2_discovery/search_results/direct_urls.jsonl
    
  2. Search Papers With Code: For each paper with an arxiv ID:

    python ~/.claude/skills/github-research/scripts/search_paperswithcode.py \
      --arxiv-id 2401.12345 \
      --output github-research-output/$SLUG/phase2_discovery/search_results/pwc_2401.12345.jsonl
    
  3. Search GitHub by keywords (3-8 queries based on research themes):

    python ~/.claude/skills/github-research/scripts/search_github.py \
      --query "multi-agent LLM coordination" \
      --min-stars 10 --sort stars --max-results 50 \
      --output github-research-output/$SLUG/phase2_discovery/search_results/gh_query1.jsonl
    
  4. Search GitHub code (for specific implementations):

    python ~/.claude/skills/github-research/scripts/search_github_code.py \
      --query "class MultiAgentOrchestrator" \
      --language python --max-results 30 \
      --output github-research-output/$SLUG/phase2_discovery/search_results/code_query1.jsonl
    
  5. Fetch READMEs for repos that lack descriptions:

    python ~/.claude/skills/github-research/scripts/repo_readme_fetch.py \
      --input <repos.jsonl> \
      --output github-research-output/$SLUG/phase2_discovery/search_results/readmes.jsonl
    
  6. Merge all results into master database:

    python ~/.claude/skills/github-research/scripts/repo_db.py merge \
      --inputs github-research-output/$SLUG/phase2_discovery/search_results/*.jsonl \
      --output github-research-output/$SLUG/repo_db.jsonl
    
  7. Write discovery log: Create phase2_discovery/discovery_log.md with search queries used, results per source, total unique repos found.

Rate Limits

  • GitHub search API: 30 requests/minute (authenticated)
  • Papers With Code API: No strict limit but be respectful (1 req/sec)
  • Add --delay 1.0 to batch operations when needed

Checkpoint

  • repo_db.jsonl populated with 50-200 repos
  • discovery_log.md with search details

Phase 3: Filtering

Goal: Score and rank repos, select top 15-30 for deeper analysis.

Steps

  1. Enrich metadata for all repos:

    python ~/.claude/skills/github-research/scripts/repo_metadata.py \
      --input github-research-output/$SLUG/repo_db.jsonl \
      --output github-research-output/$SLUG/repo_db.jsonl \
      --delay 0.5
    
  2. Score repos (quality + activity scores):

    python ~/.claude/skills/github-research/scripts/repo_db.py score \
      --input github-research-output/$SLUG/repo_db.jsonl \
      --output github-research-output/$SLUG/repo_db.jsonl
    
  3. LLM relevance scoring: Read through the top ~50 repos (by quality_score) and assign relevance_score (0.0-1.0) based on:

    • Direct relevance to research topic
    • Implementation completeness
    • Code quality signals (from README, description)
    • Update the relevance scores:
    python ~/.claude/skills/github-research/scripts/repo_db.py tag \
      --input github-research-output/$SLUG/repo_db.jsonl \
      --ids owner/name --tags "relevance:0.85"
    
  4. Compute composite scores and rank:

    python ~/.claude/skills/github-research/scripts/repo_db.py score \
      --input github-research-output/$SLUG/repo_db.jsonl \
      --output github-research-output/$SLUG/repo_db.jsonl
    python ~/.claude/skills/github-research/scripts/repo_db.py rank \
      --input github-research-output/$SLUG/repo_db.jsonl \
      --output github-research-output/$SLUG/phase3_filtering/ranked_repos.jsonl \
      --by composite_score
    
  5. Select top repos: Filter to top 15-30:

    python ~/.claude/skills/github-research/scripts/repo_db.py filter \
      --input github-research-output/$SLUG/phase3_filtering/ranked_repos.jsonl \
      --output github-research-output/$SLUG/phase3_filtering/ranked_repos.jsonl \
      --max-repos 30 --not-archived
    
  6. Write filtering report: Create phase3_filtering/filtering_report.md:

    • Stats before/after filtering
    • Score distributions
    • Top 30 repos with scores and rationale

Scoring Formula

activity_score = sigmoid((days_since_push < 90) * 0.4 + has_recent_commits * 0.3 + open_issues_ratio * 0.3)
quality_score  = normalize(log(stars+1) * 0.3 + log(forks+1) * 0.2 + has_license * 0.15 + has_readme * 0.15 + not_archived * 0.2)
composite_score = relevance * 0.4 + quality * 0.35 + activity * 0.25

Checkpoint

  • ranked_repos.jsonl with 15-30 repos
  • filtering_report.md with scoring details

Phase 4: Deep Dive

Goal: Clone and deeply analyze the top 8-15 repos.

Steps

  1. Select repos for deep dive: Take top 8-15 from ranked list.

  2. Clone each repo (shallow):

    python ~/.claude/skills/github-research/scripts/clone_repo.py \
      --repo owner/name \
      --output-dir github-research-output/$SLUG/phase4_deep_dive/repos/
    
  3. Analyze structure for each cloned repo:

    python ~/.claude/skills/github-research/scripts/analyze_repo_structure.py \
      --repo-dir github-research-output/$SLUG/phase4_deep_dive/repos/name/ \
      --output github-research-output/$SLUG/phase4_deep_dive/analyses/name_structure.json
    
  4. Extract dependencies:

    python ~/.claude/skills/github-research/scripts/extract_dependencies.py \
      --repo-dir github-research-output/$SLUG/phase4_deep_dive/repos/name/ \
      --output github-research-output/$SLUG/phase4_deep_dive/analyses/name_deps.json
    
  5. Find implementations: Search for key algorithms/concepts from research:

    python ~/.claude/skills/github-research/scripts/find_implementations.py \
      --repo-dir github-research-output/$SLUG/phase4_deep_dive/repos/name/ \
      --patterns "class Transformer" "def forward" "attention" \
      --output github-research-output/$SLUG/phase4_deep_dive/analyses/name_impls.jsonl
    
  6. Deep code reading: For each repo, READ the key source files identified by structure analysis. Write a per-repo analysis in phase4_deep_dive/analyses/{name}_analysis.md:

    • Architecture overview
    • Key algorithms implemented
    • Code quality assessment
    • API / interface design
    • Dependencies and requirements
    • Strengths and limitations
    • Reusability assessment (how easy to extract components)
  7. Write deep dive summary: phase4_deep_dive/deep_dive_summary.md

IMPORTANT: Actually Read Code

Do NOT just summarize READMEs. You must:

  • Read the main source files (entry points, core modules)
  • Understand the actual implementation approach
  • Identify specific functions/classes that implement research concepts
  • Note code patterns, design decisions, and trade-offs

Checkpoint

  • Repos cloned in repos/
  • Per-repo analysis files in analyses/
  • deep_dive_summary.md written

Phase 5: Analysis

Goal: Cross-repo comparison and technique-to-code mapping.

Steps

  1. Generate comparison matrix:

    python ~/.claude/skills/github-research/scripts/compare_repos.py \
      --input github-research-output/$SLUG/phase4_deep_dive/analyses/ \
      --output github-research-output/$SLUG/phase5_analysis/comparison.json
    
  2. Write comparison matrix: Create phase5_analysis/comparison_matrix.md:

    • Table comparing repos across dimensions (language, LOC, stars, framework, license, tests)
    • Dependency overlap analysis
    • Strengths/weaknesses per repo
  3. Write technique map: Create phase5_analysis/technique_map.md:

    • Map each paper concept / research technique → specific repo + file + function
    • Identify gaps (techniques with no implementation found)
    • Note alternative implementations of the same concept
  4. Write analysis report: phase5_analysis/analysis_report.md:

    • Executive summary of findings
    • Key insights from code analysis
    • Recommendations for which repos to use for which purposes

Checkpoint

  • comparison_matrix.md with repo comparison table
  • technique_map.md mapping concepts to code
  • analysis_report.md with findings

Phase 6: Blueprint

Goal: Produce an actionable integration and reuse plan.

Steps

  1. Write integration plan: phase6_blueprint/integration_plan.md:

    • Recommended architecture for combining repos
    • Step-by-step integration approach
    • Dependency resolution strategy
    • Potential conflicts and how to resolve them
  2. Write reuse catalog: phase6_blueprint/reuse_catalog.md:

    • For each reusable component: source repo, file path, function/class, what it does, how to extract it
    • License compatibility matrix
    • Effort estimates (easy/medium/hard to integrate)
  3. Compile final report:

    python ~/.claude/skills/github-research/scripts/compile_github_report.py \
      --topic-dir github-research-output/$SLUG/
    
  4. Write blueprint summary: phase6_blueprint/blueprint_summary.md:

    • One-page executive summary
    • Top 5 repos and why
    • Recommended next steps

Checkpoint

  • integration_plan.md complete
  • reuse_catalog.md with component catalog
  • final_report.md compiled
  • blueprint_summary.md as executive summary

Quality Conventions

  1. Repos are ranked by composite score: relevance × 0.4 + quality × 0.35 + activity × 0.25
  2. Deep dive requires reading actual code, not just READMEs
  3. Integration blueprint must map paper concepts → specific code files/functions
  4. Incremental saves: Each phase writes to disk immediately
  5. Checkpoint recovery: Can resume from any phase by checking what outputs exist
  6. All scripts are stdlib-only Python — no pip installs needed
  7. gh CLI is required for GitHub API access (must be authenticated)
  8. Deduplication by repo_id (owner/name) across all searches
  9. Rate limit awareness: Respect GitHub search API limits (30 req/min)

Error Handling

  • If gh is not installed: warn user and provide installation instructions
  • If a repo is archived/deleted: skip gracefully, note in log
  • If clone fails: skip, note in log, continue with remaining repos
  • If Papers With Code API is down: skip, rely on GitHub search only
  • Always write partial progress to disk so work is not lost

References

  • See references/phase-guide.md for detailed phase execution guidance
  • Deep-research skill: ~/.claude/skills/deep-research/SKILL.md
  • Paper database pattern: ~/.claude/skills/deep-research/scripts/paper_db.py

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

paper-to-code

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

code-debugging

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

experiment-code

No summary provided by upstream source.

Repository SourceNeeds Review
Research

literature-review

No summary provided by upstream source.

Repository SourceNeeds Review