Fork Intelligence
Systematic methodology for discovering valuable work in GitHub fork ecosystems. Stars-only filtering misses 60-100% of substantive forks — this skill uses branch-level divergence analysis, upstream PR cross-referencing, and domain-specific heuristics to find what matters.
Validated empirically across 10 repositories spanning Python, Rust, TypeScript, C++/Python, and Node.js (tensortrade, backtesting.py, kokoro, pymoo, firecrawl, barter-rs, pueue, dukascopy-node, ArcticDB, flowsurface).
FIRST — TodoWrite Task Templates
MANDATORY: Select and load the appropriate template before any fork analysis.
Template A — Full Analysis (new repository)
- Get upstream baseline (stars, forks, default branch, last push)
- List all forks with pagination, note timestamp clusters
- Filter to unique-timestamp forks (skip bulk mirrors)
- Check default branch divergence (ahead_by/behind_by)
- Check non-default branches for all forks with recent push or >1 branch
- Evaluate commit content, author emails, tags/releases
- Cross-reference upstream PR history from fork owners
- Tier ranking and cross-fork convergence analysis
- Produce report with actionable recommendations
Template B — Quick Scan (triage only)
- Get upstream baseline
- List forks, filter by timestamp clustering
- Check default branch divergence only
- Report forks with ahead_by > 0
Template C — Targeted Fork Evaluation (specific fork)
- Compare fork vs upstream on all branches
- Examine commit messages and changed files
- Check for tags/releases, open issues, PRs
- Assess cherry-pick viability
Signal Priority Order
Ranked by empirical reliability across 10 repositories. See signal-priority.md for details.
Rank Signal Reliability What It Catches
1 Branch-level divergence Highest Work on feature branches (50%+ of substantive forks)
2 Upstream PR cross-reference High Rebased/force-pushed work invisible to compare API
3 Tags/releases on fork High Independent maintenance intent
4 Commit email domains High Institutional contributors (@company.com )
5 Timestamp clustering Medium Eliminates 85%+ mirror noise
6 Cross-fork convergence Medium Reveals unmet upstream demand
7 Stars Lowest Often anti-correlated with actual value
Pipeline — 7 Steps
Step 1: Upstream Baseline
UPSTREAM="OWNER/REPO" gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, default_branch, stargazers_count}'
Step 2: List All Forks + Timestamp Clustering
List all forks with activity signals
gh api "repos/$UPSTREAM/forks" --paginate
--jq '.[] | {full_name, pushed_at, stargazers_count, default_branch}'
Timestamp clustering: Forks sharing exact pushed_at with upstream are bulk mirrors created by GitHub's fork mechanism and never touched. Group by pushed_at — forks with unique timestamps warrant investigation. This alone eliminates 85%+ of noise.
Filter to unique-timestamp forks (skip bulk mirrors)
gh api "repos/$UPSTREAM/forks" --paginate
--jq '.[] | {full_name, pushed_at, stargazers_count}' |
jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten'
Step 3: Default Branch Divergence
BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')
For each candidate fork
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:$BRANCH"
--jq '{ahead_by, behind_by, status}'
The status field meanings:
-
identical — pure mirror, skip
-
behind — stale mirror, skip
-
diverged — has original commits AND is behind (interesting)
-
ahead — has original commits, up-to-date with upstream (rare, most valuable)
Important: Always compare from the upstream repo's perspective (repos/UPSTREAM/compare/... ). The reverse direction (repos/FORK/compare/... ) returns 404 for some repositories.
Step 4: Non-Default Branch Analysis (CRITICAL)
This is the single biggest methodology improvement. Across all 10 repos tested, 50%+ of the most valuable fork work lived exclusively on feature branches.
Examples:
-
flowsurface/aviu16: 7,000-line GPU shader heatmap only on shader-heatmap
-
ArcticDB/DerThorsten: 147 commits across conda_build , clang , apple_changes
-
pueue/FrancescElies: Duration display only on cesc/duration
-
barter-rs: 6 of 12 top forks had work only on feature branches
List branches on a fork
gh api "repos/FORK_OWNER/REPO/branches" --jq '.[].name' | head -20
Check divergence on a specific branch
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:FEATURE_BRANCH"
--jq '{ahead_by, behind_by, status}'
Heuristics for which forks need branch checks:
-
Any fork with pushed_at more recent than upstream but ahead_by == 0 on default branch
-
Any fork with more than 1 branch
-
Branch count > 10 is suspicious — likely non-trivial work (ArcticDB: Rohan-flutterint had 197 branches)
Step 5: Commit Content Evaluation
gh api "repos/$UPSTREAM/compare/$BRANCH...FORK_OWNER:BRANCH"
--jq '.commits[] | {sha: .sha[:8], message: .commit.message | split("\n")[0], date: .commit.committer.date[:10], author: .commit.author.email}'
What to look for:
-
Commit email domains reveal institutional contributors (@man.com , @quantstack.net )
-
Subtract merge commits from ahead_by count (e.g., akeda2/pueue showed 35 ahead but 28 were upstream merges)
-
Build system changes (CMakeLists.txt , Cargo.toml , pyproject.toml ) indicate platform enablement
-
Protobuf schema changes indicate architectural-level features
-
Test files alongside source changes signal production-intent work
Step 6: Fork-Specific Signals
Tags/releases (strongest independent maintenance signal)
gh api "repos/FORK_OWNER/REPO/tags" --jq '.[].name' | head -10 gh api "repos/FORK_OWNER/REPO/releases" --jq '.[] | {tag_name, name, published_at}' | head -5
Open issues on the fork (signals independent project maintenance)
gh api "repos/FORK_OWNER/REPO/issues?state=open" --jq 'length'
Check if repo was renamed (strong divergence intent signal)
gh api "repos/FORK_OWNER/REPO" --jq '.name'
Signal Strength Example
Tags/releases on fork Highest pueue/freesrz93 had 6 releases
Open PRs against upstream High Formal proposals with review context
Open issues on the fork High Independent project maintenance
Repo renamed Medium flowsurface/sinaha81 became volume_flow
Build config changes High (compiled languages) Cargo.toml, CMakeLists.txt diff
Description changed Weak Many vanity renames with no code
Step 7: Cross-Fork Convergence + Upstream PR History
Check upstream PRs from fork owners
gh api "repos/$UPSTREAM/pulls?state=all" --paginate
--jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'
Cross-fork convergence: When multiple forks independently solve the same problem, it signals unmet upstream demand:
-
firecrawl: 3 forks adopted Patchright for anti-detection
-
flowsurface: 3 forks added technical indicators independently
-
kokoro: 2 independent batched inference implementations
-
barter-rs: 4 forks added Bybit support
Upstream PR cross-reference catches:
-
Rebased/force-pushed work invisible to compare API
-
Work that was merged upstream (fork shows 0 ahead but was historically significant)
-
Declined PRs with valuable code that the fork still maintains
Tier Classification
After running the pipeline, classify forks into tiers:
Tier Criteria Action
Tier 1: Major Extensions New features, architectural changes, >10 original commits Deep evaluation, cherry-pick candidates
Tier 2: Targeted Features Focused additions, bug fixes, 2-10 commits Cherry-pick individual commits
Tier 3: Infrastructure CI/CD, packaging, deployment, docs Evaluate if relevant to your setup
Tier 4: Historical Merged upstream or stale but once significant Note for context, no action needed
Domain-Specific Patterns
Different codebases exhibit different fork behaviors. See domain-patterns.md for full details.
Domain Key Pattern Example
Scientific/ML Researchers fork-implement-publish-vanish, zero social engagement pymoo: 300-file fork with 0 stars
Trading/Finance Exchange connectors dominate; best forks are private barter-rs: 4 independent Bybit impls
Infrastructure/DevTools Self-hosting/SaaS-removal is the dominant theme firecrawl: devflowinc/firecrawl-simple (630 stars)
C++/Python Mixed Feature work lives on branches; email domains reveal institutions ArcticDB: @man.com, @quantstack.net
Node.js Libraries Check npm publication as separate packages dukascopy-node: kyo06 published dukascopy-node-plus
Rust CLI Cargo.toml diff is reliable quick filter; "superset" forks add subcommands pueue: freesrz93 added 7 subcommands
Quick-Scan Pipeline (5-minute triage)
For rapid triage of any new repo:
UPSTREAM="OWNER/REPO" BRANCH=$(gh api "repos/$UPSTREAM" --jq '.default_branch')
1. Baseline
gh api "repos/$UPSTREAM" --jq '{forks_count, pushed_at, stargazers_count}'
2. Forks with unique timestamps (skip mirrors)
gh api "repos/$UPSTREAM/forks" --paginate
--jq '.[] | {full_name, pushed_at, stargazers_count}' |
jq -s 'group_by(.pushed_at) | map(select(length == 1)) | flatten | sort_by(.pushed_at) | reverse'
3. Check ahead_by for each candidate
(loop over candidates from step 2)
4. Check upstream PRs from fork authors
gh api "repos/$UPSTREAM/pulls?state=all" --paginate
--jq '.[] | select(.head.repo.fork) | {number, title, state, user: .user.login}'
Known Limitations
Limitation Impact Workaround
GitHub compare API 250-commit limit Highly divergent forks may truncate Use gh api repos/FORK/commits?per_page=1 to get total count
Private forks invisible Trading firms keep best work private Accepted limitation
Force-pushed branches break compare API Shows 0 ahead despite significant work Cross-reference upstream PR history
Renamed forks may break API calls Old URLs may 404 Use gh api repos/FORK_OWNER/REPO --jq '.name' to detect renames
Rate limiting on large fork ecosystems
1000 forks = many API calls Use timestamp clustering to reduce calls by 85%+
Maintainer dev forks look like independent work Branch names 1:1 with upstream PRs Cross-reference branch names against upstream PR branch names
Report Template
Use this structure for the final analysis report:
Fork Analysis Report: OWNER/REPO
Repository: OWNER/REPO (N stars, M forks) Analysis date: YYYY-MM-DD
Fork Landscape Summary
| Metric | Value |
|---|---|
| Total forks | N |
| Pure mirrors | N (X%) |
| Divergent forks (ahead on any branch) | N |
| Substantive forks (meaningful work) | N |
| Stars-only miss rate | X% |
Tiered Ranking
Tier 1: Major Extensions
(fork details with ahead_by, key features, files changed)
Tier 2: Targeted Features
...
Tier 3: Infrastructure/Packaging
...
Cross-Fork Convergence Patterns
(themes that multiple forks independently implemented)
Actionable Recommendations
- Cherry-pick candidates
- Feature inspiration
- Security fixes
Post-Change Checklist
After modifying THIS skill:
-
YAML frontmatter valid (no colons in description)
-
Trigger keywords current in description
-
All ./references/ links resolve
-
Pipeline steps numbered consistently
-
Shell commands tested against a real repository
-
Append changes to evolution-log.md