GitHub Wayback Recovery

Purpose: Recover deleted GitHub content (README files, issues, PRs, wiki pages, repository metadata) from the Internet Archive's Wayback Machine when content is no longer available on GitHub.

When to Use This Skill

Repository has been deleted and you need README, wiki, or metadata
Issues or PRs were deleted by author, maintainer, or moderation
Need to recover file contents that may have been archived
Investigating historical state of a repository
Finding forks of deleted repositories via archived network pages
Recovering release notes or documentation from deleted projects

Complementary Skills:

github-archive: For structured event data (who did what, when) - always check first
github-commit-recovery: For accessing commits when you have SHAs
github-wayback-recovery (this skill): For web page snapshots when content is fully deleted

Core Principles

Wayback Machine Archives Web Pages, Not Git Repositories:

Cannot git clone from archived content
Cannot reconstruct full commit history
Recovery success depends on whether specific URLs were crawled

What CAN Be Recovered:

README files and repository descriptions
Issue titles, bodies, and comments (Archive Team prioritizes these)
PR conversations and descriptions (Files Changed tab often fails)
Wiki pages (especially wiki home)
Release notes and descriptions
Repository metadata (stars, language, license visible on homepage)
Commit SHAs from archived commit list pages (use with github-commit-recovery skill to access actual content)

What CANNOT Be Recovered:

Private repository content (never crawled)
Complete git history or repository clone
Content behind authentication

Quick Start

Check if a repository page was archived:

curl -s "https://archive.org/wayback/available?url=github.com/owner/repo" | jq

Search for all archived URLs under a repository:

curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/*&output=json&collapse=urlkey" | head -50

Access an archived snapshot:

https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo

GitHub URL Patterns for Archive Searches

Understanding GitHub's URL structure is essential for constructing archive queries.

Repository-Level URLs

Content Type URL Pattern

Homepage github.com/{owner}/{repo}

Commits list github.com/{owner}/{repo}/commits/{branch}

Individual commit github.com/{owner}/{repo}/commit/{full-sha}

Fork network github.com/{owner}/{repo}/network/members

File and Directory URLs

Content Type URL Pattern

File view github.com/{owner}/{repo}/blob/{branch}/{path/to/file}

Directory view github.com/{owner}/{repo}/tree/{branch}/{directory}

File history github.com/{owner}/{repo}/commits/{branch}/{path/to/file}

Raw file raw.githubusercontent.com/{owner}/{repo}/{branch}/{path}

Note: blob = files, tree = directories. Raw URLs are rarely archived compared to rendered views.

Collaboration Artifacts

Content Type URL Pattern

Pull request github.com/{owner}/{repo}/pull/{number}

PR files github.com/{owner}/{repo}/pull/{number}/files

PR commits github.com/{owner}/{repo}/pull/{number}/commits

Issue github.com/{owner}/{repo}/issues/{number}

Wiki page github.com/{owner}/{repo}/wiki/{page-name}

Release github.com/{owner}/{repo}/releases/tag/{tag-name}

All PRs github.com/{owner}/{repo}/pulls?state=all

All issues github.com/{owner}/{repo}/issues?state=all

CDX API Reference

The Capture Index (CDX) API provides structured search across all archived URLs.

Basic Query Structure

https://web.archive.org/cdx/search/cdx?url={URL}&output=json

Essential Parameters

Parameter Effect Example

matchType=exact

Exact URL only (default) Single page

matchType=prefix

All URLs starting with path All repo content

url=.../*

Wildcard (same as prefix) github.com/owner/repo/*

from=YYYY

Start date filter from=2023

to=YYYY

End date filter to=2024

filter=statuscode:200

Only successful captures Skip redirects/errors

collapse=timestamp:8

One capture per day Reduce duplicates

collapse=urlkey

Unique URLs only List all archived pages

limit=N

Limit results limit=100

output=json

JSON format Machine-readable

Query Examples

Find all archived pages under a repository:

curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/facebook/react/*&matchType=prefix&output=json&collapse=urlkey"

Find archived issues for a specific repository:

curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/issues/*&output=json&collapse=urlkey&filter=statuscode:200"

Find archived snapshots of a specific file:

curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/blob/*/path/to/file&output=json"

Check for archived snapshots near a specific date:

curl -s "https://archive.org/wayback/available?url=github.com/owner/repo&timestamp=20230615"

CDX Response Format

[ ["urlkey", "timestamp", "original", "mimetype", "statuscode", "digest", "length"], ["com,github)/owner/repo", "20230615142311", "https://github.com/owner/repo", "text/html", "200", "ABC123...", "12345"] ]

Investigation Patterns

Recovering Deleted File Contents

Scenario: Repository or file has been deleted, need to recover file contents.

Step 1: Search for blob URLs

curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/blob/*/README.md&output=json"

Step 2: Construct archive URL from timestamp

https://web.archive.org/web/20230615142311/https://github.com/owner/repo/blob/main/README.md

Step 3: Extract content manually or use waybackpack

pip install waybackpack waybackpack "https://github.com/owner/repo/blob/main/README.md" -d output_dir

Forensic Value: Recover documentation, configuration files, or evidence that existed at specific points in time.

Recovering Deleted Issue/PR Content

Scenario: Issue or PR was deleted and you need the original content.

Step 1: Query for issue page snapshots

curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/issues/123*&output=json"

Step 2: Access archived page

https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo/issues/123

Step 3: If issue number unknown, search PR/issue listing

curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/issues?state=all&output=json"

Note: Archive Team actively crawls GitHub issues and PRs since 2020. Issue content has higher recovery success than file contents.

Finding Forks of Deleted Repositories

Scenario: Repository is deleted, but forks may contain the full git history.

Step 1: Search for archived fork network page

curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/network/members&output=json"

Step 2: Access archived network page

https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo/network/members

Step 3: Extract fork usernames from archived page, check if forks still exist

Check if fork exists

curl -s -o /dev/null -w "%{http_code}" https://github.com/forker/repo

Forensic Value: Active forks contain complete git history including all commits. This often yields better results than trying to recover individual files.

Recovering Wiki Content

Scenario: Repository wiki has been deleted or made private.

Step 1: Search for wiki pages

curl -s "https://web.archive.org/cdx/search/cdx?url=github.com/owner/repo/wiki*&output=json&collapse=urlkey"

Step 2: Access wiki home or specific pages

https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo/wiki https://web.archive.org/web/{TIMESTAMP}/https://github.com/owner/repo/wiki/Page-Name

Python Implementation

import requests import json from typing import Optional, List, Dict from time import sleep

class WaybackGitHubRecovery: CDX_API = "https://web.archive.org/cdx/search/cdx" AVAILABILITY_API = "https://archive.org/wayback/available" ARCHIVE_URL = "https://web.archive.org/web"

def check_availability(self, url: str, timestamp: Optional[str] = None) -> Optional[Dict]:
    """Check if URL has any archived snapshots."""
    params = {"url": url}
    if timestamp:
        params["timestamp"] = timestamp

    resp = requests.get(self.AVAILABILITY_API, params=params)
    data = resp.json()

    if data.get("archived_snapshots", {}).get("closest"):
        return data["archived_snapshots"]["closest"]
    return None

def search_cdx(self, url: str, match_type: str = "prefix",
               collapse: str = "urlkey", limit: int = 1000) -> List[Dict]:
    """Search CDX API for archived URLs."""
    params = {
        "url": url,
        "output": "json",
        "matchType": match_type,
        "collapse": collapse,
        "filter": "statuscode:200",
        "limit": limit
    }

    resp = requests.get(self.CDX_API, params=params)
    data = resp.json()

    if len(data) &#x3C;= 1:  # Only header row
        return []

    headers = data[0]
    results = []
    for row in data[1:]:
        results.append(dict(zip(headers, row)))

    return results

def find_repository_content(self, owner: str, repo: str) -> Dict[str, List]:
    """Find all archived content for a repository."""
    base_url = f"github.com/{owner}/{repo}"

    results = {
        "homepage": self.search_cdx(base_url, match_type="exact"),
        "issues": self.search_cdx(f"{base_url}/issues/*"),
        "pulls": self.search_cdx(f"{base_url}/pull/*"),
        "wiki": self.search_cdx(f"{base_url}/wiki*"),
        "files": self.search_cdx(f"{base_url}/blob/*"),
        "network": self.search_cdx(f"{base_url}/network/members", match_type="exact"),
    }

    return results

def get_archived_page(self, url: str, timestamp: str) -> Optional[str]:
    """Retrieve archived page content."""
    archive_url = f"{self.ARCHIVE_URL}/{timestamp}/{url}"
    resp = requests.get(archive_url)

    if resp.status_code == 200:
        return resp.text
    return None

def find_forks(self, owner: str, repo: str) -> List[str]:
    """Find potential forks from archived network page."""
    network_results = self.search_cdx(
        f"github.com/{owner}/{repo}/network/members",
        match_type="exact"
    )

    forks = []
    if network_results:
        # Get most recent snapshot
        latest = network_results[-1]
        content = self.get_archived_page(
            f"https://github.com/{owner}/{repo}/network/members",
            latest["timestamp"]
        )
        if content:
            # Extract fork usernames (simplified - would need HTML parsing)
            # Look for patterns like href="/username/repo"
            import re
            pattern = rf'href="/([^/]+)/{repo}"'
            matches = re.findall(pattern, content)
            forks = list(set(matches) - {owner})

    return forks

Usage Example

recovery = WaybackGitHubRecovery()

Check if repository homepage was archived

snapshot = recovery.check_availability("https://github.com/deleted-user/deleted-repo") if snapshot: print(f"Archived at: {snapshot['url']}") print(f"Timestamp: {snapshot['timestamp']}")

Find all archived content

content = recovery.find_repository_content("deleted-user", "deleted-repo") print(f"Found {len(content['issues'])} archived issue pages") print(f"Found {len(content['files'])} archived file pages")

Find potential forks

forks = recovery.find_forks("deleted-user", "deleted-repo") for fork in forks: print(f"Potential fork: github.com/{fork}/deleted-repo")

Limitations and Considerations

Technical Limitations

JavaScript-rendered content: GitHub's modern interface uses AJAX; archived pages may have broken file trees, blame views, and navigation
Raw file downloads: raw.githubusercontent.com URLs are rarely archived
Binary assets: Release binaries and attachments typically fail to archive

Rate Limiting

Archive.org has undocumented rate limits:

Sustainable rate: ~100 requests/minute
Implement exponential backoff if you receive 429 responses
Use collapse parameters to reduce result count
Cache results locally for repeated analysis

Troubleshooting

No archived snapshots found:

Repository may be too new or obscure for crawling
Try searching with wildcards: github.com/owner/repo/*
Check if repo was ever public (private repos not crawled)

Archived page shows broken layout:

Normal for JavaScript-heavy pages
Try "View Source" to extract text content
Use older timestamps (pre-2020 GitHub had simpler rendering)

CDX API returns empty results:

Verify URL format (no trailing slashes, correct case)
Try matchType=prefix instead of exact
Remove filter=statuscode:200 to see all captures

Rate limited by Archive.org:

Implement delays between requests (1-2 seconds)
Use collapse=timestamp:8 to reduce duplicates
Download during off-peak hours

Learn More

Wayback Machine CDX API: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
Archive Team GitHub Project: https://wiki.archiveteam.org/index.php/GitHub
Internet Archive Python Library: https://archive.org/services/docs/api/internetarchive/
waybackpy Documentation: https://pypi.org/project/waybackpy/

github-wayback-recovery

Safety Notice

Copy this and send it to your AI assistant to learn

Check if fork exists

Usage Example

Check if repository homepage was archived

Find all archived content

Find potential forks

Source Transparency

Related Skills

github-commit-recovery

github-archive

code coverage with gcov