Web Archive Analysis Skill
Purpose
Query the Wayback Machine to discover historical technology usage and detect technology migrations over time.
Operations
- query_cdx_api
Get historical snapshots from the Wayback Machine CDX API.
Endpoint:
GET http://web.archive.org/cdx/search/cdx
Parameters:
url: {domain} output: json filter: statuscode:200 collapse: timestamp:6 # Group by month (YYYYMM) limit: 100 from: {start_year} to: {end_year}
Example Request:
Response Format:
[ ["urlkey", "timestamp", "original", "mimetype", "statuscode", "digest", "length"], ["com,example)/", "20240115120000", "https://example.com/", "text/html", "200", "ABC123...", "45678"] ]
- select_snapshots
Choose representative snapshots for analysis.
Selection Strategy:
def select_snapshots(all_snapshots): # Get snapshots at regular intervals intervals = [ "6 months ago", "1 year ago", "2 years ago", "3 years ago", "5 years ago" ]
selected = []
for interval in intervals:
target_date = calculate_date(interval)
closest = find_closest_snapshot(all_snapshots, target_date)
if closest:
selected.append(closest)
return selected
Snapshot Priority:
-
Recent (baseline for comparison)
-
1 year ago (detect recent changes)
-
2-3 years ago (medium-term evolution)
-
5+ years ago (historical context)
- fetch_archived_content
Retrieve archived pages for analysis.
Wayback URL Format:
https://web.archive.org/web/{timestamp}/{original_url}
Example:
https://web.archive.org/web/20230115120000/https://example.com/
Headers to Request:
Accept: text/html User-Agent: TechStackAgent/1.0 (OSINT research)
- compare_snapshots
Detect technology changes between snapshots.
Comparison Points:
{ "headers_to_compare": [ "Server", "X-Powered-By", "Set-Cookie" ], "html_elements": [ "meta[name=generator]", "script[src]", "link[href]" ], "patterns_to_track": [ "/wp-content/", "/_next/", "/_nuxt/", "/static/js/" ] }
Change Detection:
def detect_changes(old_snapshot, new_snapshot): changes = []
# Compare technologies
old_tech = extract_technologies(old_snapshot)
new_tech = extract_technologies(new_snapshot)
added = new_tech - old_tech
removed = old_tech - new_tech
for tech in added:
changes.append({
"type": "technology_added",
"technology": tech,
"first_seen": new_snapshot.timestamp
})
for tech in removed:
changes.append({
"type": "technology_removed",
"technology": tech,
"last_seen": old_snapshot.timestamp
})
return changes
5. detect_migrations
Identify framework/platform migrations.
Common Migration Patterns:
{ "WordPress → Custom/React": { "indicators": [ "/wp-content/ disappears", "React globals appear", "/_next/ or /static/js/ paths" ], "typical_timeline": "6-18 months" }, "AngularJS → Angular": { "indicators": [ "ng-app disappears", "ng-version appears", "Angular 2+ patterns" ], "typical_timeline": "12-24 months" }, "jQuery → React/Vue": { "indicators": [ "jQuery CDN removed", "Modern framework globals", "SPA patterns" ], "typical_timeline": "6-12 months" }, "On-prem → Cloud": { "indicators": [ "CloudFront/Cloudflare headers appear", "AWS/GCP/Azure signatures", "CDN usage" ], "typical_timeline": "3-12 months" } }
- extract_historical_tech
Parse archived HTML for technology signals.
Process:
-
Fetch archived page
-
Apply same analysis as html_content_analysis skill
-
Record technologies with timestamp
-
Build timeline of technology usage
Output
{ "skill": "web_archive_analysis", "domain": "string", "results": { "archive_coverage": { "oldest_snapshot": "2015-03-15", "newest_snapshot": "2024-01-10", "total_snapshots": 450, "snapshots_analyzed": 5 }, "snapshots_analyzed": [ { "timestamp": "2024-01-10", "url": "https://web.archive.org/web/20240110/...", "technologies_detected": ["Next.js", "React", "Vercel"] }, { "timestamp": "2022-06-15", "url": "https://web.archive.org/web/20220615/...", "technologies_detected": ["React", "Create React App", "Heroku"] }, { "timestamp": "2020-01-20", "url": "https://web.archive.org/web/20200120/...", "technologies_detected": ["WordPress", "PHP"] } ], "technology_timeline": [ { "technology": "WordPress", "first_seen": "2015-03-15", "last_seen": "2020-06-01", "status": "removed" }, { "technology": "React", "first_seen": "2020-03-01", "last_seen": "present", "status": "current" }, { "technology": "Next.js", "first_seen": "2023-01-15", "last_seen": "present", "status": "current" } ], "migrations_detected": [ { "type": "CMS → Modern Framework", "from": "WordPress", "to": "React/Next.js", "approximate_date": "2020-Q1 to 2020-Q2", "confidence": 85 }, { "type": "Hosting Migration", "from": "Heroku", "to": "Vercel", "approximate_date": "2023-Q1", "confidence": 80 } ], "current_vs_historical": { "current_stack": ["Next.js", "React", "Vercel"], "historical_stack": ["WordPress", "PHP", "Heroku"], "major_changes": 2 } }, "evidence": [ { "type": "archived_snapshot", "timestamp": "string", "archive_url": "string", "technologies": ["array"], "analysis_timestamp": "ISO-8601" } ] }
Rate Limiting
-
Wayback CDX API: 15 requests/minute
-
Archived page fetches: 10/minute
-
Cache CDX results to avoid repeated queries
Error Handling
-
404: Domain not archived
-
503: Wayback Machine overloaded - retry with backoff
-
Timeout: Increase timeout for archived pages (can be slow)
-
Continue with available snapshots on partial failures
Security Considerations
-
Only access public archives
-
Respect Wayback Machine rate limits
-
Do not store archived content beyond analysis
-
Note that archived content may contain outdated security vulnerabilities
-
Log all queries for audit
Confidence Notes
Historical data provides contextual signals:
-
Confirms technology transitions
-
Validates current technology choices
-
Lower weight than current direct evidence
-
Base confidence: 60-75%