TikTok Hotspot Monitor — Agent Skill
1. Task Boundary (Scope)
Responsible For
- Crawling TikTok video public metadata (keyword/hashtag/creator/music sources)
via Apify cloud Actor (
clockworks/tiktok-scraper) - Fallback crawling via Playwright browser automation with saved session
- Offline deduplication, heat scoring, and trend analysis
- Term extraction: content keywords and TikTok hashtags, with multi-bucket aging
- Long-term term status based on current-snapshot age distribution, not only previous-snapshot overlap
- Coverage scoring to surface "broadly appearing" signals vs "single viral" signals
- Static HTML report generation with dark theme
NOT Responsible For
- Downloading video/audio files
- Real-time streaming or WebSocket data
- TikTok login or session management (must be pre-configured)
- Sentiment analysis of comments
- Cross-platform trend comparison
- Automated social media posting
- User authentication or authorization
- Data persistence beyond local JSONL/JSON files
Agent Addition Scope
The agent MAY add new keyword/hashtag sources to the config. The agent MUST NOT modify crawl window weights or add new window types without user approval, as those affect Apify billing.
2. Input Schema
2.1 Main Config (config/tiktok_hotspot_sources.json)
interface CrawlerConfig {
market: string; // default: "US"
output: {
base_dir: string; // default: "data/tiktok_hotspots"
snapshots_dir: string; // default: "snapshots"
logs_dir: string; // default: "logs"
};
provider: {
type: "apify" | "tiktok_mcp"; // default: "apify"
actor_id?: string; // required if type=apify
};
defaults: {
limit: number; // default: 10, per-source limit
};
sources: Array<{
type: "keyword" | "hashtag" | "creator" | "music";
value: string;
limit?: number; // override defaults.limit
enabled?: boolean; // default: true
}>;
apify?: {
token_env?: string; // default: "APIFY_TOKEN"
actor_id?: string;
input: {
defaults: Record<string, any>;
per_source?: Record<string, any>;
crawl_windows?: Record<string, CrawlWindow[]>;
};
};
tiktok_mcp?: {
command?: string;
args?: string[];
timeout_seconds?: number;
reject_simulated?: boolean;
};
}
interface CrawlWindow {
name: string;
label: string;
weight: number; // allocation weight
input: Record<string, any>; // searchSorting, searchDatePosted, etc.
}
2.2 CLI Arguments
| Argument | Type | Default | Description |
|---|---|---|---|
--config | Path | config/tiktok_hotspot_sources.json | Config file |
--once | Flag | - | Run single crawl |
--schedule | Flag | - | Run continuously |
--max-sources | int | None | Limit enabled sources |
--snapshot | Path | latest | JSONL snapshot for analysis |
--previous-snapshot | Path | auto | Previous snapshot for comparison |
--top | int | 10 | Items per ranked section |
--report | Path | latest | Analysis JSON for rendering |
2.3 Environment Variables
| Variable | Required | Description |
|---|---|---|
APIFY_TOKEN | For Apify mode | Apify API token |
TIKTOK_PROXY | For Playwright mode | Proxy URL |
3. Output Schema
3.1 Crawl Snapshot (JSONL, one record per line)
interface CrawlRecord {
crawl_timestamp: string; // UTC ISO
source_type: "keyword" | "hashtag" | "creator" | "music";
source_value: string;
crawl_window: string;
crawl_window_label: string;
crawl_window_limit: number;
video_id: string | null;
webpage_url: string | null;
title: string | null;
description: string | null;
uploader: string | null;
uploader_id: string | null;
view_count: number | null;
like_count: number | null;
comment_count: number | null;
share_count: number | null;
collect_count: number | null;
hashtags: string[] | null;
music: {
id: string | null;
track: string | null;
artist: string | null;
};
upload_date: string | null; // ISO date
duration: number | null;
is_ad: boolean | null;
}
3.2 Crawl Log (JSONL)
interface LogEntry {
crawl_timestamp: string;
source_type: string;
source_value: string;
crawl_window: string;
crawl_window_limit: number;
status: "success" | "failed";
record_count: number;
error: string | null;
}
Last entry is a CrawlRoundSummary:
interface CrawlRoundSummary {
event: "crawl_round_summary";
crawl_timestamp: string;
provider: string;
enabled_source_count: number;
crawl_window_count: number;
planned_run_count: number;
requested_total_limit: number;
completed_run_count: number;
failed_run_count: number;
raw_record_count: number;
unique_video_count: number;
duplicate_rate: number; // 0.0 - 1.0
effective_unique_yield: number; // unique / requested
windows: Record<string, WindowMetrics>;
cost_model_note: string;
}
3.3 Analysis Report (JSON)
interface AnalysisReport {
generated_at: string;
snapshot_path: string;
previous_snapshot_path: string | null;
analysis_window: {
current_snapshot_time: string;
previous_snapshot_time: string | null;
interval_hours: number | null;
matched_previous_video_count: number;
};
record_count: number;
unique_video_count: number;
source_counts: Record<string, number>;
top_videos: VideoItem[];
top_rising_videos: VideoItem[];
recent_videos_by_age: AgeBucket<VideoItem>[];
recent_signals_by_age: SignalBucket[];
established_terms: TermItem[];
established_hashtags: TermItem[];
top_music: RankedItem[];
top_creators: RankedItem[];
crawl_metrics: CrawlRoundSummary | null;
}
3.4 HTML Report
Self-contained static HTML file at data/tiktok_hotspot_analysis/tiktok_hotspot_report_<timestamp>.html.
No external dependencies. Dark themed. Machine-readable data embedded as JSON in comments.
4. Tools
4.1 crawl_tiktok_hotspots.py — Metadata Crawler
When to call:
- User requests data collection
- Need fresh snapshot for analysis
- Smoke test / validation run
When NOT to call:
- User wants to view existing data only (use analyze instead)
- No config changes made when config is invalid
- Apify mode: APIFY_TOKEN not set (check env first)
- MCP mode: session file missing (run
tiktok_login_save_session.pyfirst)
Provider switching:
Edit config/tiktok_hotspot_sources.json to switch between providers:
// Apify mode (default, full features)
{ "provider": { "type": "apify", "actor_id": "clockworks/tiktok-scraper" } }
// Local MCP mode (limited, testing only)
{ "provider": { "type": "tiktok_mcp" } }
MCP mode requires:
pip install playwright && playwright install chromiumpython scripts/tiktok_login_save_session.py(manual TikTok login)- Config
tiktok_mcp.argspointing toscripts/tiktok_search_mcp_adapter.py
Implementation:
# Provider dispatch
if config.provider_type == "apify":
# Requires APIFY_TOKEN in env
# Each source × window → one Actor run
# Supports all 4 source types
elif config.provider_type == "tiktok_mcp":
# Requires saved session file
# Keyword/hashtag only, ~12 items per source
Error states:
| Error | Recovery |
|---|---|
| Apify token missing | Check env, prompt user to set APIFY_TOKEN |
| Actor run timeout | Retry with same config |
| No videos found | Log as failed window, continue |
| MCP session expired | Prompt re-login via tiktok_login_save_session.py |
| Proxy unreachable | Skip proxy or switch to Apify |
| Snapshot empty | Check sources config, ensure keywords are valid |
Retry policy:
- Network errors: retry up to 2 times with 5s backoff
- Actor failures: no retry (Apify handles internally), log and continue
- MCP browser crash: retry once
4.2 analyze_tiktok_hotspots.py — Offline Analyzer
When to call:
- After crawl completes
- User has existing snapshot to analyze
- Need updated report
Implementation steps:
- Load snapshot JSONL → validate each record has
video_id - Deduplicate by
video_id(keep highest heat score) - Compute per-video heat score
- Bucket videos by upload age (1d/3d/7d/14d)
- Extract content terms and hashtags
- Compute cross-bucket novelty (new vs existing terms)
- Compute coverage scores
- Compare with previous snapshot for growth metrics
- Output structured JSON
4.2.1 Long-term Term Status
Long-term content terms and hashtags are not dropped when they are missing from the previous snapshot. A term enters the long-term section when its oldest matched video is older than 30 days. Its status is then computed from the current snapshot's video-age distribution:
| Status | Condition | Meaning |
|---|---|---|
spreading | newest video <= 7 days AND recent_7d_count / video_count >= 10% | Still actively spreading |
mature_or_flat | newest video <= 30 days but 7d ratio is too low | Existing signal, activity weakening |
cooling | newest video > 30 days | No recent new videos; cooling down |
This avoids losing a long-term term simply because the previous crawl did not hit it, while also preventing one recent video among many old videos from falsely marking a term as spreading.
4.3 — HTML Report Generator
When to call:
- After analysis completes
- User requests visual output
Output: Valid HTML5, self-contained, no external CSS/JS.
4.4 tiktok_login_save_session.py — Session Setup (optional)
When to call:
- User wants to use local Playwright mode
- Session file missing or expired
5. State Machine
IDLE
│
▼
CONFIG_LOAD ──invalid──▶ ERROR (report config issue)
│
▼
CRAWL_PLAN
├─ Build requests: enabled_sources × crawl_windows
├─ Compute: planned_run_count, requested_total_limit
└─ Validate: at least 1 enabled source
│
▼
CRAWL_EXECUTE ──fail──▶ PARTIAL_COMPLETE (log failures, continue)
│ │
▼ ▼
SNAPSHOT_WRITTEN PARTIAL_SNAPSHOT
│ │
└───────both────────────▶
│
▼
ANALYZE ──empty_snapshot──▶ ERROR (no records to analyze)
│
▼
REPORT_GENERATE ──fail──▶ ERROR (corrupted analysis JSON)
│
▼
COMPLETE
State management is handled by the Python scripts via:
- Exit codes: 0 (success), 1 (partial failure), 2 (config/input error)
- Logs: per-run JSONL entries with status
- Summary:
CrawlRoundSummaryas last log entry
6. Error Recovery
6.1 Crawl Phase
| Failure Mode | Detection | Recovery |
|---|---|---|
| Invalid config | load_config() raises ValueError | Report exact field, suggest fix |
| No enabled sources | Config load check | Add at least one source |
| Apify token missing | os.environ.get() returns empty | Message: "Set APIFY_TOKEN in .env" |
| All sources fail | All log entries show failed | Check token, network, actor_id |
| Some sources fail | Log shows mixed success/fail | Continue, report failed count |
| Snapshot empty | 0 records written | Check source keywords/limits |
| Disk full | write() raises OSError | Free disk space, retry |
| MCP browser timeout | asyncio.wait_for raises | Fallback to fewer sources |
| MCP session expired | Actor raises RuntimeError | Run tiktok_login_save_session.py |
6.2 Analyze Phase
| Failure Mode | Detection | Recovery |
|---|---|---|
| Snapshot missing | FileNotFoundError | Run crawl first |
| Corrupted JSONL | json.JSONDecodeError | Check snapshot, re-crawl |
| No video records | All lines lack video_id | Report empty snapshot |
| Previous snapshot missing | valid_snapshots() empty | Run without comparison |
| Division by zero | video_count = 0 | Guard with max(vc, 1) |
6.3 Report Phase
| Failure Mode | Detection | Recovery |
|---|---|---|
| Analysis JSON missing | FileNotFoundError | Run analyze first |
| Corrupted JSON | json.JSONDecodeError | Re-run analyze |
| KeyError in template | report.get(key) missing | Graceful fallback to empty |
| Encoding error | UnicodeEncodeError | Force UTF-8 output |
7. Planning Logic
7.1 Task Decomposition
For a typical hotspot monitoring request, decompose as:
Step 1: Check existing data
├─ Is there a recent snapshot? (< 24h old)
│ └─ Yes → skip crawl, go to Step 3
│ └─ No → continue to Step 2
│
Step 2: Crawl
├─ Validate APIFY_TOKEN exists
├─ Load config
├─ Run crawl (with timeout guard)
└─ Verify snapshot has records
│
Step 3: Analyze
├─ Auto-select latest snapshot
├─ Auto-select previous snapshot (if exists)
├─ Run analysis
└─ Verify output JSON has all required fields
│
Step 4: Generate report
├─ Render HTML from analysis JSON
└─ Verify output is valid HTML
7.2 Decision Tree
User: "check TikTok trends for summer dresses"
Check: Does latest snapshot exist and have records?
├─ YES: Is it < 24h old?
│ ├─ YES: Skip crawl, go to analyze
│ └─ NO: Is user OK waiting 5-30 min for crawl?
│ ├─ YES: Run crawl, then analyze
│ └─ NO: Use existing snapshot, warn about staleness
└─ NO: Must crawl first
├─ Is APIFY_TOKEN configured?
│ ├─ YES: Use Apify provider
│ └─ NO: Check MCP session
│ ├─ EXISTS: Use MCP provider (limited data)
│ └─ MISSING: Ask user to configure one
└─ Run crawl
8. Guardrails
8.1 Cost Limits
| Guardrail | Value | Enforcement |
|---|---|---|
| Max sources per crawl | 50 | Config validation |
| Max limit per source | 500 | Config validation (positive_int) |
| Max requested total | 5000 | Config validation (project-level) |
| Max planned runs | 250 | 50 sources × 5 windows |
| Apify mode | Required for > 200 records | MCP limited to ~12/source |
| Report HTML size | < 5MB | Self-limiting (trim if exceeded) |
8.2 Time Limits
| Operation | Timeout | Enforcement |
|---|---|---|
| Single crawl run | 60 min | Bash timeout parameter |
| Per-Apify Actor | No limit | Apify handles internally |
| Per-MCP search | 120s | tiktok_mcp.timeout_seconds |
| Analysis | 30s | Python processing (fast) |
| Report render | 10s | Python processing (fast) |
8.3 Rate Limits
- No concurrent Apify runs (sequentially dispatched)
- MCP browser: one at a time (sequential per source)
- Web fetching: 60s minimum between full re-crawls
8.4 Token / Credit Safety
- Never commit
.envto git - Never print API tokens in logs or console
APIFY_TOKENread from environment only- MCP session file is local only
9. Evaluation Criteria
9.1 Crawl Success
| Criterion | Passing | Warning | Failing |
|---|---|---|---|
| Run completion | ≥ 90% runs succeed | 70-90% | < 70% |
| Record count | ≥ 80% requested | 50-80% | < 50% |
| Duplicate rate | < 25% | 25-40% | > 40% |
| Failed windows | 0 | 1-3 | > 3 |
| Unique videos | ≥ 50 | 20-50 | < 20 |
9.2 Analysis Success
| Criterion | Passing | Failing |
|---|---|---|
| Snapshot has records | ≥ 10 unique videos | < 10 |
| Dedup processed | All records checked | Missing video_id |
| Term extraction | ≥ 1 content term found | 0 terms |
| JSON output | All required fields present | Missing required fields |
| Processing time | < 30s | > 60s |
9.3 Report Success
| Criterion | Passing | Failing |
|---|---|---|
| Valid HTML | Closes </html> tag | Missing closing tag |
| Metrics visible | ≥ 4 grid metrics shown | Empty grid |
| Videos rendered | Top list non-empty | Empty list |
| All sections present | 6+ sections | < 4 sections |
9.4 Decision: Proceed to Next Stage
After a validation crawl (target ~500 records):
unique_yield = unique_videos / requested_total_limit
if unique_yield >= 0.6 and duplicate_rate < 0.25:
✅ Proceed to pilot (2000 target)
elif unique_yield >= 0.4:
⚠️ Proceed with caution, review source quality
else:
❌ Block scaling, fix sources/windows first
10. Composability
10.1 Output Consumption
Other skills/agents consume analysis JSON via standard path:
# Example: Another agent reads analysis for downstream processing
import json
report = json.load(open("data/tiktok_hotspot_analysis/latest_analysis.json"))
top_signals = [t["name"] for t in report.get("top_videos", [])[:5]]
hot_terms = [t["name"] for t in report.get("established_terms", [])[:10]]
10.2 Pipeline Integration
Data Source Agent
└─► TikTok Hotspot Monitor Skill
├─► crawl → snapshot.jsonl
│ └─► [External] Apify usage dashboard (cost tracking)
├─► analyze → analysis.json
│ └─► [Downstream] Trend prediction / alerting
└─► render → report.html
└─► [Downstream] Static hosting / dashboard
10.3 File-Based Contract
All inter-skill communication is file-based:
| Artifact | Format | Schema | Consumer |
|---|---|---|---|
| Snapshot | JSONL | CrawlRecord | Analysis, ML pipeline |
| Analysis | JSON | AnalysisReport | Report, dashboards |
| Log | JSONL | LogEntry / Summary | Monitoring, cost tracking |
| Report | HTML | Self-contained | Human viewing |
10.4 Exit Codes
# Standard exit codes for script chaining
0: Success (all operations completed)
1: Partial success (some failures, usable results)
2: Configuration error (fix config before retry)
Appendix: Quick Reference
# Full pipeline (one command each)
python scripts/crawl_tiktok_hotspots.py --config config/tiktok_hotspot_sources.json --once
python scripts/analyze_tiktok_hotspots.py
python scripts/render_tiktok_hotspot_report.py
# Smoke test (2 sources)
python scripts/crawl_tiktok_hotspots.py --once --max-sources 2
# Validation run (500 records)
python scripts/crawl_tiktok_hotspots.py --config config/_tiktok_hotspot_apify_500_config.json --once
Apify Cost Note: Verify actual charges at console.apify.com → Usage. Cost depends on Actor pricing, run count, compute duration, memory, proxy usage, retries, add-ons, and account plan — not only requested result count.