ClawText Ingest — Production-Ready Memory Ingestion
Version: 1.3.0 | License: MIT | Status: Production ✅
Author: ragesaq | Category: Memory & Knowledge Management
GitHub: https://github.com/ragesaq/clawtext-ingest
🎯 What It Does
ClawText Ingest transforms external data (Discord forums, files, URLs, JSON, text) into structured, deduplicated memories for AI agents.
The Problem It Solves
- ❌ Manual ingestion — Tedious, error-prone, no metadata
- ❌ Duplicate memories — Same data ingested multiple times
- ❌ Unstructured data — No hierarchy, no context preservation
- ❌ One-time imports — No recurring/scheduled ingestion
- ❌ Discord-specific gaps — Can't preserve forum post↔reply structure
The Solution
✅ One command imports from Discord, files, URLs, or JSON
✅ 100% idempotent — Run 1000x, zero duplicates
✅ Automatic metadata — YAML frontmatter with date, project, type, entities
✅ 6 agent patterns — Autonomous workflows documented and ready
✅ Discord-native — Forum hierarchy preserved, progress bars, auto-batch mode
✨ Key Features
🎯 Discord Integration (New in v1.3.0)
- Forum + Channel + Thread support
- Hierarchy preservation — Post↔reply structure in metadata
- Real-time progress — Live feedback for large ingestions
- Auto-batch mode — <500 posts: full, ≥500 posts: streaming
- One-command setup — 5-minute bot creation
📁 Multi-Source Ingestion
- Files — Glob patterns (Markdown, text, etc.)
- URLs — Single or bulk URL ingestion
- JSON — Chat exports, API responses
- Raw text — Quick knowledge capture
- Batch operations — Unified ingestion from multiple sources
🔄 Deduplication & Safety
- SHA1-based — Cryptographic hash matching
- 100% idempotent — Safe for repeated runs
- Configurable —
checkDedupe: true/falseper operation - Zero data loss — Failed items tracked, fallback per-item ingestion
- Hash persistence —
.ingest_hashes.jsonfor cross-session tracking
🤖 Agent-Ready
- 6 documented patterns — Direct API, Discord Agent, CLI, Cron, Batch, Thread
- Working code examples — Copy-paste ready
- Real-world patterns — GitHub sync, Discord monitoring, team decisions
- Error handling — Comprehensive error recovery
- Progress callbacks — Track ingestion in real-time
🛠️ Developer-Friendly
- CLI tool —
clawtext-ingest+clawtext-ingest-discordcommands - Node.js API — Simple imports for programmatic use
- TypeScript-ready — Clear method signatures
- Extensible — Custom transforms, field mapping
- Well-documented — 11 guides, 20+ examples
🔗 ClawText Integration
- Automatic cluster indexing — New memories indexed after rebuild
- RAG injection — Relevant context injected into agent prompts
- Project routing — Organize memories by project/source
- Entity linking — Auto-extract and link related entities
🚀 Quick Start
Installation
# Via npm
npm install clawtext-ingest
# Via OpenClaw
openclaw install clawtext-ingest
Discord Ingestion (5 minutes)
# 1. Set up Discord bot (see DISCORD_BOT_SETUP.md)
# 2. Get bot token, set DISCORD_TOKEN env var
# 3. Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose
# 4. Ingest with progress
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord --forum-id FORUM_ID
# 5. Rebuild ClawText clusters
clawtext-ingest rebuild
File Ingestion
clawtext-ingest ingest-files --input="docs/*.md" --project="docs"
Node.js API
import { ClawTextIngest } from 'clawtext-ingest';
const ingest = new ClawTextIngest();
// Ingest files
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs', type: 'fact' });
// Ingest JSON
await ingest.fromJSON(chatArray, { project: 'team' }, {
keyMap: { contentKey: 'message', dateKey: 'timestamp', authorKey: 'user' }
});
// Rebuild clusters for RAG injection
await ingest.rebuildClusters();
🤖 Agent Integration (6 Patterns)
Pattern 1: Direct API
For: In-agent code
Use when: Agents need to ingest as part of workflow
const ingest = new ClawTextIngest();
await ingest.fromFiles(['docs/**/*.md'], { project: 'docs' });
Pattern 2: Discord Agent
For: Autonomous Discord ingestion
Use when: Agents need to fetch Discord forums
const runner = new DiscordIngestionRunner(ingest);
await runner.ingestForumAutonomous({
forumId, mode: 'batch', token: process.env.DISCORD_TOKEN
});
Pattern 3: CLI Subprocess
For: Agents executing commands
Use when: Simpler CLI-based execution needed
await execAsync('clawtext-ingest-discord fetch-discord --forum-id ID');
Pattern 4: Cron/Scheduled
For: Recurring tasks
Use when: Daily/hourly ingestion needed
cron.schedule('0 * * * *', () => agentIngest());
Pattern 5: Batch Multi-Source
For: Unified ingestion
Use when: Multiple sources in one operation
await ingest.ingestAll([
{ type: 'files', data: ['docs/**/*.md'], metadata: {...} },
{ type: 'json', data: chatExport, metadata: {...} }
]);
Pattern 6: Discord Thread
For: Thread-specific ingestion
Use when: Single thread fetch needed
await runner.ingestThread(threadId);
→ See AGENT_GUIDE.md for complete examples
📊 Real-World Examples
Example 1: Daily Documentation Sync
async function syncDocsDaily() {
const ingest = new ClawTextIngest();
const result = await ingest.ingestAll([
{ type: 'files', data: ['docs/**/*.md'], metadata: { project: 'docs' } },
{ type: 'urls', data: ['https://docs.example.com/api'], metadata: { project: 'api-docs' } }
]);
await ingest.rebuildClusters();
return result;
}
Example 2: Discord Forum Monitoring
async function monitorDiscordForum(forumId) {
const ingest = new ClawTextIngest();
const runner = new DiscordIngestionRunner(ingest);
const result = await runner.ingestForumAutonomous({
forumId,
mode: 'batch',
token: process.env.DISCORD_TOKEN,
onProgress: (p) => console.log(`${p.percent}% complete...`)
});
return result;
}
Example 3: Team Decisions Ingestion
async function ingestTeamDecisions() {
const ingest = new ClawTextIngest();
const result = await ingest.ingestAll([
{ type: 'files', data: ['decisions/adr/**/*.md'], metadata: { type: 'adr' } },
{ type: 'json', data: slackThread, metadata: { type: 'decision', source: 'slack' } }
]);
await ingest.rebuildClusters();
return result;
}
🛒 CLI Commands
clawtext-ingest — File/URL/JSON/Text Ingestion
clawtext-ingest ingest-files --input="docs/*.md" --project="docs" --verbose
clawtext-ingest ingest-urls --input="https://example.com" --project="research"
clawtext-ingest ingest-json --input=messages.json --source="slack"
clawtext-ingest ingest-text --input="Finding: X is better than Y" --project="findings"
clawtext-ingest batch --config=sources.json
clawtext-ingest rebuild
clawtext-ingest status
clawtext-ingest-discord — Discord Integration
# Inspect forum
clawtext-ingest-discord describe-forum --forum-id FORUM_ID --verbose
# Fetch & ingest
DISCORD_TOKEN=xxx clawtext-ingest-discord fetch-discord \
--forum-id FORUM_ID \
--mode batch \
--batch-size 100 \
--verbose
📚 Documentation
| Document | Purpose | Read Time |
|---|---|---|
| README.md | Overview + quick start | 5 min |
| QUICKSTART.md | 5-minute setup | 5 min |
| AGENT_GUIDE.md | 6 autonomous patterns | 10 min |
| API_REFERENCE.md | Complete API docs | 15 min |
| PHASE2_CLI_GUIDE.md | CLI commands | 10 min |
| DISCORD_BOT_SETUP.md | Bot creation | 5 min |
| CLAYHUB_GUIDE.md | Publication | 5 min |
| INDEX.md | Documentation index | 2 min |
🎯 Who Should Use This
- ✅ AI/Agent developers — Building knowledge-aware agents
- ✅ RAG engineers — Populating memory for context injection
- ✅ Teams using Discord — Leveraging Discord as knowledge base
- ✅ DevOps/MLOps — Automated knowledge ingestion pipelines
- ✅ Researchers — Structuring unstructured data sources
⚡ Performance
| Operation | Speed | Notes |
|---|---|---|
| Ingest 100 files | ~5 sec | With SHA1 dedup check |
| Ingest 1000 JSON items | ~15 sec | Batch processing |
| Small forum (<100 msgs) | ~10 sec | Full mode |
| Large forum (1000+ msgs) | ~2 min | Auto-batch, streaming |
| Rebuild clusters | ~5-30 sec | Depends on total memories |
✅ Quality Metrics
| Metric | Value |
|---|---|
| Tests | 22/22 passing ✅ |
| Code | 1,254 production lines |
| Documentation | 92 KB across 11 guides |
| Examples | 20+ working examples |
| Coverage | 100% critical paths |
🔗 Integration with ClawText
- Ingest data → Creates memories with YAML metadata
- Rebuild clusters → ClawText indexes new memories
- RAG layer → Relevant context injected on next prompt
- Agent response — Enhanced with contextual information
# Complete workflow
clawtext-ingest-discord fetch-discord --forum-id ID # Step 1
clawtext-ingest rebuild # Step 2
# Step 3-4 automatic (ClawText + Agent)
🆘 Support
- Documentation: See INDEX.md for navigation
- Issues: https://github.com/ragesaq/clawtext-ingest/issues
- Examples: 20+ examples in documentation
- Troubleshooting: Built into each guide
📦 Installation & Requirements
Requirements:
- Node.js ≥ 18.0.0
- OpenClaw (for agent patterns)
- ClawText ≥ 1.2.0 (for RAG integration)
Installation:
npm install clawtext-ingest
# or
openclaw install clawtext-ingest
Binaries:
clawtext-ingest— File/URL/JSON ingestionclawtext-ingest-discord— Discord integration
🚀 Why This Over Alternatives
| Feature | ClawText-Ingest | Manual | Generic Importer | API Tool |
|---|---|---|---|---|
| Discord native | ✅ | ❌ | ❌ | ❌ |
| Deduplication | ✅ | ❌ | Partial | ❌ |
| Agent patterns | ✅ | ❌ | ❌ | ❌ |
| Metadata auto | ✅ | ❌ | Partial | ❌ |
| ClawText integration | ✅ | ❌ | ❌ | ❌ |
| Idempotent | ✅ | ❌ | ❌ | Partial |
📄 License
MIT — Use freely, open source, community supported
🙌 Contributing
Contributions welcome! See GitHub issues for current priorities.
Ready to ingest? Start with QUICKSTART.md (5 min) or AGENT_GUIDE.md if you're building agents.