file-splitter

Split large files into smaller chunks with semantic boundary detection. Supports JSON, Markdown, and TXT formats. Preserves data integrity by splitting at natural boundaries (JSON array elements, MD headings, TXT paragraphs). Use when: user needs to split large files, chunk datasets, segment corpora, or break down files into manageable pieces for processing or analysis. Triggers: split file, chunk, segment, file splitter, JSON split, MD split, TXT split, corpus segmentation, data chunking.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "file-splitter" with this command: npx skills add expeditionhub/file-splitter

File Splitter - Universal File Splitting Tool

Split large files into smaller, manageable chunks while preserving semantic structure.

Quick Start

python <skill_dir>/scripts/split_files.py --input <input_folder> --output <output_folder> [options]

Parameters

ParameterRequiredDefaultDescription
--inputYes-Source folder containing files to split
--outputYes-Output folder for split chunks
--max-sizeNo512000 (500KB)Maximum bytes per chunk
--min-sizeNo409600 (400KB)Minimum bytes per chunk
--seq-digitsNo9Number of digits in sequence numbers
--formatsNojson,md,txtFile formats to process (comma-separated)
--dry-runNofalsePreview mode - show what would be split without executing

Examples

# Default 500KB split
python split_files.py --input "./corpus" --output "./corpus/chunks"

# Custom 200KB chunks
python split_files.py --input "./notes" --output "./notes/chunks" --max-size 204800 --min-size 153600

# JSON files only
python split_files.py --input "./data" --output "./data/out" --formats json

# Preview mode
python split_files.py --input "./data" --output "./data/out" --dry-run

Splitting Rules

JSON Files

  • Splits at JSON array element boundaries
  • Each chunk is a valid JSON array [...]
  • Automatically extracts list values if top-level is an object
  • Never cuts individual records in half

Markdown Files

  • Splits at heading boundaries (# through ######)
  • Each chunk maintains complete heading structure
  • Never cuts content within a heading section

TXT Files

  • Prefers splitting at paragraph boundaries (empty lines)
  • Falls back to line-by-line splitting if no paragraphs exist
  • Never cuts within a paragraph

Output Naming Convention

Format: {source_filename_without_extension}{9-digit_sequence_number}{extension}

Examples:

  • dataset000000001.json
  • dataset000000002.json
  • notes000000001.md

Safety Features

  1. Source File Preservation: Read-only access to source files; never deletes or modifies originals
  2. Duplicate Detection: Automatically skips files that already have N-digit sequence suffixes to avoid re-splitting
  3. Small File Skip: Files ≤ max-size are automatically skipped (no need to split)
  4. Sequential Processing: Processes files one at a time to ensure stability
  5. Data Validation: Compares total size/record count before and after splitting; reports verification results
  6. UTF-8 Encoding: Forces UTF-8 for all read/write operations to avoid encoding issues on Windows

Notes

  • Console may display garbled Chinese characters on Windows, but functionality is unaffected
  • If a single data block/paragraph exceeds max-size, it becomes its own chunk (integrity takes priority over size limits)
  • Output folder is automatically created if it doesn't exist
  • License: MIT-0

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Gigo Lobster Resume

🦞 GIGO · gigo-lobster-resume: 续跑入口:v2 stable 当前会清理旧 checkpoint 并从头重跑;保留此 slug 作为旧 checkpoint 兼容入口。 Triggers: 继续试吃 / 恢复评测 / resume tasting / continue lobster...

Registry SourceRecently Updated
General

YiHui CONTEXT MODE

context-mode is an MCP server that saves 98% of your context window by sandboxing tool outputs. It routes large file reads, shell outputs, and web fetches th...

Registry SourceRecently Updated
General

xinyi-drink

Use when users ask about 新一好喝/新一咖啡 drinks, stores, menu, activities, Skill用户大礼包, today drink recommendations, afternoon tea, feeling sleepy, or personalized...

Registry SourceRecently Updated
General

vedic-destiny

吠陀命盘分析中文入口。用于完整命盘研判、命主盘 Rashi chart 与九分盘 Navamsha chart 联读、既往事件回看、出生时间稳定度判断、事业主题、婚姻主题、时空盘专题,以及基于 Jagannatha Hora PDF、星盘截图或文本命盘数据的系统拆盘。当用户提到完整星盘、事业方向、婚姻问题、关系窗...

Registry SourceRecently Updated