clean-csv-toolkit
v0.1.0
A small honest toolkit for the work agents end up doing constantly: read a CSV someone sent you, work out what's in it, clean it up, and forward only the safe rows downstream. Built on Python 3 standard library only. No pandas, no numpy, no pip installs, no remote calls.
What this skill does
scripts/inspect.py— profile a.csv/.tsv/.jsonlfile: row count, auto-detected column types (int,float,bool,date,datetime,string,empty), null counts per column, distinct value counts (capped), three sample values per column, file size, and detected encoding.scripts/validate.py— check the file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). Exits 0/1 so it slots into CI.scripts/dedupe.py— remove duplicate rows by full-row match or by key columns. Optional--keep first|last,--case-insensitive,--trim, and a JSONL report of every removed row.scripts/diff.py— compare two files by key column(s) and classify every row as added / removed / changed / unchanged, with a per-column before/after diff for changed rows.scripts/convert.py— convert between CSV, TSV, JSON Lines, JSON array, and GitHub-flavored Markdown table.scripts/check_deps.sh— verifypython3is available.
What this skill does not do
- It does not call any LLM, web service, or remote API.
- It does not load a full dataframe into memory just to do simple structural work; the helpers stream rows where possible.
- It does not write outside the input/output paths the caller provides.
- It does not do statistical analysis (mean, percentile, correlation). For that, use a dataframe library.
- It does not parse Excel files (.xls / .xlsx). Export to CSV first.
Required dependencies
bash scripts/check_deps.sh
Only python3 is required. The skill uses csv, json, re, pathlib, argparse, datetime, collections — all stdlib.
Workflows
1. Profile an unknown CSV
python3 scripts/inspect.py customers.csv
Output:
file: /path/customers.csv
size: 284 B (284 bytes)
encoding: utf-8
kind: csv
rows: 5
columns: 6
# name type nulls null% distinct sample
----------------------------------------------------------------------------------------------------
1 id int 0 0.00 5 '1', '2', '3'
2 email string 0 0.00 5 'alice@example.com', ...
3 name string 0 0.00 5 'Alice', 'Bob', 'Carol'
4 amount float 1 20.00 4 '42.50', '100.00', '7.25'
5 status string 0 0.00 3 'approved', 'pending', ...
6 signup_date date 0 0.00 5 '2025-01-15', ...
Pass --json for machine-readable output that pipes into other tools.
The script auto-detects the dialect (CSV vs TSV vs JSON Lines) and a sensible encoding (utf-8, utf-8-sig, cp1252, latin-1). Type inference takes up to 1000 non-empty values per column and picks the most specific type that fits all of them.
2. Validate against a schema
Write a schema.json:
{
"required_columns": ["id", "email", "amount", "status"],
"columns": {
"id": {"type": "int", "required": true, "unique": true, "min": 1},
"email": {"type": "string", "required": true, "regex": ".+@.+\\..+"},
"amount": {"type": "float", "min": 0, "max": 100000},
"status": {"type": "string", "enum": ["pending", "approved", "rejected"]},
"signup_date": {"type": "date"}
}
}
Then:
python3 scripts/validate.py customers.csv --schema schema.json
A clean file exits 0 with verdict: pass. A bad file exits 1 with a detailed error table:
row column kind detail
------------------------------------------------------------------------------------------------
2 email regex_mismatch value did not match regex | value='not-an-email'
2 amount bad_type value does not match type 'float' | value='abc'
3 amount below_min value -50.0 < min 0 | value='-50.00'
3 status not_in_enum value not in allowed set | value='unknown_status'
4 id duplicate_unique value already seen earlier in this column | value='1'
Pass --json for a structured report and --max-errors N to cap collection on huge files.
3. Remove duplicates
By full-row match (any two rows identical in every column):
python3 scripts/dedupe.py messy.csv clean.csv
By a key column (only one canonical row per id):
python3 scripts/dedupe.py messy.csv clean.csv --key id \
--removed-report removed.jsonl
--keep first (default) keeps the earlier-occurring row; --keep last keeps the later one — useful when later rows are corrections. --case-insensitive and --trim normalise key values before comparison so " alice@example.com" and "ALICE@example.com" collapse to one row.
The --removed-report writes one JSON object per removed row, with the original 1-based row index, the key tuple that was duplicated, and the full row, so the dedup decision is auditable.
4. Diff two files
python3 scripts/diff.py customers_old.csv customers_new.csv --key id
Output:
added: 1
removed: 1
changed: 1
--- ADDED (1) ---
+ 6
--- REMOVED (1) ---
- 4
--- CHANGED (1) ---
~ 2
amount: '100.00' -> '150.00'
status: 'pending' -> 'approved'
Multi-column keys are supported: --key customer_id,date. Exit codes are 0 if the files are identical on the key columns, 1 if they differ — so this also works as a CI guard ("fail the build if the snapshot file changed").
5. Convert between formats
python3 scripts/convert.py data.csv data.jsonl # row -> JSON Lines
python3 scripts/convert.py data.jsonl data.csv # back
python3 scripts/convert.py data.csv data.json --pretty
python3 scripts/convert.py data.csv data.md # GitHub-flavored table
python3 scripts/convert.py data.tsv data.csv # delimiter change
Output format is picked from the extension. Allowed extensions: .csv, .tsv, .jsonl, .json, .md. The Markdown writer escapes | and \n in cell values so the table stays well-formed.
Exit codes
| Code | Meaning |
|---|---|
| 0 | success / validation pass / files identical |
| 1 | validation fail / files differ / no rows in input |
| 2 | bad arguments / unsafe path / missing input / unsupported extension / schema malformed |
This 0/1/2 split is consistent across all five scripts, so they slot into shell pipelines cleanly:
python3 scripts/validate.py incoming.csv --schema schema.json \
&& python3 scripts/dedupe.py incoming.csv clean.csv --key id \
&& python3 scripts/inspect.py clean.csv
Safety properties
- Pure Python 3 standard library. No third-party dependencies.
- No
subprocesscalls. No shell invocation. - All file paths are validated against a strict allowlist regex that rejects shell metacharacters (
;,|,&,>,<,$,`, backslash-newline, etc.). - Scripts only read the input paths the caller provides and write to the output paths the caller provides. No temp files outside the system's tempdir.
- All inputs and outputs use UTF-8 by default; CSV reads auto-fall-back through utf-8-sig, cp1252, and latin-1 when the file's encoding is non-UTF-8.
- Deterministic: the same input produces the same output every time.
Performance
inspect.pyprofiles 10,000 rows in well under one second on a single core (single-pass streaming read).- All scripts stream rows; they do not load the entire file into memory for processing. The exception is
dedupe.pyanddiff.py, which build an in-memory dict keyed by row identity — fine for hundreds of thousands of rows on a typical laptop. - No background threads, no process pool, no caching.
Known limitations
- Type inference uses regex-shape matching, not locale-aware parsing.
"1,234.56"is detected asstring, notfloat. Re-export with a different number format if you need different inference. - The Markdown writer flattens multi-line cells to single lines (newlines become spaces).
- JSON Lines input must have one JSON object per line. Multi-line JSON arrays are not supported; use the regular CSV/JSONL pipeline.
License
MIT. See LICENSE.