clean-csv-toolkit

v0.1.0

A small honest toolkit for the work agents end up doing constantly: read a CSV someone sent you, work out what's in it, clean it up, and forward only the safe rows downstream. Built on Python 3 standard library only. No pandas, no numpy, no pip installs, no remote calls.

What this skill does

scripts/inspect.py — profile a .csv / .tsv / .jsonl file: row count, auto-detected column types (int, float, bool, date, datetime, string, empty), null counts per column, distinct value counts (capped), three sample values per column, file size, and detected encoding.
scripts/validate.py — check the file against a small JSON schema (required columns, per-column type, min/max, enum, regex, unique). Exits 0/1 so it slots into CI.
scripts/dedupe.py — remove duplicate rows by full-row match or by key columns. Optional --keep first|last, --case-insensitive, --trim, and a JSONL report of every removed row.
scripts/diff.py — compare two files by key column(s) and classify every row as added / removed / changed / unchanged, with a per-column before/after diff for changed rows.
scripts/convert.py — convert between CSV, TSV, JSON Lines, JSON array, and GitHub-flavored Markdown table.
scripts/check_deps.sh — verify python3 is available.

What this skill does not do

It does not call any LLM, web service, or remote API.
It does not load a full dataframe into memory just to do simple structural work; the helpers stream rows where possible.
It does not write outside the input/output paths the caller provides.
It does not do statistical analysis (mean, percentile, correlation). For that, use a dataframe library.
It does not parse Excel files (.xls / .xlsx). Export to CSV first.

Required dependencies

bash scripts/check_deps.sh

Only python3 is required. The skill uses csv, json, re, pathlib, argparse, datetime, collections — all stdlib.

Workflows

1. Profile an unknown CSV

python3 scripts/inspect.py customers.csv

Output:

file:      /path/customers.csv
size:      284 B (284 bytes)
encoding:  utf-8
kind:      csv
rows:      5
columns:   6

  #  name                          type           nulls   null%    distinct  sample
----------------------------------------------------------------------------------------------------
  1  id                            int                0    0.00           5  '1', '2', '3'
  2  email                         string             0    0.00           5  'alice@example.com', ...
  3  name                          string             0    0.00           5  'Alice', 'Bob', 'Carol'
  4  amount                        float              1   20.00           4  '42.50', '100.00', '7.25'
  5  status                        string             0    0.00           3  'approved', 'pending', ...
  6  signup_date                   date               0    0.00           5  '2025-01-15', ...

Pass --json for machine-readable output that pipes into other tools.

The script auto-detects the dialect (CSV vs TSV vs JSON Lines) and a sensible encoding (utf-8, utf-8-sig, cp1252, latin-1). Type inference takes up to 1000 non-empty values per column and picks the most specific type that fits all of them.

2. Validate against a schema

Write a schema.json:

{
  "required_columns": ["id", "email", "amount", "status"],
  "columns": {
    "id":     {"type": "int", "required": true, "unique": true, "min": 1},
    "email":  {"type": "string", "required": true, "regex": ".+@.+\\..+"},
    "amount": {"type": "float", "min": 0, "max": 100000},
    "status": {"type": "string", "enum": ["pending", "approved", "rejected"]},
    "signup_date": {"type": "date"}
  }
}

Then:

python3 scripts/validate.py customers.csv --schema schema.json

A clean file exits 0 with verdict: pass. A bad file exits 1 with a detailed error table:

   row  column                  kind                    detail
------------------------------------------------------------------------------------------------
     2  email                   regex_mismatch          value did not match regex | value='not-an-email'
     2  amount                  bad_type                value does not match type 'float' | value='abc'
     3  amount                  below_min               value -50.0 < min 0 | value='-50.00'
     3  status                  not_in_enum             value not in allowed set | value='unknown_status'
     4  id                      duplicate_unique        value already seen earlier in this column | value='1'

Pass --json for a structured report and --max-errors N to cap collection on huge files.

3. Remove duplicates

By full-row match (any two rows identical in every column):

python3 scripts/dedupe.py messy.csv clean.csv

By a key column (only one canonical row per id):

python3 scripts/dedupe.py messy.csv clean.csv --key id \
  --removed-report removed.jsonl

--keep first (default) keeps the earlier-occurring row; --keep last keeps the later one — useful when later rows are corrections. --case-insensitive and --trim normalise key values before comparison so " alice@example.com" and "ALICE@example.com" collapse to one row.

The --removed-report writes one JSON object per removed row, with the original 1-based row index, the key tuple that was duplicated, and the full row, so the dedup decision is auditable.

4. Diff two files

python3 scripts/diff.py customers_old.csv customers_new.csv --key id

Output:

added:      1
removed:    1
changed:    1

--- ADDED (1) ---
  + 6
--- REMOVED (1) ---
  - 4
--- CHANGED (1) ---
  ~ 2
      amount: '100.00' -> '150.00'
      status: 'pending' -> 'approved'

Multi-column keys are supported: --key customer_id,date. Exit codes are 0 if the files are identical on the key columns, 1 if they differ — so this also works as a CI guard ("fail the build if the snapshot file changed").

5. Convert between formats

python3 scripts/convert.py data.csv data.jsonl       # row -> JSON Lines
python3 scripts/convert.py data.jsonl data.csv       # back
python3 scripts/convert.py data.csv data.json --pretty
python3 scripts/convert.py data.csv data.md          # GitHub-flavored table
python3 scripts/convert.py data.tsv data.csv         # delimiter change

Output format is picked from the extension. Allowed extensions: .csv, .tsv, .jsonl, .json, .md. The Markdown writer escapes | and \n in cell values so the table stays well-formed.

Exit codes

Code	Meaning
0	success / validation pass / files identical
1	validation fail / files differ / no rows in input
2	bad arguments / unsafe path / missing input / unsupported extension / schema malformed

This 0/1/2 split is consistent across all five scripts, so they slot into shell pipelines cleanly:

python3 scripts/validate.py incoming.csv --schema schema.json \
  && python3 scripts/dedupe.py incoming.csv clean.csv --key id \
  && python3 scripts/inspect.py clean.csv

Safety properties

Pure Python 3 standard library. No third-party dependencies.
No subprocess calls. No shell invocation.
All file paths are validated against a strict allowlist regex that rejects shell metacharacters (;, |, &, >, <, $, `, backslash-newline, etc.).
Scripts only read the input paths the caller provides and write to the output paths the caller provides. No temp files outside the system's tempdir.
All inputs and outputs use UTF-8 by default; CSV reads auto-fall-back through utf-8-sig, cp1252, and latin-1 when the file's encoding is non-UTF-8.
Deterministic: the same input produces the same output every time.

Performance

inspect.py profiles 10,000 rows in well under one second on a single core (single-pass streaming read).
All scripts stream rows; they do not load the entire file into memory for processing. The exception is dedupe.py and diff.py, which build an in-memory dict keyed by row identity — fine for hundreds of thousands of rows on a typical laptop.
No background threads, no process pool, no caching.

Known limitations

Type inference uses regex-shape matching, not locale-aware parsing. "1,234.56" is detected as string, not float. Re-export with a different number format if you need different inference.
The Markdown writer flattens multi-line cells to single lines (newlines become spaces).
JSON Lines input must have one JSON object per line. Multi-line JSON arrays are not supported; use the regular CSV/JSONL pipeline.

License

MIT. See LICENSE.