Quick Reference
| Task | Load |
|---|
| Creative writing (voice, dialogue, POV) | writing.md |
| Data processing (CSV, regex, encoding) | data.md |
| Academic/citations (APA, MLA, Chicago) | academic.md |
| Marketing copy (headlines, CTA, email) | copy.md |
| Translation/localization | localization.md |
Universal Text Rules
Encoding
- Always verify encoding first:
file -bi document.txt
- Normalize line endings:
tr -d '\r'
- Remove BOM if present:
sed -i '1s/^\xEF\xBB\xBF//'
Whitespace
- Collapse multiple spaces:
sed 's/[[:space:]]\+/ /g'
- Trim leading/trailing:
sed 's/^[[:space:]]*//;s/[[:space:]]*$//'
Common Traps
- Smart quotes (
" ") break parsers → normalize to "
- Em/en dashes (
– —) break ASCII → normalize to -
- Zero-width chars invisible but break comparisons → strip them
- String length ≠ byte length in UTF-8 (
"café" = 4 chars, 5 bytes)
Format Detection
# Detect encoding
file -I document.txt
# Detect line endings
cat -A document.txt | head -1
# ^M at end = Windows (CRLF)
# No ^M = Unix (LF)
# Detect delimiter (CSV/TSV)
head -1 file | tr -cd ',;\t|' | wc -c
Quick Transformations
| Task | Command |
|---|
| Lowercase | tr '[:upper:]' '[:lower:]' |
| Remove punctuation | tr -d '[:punct:]' |
| Count words | wc -w |
| Count unique lines | sort -u | wc -l |
| Find duplicates | sort | uniq -d |
| Extract emails | grep -oE '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' |
| Extract URLs | `grep -oE 'https?://[^[:space:]<>"{} |
Before Processing Checklist