name: docx-pdf-knowledge-parser description: parse local docx and pdf files into report-first knowledge artifacts. use when chatgpt needs to extract text from uploaded or locally available attachments, generate ingest-report.md, kb-items.jsonl, failed-items.jsonl, and memory.candidate.md without directly writing memory.md.
Docx PDF Knowledge Parser
Use this skill to turn local or uploaded .docx and .pdf files into structured, reviewable knowledge outputs.
What this skill does
- Accept local or already-available
.docxand.pdffiles. - Classify files into parseable, manual-review, or failed.
- Parse
.docxand.pdfin v1.0. - Produce report-first outputs instead of writing
MEMORY.mddirectly. - Preserve failures and uncertainty instead of guessing content.
Supported v1.0 scope
Inputs
- Local
.docxfile path - Local
.pdffile path - A batch of local
.docxand.pdffiles in one directory
Parsing
.docx.pdf
Outputs
ingest-report.mdkb-items.jsonlfailed-items.jsonlMEMORY.candidate.md
Required behavior
- Only process files that are already available locally or have already been provided to the runtime.
- Do not claim file content was learned unless text was actually extracted.
- Default to report-first. Do not write
MEMORY.mdin v1.0. - Record every failed file with a concrete reason.
- Prefer plain-text summaries over complex cards when reporting progress.
File routing rules
Parseable
Treat these as parseable in v1.0:
.docx.pdf
Manual-review
Route here when the file is out of scope or low-confidence in v1.0:
.pptx- images
- scans with no extractable text
- archives
- unusual file types
Failed
Route here when the file cannot be opened, parsed, or extracted successfully.
Standard workflow
- Resolve input type.
- Single file path -> process one file
- Directory path -> enumerate supported files
- Create a batch record.
- Generate
batch_id - Record
started_at
- Generate
- Build a manifest.
- File name
- File path
- File type
- Route decision
- Attempt extraction.
.docx-> useparsers/parse_docx.py.pdf-> useparsers/parse_pdf.py
- Produce structured outputs.
- success -> append to
kb-items.jsonl - failure -> append to
failed-items.jsonl
- success -> append to
- Summarize the batch.
- Write
ingest-report.md - Write
MEMORY.candidate.md
- Write
- Finish the batch.
- Record
finished_at - Never auto-write
MEMORY.md
- Record
Output contracts
kb-items.jsonl
Write one JSON object per successfully extracted knowledge item with at least:
batch_idsource_filesource_pathfile_typetopiccontent_typesummaryextracted_atconfidence
failed-items.jsonl
Write one JSON object per failed file with at least:
batch_idsource_filesource_pathfile_typefailure_reasonerror_detailsuggested_actionfailed_at
MEMORY.candidate.md
Include:
- batch header (
batch_id,started_at,finished_at,source_directoryorsource_file) - grouped knowledge summaries
- source references
- confidence notes
- items needing review
ingest-report.md
Include:
- Batch summary
- Input scope
- File counts and routing counts
- Successful extraction summary
- Failures and risks
- Recommended next actions
Safety rules
- Never invent text that was not extracted.
- If parsing fails, say so plainly and log it.
- Treat filenames as hints only, never as proof of document contents.
- Keep sensitive data out of
MEMORY.candidate.mdunless the workflow explicitly allows it.
Included files
run.py: minimal batch runner for local testingparsers/parse_docx.py: docx text extraction helperparsers/parse_pdf.py: pdf text extraction helperreferences/output_examples.md: sample output shapes and field guidanceREADME.md: setup and usage notes