working-with-documents

Working with Documents

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "working-with-documents" with this command: npx skills add mjunaidca/mjs-agent-skills/mjunaidca-mjs-agent-skills-working-with-documents

Working with Documents

Quick Reference

Format Read Create Edit

DOCX pandoc, python-docx docx-js OOXML (unpack/edit/pack)

PDF pdfplumber, pypdf reportlab pypdf (merge/split)

PPTX markitdown html2pptx OOXML (unpack/edit/pack)

Word Documents (.docx)

Reading Content

Convert to markdown (preserves structure)

pandoc document.docx -o output.md

With tracked changes visible

pandoc --track-changes=all document.docx -o output.md

Creating New Documents

Use docx-js (JavaScript):

const { Document, Packer, Paragraph, TextRun } = require('docx');

const doc = new Document({ sections: [{ children: [ new Paragraph({ children: [ new TextRun({ text: "Hello World", bold: true }), ], }), ], }], });

Packer.toBuffer(doc).then(buffer => { fs.writeFileSync("output.docx", buffer); });

Editing Existing Documents (Tracked Changes)

1. Unpack

python ooxml/scripts/unpack.py document.docx unpacked/

2. Edit XML files in unpacked/word/document.xml

Key files:

- word/document.xml (main content)

- word/comments.xml (comments)

- word/media/ (images)

3. Pack

python ooxml/scripts/pack.py unpacked/ edited.docx

Tracked changes XML pattern:

<!-- Deletion --> <w:del><w:r><w:delText>old text</w:delText></w:r></w:del>

<!-- Insertion --> <w:ins><w:r><w:t>new text</w:t></w:r></w:ins>

PDF Documents

Reading PDFs

import pdfplumber

Extract text

with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: print(page.extract_text())

Extract tables

with pdfplumber.open("document.pdf") as pdf: for page in pdf.pages: tables = page.extract_tables() for table in tables: for row in table: print(row)

Creating PDFs

from reportlab.lib.pagesizes import letter from reportlab.platypus import SimpleDocTemplate, Paragraph from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("output.pdf", pagesize=letter) styles = getSampleStyleSheet() story = [ Paragraph("Report Title", styles['Title']), Paragraph("Body text goes here.", styles['Normal']), ] doc.build(story)

Merging/Splitting PDFs

from pypdf import PdfReader, PdfWriter

Merge

writer = PdfWriter() for pdf_file in ["doc1.pdf", "doc2.pdf"]: reader = PdfReader(pdf_file) for page in reader.pages: writer.add_page(page) writer.write(open("merged.pdf", "wb"))

Split

reader = PdfReader("input.pdf") for i, page in enumerate(reader.pages): writer = PdfWriter() writer.add_page(page) writer.write(open(f"page_{i+1}.pdf", "wb"))

Command-Line Tools

Extract text

pdftotext input.pdf output.txt pdftotext -layout input.pdf output.txt # Preserve layout

Merge with qpdf

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf

Split pages

qpdf input.pdf --pages . 1-5 -- pages1-5.pdf

PowerPoint Presentations (.pptx)

Reading Content

Convert to markdown

python -m markitdown presentation.pptx

Creating New Presentations

Use html2pptx workflow:

  • Create HTML slides (720pt × 405pt for 16:9)

  • Convert with html2pptx.js library

  • Validate with thumbnail grid

Create thumbnails for validation

python scripts/thumbnail.py output.pptx --cols 4

Editing Existing Presentations

1. Unpack

python ooxml/scripts/unpack.py presentation.pptx unpacked/

Key files:

- ppt/slides/slide1.xml, slide2.xml, etc.

- ppt/notesSlides/ (speaker notes)

- ppt/media/ (images)

2. Edit XML

3. Validate

python ooxml/scripts/validate.py unpacked/ --original presentation.pptx

4. Pack

python ooxml/scripts/pack.py unpacked/ edited.pptx

Rearranging Slides

Duplicate, reorder, delete slides

python scripts/rearrange.py template.pptx output.pptx 0,3,3,5,7

Creates: slide 0, slide 3 (twice), slide 5, slide 7

Converting Between Formats

DOCX/PPTX to PDF

soffice --headless --convert-to pdf document.docx

PDF to images

pdftoppm -jpeg -r 150 document.pdf page

Creates: page-1.jpg, page-2.jpg, etc.

DOCX to Markdown

pandoc document.docx -o output.md

OCR for Scanned Documents

import pytesseract from pdf2image import convert_from_path

images = convert_from_path('scanned.pdf') text = "" for image in images: text += pytesseract.image_to_string(image)

Design Guidelines (Presentations)

Color Palettes

Pick 3-5 colors that work together:

Palette Colors

Classic Blue Navy #1C2833, Slate #2E4053, Silver #AAB7B8

Teal & Coral Teal #5EA8A7, Coral #FE4447, White #FFFFFF

Black & Gold Gold #BF9A4A, Black #000000, Cream #F4F6F6

Web-Safe Fonts Only

Arial, Helvetica, Times New Roman, Georgia, Verdana, Tahoma, Trebuchet MS, Courier New, Impact

Layout Rules

  • Two-column: Use for exactly 2 distinct items

  • Three-column: Use for exactly 3 items

  • Never vertically stack charts below text

  • Full-bleed images with text overlays work well

Dependencies

Python

pip install pypdf pdfplumber reportlab python-docx openpyxl

System tools

apt-get install pandoc poppler-utils libreoffice

Node.js (for docx-js)

npm install docx

Verification

Run: python scripts/verify.py

Related Skills

  • working-with-spreadsheets

  • Excel file handling

  • building-nextjs-apps

  • Frontend for document uploads

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

working-with-spreadsheets

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

browsing-with-playwright

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

styling-with-shadcn

No summary provided by upstream source.

Repository SourceNeeds Review