Universal CV Data Ingestion & Organization

<when_to_activate> Activate when the user:

Provides raw career data (Obsidian vault, CSV, zip, markdown)
Says "ingest", "import", "process my data", "load my experience"
Wants to populate the knowledge base from source files
Has files in source-data/ directory to process

Trigger phrases: "ingest", "import", "process", "load data", "parse my resume" </when_to_activate>

Input Directory

Drop all source files in source-data/ — this directory is gitignored.

source-data/ ├── obsidian-vault/ # Obsidian folders with .md files ├── linkedin-export.csv # CSV exports ├── resume.pdf # Resumes ├── gemini-review.md # AI summaries (processed first) └── archive.zip # Bulk exports

Target Output Structure

All output goes to the existing content/ directory:

content/ ├── profile.yaml # Identity, hero, about, stats ├── experience/index.yaml # Work history with rich details ├── case-studies/*.md # Project deep-dives (frontmatter + markdown) ├── testimonials/index.yaml # Social proof quotes ├── certifications/index.yaml # Certs and credentials ├── passion-projects/index.yaml ├── skills/index.yaml # Technical & domain skills └── variants/ # Job-specific personalizations

Supported Input Formats

Obsidian Vault (.md files)

Daily notes, project notes, work journals
YAML frontmatter with metadata
Tags, links, dataview syntax
Nested folder structures

CSV Files

LinkedIn exports
Spreadsheet data (experience, skills, projects)
Any structured tabular career data

Zip Archives

Mixed content (CSV, text, images, videos)
Bulk exports from tools
Compressed vault backups

Text Files

Resume text dumps
Job descriptions
Performance reviews
Testimonial snippets

Images & Media

Screenshots of achievements
Certificates
Project visuals
Linked for case studies

Processing Workflow

Phase 0: Schema Understanding (MANDATORY FIRST STEP)

Objective: Read and internalize the actual Zod schemas before writing ANY content.

CRITICAL: This step MUST be completed before any file writes. Schema mismatches were the #1 source of validation errors in past runs.

Always read the schema file first

Read src/lib/schemas.ts

Key Schema Constraints to Verify:

Content Type Field Constraint

Skills skills

z.array(z.string()) — simple strings only, NOT objects

Case Study Media type

z.enum(['blog', 'twitter', 'linkedin', 'video', 'article', 'slides'])

Testimonial featured

z.boolean() — required, not optional

Experience highlights

z.array(z.string()) — simple strings only

Case Study CTA action

z.enum(['contact', 'calendly', 'linkedin']) — only these 3 values

Before writing any file, confirm:

✅ Field types match schema exactly (string vs object vs array)
✅ Enum values are from the allowed set
✅ Required fields are present
✅ Optional fields use correct null/undefined handling

Phase 1: Data Discovery & Inventory

Objective: Map all source files before processing.

Step 1a: Check for Existing AI Summaries (HIGH PRIORITY)

Before parsing raw files, look for existing AI-generated summaries or reviews. These are often more structured and save significant parsing time.

Look for AI review files

find source-data/ -name "review" -o -name "summary" -o -name "gemini" -o -name "claude" -o -name "gpt"

Check for structured exports

find source-data/ -name "*.json" -o -name "export"

If found: Use as PRIMARY source, cross-reference with raw data for verification.

Step 1b: Inventory Raw Sources

For Obsidian vault

find source-data/ -name "*.md" -type f | head -50

For zip archive (extract first)

unzip -l source-data/*.zip | head -50

Categorize by type

find source-data/ -type f ( -name ".md" -o -name ".csv" -o -name ".txt" -o -name ".json" )

Create an inventory table with confidence scores:

File Type Category Priority Confidence

gemini-review.md AI Summary All HIGH ⬛⬛⬛ High

work-history.md Markdown Experience High ⬛⬛⬜ Medium

skills.csv CSV Skills Medium ⬛⬛⬛ High

testimonials.txt Text Social Proof High ⬛⬜⬜ Low

Phase 2: Experience Enrichment

For each work role, extract/synthesize:

Target: content/experience/index.yaml

jobs:

company: "Company Name" role: "Job Title" period: "YYYY – YYYY" location: "City, State" logo: "/images/logos/company.svg" # Optional, nullable url: "https://company.com" # Company website - ALWAYS extract if available highlights:
- "Achievement with specific metric (e.g., 15× revenue growth)"
- "Technical accomplishment with stack details"
- "Cross-functional impact with scope"
- "Decision/trade-off that shows judgment"
- "Scale indicator (users, transactions, team size)" tags: ["Domain1", "Tech1", "Skill1"]

Enrichment criteria:

3-4 highlights per role (focused, not exhaustive)
Specific metrics > vague claims (SMART: Specific, Measurable, Achievable, Relevant)
Technologies and methodologies named
Collaboration scope mentioned
Key decisions with trade-offs

Product Links in Highlights (IMPORTANT)

Highlights support markdown links — use Product Name to link to shipped products, docs, or proof.

Example with inline product links

highlights:

"Shipped Advanced API from 0→1, serving 1M+ daily requests"
"Built Xbox royalties blockchain achieving 99% faster processing"
"Launched Playground V2 improving developer onboarding by 60%"

Link priority for highlights:

Live product/demo URLs (highest credibility)
Official documentation
NPM/PyPI package pages
GitHub repos
Case studies or press coverage

Why this matters: Recruiters verify claims. A clickable link to a live product instantly validates the work is real.

Link Extraction (IMPORTANT)

Always extract and verify company/project URLs so viewers can quickly validate the work is real.

Sources for URLs:

Company websites mentioned in source docs
LinkedIn company pages
Product URLs, app store links
GitHub repos, documentation sites
Press releases, news articles

Link extraction process:

Search source files for URLs

grep -oE 'https?://[^\s<>"]+' [SOURCE_FILES]

Common patterns to look for

"company.com", "startup.io", "project.xyz"
LinkedIn: linkedin.com/company/[name]
GitHub: github.com/[org]
Product Hunt, Crunchbase links

Validation:

Verify URL is still active (not 404)
Prefer official company domain over third-party
For defunct companies: use archive.org link or LinkedIn

Company Type URL Priority

Active startup Official website (e.g., mempools.com)

Enterprise Company careers/about page

Acquired Parent company or press release

Defunct LinkedIn or archive.org

Confidence Scoring for Experience Data

Track extraction confidence for each data point:

Confidence Criteria Action

⬛⬛⬛ High Exact quote from source, verified metric, named source Use directly

⬛⬛⬜ Medium Synthesized from multiple sources, approximated metric Use with note

⬛⬜⬜ Low Inferred, placeholder metric like "[X%]", uncertain date Flag for user review

Before writing experience file:

List all LOW confidence items for user confirmation
Mark uncertain metrics with [VERIFY] prefix in draft
Only write after user confirms or provides corrections

Pre-Validation Check (Before Writing)

Validate structure matches schema BEFORE writing

ExperienceSchema expects:

jobs:

company: string # Required role: string # Required period: string # Required: "YYYY – YYYY" format location: string # Required logo: string | null # Optional url: string | null # Optional - but ALWAYS try to populate! highlights: string[] # Required: array of STRINGS (not objects!) tags: string[] # Required

Phase 3: Case Study Mining

Identify projects with case study potential:

Criteria:

Clear problem → approach → result arc
Quantifiable impact (revenue, users, time saved)
Decision points with reasoning
Lessons learned (what worked, what didn't)

Target: content/case-studies/NN-slug.md frontmatter

id: [next_available_number] slug: project-slug title: "Project Title" company: "Company Name" year: "YYYY-YY" tags: [Tag1, Tag2, Tag3] duration: "X months" role: "Your Role"

hook: headline: "One-line impact statement" impactMetric: value: "XX%" label: "metric label" subMetrics: - value: "YY" label: "secondary metric" thumbnail: /images/case-study-slug.png

cta: headline: "Call to action question" subtext: "Optional elaboration" action: calendly # or: contact, linkedin linkText: "Let's talk →"

[Markdown body with: Challenge, Approach, Key Decision, Execution, Results, Learnings]

Phase 4: Testimonial Extraction

Distinguishing Personal vs Project Testimonials (CRITICAL)

Personal testimonials (USE these):

About the individual: "Working with [Name]...", "[Name] is exceptional at..."
Collaboration focused: "...a pleasure to work with", "...always delivered"
Skill endorsements: "...best product manager I've worked with"
Growth observations: "...grew from... to..."

Project testimonials (DO NOT use as personal):

About the project/product: "The blockchain solution transformed..."
Company endorsements: "Microsoft's approach to..."
Generic team praise: "The team delivered..."

Detection Heuristics

High-confidence personal testimonial signals:

✅ Contains person's name + positive verb: "[Name] led", "[Name] designed" ✅ Uses "I" + relationship: "I worked with", "I hired", "I managed" ✅ Personal qualities: "collaborative", "thoughtful", "proactive" ✅ Recommendation language: "I recommend", "would hire again" ✅ Source: LinkedIn recommendation, performance review

Low-confidence / Project quote signals:

⚠️ No personal name mentioned ⚠️ Focuses on deliverable, not person ⚠️ Source: press release, case study, marketing material ⚠️ Generic praise: "great work", "successful project"

Search Patterns for Testimonials

Phrases indicating personal testimonials

grep -i "working with [name]" [SOURCE] grep -i "I recommend" [SOURCE] grep -i "pleasure to work" [SOURCE] grep -i "[name] is" [SOURCE] grep -i "one of the best" [SOURCE]

LinkedIn recommendation patterns

grep -i "I had the opportunity" [SOURCE] grep -i "I've had the pleasure" [SOURCE]

Testimonial Quality Checklist

Before including a testimonial:

Is it about the PERSON, not just the project?
Does it include specific details (not generic praise)?
Can the source be verified (LinkedIn, named author)?
Is the author's relationship clear (manager, peer, report)?

Target: content/testimonials/index.yaml

testimonials:

quote: > Full testimonial text with specific details about collaboration, impact, or skills demonstrated. author: "Full Name" # Or "Engineering Lead" if anonymized initials: "FN" role: "Their Role" company: "Company Name" avatar: null # Or path to image featured: true # Required boolean - for prominent display

Gap Handling

If no valid personal testimonials found:

Flag as HIGH priority gap in gap analysis
Suggest manual follow-up actions:
Export LinkedIn recommendations
Request recommendations from past colleagues
Check old emails for praise snippets
DO NOT fabricate or use project quotes as personal testimonials

Phase 5: Skills Taxonomy

CRITICAL: Schema Constraint

The skills schema expects SIMPLE STRINGS, not objects!

❌ WRONG - This will fail validation

categories:

name: "Technical Leadership" skills:
- name: "Platform Architecture" level: "expert" evidence: "..."

✅ CORRECT - Simple string array

categories:

name: "Technical Leadership" skills:
- "Platform Architecture"
- "API Design"
- "System Integration"

Preserving Evidence (Workaround)

Since the schema doesn't support evidence fields, preserve context using YAML comments:

Target: content/skills/index.yaml

categories:

name: "Technical Leadership" skills:

Evidence: Designed multi-client validator infra at Anchorage
- "Platform Architecture"
Evidence: Shipped Ankr Advanced APIs, 1M+ daily requests
- "API Design"
name: "Web3 & Blockchain" skills:

Evidence: First smart contract at Microsoft (2016)
- "Ethereum"
Evidence: ETF-grade validator provisioning
- "Staking Infrastructure"

Pre-Validation Check

// SkillsSchema from src/lib/schemas.ts: SkillsSchema = z.object({ categories: z.array(z.object({ name: z.string(), skills: z.array(z.string()) // <-- STRINGS ONLY! })) });

Before writing skills file:

Verify each skill is a plain string
Move any evidence/level data to comments
Test parse the YAML structure mentally

Phase 6: Validation & Gap Analysis

Validate all content against Zod schemas

npm run validate

Check for gaps

Gap checklist:

All required fields populated
No placeholder URLs (demo.com, example.com)
Metrics are specific, not vague ("improved" → "improved by 40%")
Missing metrics marked with [METRIC] for user to fill (e.g., "Increased revenue by [METRIC]%")
Each experience has 3-4 highlights (focused, not exhaustive)
Product links included for every shipped product/tool
All dates in consistent format (YYYY or YYYY–YY)

Schema Reference

Experience Entry

{ company: string; // Required role: string; // Required period: string; // Required: "YYYY – YYYY" or "YYYY – Present" location: string; // Required logo?: string; // Optional: path to logo url?: string; // Optional: company website - ALWAYS include if available highlights: string[]; // Required: 3-4 items with Product links where possible tags: string[]; // Required: 2-5 tags }

Case Study Frontmatter

{ id: number; // Required: unique incremental slug: string; // Required: URL-friendly title: string; // Required company: string; // Required year: string; // Required: "YYYY" or "YYYY-YY" tags: string[]; // Required duration: string; // Required role: string; // Required hook: { headline: string; impactMetric: { value: string; label: string }; subMetrics?: { value: string; label: string }[]; thumbnail?: string; }; cta: { headline: string; subtext?: string; action: "contact" | "calendly" | "linkedin"; linkText: string; }; // Optional links demoUrl?: string; githubUrl?: string; docsUrl?: string; media?: { type: string; url: string; label?: string }[]; }

Testimonial Entry

{ quote: string; // Required: the testimonial text author: string; // Required: name or title if anonymized initials: string; // Required: for avatar fallback role: string; // Required company: string; // Required avatar?: string; // Optional: image path featured?: boolean; // Optional: for prominent display }

Automation Outputs

After processing, generate:

Data Lineage Document (docs/data-lineage.md )

Source file → target field mapping
Transformation notes
Decisions made during processing

Gap Analysis Report (docs/gap-analysis.md )

Missing required fields
Thin content areas
Recommended improvements

Processing Summary

Files processed count
Content items created/updated
Validation status

Example Workflows

Workflow 1: Obsidian Vault Import

User: "I have an Obsidian vault at ~/Documents/Career-Notes with my work history"

Steps:

Scan vault for relevant files (work, projects, reviews folders)
Parse each markdown file, extract structured data
Map to experience/index.yaml format
Identify case study candidates
Extract any testimonial quotes
Validate and report gaps

Workflow 2: LinkedIn Export + Resume

User: "I have a LinkedIn CSV export and my resume as text"

Steps:

Parse CSV for positions, skills, recommendations
Parse resume text for additional details
Merge and deduplicate
Enrich highlights with specific metrics
Create skills taxonomy
Validate against schemas

Workflow 3: Zip Archive with Mixed Content

User: "I have a zip file with CSV files, text notes, and screenshots"

Steps:

Extract archive to temp directory
Inventory all files by type
Process CSVs for structured data
Process text files for testimonials/notes
Catalog images for case study thumbnails
Consolidate into content structure
Clean up temp files

Quality Criteria

"Hired-on-Sight" Standard

Every piece of content should pass this test:

Specific: Numbers, names, technologies — not vague claims
Verified: Can point to source data
Impactful: Shows business/user value, not just activity
Consistent: Formatting, dates, terminology aligned
Complete: No "TODO" or placeholder content

Validation Commands

Full content validation

npm run validate

Build to catch any issues

npm run build

Preview locally

npm run dev

Tips

Preserve source references: Note which file each piece of data came from
Prioritize metrics: "Shipped feature" < "Shipped feature used by 10K users"
Anonymize carefully: If removing names, keep role context
Check dates: Ensure no overlapping periods, consistent formatting
Cross-reference: Same project might appear in experience AND as case study
Images: Store in /public/images/ , reference as /images/filename.png

Dependencies

Node.js (for validation scripts)
npm run validate command
Zod schemas in src/lib/schemas.ts

No additional packages required.

cv-data-ingestion

Safety Notice

Copy this and send it to your AI assistant to learn

Always read the schema file first

Look for AI review files

Check for structured exports

For Obsidian vault

For zip archive (extract first)

Categorize by type

Target: content/experience/index.yaml

Example with inline product links

Search source files for URLs

Common patterns to look for

Validate structure matches schema BEFORE writing

ExperienceSchema expects:

Target: content/case-studies/NN-slug.md frontmatter

cta: headline: "Call to action question" subtext: "Optional elaboration" action: calendly # or: contact, linkedin linkText: "Let's talk →"

Phrases indicating personal testimonials

LinkedIn recommendation patterns

Target: content/testimonials/index.yaml

❌ WRONG - This will fail validation

✅ CORRECT - Simple string array

Target: content/skills/index.yaml

Evidence: Designed multi-client validator infra at Anchorage

Evidence: Shipped Ankr Advanced APIs, 1M+ daily requests

Evidence: First smart contract at Microsoft (2016)

Evidence: ETF-grade validator provisioning

Validate all content against Zod schemas

Check for gaps

Full content validation

Build to catch any issues

Preview locally

Source Transparency

Related Skills

ultrathink

linkedin

sprint-sync