Universal CV Data Ingestion & Organization
<when_to_activate> Activate when the user:
-
Provides raw career data (Obsidian vault, CSV, zip, markdown)
-
Says "ingest", "import", "process my data", "load my experience"
-
Wants to populate the knowledge base from source files
-
Has files in source-data/ directory to process
Trigger phrases: "ingest", "import", "process", "load data", "parse my resume" </when_to_activate>
Input Directory
Drop all source files in source-data/ — this directory is gitignored.
source-data/ ├── obsidian-vault/ # Obsidian folders with .md files ├── linkedin-export.csv # CSV exports ├── resume.pdf # Resumes ├── gemini-review.md # AI summaries (processed first) └── archive.zip # Bulk exports
Target Output Structure
All output goes to the existing content/ directory:
content/ ├── profile.yaml # Identity, hero, about, stats ├── experience/index.yaml # Work history with rich details ├── case-studies/*.md # Project deep-dives (frontmatter + markdown) ├── testimonials/index.yaml # Social proof quotes ├── certifications/index.yaml # Certs and credentials ├── passion-projects/index.yaml ├── skills/index.yaml # Technical & domain skills └── variants/ # Job-specific personalizations
Supported Input Formats
Obsidian Vault (.md files)
-
Daily notes, project notes, work journals
-
YAML frontmatter with metadata
-
Tags, links, dataview syntax
-
Nested folder structures
CSV Files
-
LinkedIn exports
-
Spreadsheet data (experience, skills, projects)
-
Any structured tabular career data
Zip Archives
-
Mixed content (CSV, text, images, videos)
-
Bulk exports from tools
-
Compressed vault backups
Text Files
-
Resume text dumps
-
Job descriptions
-
Performance reviews
-
Testimonial snippets
Images & Media
-
Screenshots of achievements
-
Certificates
-
Project visuals
-
Linked for case studies
Processing Workflow
Phase 0: Schema Understanding (MANDATORY FIRST STEP)
Objective: Read and internalize the actual Zod schemas before writing ANY content.
CRITICAL: This step MUST be completed before any file writes. Schema mismatches were the #1 source of validation errors in past runs.
Always read the schema file first
Read src/lib/schemas.ts
Key Schema Constraints to Verify:
Content Type Field Constraint
Skills skills
z.array(z.string()) — simple strings only, NOT objects
Case Study Media type
z.enum(['blog', 'twitter', 'linkedin', 'video', 'article', 'slides'])
Testimonial featured
z.boolean() — required, not optional
Experience highlights
z.array(z.string()) — simple strings only
Case Study CTA action
z.enum(['contact', 'calendly', 'linkedin']) — only these 3 values
Before writing any file, confirm:
-
✅ Field types match schema exactly (string vs object vs array)
-
✅ Enum values are from the allowed set
-
✅ Required fields are present
-
✅ Optional fields use correct null/undefined handling
Phase 1: Data Discovery & Inventory
Objective: Map all source files before processing.
Step 1a: Check for Existing AI Summaries (HIGH PRIORITY)
Before parsing raw files, look for existing AI-generated summaries or reviews. These are often more structured and save significant parsing time.
Look for AI review files
find source-data/ -name "review" -o -name "summary" -o -name "gemini" -o -name "claude" -o -name "gpt"
Check for structured exports
find source-data/ -name "*.json" -o -name "export"
If found: Use as PRIMARY source, cross-reference with raw data for verification.
Step 1b: Inventory Raw Sources
For Obsidian vault
find source-data/ -name "*.md" -type f | head -50
For zip archive (extract first)
unzip -l source-data/*.zip | head -50
Categorize by type
find source-data/ -type f ( -name ".md" -o -name ".csv" -o -name ".txt" -o -name ".json" )
Create an inventory table with confidence scores:
File Type Category Priority Confidence
gemini-review.md AI Summary All HIGH ⬛⬛⬛ High
work-history.md Markdown Experience High ⬛⬛⬜ Medium
skills.csv CSV Skills Medium ⬛⬛⬛ High
testimonials.txt Text Social Proof High ⬛⬜⬜ Low
Phase 2: Experience Enrichment
For each work role, extract/synthesize:
Target: content/experience/index.yaml
jobs:
- company: "Company Name"
role: "Job Title"
period: "YYYY – YYYY"
location: "City, State"
logo: "/images/logos/company.svg" # Optional, nullable
url: "https://company.com" # Company website - ALWAYS extract if available
highlights:
- "Achievement with specific metric (e.g., 15× revenue growth)"
- "Technical accomplishment with stack details"
- "Cross-functional impact with scope"
- "Decision/trade-off that shows judgment"
- "Scale indicator (users, transactions, team size)" tags: ["Domain1", "Tech1", "Skill1"]
Enrichment criteria:
-
3-4 highlights per role (focused, not exhaustive)
-
Specific metrics > vague claims (SMART: Specific, Measurable, Achievable, Relevant)
-
Technologies and methodologies named
-
Collaboration scope mentioned
-
Key decisions with trade-offs
Product Links in Highlights (IMPORTANT)
Highlights support markdown links — use Product Name to link to shipped products, docs, or proof.
Example with inline product links
highlights:
- "Shipped Advanced API from 0→1, serving 1M+ daily requests"
- "Built Xbox royalties blockchain achieving 99% faster processing"
- "Launched Playground V2 improving developer onboarding by 60%"
Link priority for highlights:
-
Live product/demo URLs (highest credibility)
-
Official documentation
-
NPM/PyPI package pages
-
GitHub repos
-
Case studies or press coverage
Why this matters: Recruiters verify claims. A clickable link to a live product instantly validates the work is real.
Link Extraction (IMPORTANT)
Always extract and verify company/project URLs so viewers can quickly validate the work is real.
Sources for URLs:
-
Company websites mentioned in source docs
-
LinkedIn company pages
-
Product URLs, app store links
-
GitHub repos, documentation sites
-
Press releases, news articles
Link extraction process:
Search source files for URLs
grep -oE 'https?://[^\s<>"]+' [SOURCE_FILES]
Common patterns to look for
- "company.com", "startup.io", "project.xyz"
- LinkedIn: linkedin.com/company/[name]
- GitHub: github.com/[org]
- Product Hunt, Crunchbase links
Validation:
-
Verify URL is still active (not 404)
-
Prefer official company domain over third-party
-
For defunct companies: use archive.org link or LinkedIn
Company Type URL Priority
Active startup Official website (e.g., mempools.com)
Enterprise Company careers/about page
Acquired Parent company or press release
Defunct LinkedIn or archive.org
Confidence Scoring for Experience Data
Track extraction confidence for each data point:
Confidence Criteria Action
⬛⬛⬛ High Exact quote from source, verified metric, named source Use directly
⬛⬛⬜ Medium Synthesized from multiple sources, approximated metric Use with note
⬛⬜⬜ Low Inferred, placeholder metric like "[X%]", uncertain date Flag for user review
Before writing experience file:
-
List all LOW confidence items for user confirmation
-
Mark uncertain metrics with [VERIFY] prefix in draft
-
Only write after user confirms or provides corrections
Pre-Validation Check (Before Writing)
Validate structure matches schema BEFORE writing
ExperienceSchema expects:
jobs:
- company: string # Required role: string # Required period: string # Required: "YYYY – YYYY" format location: string # Required logo: string | null # Optional url: string | null # Optional - but ALWAYS try to populate! highlights: string[] # Required: array of STRINGS (not objects!) tags: string[] # Required
Phase 3: Case Study Mining
Identify projects with case study potential:
Criteria:
-
Clear problem → approach → result arc
-
Quantifiable impact (revenue, users, time saved)
-
Decision points with reasoning
-
Lessons learned (what worked, what didn't)
Target: content/case-studies/NN-slug.md frontmatter
id: [next_available_number] slug: project-slug title: "Project Title" company: "Company Name" year: "YYYY-YY" tags: [Tag1, Tag2, Tag3] duration: "X months" role: "Your Role"
hook: headline: "One-line impact statement" impactMetric: value: "XX%" label: "metric label" subMetrics: - value: "YY" label: "secondary metric" thumbnail: /images/case-study-slug.png
cta: headline: "Call to action question" subtext: "Optional elaboration" action: calendly # or: contact, linkedin linkText: "Let's talk →"
[Markdown body with: Challenge, Approach, Key Decision, Execution, Results, Learnings]
Phase 4: Testimonial Extraction
Distinguishing Personal vs Project Testimonials (CRITICAL)
Personal testimonials (USE these):
-
About the individual: "Working with [Name]...", "[Name] is exceptional at..."
-
Collaboration focused: "...a pleasure to work with", "...always delivered"
-
Skill endorsements: "...best product manager I've worked with"
-
Growth observations: "...grew from... to..."
Project testimonials (DO NOT use as personal):
-
About the project/product: "The blockchain solution transformed..."
-
Company endorsements: "Microsoft's approach to..."
-
Generic team praise: "The team delivered..."
Detection Heuristics
High-confidence personal testimonial signals:
✅ Contains person's name + positive verb: "[Name] led", "[Name] designed" ✅ Uses "I" + relationship: "I worked with", "I hired", "I managed" ✅ Personal qualities: "collaborative", "thoughtful", "proactive" ✅ Recommendation language: "I recommend", "would hire again" ✅ Source: LinkedIn recommendation, performance review
Low-confidence / Project quote signals:
⚠️ No personal name mentioned ⚠️ Focuses on deliverable, not person ⚠️ Source: press release, case study, marketing material ⚠️ Generic praise: "great work", "successful project"
Search Patterns for Testimonials
Phrases indicating personal testimonials
grep -i "working with [name]" [SOURCE] grep -i "I recommend" [SOURCE] grep -i "pleasure to work" [SOURCE] grep -i "[name] is" [SOURCE] grep -i "one of the best" [SOURCE]
LinkedIn recommendation patterns
grep -i "I had the opportunity" [SOURCE] grep -i "I've had the pleasure" [SOURCE]
Testimonial Quality Checklist
Before including a testimonial:
-
Is it about the PERSON, not just the project?
-
Does it include specific details (not generic praise)?
-
Can the source be verified (LinkedIn, named author)?
-
Is the author's relationship clear (manager, peer, report)?
Target: content/testimonials/index.yaml
testimonials:
- quote: > Full testimonial text with specific details about collaboration, impact, or skills demonstrated. author: "Full Name" # Or "Engineering Lead" if anonymized initials: "FN" role: "Their Role" company: "Company Name" avatar: null # Or path to image featured: true # Required boolean - for prominent display
Gap Handling
If no valid personal testimonials found:
-
Flag as HIGH priority gap in gap analysis
-
Suggest manual follow-up actions:
-
Export LinkedIn recommendations
-
Request recommendations from past colleagues
-
Check old emails for praise snippets
-
DO NOT fabricate or use project quotes as personal testimonials
Phase 5: Skills Taxonomy
CRITICAL: Schema Constraint
The skills schema expects SIMPLE STRINGS, not objects!
❌ WRONG - This will fail validation
categories:
- name: "Technical Leadership"
skills:
- name: "Platform Architecture" level: "expert" evidence: "..."
✅ CORRECT - Simple string array
categories:
- name: "Technical Leadership"
skills:
- "Platform Architecture"
- "API Design"
- "System Integration"
Preserving Evidence (Workaround)
Since the schema doesn't support evidence fields, preserve context using YAML comments:
Target: content/skills/index.yaml
categories:
-
name: "Technical Leadership" skills:
Evidence: Designed multi-client validator infra at Anchorage
- "Platform Architecture"
Evidence: Shipped Ankr Advanced APIs, 1M+ daily requests
- "API Design"
-
name: "Web3 & Blockchain" skills:
Evidence: First smart contract at Microsoft (2016)
- "Ethereum"
Evidence: ETF-grade validator provisioning
- "Staking Infrastructure"
Pre-Validation Check
// SkillsSchema from src/lib/schemas.ts: SkillsSchema = z.object({ categories: z.array(z.object({ name: z.string(), skills: z.array(z.string()) // <-- STRINGS ONLY! })) });
Before writing skills file:
-
Verify each skill is a plain string
-
Move any evidence/level data to comments
-
Test parse the YAML structure mentally
Phase 6: Validation & Gap Analysis
Validate all content against Zod schemas
npm run validate
Check for gaps
Gap checklist:
-
All required fields populated
-
No placeholder URLs (demo.com, example.com)
-
Metrics are specific, not vague ("improved" → "improved by 40%")
-
Missing metrics marked with [METRIC] for user to fill (e.g., "Increased revenue by [METRIC]%")
-
Each experience has 3-4 highlights (focused, not exhaustive)
-
Product links included for every shipped product/tool
-
All dates in consistent format (YYYY or YYYY–YY)
Schema Reference
Experience Entry
{ company: string; // Required role: string; // Required period: string; // Required: "YYYY – YYYY" or "YYYY – Present" location: string; // Required logo?: string; // Optional: path to logo url?: string; // Optional: company website - ALWAYS include if available highlights: string[]; // Required: 3-4 items with Product links where possible tags: string[]; // Required: 2-5 tags }
Case Study Frontmatter
{ id: number; // Required: unique incremental slug: string; // Required: URL-friendly title: string; // Required company: string; // Required year: string; // Required: "YYYY" or "YYYY-YY" tags: string[]; // Required duration: string; // Required role: string; // Required hook: { headline: string; impactMetric: { value: string; label: string }; subMetrics?: { value: string; label: string }[]; thumbnail?: string; }; cta: { headline: string; subtext?: string; action: "contact" | "calendly" | "linkedin"; linkText: string; }; // Optional links demoUrl?: string; githubUrl?: string; docsUrl?: string; media?: { type: string; url: string; label?: string }[]; }
Testimonial Entry
{ quote: string; // Required: the testimonial text author: string; // Required: name or title if anonymized initials: string; // Required: for avatar fallback role: string; // Required company: string; // Required avatar?: string; // Optional: image path featured?: boolean; // Optional: for prominent display }
Automation Outputs
After processing, generate:
Data Lineage Document (docs/data-lineage.md )
-
Source file → target field mapping
-
Transformation notes
-
Decisions made during processing
Gap Analysis Report (docs/gap-analysis.md )
-
Missing required fields
-
Thin content areas
-
Recommended improvements
Processing Summary
-
Files processed count
-
Content items created/updated
-
Validation status
Example Workflows
Workflow 1: Obsidian Vault Import
User: "I have an Obsidian vault at ~/Documents/Career-Notes with my work history"
Steps:
- Scan vault for relevant files (work, projects, reviews folders)
- Parse each markdown file, extract structured data
- Map to experience/index.yaml format
- Identify case study candidates
- Extract any testimonial quotes
- Validate and report gaps
Workflow 2: LinkedIn Export + Resume
User: "I have a LinkedIn CSV export and my resume as text"
Steps:
- Parse CSV for positions, skills, recommendations
- Parse resume text for additional details
- Merge and deduplicate
- Enrich highlights with specific metrics
- Create skills taxonomy
- Validate against schemas
Workflow 3: Zip Archive with Mixed Content
User: "I have a zip file with CSV files, text notes, and screenshots"
Steps:
- Extract archive to temp directory
- Inventory all files by type
- Process CSVs for structured data
- Process text files for testimonials/notes
- Catalog images for case study thumbnails
- Consolidate into content structure
- Clean up temp files
Quality Criteria
"Hired-on-Sight" Standard
Every piece of content should pass this test:
-
Specific: Numbers, names, technologies — not vague claims
-
Verified: Can point to source data
-
Impactful: Shows business/user value, not just activity
-
Consistent: Formatting, dates, terminology aligned
-
Complete: No "TODO" or placeholder content
Validation Commands
Full content validation
npm run validate
Build to catch any issues
npm run build
Preview locally
npm run dev
Tips
-
Preserve source references: Note which file each piece of data came from
-
Prioritize metrics: "Shipped feature" < "Shipped feature used by 10K users"
-
Anonymize carefully: If removing names, keep role context
-
Check dates: Ensure no overlapping periods, consistent formatting
-
Cross-reference: Same project might appear in experience AND as case study
-
Images: Store in /public/images/ , reference as /images/filename.png
Dependencies
-
Node.js (for validation scripts)
-
npm run validate command
-
Zod schemas in src/lib/schemas.ts
No additional packages required.