pSEO Data Architecture
Design and implement the structured data layer that feeds all programmatic SEO pages. This is the foundation every other pSEO skill depends on.
Core Principles
- Single source of truth: All page data flows from one data layer
- SEO-complete models: Every content model includes all fields needed for metadata, schema markup, and linking
- Unique slugs by construction: Slug generation enforces uniqueness at the data level
- Type safety: All data models are fully typed (TypeScript interfaces/types)
- Separation of concerns: Data fetching is decoupled from page rendering
Implementation Steps
1. Define Content Models
Create TypeScript interfaces for each page type using a two-tier model. The lightweight index tier is safe to hold in memory for all pages; the heavy full tier is loaded per-page only.
// Index tier: safe to load all at once (~1KB per page)
interface PageIndex {
slug: string; // unique, URL-safe
title: string; // page title (50-60 chars target)
metaDescription: string; // meta description (150-160 chars target)
h1: string; // primary heading (can differ from title)
canonicalPath: string; // canonical URL path
category: string; // for hub-spoke and breadcrumbs
lastModified: string; // ISO date for sitemap
}
// Full tier: extends PageIndex with heavy fields (~50-500KB per page)
interface BaseSEOContent extends PageIndex {
introText: string;
bodyContent: string;
faqs?: FAQ[];
relatedSlugs?: string[];
featuredImage?: SEOImage;
}
Extend BaseSEOContent for each page type with domain-specific fields. The interfaces above show the minimum required fields. See references/content-models.md for the full definitions (which add subcategory, tags, publishedDate, status, and more) and extended type examples (LocationPage, ProductPage, ComparisonPage, CategoryPage).
2. Build the Data-Fetching Layer
Create a centralized data module (e.g., lib/data.ts or src/data/index.ts) that exports:
getAllSlugs()- Returns all valid slugs for static generation. Must handle pagination internally when the data source has 1000+ records (fetch in batches, return the complete list).getPageData(slug)- Returns full content for a single pagegetPagesByCategory(category, opts?)- Returns pages in a category for hub pages. Accept optionallimitandoffsetfor paginated hub pages.getRelatedPages(slug, limit?)- Returns related pages for internal linkinggetAllCategories()- Returns all categories for navigation and hubsgetPageCount()- Returns total page count (useful for sitemap splitting and build diagnostics)
All functions must be:
- Cached or memoized during build to avoid redundant reads
- Typed with explicit return types
- Guarded against missing or malformed data
- Internally paginated when the data source imposes limits (e.g., CMS APIs with 100-item pages). The consumer should never need to handle pagination — the data layer abstracts it.
3. Implement Slug Generation
Design a slug strategy that:
- Produces URL-safe, lowercase, hyphenated strings
- Guarantees uniqueness across the entire dataset
- Is deterministic (same input always produces same slug)
- Includes a collision detection mechanism
- Follows a consistent URL hierarchy (e.g.,
/category/page-slug)
4. Validate Data Integrity
Build a validation function or script that checks:
- No duplicate slugs exist
- All required fields are present and non-empty
- Title and description lengths are within SEO targets
- All category references resolve to valid categories
- No orphan pages (pages not reachable through any category)
5. Set Up Data Source Integration
Based on the data source ($ARGUMENTS or detected):
JSON files: Create a data/ directory with typed JSON, a loader, and build-time validation.
CMS (headless): Create API client with typed responses, implement caching, handle pagination for 1000+ items.
Database: Create a query layer with connection pooling, implement cursor-based pagination, add query caching.
MDX files: Set up frontmatter schema validation, create a content loader with gray-matter parsing.
API: Create a typed API client, implement rate limiting and retry logic, add response caching.
Scale Limits
The in-memory and file-based patterns in this skill work up to ~10K pages. Beyond that:
- 10K-50K pages: Requires a database (PostgreSQL, MySQL). In-memory index tier becomes borderline at 50K (~50MB). File-based data sources are too slow.
- 50K-100K+ pages: Requires database + cache layer (Redis) + cursor-based pagination.
getAllSlugs()must use cursor iteration, not array return. Data sufficiency gating prevents generating thin pages.
See pseo-scale for the complete database-backed data layer, sufficiency scoring, and scale-specific patterns.
Memory-Conscious Data Patterns
At 1000+ pages, how data is loaded matters more than what is loaded. A full content model with body text, FAQs, and images can be 50-500KB per page. Loading all pages into memory simultaneously will OOM.
Two-tier data model:
Split the data layer into lightweight index data and full page data. The PageIndex and BaseSEOContent interfaces from section 1 define the two tiers:
getAllSlugs(),getRelatedPages(),getPagesByCategory()— returnPageIndex[](lightweight, ~1KB per page)getPageData()— returnsBaseSEOContent(or an extended type) for a single page (heavy, ~50-500KB per page, only one at a time)
Never do this:
// Loads ALL full content into memory — will OOM at scale
const allPages = await Promise.all(slugs.map(s => getPageData(s)));
Instead:
// Process pages one at a time or in small batches
for (const slug of slugs) {
const page = await getPageData(slug);
await processPage(page);
// page is GC'd after each iteration
}
CMS/API pagination:
- Fetch in batches of 100-250 records
- Yield or push to an array incrementally — don't hold all API responses in memory simultaneously
- If using GraphQL, only request index fields in list queries, full fields in single-item queries
File Organization
lib/
data/
index.ts # public API (re-exports)
types.ts # TypeScript interfaces
fetcher.ts # data source integration
slugs.ts # slug generation and validation
validation.ts # data integrity checks
cache.ts # build-time caching utilities
Quality Checks
Before considering this complete:
- All content models extend BaseSEOContent (which extends PageIndex)
-
getAllSlugs()returns 0 duplicates - Data validation passes with zero errors
- Data layer exports are fully typed with no
any - Fetching is memoized for build performance
- A test or script can validate the full dataset
- Two-tier data model implemented (index data vs. full page data)
- No function loads all full page content into memory simultaneously
- CMS/API fetching uses batched pagination internally
Relationship to Other Skills
This skill provides the data foundation for:
- pseo-templates: Consumes
getPageData()andgetAllSlugs() - pseo-metadata: Reads title, description, canonical from content models
- pseo-schema: Uses structured fields for JSON-LD generation
- pseo-linking: Uses
getRelatedPages()and category data - pseo-quality-guard: Validates against the content models