pSEO Data Architecture

Design and implement the structured data layer that feeds all programmatic SEO pages. This is the foundation every other pSEO skill depends on.

Core Principles

Single source of truth: All page data flows from one data layer
SEO-complete models: Every content model includes all fields needed for metadata, schema markup, and linking
Unique slugs by construction: Slug generation enforces uniqueness at the data level
Type safety: All data models are fully typed (TypeScript interfaces/types)
Separation of concerns: Data fetching is decoupled from page rendering

Implementation Steps

1. Define Content Models

Create TypeScript interfaces for each page type using a two-tier model. The lightweight index tier is safe to hold in memory for all pages; the heavy full tier is loaded per-page only.

// Index tier: safe to load all at once (~1KB per page)
interface PageIndex {
  slug: string;               // unique, URL-safe
  title: string;              // page title (50-60 chars target)
  metaDescription: string;    // meta description (150-160 chars target)
  h1: string;                 // primary heading (can differ from title)
  canonicalPath: string;      // canonical URL path
  category: string;           // for hub-spoke and breadcrumbs
  lastModified: string;       // ISO date for sitemap
}

// Full tier: extends PageIndex with heavy fields (~50-500KB per page)
interface BaseSEOContent extends PageIndex {
  introText: string;
  bodyContent: string;
  faqs?: FAQ[];
  relatedSlugs?: string[];
  featuredImage?: SEOImage;
}

Extend BaseSEOContent for each page type with domain-specific fields. The interfaces above show the minimum required fields. See references/content-models.md for the full definitions (which add subcategory, tags, publishedDate, status, and more) and extended type examples (LocationPage, ProductPage, ComparisonPage, CategoryPage).

2. Build the Data-Fetching Layer

Create a centralized data module (e.g., lib/data.ts or src/data/index.ts) that exports:

getAllSlugs() - Returns all valid slugs for static generation. Must handle pagination internally when the data source has 1000+ records (fetch in batches, return the complete list).
getPageData(slug) - Returns full content for a single page
getPagesByCategory(category, opts?) - Returns pages in a category for hub pages. Accept optional limit and offset for paginated hub pages.
getRelatedPages(slug, limit?) - Returns related pages for internal linking
getAllCategories() - Returns all categories for navigation and hubs
getPageCount() - Returns total page count (useful for sitemap splitting and build diagnostics)

All functions must be:

Cached or memoized during build to avoid redundant reads
Typed with explicit return types
Guarded against missing or malformed data
Internally paginated when the data source imposes limits (e.g., CMS APIs with 100-item pages). The consumer should never need to handle pagination — the data layer abstracts it.

3. Implement Slug Generation

Design a slug strategy that:

Produces URL-safe, lowercase, hyphenated strings
Guarantees uniqueness across the entire dataset
Is deterministic (same input always produces same slug)
Includes a collision detection mechanism
Follows a consistent URL hierarchy (e.g., /category/page-slug)

4. Validate Data Integrity

Build a validation function or script that checks:

No duplicate slugs exist
All required fields are present and non-empty
Title and description lengths are within SEO targets
All category references resolve to valid categories
No orphan pages (pages not reachable through any category)

5. Set Up Data Source Integration

Based on the data source ($ARGUMENTS or detected):

JSON files: Create a data/ directory with typed JSON, a loader, and build-time validation.

CMS (headless): Create API client with typed responses, implement caching, handle pagination for 1000+ items.

Database: Create a query layer with connection pooling, implement cursor-based pagination, add query caching.

MDX files: Set up frontmatter schema validation, create a content loader with gray-matter parsing.

API: Create a typed API client, implement rate limiting and retry logic, add response caching.

Scale Limits

The in-memory and file-based patterns in this skill work up to ~10K pages. Beyond that:

10K-50K pages: Requires a database (PostgreSQL, MySQL). In-memory index tier becomes borderline at 50K (~50MB). File-based data sources are too slow.
50K-100K+ pages: Requires database + cache layer (Redis) + cursor-based pagination. getAllSlugs() must use cursor iteration, not array return. Data sufficiency gating prevents generating thin pages.

See pseo-scale for the complete database-backed data layer, sufficiency scoring, and scale-specific patterns.

Memory-Conscious Data Patterns

At 1000+ pages, how data is loaded matters more than what is loaded. A full content model with body text, FAQs, and images can be 50-500KB per page. Loading all pages into memory simultaneously will OOM.

Two-tier data model:

Split the data layer into lightweight index data and full page data. The PageIndex and BaseSEOContent interfaces from section 1 define the two tiers:

getAllSlugs(), getRelatedPages(), getPagesByCategory() — return PageIndex[] (lightweight, ~1KB per page)
getPageData() — returns BaseSEOContent (or an extended type) for a single page (heavy, ~50-500KB per page, only one at a time)

Never do this:

// Loads ALL full content into memory — will OOM at scale
const allPages = await Promise.all(slugs.map(s => getPageData(s)));

Instead:

// Process pages one at a time or in small batches
for (const slug of slugs) {
  const page = await getPageData(slug);
  await processPage(page);
  // page is GC'd after each iteration
}

CMS/API pagination:

Fetch in batches of 100-250 records
Yield or push to an array incrementally — don't hold all API responses in memory simultaneously
If using GraphQL, only request index fields in list queries, full fields in single-item queries

File Organization

lib/
  data/
    index.ts          # public API (re-exports)
    types.ts          # TypeScript interfaces
    fetcher.ts        # data source integration
    slugs.ts          # slug generation and validation
    validation.ts     # data integrity checks
    cache.ts          # build-time caching utilities

Quality Checks

Before considering this complete:

All content models extend BaseSEOContent (which extends PageIndex)
getAllSlugs() returns 0 duplicates
Data validation passes with zero errors
Data layer exports are fully typed with no any
Fetching is memoized for build performance
A test or script can validate the full dataset
Two-tier data model implemented (index data vs. full page data)
No function loads all full page content into memory simultaneously
CMS/API fetching uses batched pagination internally

Relationship to Other Skills

This skill provides the data foundation for:

pseo-templates: Consumes getPageData() and getAllSlugs()
pseo-metadata: Reads title, description, canonical from content models
pseo-schema: Uses structured fields for JSON-LD generation
pseo-linking: Uses getRelatedPages() and category data
pseo-quality-guard: Validates against the content models

pseo-data

Safety Notice

Copy this and send it to your AI assistant to learn