Spice Data Accelerators
Accelerators materialize data locally from connected sources for faster queries and reduced load on source systems.
Basic Configuration
datasets:
- from: postgres:my_table
name: my_table
acceleration:
enabled: true
engine: duckdb # arrow, duckdb, sqlite, cayenne, postgres, turso
mode: memory # memory or file
refresh_check_interval: 1h
Choosing an Accelerator
| Use Case | Engine | Why |
|---|---|---|
| Small datasets (<1 GB), max speed | arrow | In-memory, lowest latency |
| Medium datasets (1-100 GB), complex SQL | duckdb | Mature SQL, memory management |
| Large datasets (100 GB-1+ TB), analytics | cayenne | Built on Vortex (Linux Foundation), 10-20x faster scans |
| Point lookups on large datasets | cayenne | 100x faster random access vs Parquet |
| Simple queries, low resource usage | sqlite | Lightweight, minimal overhead |
| Async operations, concurrent workloads | turso | Native async, modern connection pooling |
| External database integration | postgres | Leverage existing PostgreSQL infra |
Cayenne vs DuckDB
Choose Cayenne when datasets exceed ~1 TB, multi-file ingestion is needed, or point lookups are common. Choose DuckDB when datasets are under ~1 TB, complex SQL (window functions, CTEs) is needed, or DuckDB tooling is beneficial.
Supported Engines
| Engine | Mode | Status |
|---|---|---|
arrow | memory | Stable |
duckdb | memory, file | Stable |
sqlite | memory, file | Release Candidate |
cayenne | file | Beta |
postgres | N/A (attached) | Release Candidate |
turso | memory, file | Beta |
Refresh Modes
| Mode | Description | Use Case |
|---|---|---|
full | Complete dataset replacement on each refresh | Small, slowly-changing datasets |
append (batch) | Adds new records based on a time_column | Append-only logs, time-series data |
append (stream) | Continuous streaming without time column | Real-time event streams (Kafka, Debezium) |
changes | CDC-based incremental updates via Debezium or DynamoDB Streams | Frequently updated transactional data |
caching | Request-based row-level caching | API responses, HTTP endpoints |
# Full refresh every 8 hours
acceleration:
refresh_mode: full
refresh_check_interval: 8h
# Append mode: check for new records from the last day every 10 minutes
acceleration:
refresh_mode: append
time_column: created_at
refresh_check_interval: 10m
refresh_data_window: 1d
# Continuous ingestion using Kafka
acceleration:
refresh_mode: append
# CDC with Debezium or DynamoDB Streams
acceleration:
refresh_mode: changes
Common Configurations
In-Memory with Interval Refresh
acceleration:
enabled: true
engine: arrow
refresh_check_interval: 5m
File-Based with Append and Time Window
datasets:
- from: postgres:events
name: events
time_column: created_at
acceleration:
enabled: true
engine: duckdb
mode: file
refresh_mode: append
refresh_check_interval: 1h
refresh_data_window: 7d
With Retention Policy
Retention policies prevent unbounded growth of accelerated datasets. Spice supports time-based and custom SQL-based retention strategies:
datasets:
- from: postgres:events
name: events
time_column: created_at
acceleration:
enabled: true
engine: duckdb
retention_check_enabled: true
retention_period: 30d
retention_check_interval: 1h
With SQL-Based Retention
acceleration:
retention_check_enabled: true
retention_check_interval: 1h
retention_sql: "DELETE FROM logs WHERE status = 'archived'"
With Indexes (DuckDB, SQLite, Turso)
acceleration:
enabled: true
engine: sqlite
indexes:
user_id: enabled
'(created_at, status)': unique
primary_key: id
Engine-Specific Parameters
DuckDB
acceleration:
engine: duckdb
mode: file
params:
duckdb_file: ./data/cache.db
SQLite
acceleration:
engine: sqlite
mode: file
params:
sqlite_file: ./data/cache.sqlite
Constraints and Indexes
Accelerated datasets support primary key constraints and indexes:
acceleration:
enabled: true
engine: duckdb
primary_key: order_id # Creates non-null unique index
indexes:
customer_id: enabled # Single column index
'(created_at, status)': unique # Multi-column unique index
Snapshots (DuckDB, SQLite & Cayenne file mode)
Bootstrap file-based accelerations from S3 or filesystem snapshots on startup. This dramatically reduces cold-start latency in distributed deployments.
Snapshot triggers vary by refresh mode:
refresh_complete: Creates snapshots after each refresh (full and batch-append modes)time_interval: Creates snapshots on a fixed schedule (all refresh modes)stream_batches: Creates snapshots after every N batches (streaming modes: Kafka, Debezium, DynamoDB Streams)
snapshots:
enabled: true
location: s3://my_bucket/snapshots/
bootstrap_on_failure_behavior: warn # warn | retry | fallback
params:
s3_auth: iam_role
Per-dataset opt-in:
acceleration:
enabled: true
engine: duckdb
mode: file
snapshots:
enabled: true
Memory Considerations
When using mode: memory (default), the dataset is loaded into RAM. Ensure sufficient memory including overhead for queries and the runtime. Mitigate with mode: file for duckdb, sqlite, turso, or cayenne accelerators.