turbo-architecture

Design and architect Goldsky Turbo pipelines. Use this skill for 'should I use X or Y' decisions: kafka source vs dataset source, streaming vs job mode, which resource size (xs/s/m/l/xl/xxl) for my workload, postgres vs clickhouse vs kafka sink, fan-in vs fan-out data flow, one pipeline vs many, dynamic table vs SQL join, how to handle multi-chain deployments. Also use when the user asks 'what's the best way to...' for a pipeline design problem, or is unsure how to structure their pipeline before building it.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "turbo-architecture" with this command: npx skills add goldsky-io/goldsky-agent/goldsky-io-goldsky-agent-turbo-architecture

Turbo Pipeline Architecture

Help users make architecture decisions for Turbo pipelines — source types, data flow patterns, resource sizing, sink strategies, streaming vs job mode, dynamic table design, and multi-chain deployment.

Agent Instructions

Step 1: Understand the Requirements

Ask the user about:

  1. What data? — Which chain(s), which events/datasets, historical or real-time only?
  2. Where does it go? — Database, webhook, streaming topic, multiple destinations?
  3. How much volume? — Single contract or all chain activity? How many events/sec?
  4. Latency needs? — Real-time dashboards (sub-second) or analytics (minutes OK)?

Step 2: Recommend an Architecture

Use the patterns and decision guides below to recommend a pipeline architecture. Reference the templates in templates/ as starting points:

  • templates/linear-pipeline.yaml — Simple decode → filter → sink
  • templates/fan-out-pipeline.yaml — One source → multiple sinks
  • templates/fan-in-pipeline.yaml — Multiple events → UNION ALL → sink
  • templates/multi-chain-templated.yaml — Per-chain pipeline pattern

Step 3: Hand Off to Implementation Skills

After the architecture is decided, direct the user to:

  • /turbo-pipelines — To build and deploy the YAML
  • /turbo-transforms — To write the SQL transforms
  • /secrets — To set up sink credentials

Reminder: When presenting complete pipeline YAML as part of architecture recommendations, validate it first with goldsky turbo validate. Templates in templates/ are structural patterns — any customized version must be validated before presenting to the user.


Source Types

Dataset Sources — Historical + Real-Time

Use type: dataset when you need to process historical blockchain data and/or continue streaming new data.

sources:
  my_source:
    type: dataset
    dataset_name: base.erc20_transfers
    version: 1.2.0
    start_at: earliest  # or: latest

Best for: Raw logs, transactions, transfers, blocks — anything where you need historical backfill or decoded event processing.

Available EVM datasets: raw_logs, blocks, raw_transactions, raw_traces, erc20_transfers, erc721_transfers, erc1155_transfers, decoded_logs

Non-EVM chains also supported: Solana (solana.*), Bitcoin (bitcoin.raw.*), Stellar (stellar_mainnet.*)

Kafka Sources — Real-Time Streaming

Note: Kafka sources are used in production pipelines but are not documented in the official Goldsky docs. Use with caution and contact Goldsky support for topic names.

Use type: kafka when consuming from a Goldsky-managed Kafka topic, typically for continuously-updated state like balances.

sources:
  my_source:
    type: kafka
    topic: base.raw.latest_balances_v2

Best for: Balance snapshots, latest state data, high-volume continuous streams.

Key differences from dataset sources:

  • No start_at or version fields
  • Optional fields: filter, include_metadata, starting_offsets
  • Delivers the latest state rather than historical event logs

When to Use Which

ScenarioSource TypeWhy
Decode contract events from logsdatasetNeed raw_logs + _gs_log_decode()
Track token transfersdataseterc20_transfers has structured data
Historical backfill + livedatasetstart_at: earliest processes history
Live token balanceskafkalatest_balances_v2 is a streaming topic
Real-time state snapshotskafkaKafka delivers latest state continuously
Only need new data going forwardEitherDataset with start_at: latest or Kafka

Data Flow Patterns

Linear Pipeline

The simplest pattern. One source → one or more chained transforms → one sink.

source → transform_a → transform_b → sink

Use when: You have a single data source, single destination, and straightforward processing (decode, filter, reshape).

Example: templates/linear-pipeline.yaml — raw logs → decode → extract trade events → postgres

Resource size: s or m

Fan-Out (One Source → Multiple Sinks)

One source feeds multiple transforms, each writing to a different sink.

              ┌→ transform_a → sink_1 (clickhouse)
source ──────┤
              └→ transform_b → sink_2 (webhook)

Use when: You need different views or subsets of the same data going to different destinations — e.g., balances to a warehouse AND token metadata to a webhook.

Example: templates/fan-out-pipeline.yaml — one Kafka source → fungible balances to ClickHouse + all tokens to a webhook

transforms:
  fungible_balances:
    type: sql
    primary_key: id
    sql: |
      SELECT ... FROM latest_balances balance
      WHERE balance.token_type = 'ERC_20' OR balance.token_type IS NULL

  all_tokens:
    type: sql
    primary_key: id
    sql: |
      SELECT ... FROM latest_balances balance
      WHERE balance.token_type IN ('ERC_20', 'ERC_721', 'ERC_1155')

sinks:
  warehouse:
    type: clickhouse
    from: fungible_balances
    # ...
  webhook:
    type: webhook
    from: all_tokens
    # ...

Resource size: m (multiple output paths)

Fan-In (Multiple Events → One Output)

Multiple event types decoded from the same source, normalized to a common schema, then combined with UNION ALL into a single sink.

              ┌→ event_type_a ──┐
source → decode ┤                 ├→ UNION ALL → sink
              └→ event_type_b ──┘

Use when: You want a unified activity feed, combining trades, deposits, withdrawals, transfers, etc. into one table.

Example: templates/fan-in-pipeline.yaml — one raw_logs source → decode → multiple event-type transforms → UNION ALL → ClickHouse

Resource size: l (complex processing with many transforms)

Multi-Chain Fan-In

Multiple sources from different chains combined into a single output.

source_chain_a ──┐
source_chain_b ──┼→ UNION ALL → sink
source_chain_c ──┘

Use when: You want cross-chain analytics or a unified view across chains.

sources:
  eth_transfers:
    type: dataset
    dataset_name: ethereum.erc20_transfers
    version: 1.0.0
    start_at: latest
  base_transfers:
    type: dataset
    dataset_name: base.erc20_transfers
    version: 1.2.0
    start_at: latest

transforms:
  combined:
    type: sql
    primary_key: id
    sql: |
      SELECT *, 'ethereum' AS chain FROM eth_transfers
      UNION ALL
      SELECT *, 'base' AS chain FROM base_transfers

Resource size: m or l depending on chain count


Templated Multi-Chain Deployment

When you need the same pipeline logic across multiple chains, create separate pipeline files per chain rather than one multi-source pipeline. This gives you:

  • Independent lifecycle (deploy/delete per chain)
  • Independent checkpointing (one chain failing doesn't block others)
  • Clearer monitoring per chain

Pattern: Copy the pipeline YAML and swap the chain-specific values:

FieldChain A (base)Chain B (arbitrum)
namebase-balance-streamingarbitrum-balance-streaming
topicbase.raw.latest_balances_v2arbitrum.raw.latest_balances_v2
Source keybase_latest_balances_v2arbitrum_latest_balances_v2
Transform SQL'base' AS chain'arbitrum' AS chain
Sink tablebase_token_balancesarbitrum_token_balances

Example: templates/multi-chain-templated.yaml — shows the base chain version; duplicate for each chain.

When to use templated vs multi-source:

ApproachProsCons
Templated (per-chain)Independent lifecycle, clear monitoringMore files to manage
Multi-source (one file)Single deployment, cross-chain UNION possibleCoupled lifecycle, harder to debug

Resource Sizing

Each size doubles the previous tier's CPU and memory:

SizeWorkersCPU RequestMemoryWhen to Use
xs0.40.5 GiSmall datasets, light testing
s10.81.0 GiTesting, simple filters, single source/sink, low volume (default)
m41.62.0 GiMultiple sinks, Kafka streaming, moderate transform complexity
l103.24.0 GiMulti-event decoding with UNION ALL, high-volume historical backfill
xl206.48.0 GiLarge chain backfills, complex JOINs (e.g. Solana accounts+transactions)
xxl4012.816.0 GiHighest throughput needs; up to 6.3M rows/min

Rules of thumb from production pipelines:

  • Simple filter + single sink → s (default, try this first)
  • Kafka source + multiple sinks OR multiple transforms → m
  • Raw log decoding + 5+ event types + UNION ALL → l
  • Historical backfill of high-volume data → l or xl (can downsize after catch-up)
  • Start small and scale up — defensive sizing avoids wasted resources

Sink Selection

Quick Reference

DestinationSink TypeBest For
Application DBpostgresRow-level lookups, joins, application serving
Real-time aggregatespostgres_aggregateBalances, counters, running totals via triggers
Analytics queriesclickhouseLarge-scale aggregations, time-series data
Event processingkafkaDownstream consumers, event-driven systems
Serverless streamings2_sinkS2.dev streams, alternative to Kafka
NotificationswebhookLambda functions, API callbacks, alerts
Data lakes3_sinkLong-term archival, batch processing
TestingblackholeValidate pipeline without writing data

Decision Flowchart

What's your primary use case?
│
├─ Application serving (REST/GraphQL API)
│  └─ PostgreSQL ← row-level lookups, joins, strong consistency
│
├─ Analytics / dashboards
│  ├─ Time-series queries → ClickHouse ← columnar, fast aggregations
│  └─ Full-text search → Elasticsearch / OpenSearch
│
├─ Real-time aggregations (balances, counters)
│  └─ PostgreSQL Aggregate ← trigger-based running totals
│
├─ Event-driven downstream processing
│  ├─ Need Kafka ecosystem → Kafka
│  └─ Serverless / simpler → S2 (s2.dev)
│
├─ Notifications / webhooks
│  └─ Webhook ← HTTP POST per event
│
├─ Long-term archival
│  └─ S3 ← object storage, cheapest for bulk data
│
├─ Just testing
│  └─ Blackhole ← validates pipeline without writing
│
└─ Multiple of the above
   └─ Use multiple sinks in the same pipeline (fan-out pattern)

PostgreSQL Aggregate Sink

The postgres_aggregate sink is uniquely suited for real-time running aggregations (balances, counters, totals). It uses a two-table pattern: a landing table that receives raw events, and an aggregation table maintained by a database trigger.

sinks:
  token_balances:
    type: postgres_aggregate
    from: transfers
    schema: public
    landing_table: transfer_events
    agg_table: account_balances
    primary_key: id
    secret_name: MY_POSTGRES
    group_by:
      account:
        type: text
      token_address:
        type: text
    aggregate:
      balance:
        from: amount
        fn: sum
      transfer_count:
        from: id
        fn: count

Supported aggregation functions: sum, count, avg, min, max

Multi-Sink Considerations

  • Each sink reads from a from: field — different sinks can read from different transforms
  • Sinks are independent — one failing doesn't block others
  • Use different batch_size / batch_flush_interval per sink based on latency needs
  • ClickHouse supports parallelism: N for concurrent writers (default 1)

Webhook Sinks Without Secrets

Webhooks can use a direct URL instead of a secret when no auth headers are needed:

sinks:
  my_webhook:
    type: webhook
    from: my_transform
    url: https://my-lambda.us-west-2.on.aws/

Pipeline Splitting Decisions

One Pipeline vs. Multiple

Use one pipeline when:

  • All data comes from the same source
  • Transforms share intermediate results (e.g., a shared decode step)
  • You want atomic deployment of the whole flow

Split into multiple pipelines when:

  • Different data sources with no shared transforms
  • Different lifecycle needs (one is stable, another changes frequently)
  • Different resource requirements (one needs l, another needs s)
  • Different chains with independent processing (templated pattern)

Keeping Pipelines Focused

A pipeline should ideally do one logical thing:

PipelineFocus
dex-tradesTrade events → Postgres
dex-activitiesAll activity types → ClickHouse DWH
token-balancesToken balances → Postgres
base-balance-streamingBase balances → ClickHouse + webhook

Even though trades are a subset of activities, they're separate pipelines because they serve different consumers (application DB vs data warehouse).


Streaming vs Job Mode

Turbo pipelines have two execution modes:

Streaming Mode (Default)

name: my-streaming-pipeline
resource_size: s
# job: false  (default — omit this field)
  • Runs continuously, processing data as it arrives
  • Maintains checkpoints for exactly-once processing
  • Use for real-time feeds, dashboards, APIs

Job Mode (One-Time Batch)

name: my-backfill-job
resource_size: l
job: true
  • Runs to completion and stops automatically
  • Auto-deletes resources ~1 hour after completion
  • Must delete before redeploying — cannot update a job pipeline, must goldsky turbo delete first
  • Cannot use restart — use delete + apply instead
  • Use for historical backfills, one-time data migrations, snapshot exports

When to Use Which

ScenarioModeWhy
Real-time dashboardStreamingContinuous updates needed
Backfill 6 months of historyJobOne-time, stops when done
Real-time + catch-up on deployStreamingstart_at: earliest does backfill then streams
Export data to S3 onceJobNo need for continuous processing
Webhook notifications on eventsStreamingNeeds to react as events happen
Load test with historical dataJobProcess and inspect, then discard

Job Mode with Bounded Ranges

Combine job mode with start_at: earliest and an end_block to process a specific range:

name: historical-export
resource_size: l
job: true

sources:
  logs:
    type: dataset
    dataset_name: ethereum.raw_logs
    version: 1.0.0
    start_at: earliest
    end_block: 19000000
    filter: >-
      address = '0xdac17f958d2ee523a2206206994597c13d831ec7'

Dynamic Table Architecture

Dynamic tables enable runtime-updatable lookup data within a pipeline. They're the Turbo answer to the "no joins in streaming SQL" limitation.

Pattern: Dynamic Allowlist/Blocklist

                    ┌──────────────────────┐
                    │  External Updates     │
                    │  (Postgres / REST)    │
                    └──────────┬───────────┘
                               ▼
source ──→ sql transform ──→ [dynamic_table_check()] ──→ sink

The SQL transform filters records against the dynamic table. The table contents can be updated externally without pipeline restart.

Pattern: Lookup Enrichment

source ──→ decode ──→ filter ──→ sql (with dynamic_table_check) ──→ sink
                                        ▲
                              [token_metadata table]
                              (postgres-backed)

Store metadata (token symbols, decimals, protocol names) in a PostgreSQL table. Reference it in transforms for enrichment.

Backend Decisions

Backendbackend_typeWhen to Use
PostgreSQLPostgresData managed by external systems, shared across pipeline restarts
In-memoryInMemoryAuto-populated from pipeline data, ephemeral, fastest lookups

Sizing Considerations

  • Dynamic tables add memory overhead proportional to table size
  • For large lookup tables (>100K rows), use Postgres backend
  • For small, frequently-changing lists (<10K rows), InMemory is faster
  • Dynamic table queries are async — they add slight latency per record

For full dynamic table configuration syntax and examples, see /turbo-transforms.


Related

  • /turbo-builder — Build and deploy pipelines interactively using these architecture patterns
  • /turbo-doctor — Diagnose and fix pipeline issues
  • /turbo-pipelines — Pipeline YAML configuration reference
  • /turbo-transforms — SQL, TypeScript, and dynamic table transform reference
  • /datasets — Blockchain dataset and chain prefix reference
  • /secrets — Sink credential management
  • /turbo-monitor-debug — Monitoring and debugging reference
  • /turbo-lifecycle — Pipeline lifecycle command reference

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

turbo-pipelines

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

turbo-builder

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

turbo-monitor-debug

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

datasets

No summary provided by upstream source.

Repository SourceNeeds Review
turbo-architecture | V50.AI