bq-query-optimization

BigQuery Query Optimization

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "bq-query-optimization" with this command: npx skills add funnelenvy/agents_webinar_demos/funnelenvy-agents-webinar-demos-bq-query-optimization

BigQuery Query Optimization

Use this skill when writing, debugging, or optimizing BigQuery SQL queries for performance and efficiency.

Query Execution Analysis

Using EXPLAIN

-- Get execution plan EXPLAIN SELECT * FROM project.dataset.table WHERE condition;

-- Get execution plan with runtime stats EXPLAIN ANALYZE SELECT * FROM project.dataset.table WHERE condition;

What the plan shows:

  • Stages of execution

  • Bytes read per stage

  • Slot time consumed

  • Potential bottlenecks

Performance Best Practices

  1. Avoid SELECT *

❌ Bad (scans all columns):

SELECT * FROM project.dataset.large_table

✅ Good (only needed columns):

SELECT customer_id, amount, date FROM project.dataset.large_table

Impact: Full table scan vs targeted column read. Can reduce data scanned by 90%+.

  1. Filter Early and Often

❌ Bad (filter after aggregation):

SELECT customer_id, SUM(amount) as total FROM project.dataset.orders GROUP BY customer_id HAVING SUM(amount) > 1000

✅ Good (filter before aggregation):

SELECT customer_id, SUM(amount) as total FROM project.dataset.orders WHERE amount > 100 -- Filter early GROUP BY customer_id HAVING SUM(amount) > 1000

  1. Use Partitioned Tables

Without partition filter:

-- Scans entire table SELECT * FROM project.dataset.orders WHERE order_date >= '2024-01-01'

With partition filter:

-- Only scans relevant partitions SELECT * FROM project.dataset.orders WHERE DATE(order_timestamp) >= '2024-01-01' -- Partition column

Key: Filter on the partition column for automatic partition pruning.

  1. Break Complex Queries

❌ Anti-pattern (one huge query):

SELECT ... FROM ( SELECT ... FROM ( SELECT ... -- Deeply nested ) ) WHERE ...

✅ Good (use CTEs):

WITH base_data AS ( SELECT customer_id, amount, date FROM project.dataset.orders WHERE date >= '2024-01-01' ), aggregated AS ( SELECT customer_id, SUM(amount) as total FROM base_data GROUP BY customer_id ) SELECT * FROM aggregated WHERE total > 1000

✅ Better (multi-statement with temp tables):

CREATE TEMP TABLE base_data AS SELECT customer_id, amount, date FROM project.dataset.orders WHERE date >= '2024-01-01';

CREATE TEMP TABLE aggregated AS SELECT customer_id, SUM(amount) as total FROM base_data GROUP BY customer_id;

SELECT * FROM aggregated WHERE total > 1000;

  1. JOIN Optimization

Put largest table first:

-- ✅ Large table first SELECT l.*, s.detail FROM project.dataset.large_table l JOIN project.dataset.small_table s ON l.id = s.id

Use clustering on JOIN columns:

  • Cluster tables on frequently joined columns

  • BigQuery can prune data blocks more effectively

Consider ARRAY/STRUCT for 1:many:

-- Instead of JOIN for 1:many relationships SELECT order_id, ARRAY_AGG(STRUCT(product_id, quantity, price)) as items FROM project.dataset.order_items GROUP BY order_id

  1. Leverage Automatic Features

BigQuery automatically performs:

  • Query rewrites - Optimizes query structure

  • Partition pruning - With proper filters

  • Dynamic filtering - Reduces data scanned

Ensure your queries enable these:

  • Filter on partition columns

  • Use simple, clear predicates

  • Avoid functions on partition columns in WHERE clause

Parameterized Queries

CLI Syntax

bq query
--use_legacy_sql=false
--parameter=start_date:DATE:2024-01-01
--parameter=end_date:DATE:2024-12-31
--parameter=min_amount:FLOAT64:100.0
'SELECT * FROM project.dataset.orders WHERE order_date BETWEEN @start_date AND @end_date AND amount >= @min_amount'

Python Syntax

from google.cloud import bigquery

client = bigquery.Client()

query = """ SELECT customer_id, SUM(amount) as total FROM project.dataset.orders WHERE order_date >= @start_date GROUP BY customer_id """

job_config = bigquery.QueryJobConfig( query_parameters=[ bigquery.ScalarQueryParameter("start_date", "DATE", "2024-01-01") ] )

results = client.query_and_wait(query, job_config=job_config)

Key points:

  • Use @ prefix for named parameters

  • Syntax: name:TYPE:value or name::value (STRING default)

  • Cannot use as column/table names

  • Only works with standard SQL

User-Defined Functions (UDFs)

SQL UDF (Simple)

CREATE TEMP FUNCTION CleanEmail(email STRING) RETURNS STRING AS ( LOWER(TRIM(email)) );

SELECT CleanEmail(customer_email) as email FROM project.dataset.customers;

JavaScript UDF (Complex Logic)

CREATE TEMP FUNCTION ParseUserAgent(ua STRING) RETURNS STRUCT<browser STRING, version STRING> LANGUAGE js AS r""" var match = ua.match(/(Chrome|Firefox|Safari)/(\d+)/); return { browser: match ? match[1] : 'Unknown', version: match ? match[2] : '0' }; """;

SELECT ParseUserAgent(user_agent).browser as browser FROM project.dataset.sessions;

Limitations:

  • INT64 unsupported in JavaScript (use FLOAT64 or STRING)

  • JavaScript doesn't support 64-bit integers natively

Persistent UDFs

-- Create once, use many times CREATE FUNCTION project.dataset.clean_email(email STRING) RETURNS STRING AS (LOWER(TRIM(email)));

-- Use anywhere SELECT project.dataset.clean_email(email) FROM ...

Scripting & Procedural Language

Variables

DECLARE total_orders INT64; SET total_orders = (SELECT COUNT(*) FROM project.dataset.orders);

SELECT total_orders;

Loops

LOOP:

DECLARE x INT64 DEFAULT 0; LOOP SET x = x + 1; IF x >= 10 THEN LEAVE; END IF; END LOOP;

WHILE:

DECLARE x INT64 DEFAULT 0; WHILE x < 10 DO SET x = x + 1; END WHILE;

FOR with arrays:

DECLARE ids ARRAY<STRING>; SET ids = ['id1', 'id2', 'id3'];

FOR item IN (SELECT * FROM UNNEST(ids) as id) DO -- Process each id SELECT id; END FOR;

Query Caching

Automatic caching (24 hours):

  • Identical queries serve cached results (free)

  • No additional cost

  • Instant response

To bypass cache:

bq query --use_cache=false 'SELECT...'

Common Anti-Patterns

❌ Using LIMIT to reduce cost

-- LIMIT doesn't reduce data scanned or cost! SELECT * FROM project.dataset.huge_table LIMIT 10

Impact: Still scans entire table. Use WHERE filters instead.

❌ Functions on partition columns

-- Prevents partition pruning WHERE CAST(date_column AS STRING) = '2024-01-01'

✅ Better:

WHERE date_column = DATE('2024-01-01')

❌ Cross joins without filters

-- Cartesian product = huge result SELECT * FROM table1 CROSS JOIN table2

Impact: Can generate millions/billions of rows.

❌ Correlated subqueries

-- Runs subquery for each row SELECT * FROM orders o WHERE amount > (SELECT AVG(amount) FROM orders WHERE customer_id = o.customer_id)

✅ Better (use window functions):

SELECT * FROM ( SELECT *, AVG(amount) OVER (PARTITION BY customer_id) as avg_amount FROM orders ) WHERE amount > avg_amount

Common SQL Gotchas

CTE Re-execution (CRITICAL COST ISSUE)

Problem: When a CTE is referenced multiple times, BigQuery re-executes it each time, billing you multiple times.

❌ Bad (CTE runs 3 times - billed 3x):

WITH expensive_cte AS ( SELECT * FROM project.dataset.huge_table WHERE complex_conditions AND lots_of_joins ) SELECT COUNT(*) FROM expensive_cte UNION ALL SELECT SUM(amount) FROM expensive_cte UNION ALL SELECT MAX(date) FROM expensive_cte;

Impact: If the CTE scans 10 TB, you're billed for 30 TB (10 TB × 3).

✅ Good (use temp table - billed 1x):

CREATE TEMP TABLE expensive_data AS SELECT * FROM project.dataset.huge_table WHERE complex_conditions AND lots_of_joins;

SELECT COUNT(*) FROM expensive_data UNION ALL SELECT SUM(amount) FROM expensive_data UNION ALL SELECT MAX(date) FROM expensive_data;

When CTEs are OK:

  • Referenced only once

  • Very small result set

  • Part of larger query (BigQuery may optimize)

When to use temp tables:

  • CTE referenced 2+ times

  • Large data volumes

  • Complex/expensive CTE query

NOT IN with NULL Values (SILENT FAILURE)

Problem: NOT IN returns NOTHING (empty result) if ANY NULL exists in the subquery.

❌ Broken (returns empty if blocked_customers has any NULL):

SELECT * FROM customers WHERE customer_id NOT IN ( SELECT customer_id FROM blocked_customers -- If ANY NULL, returns 0 rows! );

Why it fails:

  • SQL three-valued logic: TRUE, FALSE, UNKNOWN

  • NULL IN (...) evaluates to UNKNOWN

  • NOT UNKNOWN is still UNKNOWN

  • Rows with UNKNOWN are filtered out

✅ Solution 1: Use NOT EXISTS (safest):

SELECT * FROM customers c WHERE NOT EXISTS ( SELECT 1 FROM blocked_customers b WHERE b.customer_id = c.customer_id );

✅ Solution 2: Filter NULLs explicitly:

SELECT * FROM customers WHERE customer_id NOT IN ( SELECT customer_id FROM blocked_customers WHERE customer_id IS NOT NULL -- Explicit NULL filter );

✅ Solution 3: Use LEFT JOIN:

SELECT c.* FROM customers c LEFT JOIN blocked_customers b ON c.customer_id = b.customer_id WHERE b.customer_id IS NULL;

Best practice: Prefer NOT EXISTS

  • it's clearer, safer, and often faster.

DML Statement Performance

Problem: BigQuery is optimized for analytics (OLAP), not transactional updates (OLTP). DML statements are slow and expensive.

Why DML is slow in BigQuery:

  • Columnar storage (not row-based)

  • Designed for bulk reads, not individual updates

  • No indexes for fast row lookups

  • Every update rewrites affected partitions

❌ Very slow (row-by-row updates):

-- Don't do this - takes minutes/hours UPDATE project.dataset.orders SET status = 'processed' WHERE order_id = '12345';

-- This is even worse - runs once per row FOR record IN (SELECT order_id FROM orders_to_update) DO UPDATE orders SET status = 'processed' WHERE order_id = record.order_id; END FOR;

⚠️ Better (batch updates):

UPDATE project.dataset.orders SET status = 'processed' WHERE order_id IN (SELECT order_id FROM orders_to_update);

✅ Best (recreate table - fastest):

CREATE OR REPLACE TABLE project.dataset.orders AS SELECT

  • EXCEPT(status), CASE WHEN order_id IN (SELECT order_id FROM orders_to_update) THEN 'processed' ELSE status END AS status FROM project.dataset.orders;

For INSERT/UPSERT - use MERGE:

MERGE project.dataset.customers AS target USING project.dataset.customer_updates AS source ON target.customer_id = source.customer_id WHEN MATCHED THEN UPDATE SET name = source.name, updated_at = CURRENT_TIMESTAMP() WHEN NOT MATCHED THEN INSERT (customer_id, name, created_at) VALUES (customer_id, name, CURRENT_TIMESTAMP());

Best practices:

  • Batch updates instead of row-by-row

  • Use MERGE for upserts

  • Consider recreating table for large updates

  • Partition tables to limit update scope

  • Avoid frequent small DML operations

Data Type Gotchas

INT64 is the only integer type:

-- All of these are the same: INT64 CREATE TABLE example ( col1 INT64, -- ✅ Explicit col2 INTEGER, -- Converted to INT64 col3 INT, -- Converted to INT64 col4 SMALLINT, -- Converted to INT64 col5 BIGINT -- Converted to INT64 );

No UUID type - use STRING:

-- PostgreSQL CREATE TABLE users (id UUID);

-- BigQuery CREATE TABLE users (id STRING); -- Store UUID as string

NUMERIC precision limits:

-- NUMERIC: 38 digits precision, 9 decimal places NUMERIC(38, 9)

-- BIGNUMERIC: 76 digits precision, 38 decimal places BIGNUMERIC(76, 38)

-- Example SELECT CAST('12345678901234567890.123456789' AS NUMERIC) AS num, CAST('12345678901234567890.123456789' AS BIGNUMERIC) AS bignum;

TIMESTAMP vs DATETIME vs DATE:

-- TIMESTAMP: UTC, timezone-aware SELECT CURRENT_TIMESTAMP(); -- 2024-01-15 10:30:45.123456 UTC

-- DATETIME: No timezone SELECT CURRENT_DATETIME(); -- 2024-01-15 10:30:45.123456

-- DATE: Date only SELECT CURRENT_DATE(); -- 2024-01-15

-- Conversion SELECT TIMESTAMP('2024-01-15 10:30:45'), -- Assumes UTC DATETIME(TIMESTAMP '2024-01-15 10:30:45'), -- Loses timezone DATE(TIMESTAMP '2024-01-15 10:30:45'); -- Loses time

Type coercion in JOINs:

-- ❌ Implicit cast can prevent optimization SELECT * FROM table1 t1 JOIN table2 t2 ON t1.id = CAST(t2.id AS STRING); -- Prevents clustering optimization

-- ✅ Match types explicitly SELECT * FROM table1 t1 JOIN table2 t2 ON t1.id = t2.id; -- Both STRING

Window Functions

Window functions perform calculations across a set of rows related to the current row, without collapsing the result set.

Basic Syntax

<function> OVER ( [PARTITION BY partition_expression] [ORDER BY sort_expression] [window_frame_clause] )

Ranking Functions

ROW_NUMBER() - Sequential numbering:

SELECT customer_id, order_date, amount, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num FROM orders;

Common use: Deduplication

SELECT * EXCEPT(row_num) FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num FROM customers ) WHERE row_num = 1;

RANK() - Rank with gaps:

SELECT product_name, revenue, RANK() OVER (ORDER BY revenue DESC) AS rank FROM products;

-- Results: -- Product A: $1000, rank 1 -- Product B: $900, rank 2 -- Product C: $900, rank 2 (tie) -- Product D: $800, rank 4 (gap after tie)

DENSE_RANK() - Rank without gaps:

SELECT product_name, revenue, DENSE_RANK() OVER (ORDER BY revenue DESC) AS dense_rank FROM products;

-- Results: -- Product A: $1000, rank 1 -- Product B: $900, rank 2 -- Product C: $900, rank 2 (tie) -- Product D: $800, rank 3 (no gap)

NTILE() - Divide into N buckets:

SELECT customer_id, total_spend, NTILE(4) OVER (ORDER BY total_spend DESC) AS quartile FROM customer_totals;

Analytical Functions

LEAD() and LAG() - Access rows before/after:

-- Time series analysis SELECT date, revenue, LAG(revenue) OVER (ORDER BY date) AS prev_day_revenue, LEAD(revenue) OVER (ORDER BY date) AS next_day_revenue, revenue - LAG(revenue) OVER (ORDER BY date) AS daily_change, ROUND(100.0 * (revenue - LAG(revenue) OVER (ORDER BY date)) / LAG(revenue) OVER (ORDER BY date), 2) AS pct_change FROM daily_sales ORDER BY date;

With PARTITION BY:

-- Per-customer analysis SELECT customer_id, order_date, amount, LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS prev_order_amount, amount - LAG(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS amount_diff FROM orders;

FIRST_VALUE() and LAST_VALUE():

SELECT date, revenue, FIRST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS first_day_revenue, LAST_VALUE(revenue) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_day_revenue FROM daily_sales;

NTH_VALUE() - Get Nth value:

SELECT product_id, date, sales, NTH_VALUE(sales, 2) OVER (PARTITION BY product_id ORDER BY date) AS second_day_sales FROM product_sales;

Aggregate Window Functions

SUM/AVG/COUNT as window functions:

SELECT date, revenue, SUM(revenue) OVER (ORDER BY date) AS cumulative_revenue, AVG(revenue) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_avg_7day, COUNT(*) OVER (ORDER BY date ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) AS rolling_30day_count FROM daily_sales;

Running totals:

SELECT customer_id, order_date, amount, SUM(amount) OVER (PARTITION BY customer_id ORDER BY order_date) AS customer_lifetime_value FROM orders;

QUALIFY Clause (BigQuery-Specific)

QUALIFY filters on window function results - no subquery needed!

❌ Standard SQL (verbose):

SELECT * FROM ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS row_num FROM orders ) WHERE row_num = 1;

✅ BigQuery with QUALIFY (clean):

SELECT customer_id, order_date, amount FROM orders QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;

More QUALIFY examples:

-- Get top 3 products per category SELECT category, product_name, revenue FROM products QUALIFY RANK() OVER (PARTITION BY category ORDER BY revenue DESC) <= 3;

-- Filter outliers (keep middle 80%) SELECT * FROM measurements QUALIFY NTILE(10) OVER (ORDER BY value) BETWEEN 2 AND 9;

-- Get first order per customer SELECT customer_id, order_date, amount FROM orders QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date ASC) = 1;

Window Frame Clauses

Control which rows are included in the window:

-- ROWS: Physical row count ROWS BETWEEN 3 PRECEDING AND CURRENT ROW -- Last 4 rows including current

-- RANGE: Logical range (based on ORDER BY values) RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW -- Last 7 days

-- Examples: SELECT date, sales, -- Last 7 rows AVG(sales) OVER (ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS avg_7rows,

-- Last 7 days (logical) AVG(sales) OVER (ORDER BY date RANGE BETWEEN INTERVAL 7 DAY PRECEDING AND CURRENT ROW) AS avg_7days,

-- All preceding rows SUM(sales) OVER (ORDER BY date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS cumulative_sum,

-- Centered window (3 before, 3 after) AVG(sales) OVER (ORDER BY date ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING) AS centered_avg FROM daily_sales;

Window Function Performance Tips

  1. Partition appropriately:

-- ✅ Good: Partitions reduce data scanned SELECT * FROM events WHERE date >= '2024-01-01' -- Partition filter QUALIFY ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY timestamp DESC) = 1;

  1. Avoid window functions in WHERE:

-- ❌ Wrong: Can't use window functions in WHERE SELECT * FROM orders WHERE ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1; -- ERROR

-- ✅ Use QUALIFY instead SELECT * FROM orders QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY date) = 1;

  1. Reuse window definitions:

SELECT date, revenue, ROW_NUMBER() OVER w AS row_num, RANK() OVER w AS rank, AVG(revenue) OVER w AS avg_revenue FROM sales WINDOW w AS (PARTITION BY category ORDER BY revenue DESC);

JSON Functions

BigQuery provides rich functions for parsing, extracting, and manipulating JSON data.

Extracting JSON Values

JSON_VALUE() - Extract scalar values (Standard SQL):

SELECT JSON_VALUE(json_column, '$.user.name') AS user_name, JSON_VALUE(json_column, '$.user.email') AS email, CAST(JSON_VALUE(json_column, '$.amount') AS FLOAT64) AS amount, CAST(JSON_VALUE(json_column, '$.quantity') AS INT64) AS quantity FROM events;

JSON_QUERY() - Extract objects or arrays:

SELECT JSON_QUERY(json_column, '$.user') AS user_object, JSON_QUERY(json_column, '$.items') AS items_array FROM events;

JSON_EXTRACT() - Legacy, still widely used:

SELECT JSON_EXTRACT(json_column, '$.user.name') AS user_name, JSON_EXTRACT_SCALAR(json_column, '$.user.email') AS email -- Returns STRING FROM events;

JSONPath syntax:

-- Dot notation '$.user.name'

-- Array index '$.items[0].product_id'

-- Array slice '$.items[0:3]'

-- Wildcard '$.users[*].name'

-- Recursive descent '$..name' -- All 'name' fields at any level

Working with JSON Arrays

JSON_EXTRACT_ARRAY() - Extract array elements:

SELECT event_id, tag FROM events, UNNEST(JSON_EXTRACT_ARRAY(tags_json, '$')) AS tag;

JSON_VALUE_ARRAY() - Extract array of scalars:

SELECT product_id, tag FROM products, UNNEST(JSON_VALUE_ARRAY(tags_json, '$')) AS tag;

Complete example:

-- JSON: {"product_id": "A123", "tags": ["electronics", "sale", "featured"]} SELECT JSON_VALUE(product_json, '$.product_id') AS product_id, tag FROM products, UNNEST(JSON_VALUE_ARRAY(product_json, '$.tags')) AS tag;

-- Results: -- A123, electronics -- A123, sale -- A123, featured

Creating JSON

TO_JSON_STRING() - Convert to JSON:

SELECT customer_id, TO_JSON_STRING(STRUCT( name, email, created_at )) AS customer_json FROM customers;

Create JSON objects:

SELECT TO_JSON_STRING(STRUCT( 'John' AS name, 30 AS age, ['reading', 'hiking'] AS hobbies, STRUCT('123 Main St' AS street, 'Boston' AS city) AS address )) AS person_json;

-- Result: -- {"name":"John","age":30,"hobbies":["reading","hiking"],"address":{"street":"123 Main St","city":"Boston"}}

Aggregate into JSON:

SELECT customer_id, TO_JSON_STRING( ARRAY_AGG( STRUCT(order_id, amount, date) ORDER BY date DESC LIMIT 5 ) ) AS recent_orders_json FROM orders GROUP BY customer_id;

Parsing JSON Strings

PARSE_JSON() - Convert string to JSON:

SELECT JSON_VALUE(PARSE_JSON('{"name":"Alice","age":25}'), '$.name') AS name;

Safe JSON parsing (avoid errors):

SELECT SAFE.JSON_VALUE(invalid_json, '$.name') AS name -- Returns NULL on error FROM events;

Complex JSON Examples

Nested JSON extraction:

-- JSON structure: -- { -- "order": { -- "id": "ORD123", -- "items": [ -- {"product": "A", "qty": 2, "price": 10.50}, -- {"product": "B", "qty": 1, "price": 25.00} -- ] -- } -- }

SELECT JSON_VALUE(data, '$.order.id') AS order_id, JSON_VALUE(item, '$.product') AS product, CAST(JSON_VALUE(item, '$.qty') AS INT64) AS quantity, CAST(JSON_VALUE(item, '$.price') AS FLOAT64) AS price FROM events, UNNEST(JSON_EXTRACT_ARRAY(data, '$.order.items')) AS item;

Transform relational data to JSON:

SELECT category, TO_JSON_STRING( STRUCT( category AS category_name, COUNT(*) AS product_count, ROUND(AVG(price), 2) AS avg_price, ARRAY_AGG( STRUCT(product_name, price) ORDER BY price DESC LIMIT 3 ) AS top_products ) ) AS category_summary FROM products GROUP BY category;

JSON Performance Tips

  1. Extract once, reuse:

-- ❌ Bad: Multiple extractions SELECT JSON_VALUE(data, '$.user.id'), JSON_VALUE(data, '$.user.name'), JSON_VALUE(data, '$.user.email') FROM events;

-- ✅ Better: Extract object once WITH extracted AS ( SELECT JSON_QUERY(data, '$.user') AS user_json FROM events ) SELECT JSON_VALUE(user_json, '$.id'), JSON_VALUE(user_json, '$.name'), JSON_VALUE(user_json, '$.email') FROM extracted;

  1. Consider STRUCT columns instead of JSON:

-- If schema is known and stable, use STRUCT CREATE TABLE events ( user STRUCT<id STRING, name STRING, email STRING>, timestamp TIMESTAMP );

-- Query with dot notation (faster than JSON extraction) SELECT user.id, user.name, user.email FROM events;

  1. Materialize frequently accessed JSON fields:

-- Add extracted columns to table ALTER TABLE events ADD COLUMN user_id STRING AS (JSON_VALUE(data, '$.user.id'));

-- Now queries can filter efficiently SELECT * FROM events WHERE user_id = 'U123';

ARRAY and STRUCT

BigQuery's native support for nested and repeated data allows for powerful denormalization and performance optimization.

ARRAY Basics

Creating arrays:

SELECT [1, 2, 3, 4, 5] AS numbers, ['apple', 'banana', 'cherry'] AS fruits, [DATE '2024-01-01', DATE '2024-01-02'] AS dates;

ARRAY_AGG() - Aggregate into array:

SELECT customer_id, ARRAY_AGG(order_id ORDER BY order_date DESC) AS order_ids, ARRAY_AGG(amount) AS order_amounts FROM orders GROUP BY customer_id;

With LIMIT:

SELECT customer_id, ARRAY_AGG(order_id ORDER BY order_date DESC LIMIT 5) AS recent_order_ids FROM orders GROUP BY customer_id;

UNNEST - Flattening Arrays

Basic UNNEST:

SELECT element FROM UNNEST(['a', 'b', 'c']) AS element;

-- Results: -- a -- b -- c

UNNEST with table:

-- Table: customers -- customer_id | order_ids -- 1 | [101, 102, 103] -- 2 | [201, 202]

SELECT customer_id, order_id FROM customers, UNNEST(order_ids) AS order_id;

-- Results: -- 1, 101 -- 1, 102 -- 1, 103 -- 2, 201 -- 2, 202

UNNEST with OFFSET (get array index):

SELECT item, idx FROM UNNEST(['first', 'second', 'third']) AS item WITH OFFSET AS idx;

-- Results: -- first, 0 -- second, 1 -- third, 2

STRUCT - Nested Records

Creating structs:

SELECT STRUCT('John' AS name, 30 AS age, 'Engineer' AS role) AS person, STRUCT('123 Main St' AS street, 'Boston' AS city, '02101' AS zip) AS address;

Querying struct fields:

SELECT person.name, person.age, address.city FROM ( SELECT STRUCT('John' AS name, 30 AS age) AS person, STRUCT('Boston' AS city) AS address );

ARRAY of STRUCT (Most Powerful Pattern)

Create:

SELECT customer_id, ARRAY_AGG( STRUCT( order_id, amount, order_date, status ) ORDER BY order_date DESC ) AS orders FROM orders GROUP BY customer_id;

Query:

-- Flatten array of struct SELECT customer_id, order.order_id, order.amount, order.order_date FROM customers, UNNEST(orders) AS order WHERE order.status = 'completed';

Filter array elements:

SELECT customer_id, ARRAY( SELECT AS STRUCT order_id, amount FROM UNNEST(orders) AS order WHERE order.status = 'completed' ORDER BY amount DESC LIMIT 3 ) AS top_completed_orders FROM customers;

ARRAY Functions

ARRAY_LENGTH():

SELECT customer_id, ARRAY_LENGTH(order_ids) AS total_orders FROM customers;

ARRAY_CONCAT():

SELECT ARRAY_CONCAT([1, 2], [3, 4], [5]) AS combined; -- Result: [1, 2, 3, 4, 5]

ARRAY_TO_STRING():

SELECT ARRAY_TO_STRING(['apple', 'banana', 'cherry'], ', ') AS fruits; -- Result: 'apple, banana, cherry'

GENERATE_ARRAY():

SELECT GENERATE_ARRAY(1, 10) AS numbers; -- Result: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

SELECT GENERATE_ARRAY(0, 100, 10) AS multiples_of_10; -- Result: [0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

ARRAY_REVERSE():

SELECT ARRAY_REVERSE([1, 2, 3, 4, 5]) AS reversed; -- Result: [5, 4, 3, 2, 1]

Performance: ARRAY vs JOIN

Traditional approach (2 tables, JOIN):

-- Table 1: customers (1M rows) -- Table 2: orders (10M rows)

SELECT c.customer_id, c.name, o.order_id, o.amount FROM customers c JOIN orders o ON c.customer_id = o.customer_id WHERE c.customer_id = '12345';

-- Scans: customers (1M) + orders (10M) = 11M rows -- Join cost: High

Array approach (1 table with ARRAY):

-- Table: customers (1M rows with nested orders array)

SELECT customer_id, name, order.order_id, order.amount FROM customers, UNNEST(orders) AS order WHERE customer_id = '12345';

-- Scans: customers (1M) only -- No join cost -- 50-80% faster for 1:many relationships

When to use ARRAY:

  • 1:many relationships (orders per customer)

  • Moderate array size (< 1000 elements)

  • Frequent filtering by parent entity

  • Want to reduce JOINs

When NOT to use ARRAY:

  • Many:many relationships

  • Very large arrays (> 10,000 elements)

  • Need to query array elements independently

  • Array elements frequently updated

Complete Example: Denormalized Design

Traditional normalized:

-- 3 tables, 2 JOINs SELECT c.customer_id, c.name, o.order_id, oi.product_id, oi.quantity FROM customers c JOIN orders o ON c.customer_id = o.customer_id JOIN order_items oi ON o.order_id = oi.order_id;

Denormalized with ARRAY/STRUCT:

-- 1 table, no JOINs CREATE TABLE customers_denormalized AS SELECT c.customer_id, c.name, ARRAY_AGG( STRUCT( o.order_id, o.order_date, o.status, ARRAY( SELECT AS STRUCT product_id, quantity, price FROM order_items WHERE order_id = o.order_id ) AS items ) ORDER BY o.order_date DESC ) AS orders FROM customers c LEFT JOIN orders o ON c.customer_id = o.customer_id GROUP BY c.customer_id, c.name;

-- Query (no JOINs!) SELECT customer_id, name, order.order_id, item.product_id, item.quantity FROM customers_denormalized, UNNEST(orders) AS order, UNNEST(order.items) AS item WHERE customer_id = '12345';

Performance improvement: 3-5x faster for typical queries.

BigQuery-Specific Features

EXCEPT and REPLACE in SELECT

EXCEPT - Exclude columns:

-- Select all except sensitive columns SELECT * EXCEPT(ssn, password, credit_card) FROM customers;

-- Combine with WHERE SELECT * EXCEPT(internal_notes) FROM orders WHERE status = 'shipped';

REPLACE - Modify columns:

-- Replace column values SELECT * REPLACE(UPPER(name) AS name, ROUND(price, 2) AS price) FROM products;

-- Anonymize data SELECT * REPLACE('' AS ssn, '' AS credit_card) FROM customers;

Combine EXCEPT and REPLACE:

SELECT * EXCEPT(password) REPLACE(LOWER(email) AS email) FROM users;

SAFE Functions (NULL Instead of Errors)

SAFE_CAST() - Returns NULL on error:

-- Regular CAST throws error on invalid input SELECT CAST('invalid' AS INT64); -- ERROR

-- SAFE_CAST returns NULL SELECT SAFE_CAST('invalid' AS INT64) AS result; -- NULL

SAFE_DIVIDE() - Returns NULL on division by zero:

SELECT revenue, orders, SAFE_DIVIDE(revenue, orders) AS avg_order_value -- NULL if orders = 0 FROM daily_metrics;

Other SAFE functions:

-- SAFE_SUBTRACT (for dates) SELECT SAFE_SUBTRACT(DATE '2024-01-01', DATE '2024-12-31'); -- NULL (negative)

-- SAFE_NEGATE SELECT SAFE_NEGATE(9223372036854775807); -- NULL (overflow)

-- SAFE_ADD SELECT SAFE_ADD(9223372036854775807, 1); -- NULL (overflow)

Use case: Data quality checks:

SELECT COUNT() AS total_rows, COUNT(SAFE_CAST(amount AS FLOAT64)) AS valid_amounts, COUNT() - COUNT(SAFE_CAST(amount AS FLOAT64)) AS invalid_amounts FROM transactions;

GROUP BY with Column Numbers

-- ✅ Valid: Group by column position SELECT customer_id, DATE(order_date) AS order_date, SUM(amount) AS total FROM orders GROUP BY 1, 2; -- Same as: GROUP BY customer_id, DATE(order_date)

-- ✅ Also valid: Mix names and numbers SELECT customer_id, DATE(order_date) AS order_date, SUM(amount) AS total FROM orders GROUP BY customer_id, 2;

When useful:

  • Long expressions in SELECT

  • Complex CASE statements

  • Simplifies GROUP BY clause

TABLESAMPLE - Random Sampling

System sampling (fast, approximate):

-- Sample ~10% of data (by blocks) SELECT * FROM large_table TABLESAMPLE SYSTEM (10 PERCENT);

Use cases:

  • Quick data exploration

  • Testing queries on large tables

  • Statistical sampling

Note: SYSTEM sampling is block-based, not truly random. For exact percentages, use ROW_NUMBER() with RAND().

PIVOT and UNPIVOT

PIVOT - Columns to rows:

SELECT * FROM ( SELECT product, quarter, sales FROM quarterly_sales ) PIVOT ( SUM(sales) FOR quarter IN ('Q1', 'Q2', 'Q3', 'Q4') );

-- Before: -- product | quarter | sales -- A | Q1 | 100 -- A | Q2 | 150

-- After: -- product | Q1 | Q2 | Q3 | Q4 -- A | 100 | 150 | ... | ...

UNPIVOT - Rows to columns:

SELECT * FROM quarterly_totals UNPIVOT ( sales FOR quarter IN (Q1, Q2, Q3, Q4) );

-- Before: -- product | Q1 | Q2 | Q3 | Q4 -- A | 100 | 150 | 200 | 250

-- After: -- product | quarter | sales -- A | Q1 | 100 -- A | Q2 | 150

Standard SQL vs Legacy SQL

BigQuery has two SQL dialects:

Feature Standard SQL Legacy SQL

ANSI Compliance ✅ Yes ❌ No

Recommended ✅ Yes ❌ Deprecated

Table Reference project.dataset.table [project:dataset.table]

CTEs (WITH) ✅ Yes ❌ No

Window Functions ✅ Full support ⚠️ Limited

ARRAY/STRUCT ✅ Native ⚠️ Limited

Portability ✅ High ❌ BigQuery-only

How to detect Legacy SQL:

-- Legacy SQL indicators: -- 1. Square brackets for tables SELECT * FROM [project:dataset.table]

-- 2. GROUP EACH BY SELECT customer_id, COUNT(*) FROM orders GROUP EACH BY customer_id

-- 3. FLATTEN SELECT * FROM FLATTEN([project:dataset.table], field)

-- 4. TABLE_DATE_RANGE SELECT * FROM TABLE_DATE_RANGE([dataset.table_], TIMESTAMP('2024-01-01'), TIMESTAMP('2024-12-31'))

Standard SQL equivalent:

-- 1. Backticks SELECT * FROM project.dataset.table

-- 2. Regular GROUP BY SELECT customer_id, COUNT(*) FROM orders GROUP BY customer_id

-- 3. UNNEST SELECT * FROM project.dataset.table, UNNEST(field)

-- 4. Partitioned table filter SELECT * FROM project.dataset.table WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2024-01-01') AND TIMESTAMP('2024-12-31')

Migration:

Set default to Standard SQL

bq query --use_legacy_sql=false 'SELECT ...'

Or in Python

job_config = bigquery.QueryJobConfig(use_legacy_sql=False)

Performance Checklist

Before running expensive queries:

  • ☐ Use --dry_run to estimate cost

  • ☐ Check if partition pruning is active

  • ☐ Verify clustering on JOIN/WHERE columns

  • ☐ Remove SELECT *

  • ☐ Filter early with WHERE

  • ☐ Use EXPLAIN to analyze plan

  • ☐ Consider materialized views for repeated queries

  • ☐ Test with LIMIT first on full query

  • ☐ Avoid CTE re-execution (use temp tables if referenced 2+ times)

  • ☐ Use NOT EXISTS instead of NOT IN

  • ☐ Batch DML operations (avoid row-by-row updates)

  • ☐ Consider ARRAY/STRUCT for 1:many relationships

Monitoring Query Performance

-- Check query statistics SELECT job_id, user_email, total_bytes_processed, total_slot_ms, TIMESTAMP_DIFF(end_time, start_time, SECOND) as duration_sec FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY) ORDER BY total_bytes_processed DESC LIMIT 10;

Quick Wins

Immediate improvements:

  • Add partition filter → 50-90% cost reduction

  • Remove SELECT * → 30-70% cost reduction

  • Use clustering → 20-50% performance improvement

  • Break complex queries → 2-5x faster execution

  • Enable query cache → Free repeated queries

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

n8n-workflow

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

playwright-browser

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

hubspot-crm

No summary provided by upstream source.

Repository SourceNeeds Review