BigQuery Data Loading

Use this skill when importing data into BigQuery from various file formats and sources.

Basic Load Command

bq load
--location=LOCATION
--source_format=FORMAT
PROJECT:DATASET.TABLE
SOURCE_PATH
SCHEMA

Loading from CSV

Basic CSV Load

bq load
--source_format=CSV
--skip_leading_rows=1
dataset.table
gs://bucket/data.csv
customer_id:STRING,amount:FLOAT,date:DATE

CSV with Schema File

Create schema.json

echo '[ {"name": "customer_id", "type": "STRING"}, {"name": "amount", "type": "FLOAT"}, {"name": "date", "type": "DATE"} ]' > schema.json

Load with schema file

bq load
--source_format=CSV
--skip_leading_rows=1
dataset.table
gs://bucket/data.csv
./schema.json

CSV Options

bq load
--source_format=CSV
--skip_leading_rows=1
--field_delimiter=','
--quote='"'
--allow_quoted_newlines
--allow_jagged_rows
--max_bad_records=100
--null_marker='NULL'
dataset.table
gs://bucket/data.csv
schema.json

Key flags:

--skip_leading_rows=N
Skip header rows
--field_delimiter=','
Column separator (default: comma)
--allow_quoted_newlines
Allow newlines in quoted fields
--allow_jagged_rows
Allow rows with missing fields
--max_bad_records=N
Tolerate N parsing errors
--null_marker='NULL'
String representing NULL values

CSV with Auto-detect

bq load
--source_format=CSV
--autodetect
--skip_leading_rows=1
dataset.table
gs://bucket/data.csv

Warning: Auto-detect is convenient but not recommended for production. Schema may change between loads.

Loading from JSON

Newline-Delimited JSON

bq load
--source_format=NEWLINE_DELIMITED_JSON
dataset.table
gs://bucket/data.json
customer_id:STRING,amount:FLOAT,date:DATE

JSON format required:

{"customer_id": "C001", "amount": 99.99, "date": "2024-01-15"} {"customer_id": "C002", "amount": 149.99, "date": "2024-01-15"}

NOT standard JSON array:

// ❌ This won't work [ {"customer_id": "C001", "amount": 99.99}, {"customer_id": "C002", "amount": 149.99} ]

JSON with Auto-detect

bq load
--source_format=NEWLINE_DELIMITED_JSON
--autodetect
dataset.table
gs://bucket/data.json

JSON with Nested Fields

Schema with nested STRUCT:

[ {"name": "customer_id", "type": "STRING"}, {"name": "address", "type": "RECORD", "fields": [ {"name": "street", "type": "STRING"}, {"name": "city", "type": "STRING"}, {"name": "zip", "type": "STRING"} ]}, {"name": "orders", "type": "RECORD", "mode": "REPEATED", "fields": [ {"name": "order_id", "type": "STRING"}, {"name": "amount", "type": "FLOAT"} ]} ]

Loading from Avro

Basic Avro Load

bq load
--source_format=AVRO
dataset.table
gs://bucket/data.avro

Key benefit: Schema is auto-detected from Avro metadata. No schema specification needed!

Avro with Wildcards

bq load
--source_format=AVRO
dataset.table
"gs://bucket/path/to/*.avro"

Note: Use quotes around wildcard paths.

Avro Advantages

Self-describing: Schema embedded in file
Efficient: Compact binary format
Type-safe: Strong typing preserved
No schema drift: Schema always matches data

Loading from Parquet

Basic Parquet Load

bq load
--source_format=PARQUET
dataset.table
gs://bucket/data.parquet

Like Avro: Schema auto-detected from Parquet metadata.

Parquet with Compression

bq load
--source_format=PARQUET
dataset.table
gs://bucket/data.snappy.parquet

Supported compression: SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD

Loading Strategies

Append Data (Default)

bq load
--source_format=CSV
dataset.table
gs://bucket/new_data.csv
schema.json

Behavior: Adds rows to existing table.

Replace Table

bq load
--source_format=CSV
--replace
dataset.table
gs://bucket/data.csv
schema.json

Behavior: Deletes all existing data, loads new data.

Overwrite Partition

bq load
--source_format=CSV
--replace
--time_partitioning_field=date
dataset.table$20240115
gs://bucket/data_20240115.csv
schema.json

Syntax: TABLE$YYYYMMDD targets specific partition.

Behavior: Replaces only that partition, leaves others intact.

Append or Skip

Skip load if table already has data

bq load
--source_format=CSV
--if_not_exists
dataset.table
gs://bucket/data.csv
schema.json

Loading from Cloud Storage

Single File

bq load
--source_format=CSV
dataset.table
gs://bucket/data.csv
schema.json

Multiple Files (List)

bq load
--source_format=CSV
dataset.table
gs://bucket/file1.csv,gs://bucket/file2.csv,gs://bucket/file3.csv
schema.json

Wildcard Pattern

bq load
--source_format=CSV
dataset.table
"gs://bucket/path/data-*.csv"
schema.json

Patterns:

gs://bucket/*.csv
All CSV files in bucket root
gs://bucket/2024//.csv
All CSV files in subdirectories
gs://bucket/data-[0-9]*.csv
Files matching pattern

Loading from Local Files

Local CSV

bq load
--source_format=CSV
--skip_leading_rows=1
dataset.table
/path/to/local/data.csv
schema.json

Limitation: Files are uploaded first, then loaded. Slower for large files. Use GCS for better performance.

Schema Handling

Inline Schema

bq load
--source_format=CSV
dataset.table
gs://bucket/data.csv
customer_id:STRING,amount:FLOAT64,order_date:DATE,active:BOOLEAN

Schema File

[ {"name": "customer_id", "type": "STRING", "mode": "REQUIRED"}, {"name": "amount", "type": "FLOAT64"}, {"name": "order_date", "type": "DATE"}, {"name": "metadata", "type": "JSON"} ]

Modes:

REQUIRED
Field must have value
NULLABLE
Field can be NULL (default)
REPEATED
Field is an array

Auto-detect Schema

bq load
--source_format=CSV
--autodetect
dataset.table
gs://bucket/data.csv

Pros:

Quick for exploration
No schema definition needed

Cons:

Not reliable for production
Schema may change between loads
May misinterpret types

Schema Evolution

Add new columns:

bq load
--source_format=CSV
--schema_update_option=ALLOW_FIELD_ADDITION
dataset.table
gs://bucket/data_with_new_column.csv
schema.json

Relax required columns:

bq load
--schema_update_option=ALLOW_FIELD_RELAXATION
dataset.table
gs://bucket/data.csv
schema.json

Compression

Compressed Files

BigQuery automatically detects compression:

Gzip compressed CSV

bq load
--source_format=CSV
dataset.table
gs://bucket/data.csv.gz
schema.json

Supported: GZIP, DEFLATE, SNAPPY, BZIP2, LZ4, ZSTD

Performance note: Uncompressed files load faster (parallel processing). Use compression only if network/storage is bottleneck.

Error Handling

Allow Bad Records

bq load
--source_format=CSV
--max_bad_records=1000
dataset.table
gs://bucket/data.csv
schema.json

Behavior: Skip up to 1000 rows with errors, load the rest.

Ignore Unknown Values

bq load
--source_format=JSON
--ignore_unknown_values
dataset.table
gs://bucket/data.json
schema.json

Behavior: Ignore JSON fields not in schema.

Validate Only (Dry Run)

bq load
--dry_run
--source_format=CSV
dataset.table
gs://bucket/data.csv
schema.json

Behavior: Validate schema and format without loading data.

Loading to Partitioned Tables

Time-Partitioned Table

bq load
--source_format=CSV
--time_partitioning_field=order_date
--time_partitioning_type=DAY
dataset.partitioned_orders
gs://bucket/orders_2024.csv
order_id:STRING,customer_id:STRING,order_date:DATE,amount:FLOAT

Load Specific Partition

Load into 2024-01-15 partition

bq load
--source_format=CSV
dataset.orders$20240115
gs://bucket/orders_20240115.csv
schema.json

Range-Partitioned Table

bq load
--source_format=CSV
--range_partitioning=customer_id,0,1000,100
dataset.range_partitioned
gs://bucket/data.csv
customer_id:INTEGER,amount:FLOAT

Performance Optimization

Parallel Loading

BigQuery automatically parallelizes loads from:

Multiple files with wildcards
Uncompressed files (internal parallelization)

Optimal: Split large files into 1GB chunks

File Format Recommendations

Best performance:

Avro - Fastest, self-describing, splittable
Parquet - Fast, columnar, good compression
CSV uncompressed - Good for parallel processing
JSON - Flexible but slower

Avoid:

Very large single files (>10GB)
CSV with complex escaping
Highly compressed files (limits parallelism)

Monitoring Loads

Check Load Jobs

bq ls --jobs --max_results=10

Job Details

bq show -j JOB_ID

Failed Loads

SELECT job_id, user_email, error_result.message as error_message, creation_time FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT WHERE job_type = 'LOAD' AND state = 'DONE' AND error_result IS NOT NULL AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR) ORDER BY creation_time DESC;

Common Patterns

Daily Batch Load

#!/bin/bash DATE=$(date +%Y%m%d) bq load
--source_format=CSV
--replace
dataset.daily_data$$DATE
gs://bucket/data_$DATE.csv
schema.json

Incremental Load with Deduplication

Load to staging

bq load
--source_format=CSV
dataset.staging_orders
gs://bucket/new_orders.csv
schema.json

Merge to production (dedup)

bq query --use_legacy_sql=false ' MERGE project.dataset.orders T USING project.dataset.staging_orders S ON T.order_id = S.order_id WHEN NOT MATCHED THEN INSERT ROW '

Troubleshooting

"Too many errors"

Problem: Exceeds max_bad_records Solution: Increase --max_bad_records or fix data quality

"Schema mismatch"

Problem: CSV columns don't match schema Solution: Verify column order and count

"Invalid CSV format"

Problem: Unescaped quotes or newlines Solution: Use --allow_quoted_newlines

"Permission denied"

Problem: No access to GCS bucket Solution: Grant BigQuery service account Storage Object Viewer role

Quick Reference

Format priorities (fastest first):

Avro (splittable, self-describing)
Parquet (columnar, efficient)
CSV uncompressed (parallel)
JSON (flexible)

Schema strategies:

Production: Explicit schema file
Development: Auto-detect
Evolution: Allow field addition/relaxation

Loading strategies:

New data: Append (default)
Replace all: --replace
Replace partition: TABLE$YYYYMMDD
Incremental: Load staging → MERGE

gcp-bq-data-loading

Safety Notice

Copy this and send it to your AI assistant to learn

Create schema.json

Load with schema file

Skip load if table already has data

Gzip compressed CSV

Load into 2024-01-15 partition

Load to staging

Merge to production (dedup)

Source Transparency

Related Skills

n8n-workflow

bq-query-optimization

playwright-browser

hubspot-crm