BigQuery Data Loading
Use this skill when importing data into BigQuery from various file formats and sources.
Basic Load Command
bq load
--location=LOCATION
--source_format=FORMAT
PROJECT:DATASET.TABLE
SOURCE_PATH
SCHEMA
Loading from CSV
Basic CSV Load
bq load
--source_format=CSV
--skip_leading_rows=1
dataset.table
gs://bucket/data.csv
customer_id:STRING,amount:FLOAT,date:DATE
CSV with Schema File
Create schema.json
echo '[ {"name": "customer_id", "type": "STRING"}, {"name": "amount", "type": "FLOAT"}, {"name": "date", "type": "DATE"} ]' > schema.json
Load with schema file
bq load
--source_format=CSV
--skip_leading_rows=1
dataset.table
gs://bucket/data.csv
./schema.json
CSV Options
bq load
--source_format=CSV
--skip_leading_rows=1
--field_delimiter=','
--quote='"'
--allow_quoted_newlines
--allow_jagged_rows
--max_bad_records=100
--null_marker='NULL'
dataset.table
gs://bucket/data.csv
schema.json
Key flags:
-
--skip_leading_rows=N
-
Skip header rows
-
--field_delimiter=','
-
Column separator (default: comma)
-
--allow_quoted_newlines
-
Allow newlines in quoted fields
-
--allow_jagged_rows
-
Allow rows with missing fields
-
--max_bad_records=N
-
Tolerate N parsing errors
-
--null_marker='NULL'
-
String representing NULL values
CSV with Auto-detect
bq load
--source_format=CSV
--autodetect
--skip_leading_rows=1
dataset.table
gs://bucket/data.csv
Warning: Auto-detect is convenient but not recommended for production. Schema may change between loads.
Loading from JSON
Newline-Delimited JSON
bq load
--source_format=NEWLINE_DELIMITED_JSON
dataset.table
gs://bucket/data.json
customer_id:STRING,amount:FLOAT,date:DATE
JSON format required:
{"customer_id": "C001", "amount": 99.99, "date": "2024-01-15"} {"customer_id": "C002", "amount": 149.99, "date": "2024-01-15"}
NOT standard JSON array:
// ❌ This won't work [ {"customer_id": "C001", "amount": 99.99}, {"customer_id": "C002", "amount": 149.99} ]
JSON with Auto-detect
bq load
--source_format=NEWLINE_DELIMITED_JSON
--autodetect
dataset.table
gs://bucket/data.json
JSON with Nested Fields
Schema with nested STRUCT:
[ {"name": "customer_id", "type": "STRING"}, {"name": "address", "type": "RECORD", "fields": [ {"name": "street", "type": "STRING"}, {"name": "city", "type": "STRING"}, {"name": "zip", "type": "STRING"} ]}, {"name": "orders", "type": "RECORD", "mode": "REPEATED", "fields": [ {"name": "order_id", "type": "STRING"}, {"name": "amount", "type": "FLOAT"} ]} ]
Loading from Avro
Basic Avro Load
bq load
--source_format=AVRO
dataset.table
gs://bucket/data.avro
Key benefit: Schema is auto-detected from Avro metadata. No schema specification needed!
Avro with Wildcards
bq load
--source_format=AVRO
dataset.table
"gs://bucket/path/to/*.avro"
Note: Use quotes around wildcard paths.
Avro Advantages
-
Self-describing: Schema embedded in file
-
Efficient: Compact binary format
-
Type-safe: Strong typing preserved
-
No schema drift: Schema always matches data
Loading from Parquet
Basic Parquet Load
bq load
--source_format=PARQUET
dataset.table
gs://bucket/data.parquet
Like Avro: Schema auto-detected from Parquet metadata.
Parquet with Compression
bq load
--source_format=PARQUET
dataset.table
gs://bucket/data.snappy.parquet
Supported compression: SNAPPY, GZIP, LZO, BROTLI, LZ4, ZSTD
Loading Strategies
Append Data (Default)
bq load
--source_format=CSV
dataset.table
gs://bucket/new_data.csv
schema.json
Behavior: Adds rows to existing table.
Replace Table
bq load
--source_format=CSV
--replace
dataset.table
gs://bucket/data.csv
schema.json
Behavior: Deletes all existing data, loads new data.
Overwrite Partition
bq load
--source_format=CSV
--replace
--time_partitioning_field=date
dataset.table$20240115
gs://bucket/data_20240115.csv
schema.json
Syntax: TABLE$YYYYMMDD targets specific partition.
Behavior: Replaces only that partition, leaves others intact.
Append or Skip
Skip load if table already has data
bq load
--source_format=CSV
--if_not_exists
dataset.table
gs://bucket/data.csv
schema.json
Loading from Cloud Storage
Single File
bq load
--source_format=CSV
dataset.table
gs://bucket/data.csv
schema.json
Multiple Files (List)
bq load
--source_format=CSV
dataset.table
gs://bucket/file1.csv,gs://bucket/file2.csv,gs://bucket/file3.csv
schema.json
Wildcard Pattern
bq load
--source_format=CSV
dataset.table
"gs://bucket/path/data-*.csv"
schema.json
Patterns:
-
gs://bucket/*.csv
-
All CSV files in bucket root
-
gs://bucket/2024//.csv
-
All CSV files in subdirectories
-
gs://bucket/data-[0-9]*.csv
-
Files matching pattern
Loading from Local Files
Local CSV
bq load
--source_format=CSV
--skip_leading_rows=1
dataset.table
/path/to/local/data.csv
schema.json
Limitation: Files are uploaded first, then loaded. Slower for large files. Use GCS for better performance.
Schema Handling
Inline Schema
bq load
--source_format=CSV
dataset.table
gs://bucket/data.csv
customer_id:STRING,amount:FLOAT64,order_date:DATE,active:BOOLEAN
Schema File
[ {"name": "customer_id", "type": "STRING", "mode": "REQUIRED"}, {"name": "amount", "type": "FLOAT64"}, {"name": "order_date", "type": "DATE"}, {"name": "metadata", "type": "JSON"} ]
Modes:
-
REQUIRED
-
Field must have value
-
NULLABLE
-
Field can be NULL (default)
-
REPEATED
-
Field is an array
Auto-detect Schema
bq load
--source_format=CSV
--autodetect
dataset.table
gs://bucket/data.csv
Pros:
-
Quick for exploration
-
No schema definition needed
Cons:
-
Not reliable for production
-
Schema may change between loads
-
May misinterpret types
Schema Evolution
Add new columns:
bq load
--source_format=CSV
--schema_update_option=ALLOW_FIELD_ADDITION
dataset.table
gs://bucket/data_with_new_column.csv
schema.json
Relax required columns:
bq load
--schema_update_option=ALLOW_FIELD_RELAXATION
dataset.table
gs://bucket/data.csv
schema.json
Compression
Compressed Files
BigQuery automatically detects compression:
Gzip compressed CSV
bq load
--source_format=CSV
dataset.table
gs://bucket/data.csv.gz
schema.json
Supported: GZIP, DEFLATE, SNAPPY, BZIP2, LZ4, ZSTD
Performance note: Uncompressed files load faster (parallel processing). Use compression only if network/storage is bottleneck.
Error Handling
Allow Bad Records
bq load
--source_format=CSV
--max_bad_records=1000
dataset.table
gs://bucket/data.csv
schema.json
Behavior: Skip up to 1000 rows with errors, load the rest.
Ignore Unknown Values
bq load
--source_format=JSON
--ignore_unknown_values
dataset.table
gs://bucket/data.json
schema.json
Behavior: Ignore JSON fields not in schema.
Validate Only (Dry Run)
bq load
--dry_run
--source_format=CSV
dataset.table
gs://bucket/data.csv
schema.json
Behavior: Validate schema and format without loading data.
Loading to Partitioned Tables
Time-Partitioned Table
bq load
--source_format=CSV
--time_partitioning_field=order_date
--time_partitioning_type=DAY
dataset.partitioned_orders
gs://bucket/orders_2024.csv
order_id:STRING,customer_id:STRING,order_date:DATE,amount:FLOAT
Load Specific Partition
Load into 2024-01-15 partition
bq load
--source_format=CSV
dataset.orders$20240115
gs://bucket/orders_20240115.csv
schema.json
Range-Partitioned Table
bq load
--source_format=CSV
--range_partitioning=customer_id,0,1000,100
dataset.range_partitioned
gs://bucket/data.csv
customer_id:INTEGER,amount:FLOAT
Performance Optimization
Parallel Loading
BigQuery automatically parallelizes loads from:
-
Multiple files with wildcards
-
Uncompressed files (internal parallelization)
Optimal: Split large files into 1GB chunks
File Format Recommendations
Best performance:
-
Avro - Fastest, self-describing, splittable
-
Parquet - Fast, columnar, good compression
-
CSV uncompressed - Good for parallel processing
-
JSON - Flexible but slower
Avoid:
-
Very large single files (>10GB)
-
CSV with complex escaping
-
Highly compressed files (limits parallelism)
Monitoring Loads
Check Load Jobs
bq ls --jobs --max_results=10
Job Details
bq show -j JOB_ID
Failed Loads
SELECT
job_id,
user_email,
error_result.message as error_message,
creation_time
FROM region-us.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE
job_type = 'LOAD'
AND state = 'DONE'
AND error_result IS NOT NULL
AND creation_time >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 24 HOUR)
ORDER BY creation_time DESC;
Common Patterns
Daily Batch Load
#!/bin/bash
DATE=$(date +%Y%m%d)
bq load
--source_format=CSV
--replace
dataset.daily_data$$DATE
gs://bucket/data_$DATE.csv
schema.json
Incremental Load with Deduplication
Load to staging
bq load
--source_format=CSV
dataset.staging_orders
gs://bucket/new_orders.csv
schema.json
Merge to production (dedup)
bq query --use_legacy_sql=false '
MERGE project.dataset.orders T
USING project.dataset.staging_orders S
ON T.order_id = S.order_id
WHEN NOT MATCHED THEN INSERT ROW
'
Troubleshooting
"Too many errors"
Problem: Exceeds max_bad_records Solution: Increase --max_bad_records or fix data quality
"Schema mismatch"
Problem: CSV columns don't match schema Solution: Verify column order and count
"Invalid CSV format"
Problem: Unescaped quotes or newlines Solution: Use --allow_quoted_newlines
"Permission denied"
Problem: No access to GCS bucket Solution: Grant BigQuery service account Storage Object Viewer role
Quick Reference
Format priorities (fastest first):
-
Avro (splittable, self-describing)
-
Parquet (columnar, efficient)
-
CSV uncompressed (parallel)
-
JSON (flexible)
Schema strategies:
-
Production: Explicit schema file
-
Development: Auto-detect
-
Evolution: Allow field addition/relaxation
Loading strategies:
-
New data: Append (default)
-
Replace all: --replace
-
Replace partition: TABLE$YYYYMMDD
-
Incremental: Load staging → MERGE