Remote Storage Access
Comprehensive guide to accessing cloud storage (S3, GCS, Azure) and remote filesystems in Python. Covers three major libraries - fsspec, pyarrow.fs, and obstore - and their integration with data engineering tools.
Quick Comparison
Feature fsspec pyarrow.fs obstore
Best For Broad compatibility, ecosystem integration Arrow-native workflows, Parquet High-throughput, performance-critical
Backends S3, GCS, Azure, HTTP, FTP, 20+ more S3, GCS, HDFS, local S3, GCS, Azure, local
Performance Good (with caching) Excellent for Parquet 9x faster for concurrent ops
Dependencies Backend-specific (s3fs, gcsfs) Bundled with PyArrow Zero Python deps (Rust)
Async Support Yes (aiohttp) Limited Native sync/async
DataFrame Integration Universal PyArrow-native Via fsspec wrapper
Maturity Very mature (2018+) Mature New (2025), rapidly evolving
When to Use Which?
Use fsspec when:
-
You need broad ecosystem compatibility (pandas, xarray, Dask)
-
Working with multiple storage backends (S3, GCS, Azure, HTTP)
-
You need protocol chaining and caching features
-
Your workflow involves diverse data formats beyond Parquet
Use pyarrow.fs when:
-
Your pipeline is Arrow/Parquet-native
-
You need zero-copy integration with PyArrow datasets
-
Predicate pushdown and column pruning are critical
-
Working with partitioned Parquet datasets
Use obstore when:
-
Performance is paramount (many small files, high concurrency)
-
You need async/await support for concurrent operations
-
You want minimal dependencies (Rust-based)
-
Working with large-scale data ingestion/egestion
Skill Dependencies
Prerequisites:
-
@data-engineering-core
-
Polars, DuckDB, PyArrow basics
-
@data-engineering-storage-authentication
-
AWS, GCP, Azure auth patterns
-
@data-engineering-storage-formats
-
Parquet, Arrow, Lance, Zarr, Avro, ORC
Related:
-
@data-engineering-storage-lakehouse
-
Delta Lake, Iceberg on cloud storage
-
@data-engineering-orchestration
-
dbt with cloud storage
Detailed Guides
Library Deep Dives
-
@data-engineering-storage-remote-access-libraries-fsspec
-
Universal filesystem interface
-
@data-engineering-storage-remote-access-libraries-pyarrow-fs
-
Native Arrow integration
-
@data-engineering-storage-remote-access-libraries-obstore
-
High-performance Rust
DataFrame Integrations
-
@data-engineering-storage-remote-access-integrations-polars
-
Polars + cloud URIs
-
@data-engineering-storage-remote-access-integrations-duckdb
-
DuckDB HTTPFS extension
-
@data-engineering-storage-remote-access-integrations-pandas
-
Pandas + remote files
-
@data-engineering-storage-remote-access-integrations-pyarrow
-
PyArrow datasets
-
@data-engineering-storage-remote-access-integrations-delta-lake
-
Delta on S3/GCS/Azure
-
@data-engineering-storage-remote-access-integrations-iceberg
-
Iceberg with cloud catalogs
Infrastructure Patterns
-
@data-engineering-storage-authentication
-
AWS, GCP, Azure auth patterns, IAM roles, service principals
-
See performance.md in this skill - Caching, concurrency, async
-
See patterns.md in this skill - Incremental loading, partitioned writes, cross-cloud copy
Storage Formats
- @data-engineering-storage-formats
- Parquet, Arrow/Feather, Lance, Zarr, Avro, ORC
Quick Start Example
import fsspec import pyarrow.fs as fs import obstore as obs
Method 1: fsspec (universal)
s3_fs = fsspec.filesystem('s3') with s3_fs.open('s3://bucket/data.parquet', 'rb') as f: df = pl.read_parquet(f)
Method 2: pyarrow.fs (Arrow-native)
s3_pa = fs.S3FileSystem(region='us-east-1') table = pq.read_table("bucket/data.parquet", filesystem=s3_pa)
Method 3: obstore (high-performance)
from obstore.store import S3Store store = S3Store(bucket='my-bucket', region='us-east-1') data = obs.get(store, 'data.parquet').bytes()
All approaches work - choose based on your performance and ecosystem needs
Authentication
All three libraries follow standard cloud authentication patterns: explicit credentials → environment variables → config files → IAM roles/Managed Identities.
See: @data-engineering-storage-authentication
Performance Optimization
Key strategies:
-
Caching: fsspec's SimpleCache for repeated access
-
Concurrency: obstore async API for many small files
-
Predicate pushdown: Filter at storage layer using partitioning
-
Column pruning: Read only required columns
See: @data-engineering-storage-remote-access/performance.md
References
-
fsspec Documentation
-
PyArrow Filesystems
-
obstore Documentation
-
s3fs Documentation
-
gcsfs Documentation