data-engineering-storage-remote-access

Remote Storage Access

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-storage-remote-access" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-storage-remote-access

Remote Storage Access

Comprehensive guide to accessing cloud storage (S3, GCS, Azure) and remote filesystems in Python. Covers three major libraries - fsspec, pyarrow.fs, and obstore - and their integration with data engineering tools.

Quick Comparison

Feature fsspec pyarrow.fs obstore

Best For Broad compatibility, ecosystem integration Arrow-native workflows, Parquet High-throughput, performance-critical

Backends S3, GCS, Azure, HTTP, FTP, 20+ more S3, GCS, HDFS, local S3, GCS, Azure, local

Performance Good (with caching) Excellent for Parquet 9x faster for concurrent ops

Dependencies Backend-specific (s3fs, gcsfs) Bundled with PyArrow Zero Python deps (Rust)

Async Support Yes (aiohttp) Limited Native sync/async

DataFrame Integration Universal PyArrow-native Via fsspec wrapper

Maturity Very mature (2018+) Mature New (2025), rapidly evolving

When to Use Which?

Use fsspec when:

  • You need broad ecosystem compatibility (pandas, xarray, Dask)

  • Working with multiple storage backends (S3, GCS, Azure, HTTP)

  • You need protocol chaining and caching features

  • Your workflow involves diverse data formats beyond Parquet

Use pyarrow.fs when:

  • Your pipeline is Arrow/Parquet-native

  • You need zero-copy integration with PyArrow datasets

  • Predicate pushdown and column pruning are critical

  • Working with partitioned Parquet datasets

Use obstore when:

  • Performance is paramount (many small files, high concurrency)

  • You need async/await support for concurrent operations

  • You want minimal dependencies (Rust-based)

  • Working with large-scale data ingestion/egestion

Skill Dependencies

Prerequisites:

  • @data-engineering-core

  • Polars, DuckDB, PyArrow basics

  • @data-engineering-storage-authentication

  • AWS, GCP, Azure auth patterns

  • @data-engineering-storage-formats

  • Parquet, Arrow, Lance, Zarr, Avro, ORC

Related:

  • @data-engineering-storage-lakehouse

  • Delta Lake, Iceberg on cloud storage

  • @data-engineering-orchestration

  • dbt with cloud storage

Detailed Guides

Library Deep Dives

  • @data-engineering-storage-remote-access-libraries-fsspec

  • Universal filesystem interface

  • @data-engineering-storage-remote-access-libraries-pyarrow-fs

  • Native Arrow integration

  • @data-engineering-storage-remote-access-libraries-obstore

  • High-performance Rust

DataFrame Integrations

  • @data-engineering-storage-remote-access-integrations-polars

  • Polars + cloud URIs

  • @data-engineering-storage-remote-access-integrations-duckdb

  • DuckDB HTTPFS extension

  • @data-engineering-storage-remote-access-integrations-pandas

  • Pandas + remote files

  • @data-engineering-storage-remote-access-integrations-pyarrow

  • PyArrow datasets

  • @data-engineering-storage-remote-access-integrations-delta-lake

  • Delta on S3/GCS/Azure

  • @data-engineering-storage-remote-access-integrations-iceberg

  • Iceberg with cloud catalogs

Infrastructure Patterns

  • @data-engineering-storage-authentication

  • AWS, GCP, Azure auth patterns, IAM roles, service principals

  • See performance.md in this skill - Caching, concurrency, async

  • See patterns.md in this skill - Incremental loading, partitioned writes, cross-cloud copy

Storage Formats

  • @data-engineering-storage-formats
  • Parquet, Arrow/Feather, Lance, Zarr, Avro, ORC

Quick Start Example

import fsspec import pyarrow.fs as fs import obstore as obs

Method 1: fsspec (universal)

s3_fs = fsspec.filesystem('s3') with s3_fs.open('s3://bucket/data.parquet', 'rb') as f: df = pl.read_parquet(f)

Method 2: pyarrow.fs (Arrow-native)

s3_pa = fs.S3FileSystem(region='us-east-1') table = pq.read_table("bucket/data.parquet", filesystem=s3_pa)

Method 3: obstore (high-performance)

from obstore.store import S3Store store = S3Store(bucket='my-bucket', region='us-east-1') data = obs.get(store, 'data.parquet').bytes()

All approaches work - choose based on your performance and ecosystem needs

Authentication

All three libraries follow standard cloud authentication patterns: explicit credentials → environment variables → config files → IAM roles/Managed Identities.

See: @data-engineering-storage-authentication

Performance Optimization

Key strategies:

  • Caching: fsspec's SimpleCache for repeated access

  • Concurrency: obstore async API for many small files

  • Predicate pushdown: Filter at storage layer using partitioning

  • Column pruning: Read only required columns

See: @data-engineering-storage-remote-access/performance.md

References

  • fsspec Documentation

  • PyArrow Filesystems

  • obstore Documentation

  • s3fs Documentation

  • gcsfs Documentation

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-visualization

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review