data-engineering-storage-remote-access-integrations-pandas

Pandas Integration with Remote Storage

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-storage-remote-access-integrations-pandas" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-storage-remote-access-integrations-pandas

Pandas Integration with Remote Storage

Pandas leverages fsspec under the hood for cloud storage access (s3://, gs://, etc.). This makes reading from and writing to cloud storage straightforward.

Auto-Detection (Simplest)

Pandas automatically uses fsspec for cloud URIs:

import pandas as pd

Read CSV/Parquet directly from cloud URIs

df = pd.read_csv("s3://bucket/data.csv") df = pd.read_parquet("s3://bucket/data.parquet") df = pd.read_json("gs://bucket/data.json")

Compression is auto-detected

df = pd.read_csv("s3://bucket/data.csv.gz") # Automatically decompressed

Note: Auto-detection uses default credentials. For explicit auth, see below.

Explicit Filesystem (More Control)

import fsspec import pandas as pd

Create fsspec filesystem with configuration

fs = fsspec.filesystem("s3", anon=False) # Uses default credentials chain

Open file through filesystem

with fs.open("s3://bucket/data.csv") as f: df = pd.read_csv(f)

Or pass filesystem directly (recommended for performance)

df = pd.read_parquet( "s3://bucket/data.parquet", filesystem=fs, columns=["id", "value"], # Column pruning reduces data transfer filters=[("date", ">=", "2024-01-01")] # Row group filtering )

PyArrow Filesystem Backend

For better Arrow integration and zero-copy transfers:

import pyarrow.fs as fs import pandas as pd

s3_fs = fs.S3FileSystem(region="us-east-1")

Read with column filtering

df = pd.read_parquet( "bucket/data.parquet", # Note: no s3:// prefix when using filesystem filesystem=s3_fs, columns=["id", "name", "value"] )

Write to cloud storage

df.to_parquet( "s3://bucket/output/", filesystem=s3_fs, partition_cols=["year", "month"] # Partitioned write )

Partitioned Writes

Write partitioned datasets efficiently:

import pandas as pd

df = pd.DataFrame({ "id": [1, 2, 3], "year": [2024, 2024, 2023], "month": [1, 2, 12], "value": [100.0, 200.0, 150.0] })

Using fsspec

fs = fsspec.filesystem("s3") df.to_parquet( "s3://bucket/output/", partition_cols=["year", "month"], filesystem=fs )

Output structure: s3://bucket/output/year=2024/month=1/part-0.parquet

Authentication

  • Auto-detection: Uses default credential chain (AWS_PROFILE, ~/.aws/credentials, IAM role)

  • Explicit: Pass key= , secret= to fsspec.filesystem() constructor

  • For S3-compatible (MinIO, Ceph): fs = fsspec.filesystem("s3", client_kwargs={ "endpoint_url": "http://minio.local:9000" })

See @data-engineering-storage-authentication for detailed patterns.

Performance Tips

  • Column pruning: pd.read_parquet(columns=[...]) only reads needed columns

  • Row group filtering: Use filters= parameter for partitioned data

  • Cache results: Wrap filesystem with simplecache:: or filecache::

cached_fs = fsspec.filesystem("simplecache", target_protocol="s3") df = pd.read_parquet("simplecache::s3://bucket/data.parquet", filesystem=cached_fs)

  • Use Parquet, not CSV: Parquet supports pushdown, compression, and typed storage

  • For large datasets: Consider PySpark or Dask instead of pandas (pandas loads everything into memory)

Limitations

  • pandas loads entire DataFrame into memory - not suitable for datasets larger than RAM

  • For lazy evaluation and better performance with large files, use @data-engineering-core (Polars)

  • Multi-file reads require manual iteration (use fs.glob()

  • list comprehension)

Alternatives

  • Polars (@data-engineering-core ): Faster, memory-mapped, lazy evaluation

  • Dask: Parallel pandas for out-of-core computation

  • PySpark: Distributed processing for big data

References

  • pandas I/O documentation

  • fsspec documentation

  • @data-engineering-storage-remote-access/libraries/fsspec

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review