data-engineering-storage-remote-access-libraries-obstore

obstore: High-Performance Rust-Based Storage

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-engineering-storage-remote-access-libraries-obstore" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-engineering-storage-remote-access-libraries-obstore

obstore: High-Performance Rust-Based Storage

obstore (released 2025) provides a minimal, stateless API built on Rust's object_store crate, offering superior performance for concurrent operations (up to 9x faster than Python-based alternatives).

Installation

pip install obstore

Or with conda

conda install -c conda-forge obstore

Core Concepts

obstore uses top-level functions (not methods) and a functional API. All operations are functions like obs.get(store, path) , not store.get(path) .

Creating Stores

import obstore as obs from obstore.store import S3Store, GCSStore, AzureStore, LocalStore

S3 Store

s3 = S3Store( bucket="my-bucket", region="us-east-1", access_key_id="AKIA...", secret_access_key="...", # Or use environment credentials )

GCS Store

gcs = GCSStore( bucket="my-bucket", # Uses GOOGLE_APPLICATION_CREDENTIALS by default )

Azure Store

azure = AzureStore( container="my-container", account_name="myaccount", account_key="...", # Or use DefaultAzureCredential )

Local filesystem

local = LocalStore("/path/to/root")

From environment (picks up standard env vars)

s3 = S3Store.from_env(bucket="my-bucket") gcs = GCSStore.from_env(bucket="my-bucket")

Basic Operations

import obstore as obs

store = S3Store(bucket="my-bucket", region="us-east-1")

Put object (bytes)

obs.put(store, "hello.txt", b"Hello, World!")

Put from file

with open("local-file.csv", "rb") as f: obs.put(store, "data/file.csv", f)

Get object

response = obs.get(store, "hello.txt") print(response.bytes()) # b"Hello, World!" print(response.meta) # Object metadata (size, mtime, etag, etc.)

Get range (efficient partial reads)

partial = obs.get_range(store, "large-file.bin", offset=0, length=1024)

Stream download

stream = obs.get(store, "large-file.bin") for chunk in stream.stream(min_chunk_size=8 * 1024 * 1024): process(chunk)

List objects (streaming, no pagination needed!)

for obj in obs.list(store, prefix="data/2024/"): print(f"{obj['path']}: {obj['size']} bytes")

List with delimiter (like directory listing)

result = obs.list_with_delimiter(store, prefix="data/") print(result["common_prefixes"]) # "directories" print(result["objects"]) # files

Delete

obs.delete(store, "old-file.txt")

Copy within same store

obs.copy(store, "src/file.txt", "dst/file.txt")

Rename/move

obs.rename(store, "old-name.txt", "new-name.txt")

Check existence (via head)

try: meta = obs.head(store, "file.txt") print(f"Exists: {meta['size']} bytes") except obs.NotFoundError: print("File not found")

Async API

import asyncio import obstore as obs from obstore.store import S3Store

async def main(): store = S3Store(bucket="my-bucket", region="us-east-1")

# Concurrent uploads
await asyncio.gather(
    obs.put_async(store, "file1.txt", b"content1"),
    obs.put_async(store, "file2.txt", b"content2"),
    obs.put_async(store, "file3.txt", b"content3"),
)

# Concurrent downloads
responses = await asyncio.gather(
    obs.get_async(store, "file1.txt"),
    obs.get_async(store, "file2.txt"),
    obs.get_async(store, "file3.txt"),
)

for resp in responses:
    print(await resp.bytes_async())

asyncio.run(main())

Streaming Uploads

import asyncio import obstore as obs from obstore.store import S3Store

store = S3Store(bucket="my-bucket")

Upload from generator (streaming, memory-efficient)

def data_generator(): for i in range(1000): yield f"Row {i}\n".encode()

obs.put(store, "output.txt", data_generator())

Upload from async iterator

async def async_data(): for i in range(1000): await asyncio.sleep(0) yield f"Row {i}\n".encode()

async def upload_async(): await obs.put_async(store, "output-async.txt", async_data())

asyncio.run(upload_async())

Automatic multipart upload for large files

(triggered automatically based on size)

with open("huge-file.bin", "rb") as f: obs.put(store, "huge-file.bin", f) # Multi-part automatically

Arrow Integration

import obstore as obs from obstore.store import S3Store

store = S3Store(bucket="my-bucket")

Return list results as Arrow table (faster, more memory-efficient)

arrow_table = obs.list(store, prefix="data/", return_arrow=True) print(arrow_table.schema)

pyarrow.Schema

├── path: string

├── size: int64

├── last_modified: timestamp[ns]

└── etag: string

Process with PyArrow/Polars

import polars as pl df = pl.from_arrow(arrow_table)

fsspec Compatibility

obstore provides an fsspec-compatible wrapper:

from obstore.fsspec import FsspecStore, register import pyarrow.parquet as pq

Method 1: Register as default handler for protocols

register()

Now fsspec uses obstore internally

import fsspec fs = fsspec.filesystem("s3", region="us-east-1")

Method 2: Use FsspecStore directly

fs = FsspecStore("s3", bucket="my-bucket", region="us-east-1")

or

fs = FsspecStore.from_store(s3_store_object)

Use with PyArrow

parquet_file = pq.ParquetFile( "s3://bucket/data/file.parquet", filesystem=fs )

When to Use obstore

Choose obstore when:

  • ✅ Performance is paramount (many small files, high concurrency)

  • ✅ You need async/await for concurrent operations

  • ✅ Minimal dependencies are desired (Rust-based, no Python C extensions)

  • ✅ Streaming uploads from generators/iterators

  • ✅ Large-scale data ingestion/egestion

Performance Comparison

Operation fsspec pyarrow.fs obstore

Concurrent small files Moderate Moderate 9x faster

Async support Yes (aiohttp) Limited Native

Streaming uploads Yes Limited Yes (efficient)

Parquet pushdown Via PyArrow Excellent Via PyArrow

Maturity (2025) Very high High Rapidly growing

Authentication

See @data-engineering-storage-authentication for credential patterns. All S3Store , GCSStore , AzureStore constructors accept explicit credentials or use environment variables via from_env() .

References

  • obstore Documentation

  • PyPI: obstore

  • object_store (Rust)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-engineering-core

No summary provided by upstream source.

Repository SourceNeeds Review