large-data-with-dask

Large Data With Dask Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "large-data-with-dask" with this command: npx skills add oimiragieo/agent-studio/oimiragieo-agent-studio-large-data-with-dask

Large Data With Dask Skill

  • Consider using dask for larger-than-memory datasets.

Iron Laws

  • ALWAYS call dask.compute() only once at the end of a pipeline — multiple intermediate compute() calls break the lazy evaluation graph and eliminate Dask's ability to fuse and parallelize operations.

  • NEVER use df.apply(lambda ...) with Dask DataFrames for element-wise operations — Pandas-style apply forces row-by-row Python execution that bypasses Dask's vectorized C extensions and is slower than single-threaded Pandas.

  • ALWAYS specify partition sizes explicitly when reading large datasets (blocksize= for CSV, chunksize= for Parquet) — auto-detected partition sizes frequently produce thousands of tiny partitions (slow scheduler overhead) or a single giant partition (no parallelism).

  • NEVER call len(df) or df.shape on a Dask DataFrame without wrapping in compute() — these trigger immediate full dataset computation and negate lazy evaluation.

  • ALWAYS use dask.distributed.Client for multi-machine or CPU-bound workloads — the default threaded scheduler serializes Python-heavy operations due to the GIL; the distributed scheduler bypasses this.

Anti-Patterns

Anti-Pattern Why It Fails Correct Approach

Multiple compute() calls in pipeline Breaks lazy graph; forces data to materialize and re-partition at each call Build complete computation graph first; call compute() once at the end

df.apply(lambda ...) on large DataFrames Row-by-row Python; GIL contention; slower than equivalent Pandas on single core Use vectorized Dask operations (map_partitions , assign , arithmetic operators)

Default blocksize on large CSV files 128MB default creates thousands of partitions for 100GB files; scheduler overhead dominates Set blocksize="256MB" or blocksize="1GB" for large files; profile optimal size

len(df) without compute()

Triggers full dataset read and count; defeats lazy evaluation Use df.shape[0].compute() explicitly; only compute when size is truly needed

Threaded scheduler for CPU-bound work Python GIL serializes CPU computation across threads; no true parallelism Use dask.distributed.LocalCluster() or process-based scheduler for CPU tasks

Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

After completing: Record any new patterns or exceptions discovered.

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

filesystem

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

slack-notifications

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

chrome-browser

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

text-to-sql

No summary provided by upstream source.

Repository SourceNeeds Review