large-data-with-dask

Large Data With Dask Skill

Consider using dask for larger-than-memory datasets.

Iron Laws

ALWAYS call dask.compute() only once at the end of a pipeline — multiple intermediate compute() calls break the lazy evaluation graph and eliminate Dask's ability to fuse and parallelize operations.
NEVER use df.apply(lambda ...) with Dask DataFrames for element-wise operations — Pandas-style apply forces row-by-row Python execution that bypasses Dask's vectorized C extensions and is slower than single-threaded Pandas.
ALWAYS specify partition sizes explicitly when reading large datasets (blocksize= for CSV, chunksize= for Parquet) — auto-detected partition sizes frequently produce thousands of tiny partitions (slow scheduler overhead) or a single giant partition (no parallelism).
NEVER call len(df) or df.shape on a Dask DataFrame without wrapping in compute() — these trigger immediate full dataset computation and negate lazy evaluation.
ALWAYS use dask.distributed.Client for multi-machine or CPU-bound workloads — the default threaded scheduler serializes Python-heavy operations due to the GIL; the distributed scheduler bypasses this.

Anti-Patterns

Anti-Pattern Why It Fails Correct Approach

Multiple compute() calls in pipeline Breaks lazy graph; forces data to materialize and re-partition at each call Build complete computation graph first; call compute() once at the end

df.apply(lambda ...) on large DataFrames Row-by-row Python; GIL contention; slower than equivalent Pandas on single core Use vectorized Dask operations (map_partitions , assign , arithmetic operators)

Default blocksize on large CSV files 128MB default creates thousands of partitions for 100GB files; scheduler overhead dominates Set blocksize="256MB" or blocksize="1GB" for large files; profile optimal size

len(df) without compute()

Triggers full dataset read and count; defeats lazy evaluation Use df.shape[0].compute() explicitly; only compute when size is truly needed

Threaded scheduler for CPU-bound work Python GIL serializes CPU computation across threads; no true parallelism Use dask.distributed.LocalCluster() or process-based scheduler for CPU tasks

Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

After completing: Record any new patterns or exceptions discovered.

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

large-data-with-dask

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

filesystem

slack-notifications

chrome-browser

text-to-sql