dask

Dask Parallel and Distributed Computing

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "dask" with this command: npx skills add eyadsibai/ltk/eyadsibai-ltk-dask

Dask Parallel and Distributed Computing

Scale pandas/NumPy workflows beyond memory and across clusters.

When to Use

  • Datasets exceed available RAM

  • Need to parallelize pandas or NumPy operations

  • Processing multiple files efficiently (CSVs, Parquet)

  • Building custom parallel workflows

  • Distributing workloads across multiple cores/machines

Dask Collections

Collection Like Use Case

DataFrame pandas Tabular data, CSV/Parquet

Array NumPy Numerical arrays, matrices

Bag list Unstructured data, JSON logs

Delayed Custom Arbitrary Python functions

Key concept: All collections are lazy—computation happens only when you call .compute() .

Lazy Evaluation

Function Behavior Use

dd.read_csv()

Lazy load Large CSVs

dd.read_parquet()

Lazy load Large Parquet

Operations Build graph Chain transforms

.compute()

Execute Get final result

Key concept: Dask builds a task graph of operations, optimizes it, then executes in parallel. Call .compute() once at the end, not after every operation.

Schedulers

Scheduler Best For Start

threaded NumPy/Pandas (releases GIL) Default

processes Pure Python (GIL bound) scheduler='processes'

synchronous Debugging scheduler='synchronous'

distributed Monitoring, scaling, clusters Client()

Distributed Scheduler

Feature Benefit

Dashboard Real-time progress monitoring

Cluster scaling Add/remove workers

Fault tolerance Retry failed tasks

Worker resources Memory management

Chunking Concepts

DataFrame Partitions

Concept Description

Partition Subset of rows (like a mini DataFrame)

npartitions Number of partitions

divisions Index boundaries between partitions

Array Chunks

Concept Description

Chunk Subset of array (n-dimensional block)

chunks Tuple of chunk sizes per dimension

Optimal size ~100 MB per chunk

Key concept: Chunk size is critical. Too small = scheduling overhead. Too large = memory issues. Target ~100 MB.

DataFrame Operations

Supported (parallel)

Category Operations

Selection filter , loc , column selection

Aggregation groupby , sum , mean , count

Transforms apply (row-wise), map_partitions

Joins merge , join (shuffles data)

I/O read_csv , read_parquet , to_parquet

Avoid or Use Carefully

Operation Issue Alternative

iterrows

Kills parallelism map_partitions

apply(axis=1)

Slow map_partitions

Repeated compute()

Inefficient Single compute() at end

sort_values

Expensive shuffle Avoid if possible

Common Patterns

ETL Pipeline

  • scan_* or read_* (lazy load)

  • Chain filters and transforms

  • Single .compute() or .to_parquet()

Multi-File Processing

Pattern Description

Glob patterns dd.read_csv('data/*.csv')

Partition per file Natural parallelism

Output partitioned to_parquet('output/')

Custom Operations

Method Use Case

map_partitions

Apply function to each partition

map_blocks

Apply function to each array block

delayed

Wrap arbitrary Python functions

Best Practices

Practice Why

Don't load locally first Let Dask handle loading

Single compute() at end Avoid redundant computation

Use Parquet Faster than CSV, columnar

Match partition to files One partition per file

Check task graph size len(ddf.dask_graph()) < 100k

Use distributed for debugging Dashboard shows progress

Common Pitfalls

Pitfall Solution

Loading with pandas first Use dd.read_* directly

compute() in loops Collect all, single compute()

Too many partitions Repartition to ~100 MB each

Memory errors Reduce chunk size, add workers

Slow shuffles Avoid sorts/joins when possible

vs Alternatives

Tool Best For Trade-off

Dask Scale pandas/NumPy, clusters Setup complexity

Polars Fast in-memory Must fit in RAM

Vaex Out-of-core single machine Limited operations

Spark Enterprise, SQL-heavy Infrastructure

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

document-processing

No summary provided by upstream source.

Repository SourceNeeds Review
General

stripe-payments

No summary provided by upstream source.

Repository SourceNeeds Review
General

file-organization

No summary provided by upstream source.

Repository SourceNeeds Review