Skip to content

Dask

  • Lets you run familiar Python code in parallel across multiple cores or machines.
  • Makes it easy to parallelize existing functions (e.g., with dask.delayed) and to distribute datasets that are too large for a single machine’s memory.
  • Avoids the need for complex parallel algorithms or redundant data copies while using available CPU and memory resources.

Dask is a flexible parallel computing library for Python. It allows users to harness the full power of their CPU and memory resources without the need for complex parallel algorithms or redundant copies of data. With Dask, you can write simple, familiar code that is executed in parallel across a cluster of machines or CPUs.

Dask provides two common modes of use described here:

  • Parallelizing existing Python code: You can mark a function to be executed in parallel using the dask.delayed decorator, which lets Dask schedule and run the function across available cores or machines without modifying the original implementation.

  • Distributed computing for large datasets: When a dataset is too large to fit in memory on a single machine, Dask can split the data into smaller pieces, distribute them across a cluster using a client, perform parallel computations (for example map and reduce operations), and gather the results back to the local machine. This workflow helps avoid memory limitations on a single node.

Parallelizing an existing function with dask.delayed

Section titled “Parallelizing an existing function with dask.delayed”
from dask import delayed
@delayed
def process_data(data):
# perform some computations on the data
...

Now, when you call this function with a large dataset, Dask will automatically parallelize the computation across multiple cores or machines, allowing the function to run much faster.

Distributing a large dataset across a cluster

Section titled “Distributing a large dataset across a cluster”
from dask.distributed import Client
client = Client() # connect to a Dask cluster
data = client.scatter(large_dataset) # scatter the dataset across the cluster
perform some computations on the data in parallel
results = data.map(lambda x: x + 1).reduce(lambda x, y: x + y)
results = client.gather(results) # gather the results back to the local machine

Now, the computations on the large dataset will be distributed across the cluster of machines, allowing you to perform complex computations on very large datasets without running into memory limitations.

  • Parallelizing existing Python functions to run on multiple cores or machines.
  • Distributing and processing datasets that do not fit in memory on a single machine.
  • dask.delayed
  • dask.distributed
  • Client