Lecture: Parallel Computing with Dask
With the move toward multi-core CPUs, parallel processing is becoming ever more important. Python is often criticized in this area because of it's Global Interpreter Lock, but the Dask library makes parallelism with Python tantalizingly simple. It offers an easy and consistent way to parallelize computations that scales from a single laptop to clusters with thousands of cores.
Parallelization is increasingly necessary to achieve good performance. Whether the goal is to speed up server provisioning, improve web service request handling rates, or process data in a timely fashion, parallelization is the way to go.
Python includes several mechanisms for concurrent processing, such as the threading, multiprocessing and asyncio modules. Both, threading and multiprocessing are low-level primitives. asyncio and async/await are great, but only speed up I/O intensive tasks. Dask makes parallelized and distributed computing easy. It's minimally invasive – a 3-line patch is enough to start using Dask – and works well for both I/O and CPU intensive tasks.
Dask's delayed decorator wraps functions and turns them into Delayed objects. Calls to Delayed objects do not execute the code, instead they build a calculation graph and return a "promise". Calling this object's compute() method executes the task graph in parallel and returns the result. Multi-threaded execution is the default, but switching to multi-process execution is as simple as passing a keyword argument to compute(). Major architectural decisions like threads vs processes become an implementation detail.
Dask is designed to parallelize computation heavy tasks, such as data processing. Consequently, it offers classes that behave like distributed versions of containers commonly used in this area: lists, NumPy Arrays and Pandas Dataframes. The Dask equivalents, Bag, Array, and DataFrame have very similar APIs, but offer improved performance and can deal with data sets larger than the available memory. This API compatibility makes conversion of existing code to Dask very easy. When normal data processing code hits scaling boundaries, Dask offers a path forward without costly rewriting.
Just as Dask facilitates multi-threading and multi-processing, it also makes the next step simple: scaling beyond a single computer. Worker processes can run on remote machines, like servers (via ssh), on a cluster, or on cloud computing instances. After telling Dask about these workers, they are used for executing tasks without any further code changes. Dask automatically transfers data between machines and schedules jobs to maximize overall performance. An interactive dashboard shows job completion and resource utilization.