What is Dask in Python?
Overview
Dask is a Python library, which is used for parallel computing. It offers various features like:
BigData in Dask: It covers the commonly-known Python interfaces like pandas, NumPy, and more.
Dynamic task scheduling: It is enhanced for cooperative computational workloads.
Why do we need Dask?
Most of the analytics use NumPy and pandas to analyze big data. These packages are helpful in supporting various computations. Dask is also of great use in cases where our dataset does not fit in the given memory. It helps in scaling up to a cluster with 1000s of cores or CPUs. It also allows scaling down to a single process, or a single core for processing.
Features of Dask
Developed with a wider community: Dask is an open-source platform. It was built in coordination with various other community projects like Numpy, Pandas, and Scikit-learn.
Dask DataFrames: Dask datasets are the same as the pandas DataFrame. We can deal with larger DataFrames with ease by using Dask. It assists users in manipulating larger data.
Dask array and Dask-ML: Dask array offers parallel, larger than memory and n-dimensional arrays by using the blocked algorithms:
The blocked algorithm works by performing the smaller computation to complete a larger computation. It works by using all the cores on our system.
If we are working on a dataset that is larger than our memory, then these arrays are helpful.
Dask-ML also contains resources for both parallel and distributed machine learning.
Moreover, Dask also helps to break the given array into smaller pieces to offer effective data streaming from the disk. This process also decreases the memory footprint of our computation.
Installation
We can use the following command to install the package:
python –m pip install "dask[complete]"
Limitations of Dask
Dask in Python cannot parallelize within discrete or individual tasks.
It permits the remote execution of arbitrary code, due to the reason that it is a distributed computing framework. Thus, the Dask workers should be held within trustworthy networks only.
Explanations
In the section below, we are going to discuss different concepts of the Dask library that are used for parallel computing including arrays, bags, DataFrames, workloads, xarrays, and more.
Arrays
In the code below, we are going to a create Dask array of size 1000*1000, where each chunk of 100*100 being a Numpy array. Each Numpy array will be filled with random values using daskArray.random.random().
# import dask.arrays objectimport dask.array as daskArray# invoke random() to generate dask 2D array of 1000 * 10000 with 100 * 100 chunksarr = daskArray.random.random((1000, 1000), chunks=(100, 100))# print array on consoleprint(arr.compute())
Explanation
Line 5: We create a
1000*1000Dask array, which contains100*100chunks of Numpy arrays to make efficient data processing.Line 8: We print the array to the console by invoking the
compute()method.
DataFrames
In the code below, we are going to create a Dask DataFrame. For this purpose, we are going to load a time series built-in dataset. Here, the timeseries() method will return the time and series dataset as a single DataFrame.
# include dask moduleimport dask.datasets as dd# invoke timeseries() from dask.datasetsdf = dd.timeseries()# print time series DataFrame on consoleprint(df.compute())
Explanation
Line 5: We extract the time and series dataset as the
dfDataFrame.Line 8: We print the dataset to the console.