Getting Started

Get introduced to the course and what we’ll learn.

Overview

In this course, we’ll learn how to use PySpark instead of pandas whereever possible. In Python, pandas is a library used to manipulate and analyze data.PySpark is a set of libraries written in Scala and is used for large-scale data processing.

We’ll use a subset of Amazon Review DataJustifying recommendations using distantly-labeled reviews and fined-grained aspects

Jianmo Ni, Jiacheng Li, Julian McAuley

Empirical Methods in Natural Language Processing (EMNLP), 2019
To demonstrate the modules of PySpark DataFrame API. In each part, we’ll first solve some tasks using pandas. Then we’ll try to accomplish the same task in PySpark.

Obtain valuable information from data

The content flow of the course is followed by a short analytics project lifecycle. In this lifecycle, we follow an almost predetermined set of actions to get some valuable information out of the data, as shown in the illustration below.

%0 node_1 Load/Read node_2 Select node_1->node_2 node_3 Explore node_2->node_3 node_1647416623977 Filter/Impute node_3->node_1647416623977 node_1647416632933 Calculated Columns node_1647416623977->node_1647416632933 node_1647416730214 Visualisation node_1647416632933->node_1647416730214
Steps to get valuable information from data
  1. Load or read the data, such as CSV, JSON, and parquet, in the tabular form with pandas or PySpark.
  2. Select fields based on project requirements. This is called subsetting.
  3. Explore a bit if the data is new to you.
  4. Filter or impute the invalid data.
  5. Introduce new calculated columns based on existing columns by aggregating the data with a framework, such as pandas or PySpark. We can do this using the provided methods—group by, order by, limit, and so on.
  6. Calculate some metrics or produce visualization, which can easily be reviewed by business partners as a support document when making some data-driven decision

Useful tips

Always make a snapshot of the working DataFrame whenever it makes sense.

It reduces the extra overhead of querying the whole DataFrame and makes our query much faster. Additionally, it allows us to get rid of redundant fields from the data we won’t use.

PySpark uses cache to create a subset of data in the memory or save a subset of the data locally. It uses the subset for the further task, which increases the query speed significantly.