A PySpark Primer

An overview of PySpark.

We'll cover the following

What is PySpark?

PySpark is a powerful language for both exploratory analysis and building machine learning pipelines. The core data type in PySpark is the Spark dataframe, which is similar to Pandas dataframes but is designed to execute in a distributed environment.

While the Spark Dataframe API does provide a familiar interface for Python programmers, there are significant differences in the way that commands issued to these objects are executed.

A key difference is that Spark commands are lazily executed, which means that commands such as iloc are not available on these objects. While working with Spark dataframes can seem to constrain us, the benefit is that PySpark can scale to much larger datasets than Pandas.

Get hands-on with 1200+ tech skills courses.