Introduction to PySpark for Batch Pipelines

Get introduced to PySpark for Batch Pipelines.

Overview

Spark is a general-purpose computing framework that can scale to massive data volumes. It builds upon prior big data tools such as Hadoop and MapReduce, while providing significant improvements in the expressivity of the languages it supports. One of the core components of Spark is resilient distributed datasets (RDD), which enable clusters of machines to perform workloads in a coordinated and fault-tolerant manner. In more recent versions of Spark, the Data frame API provides an abstraction on top of RDDs that resembles the same data structure in R and Pandas. PySpark is the Python interface for Spark, and it provides an API for working with large-scale datasets in a distributed computing environment.

Create a free account to view this lesson.

By signing up, you agree to Educative's Terms of Service and Privacy Policy