...

Read Parquet Data Source

Learn to read the parquet data source of PySpark.

We'll cover the following...

Read data from a snapshot
Create a data catalog
Load data into PySpark

PySpark API already provides a built-in function to read the distributed data. We have to give the main directory location. PySpark will consider the whole directory as a data source. The SparkContextSparkSession vs SparkContext – Since earlier versions of Spark or Pyspark, SparkContext (JavaSparkContext for Java) is an entry point to Spark programming with RDD and to connect to Spark Cluster, Since Spark 2.0 SparkSession has been introduced and became an entry point to start programming with DataFrame and Dataset. exposes a spark.read.<filetype> ...

Introduction

Data Input/Output

Data Transformation

User Defined Function (UDF)

Wrapping Up

Appendix

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Read Parquet Data Source