Solution: PySpark Data Structures
Learn about the solution to the problem from the previous lesson.
We'll cover the following...
We'll cover the following...
Let’s look at solutions to the problems related to our understanding of PySpark Data structures, specifically focusing on PySpark DataFrames in this quiz.
- Create a PySpark DataFrame named
dfas shown below with the following provided data:
data = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]
ID | Name | Age | City |
1 | Alice | 25 | New York |
2 | Bob | 30 | Chicago |
3 | Charlie | 35 | San Diego |
Let’s understand the above solution now:
- Line 1: Import the
SparkSessionclass from thepyspark.sqlmodule. - Line 2: Create a
SparkSessionusing thebuilderpattern and thegetOrCreate()method. - Line 5: Copy the input data from the question.
- Line 8: Create an RDD first by using
spark.sparkContext.parallelize(data). - Line 11: Use the
createDataFrame()method of theSparkSessionto create a PySpark DataFrame nameddffrom the created RDDrdd. Theschemaparameter is provided to specify the column names. - Line 14: Use the
show()method of the DataFrame to display the contents of the DataFramedf.
- Show the first three rows of the
dfDataFrame and print the schema.
Let’s understand the above solution:
- Line 1: Import the
SparkSessionclass from thepyspark.sqlmodule. - Line 2: Create a
SparkSessionusing thebuilder