Solution: PySpark Data Structures
Learn about the solution to the problem from the previous lesson.
We'll cover the following...
We'll cover the following...
Let’s look at solutions to the problems related to our understanding of PySpark Data structures, specifically focusing on PySpark DataFrames in this quiz.
- Create a PySpark DataFrame named
dfas shown below with the following provided data:
data = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]
ID | Name | Age | City |
1 | Alice | 25 | New York |
2 | Bob | 30 | Chicago |
3 | Charlie | 35 | San Diego |
Python 3.8
from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()# Copy the input data to heredata = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]# Create an RDD firstrdd = spark.sparkContext.parallelize(data)# Create a PySpark DataFrame named dfdf = spark.createDataFrame(rdd, schema = ["Id", "name", "age", "city"])# Print the contents of the dfdf.show()
Let’s understand the above solution now:
- Line 1: Import the
SparkSessionclass from thepyspark.sqlmodule. - Line 2: Create a
SparkSessionusing thebuilderpattern and thegetOrCreate()method. - Line 5: Copy the input data from the question.
- Line 8: Create an RDD first by using
spark.sparkContext.parallelize(data). - Line 11: Use the
createDataFrame()method of theSparkSessionto create a PySpark DataFrame nameddffrom the created RDDrdd. Theschemaparameter is provided to specify the column names. - Line 14: Use the
show()method of the DataFrame to display the contents of the DataFramedf.
- Show the first three rows of the
dfDataFrame and print the schema.
Python 3.8
from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate()# Copy the input data to heredata = [('1', 'Alice', '25', 'NY'), ('2', 'Bob', '30', 'Chicago'), ('3', 'Charlie', '35', 'San Diego')]# Create an RDD firstrdd = spark.sparkContext.parallelize(data)# Create a PySpark DataFrame named dfdf = spark.createDataFrame(rdd, schema = ["Id", "name", "age", "city"])# Show the first three rows of the `df` DataFramefor row in df.take(3):print(row)# Print the schema of the `df` DataFramedf.printSchema()
Let’s understand the above solution:
- Line 1: Import the
SparkSessionclass from thepyspark.sqlmodule. - Line 2: Create a
SparkSessionusing thebuilder