Search⌘ K

More Operations with DataFrames

Explore core DataFrame operations in Spark including renaming columns, changing data types, and using aggregation methods for analysis. Understand how to handle date conversion, perform groupings, and use functions like avg, max, min, and sum to extract insights from your data. This lesson helps you manipulate and analyze structured data with confidence.

We can also rename, drop or change the data type of DataFrame columns. Let’s see examples of these.

Changing column names

Our data has a rather awkward name for the column that represents movie rating: hitFlop. We can rename the column to the more appropriate name, “Rating,” using the withColumnRenamed method.

scala> val moviesNewColDF = movies.withColumnRenamed("hitFlop","Rating")
moviesNewColDF: org.apache.spark.sql.DataFrame = [imdbId: string, title: string ... 8 more fields]

scala> moviesNewColDF.printSchema
root
 |-- imdbId: string (nullable = true)
 |-- title: string (nullable = true)
 |-- releaseYear: string (nullable = true)
 |-- releaseDate: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- writers: string (nullable = true)
 |-- actors: string (nullable = true)
 |-- directors: string (nullable = true)
 |-- sequel: integer (nullable = true)
 |-- Rating: integer (nullable = true)

The original DataFrame movies isn’t changed, rather we add a new DataFrame that’s created with the changed column name.

Changing column types

In our original movies DataFrame, the column releaseDate is inferred as string type instead of date type if we don’t use the samplingRatio option. To fix this, we can create a new column from the releaseDate column and interpret it as a date type using the withColumn method.

scala> val newDF = movies.withColumn("launchDate", to_date($"releaseDate", "d MMM yyyy"))
                         .drop("releaseDate")
k: org.apache.spark.sql.DataFrame = [imdbId: string, title: string ... 8 more fields]

scala> newDF.printSchema
root
...