Search⌘ K
AI Features

An Example

Explore how to use Spark's low-level APIs to solve a car brand counting problem with fewer lines of code than MapReduce. Learn to create and manipulate RDDs, perform transformations such as map and flatMap, and apply reduceByKey to aggregate results. Understand Spark's process flow for processing text data in a distributed cluster environment.

We'll cover the following...

An Example

In this lesson, we’ll use Spark to count cars by brand name listed in a text file. Previously, we solved this same problem with MapReduce. Now, we’ll see how Spark implements a solution in far fewer lines of code.

  1. We’ll start with some commands used to manipulate RDDs. Start the spark-shell in the terminal below. Once the shell loads successfully, we see the Scala prompt. The executed commands are in Scala. The entry-point to Spark low-level APIs is the SparkContext which can be accessed as spark.sparkContext, if you enter that bit, you’ll see the object print out on the console as follows:

Next, create an RDD by reading a comma-separated file containing car records from the local disk as follows:

val carsRDD =
...