What are Resilient Distributed Datasets in Apache Spark?
Resilient Distributed Datasets
Resilient Distributed Datasets (RDDs) are fault-tolerant collections of data in Apache Spark that can be operated on in parallel.
Setting up RDDs
There are two main ways to create RDDs.
Parallelising iterables
A very simple way to set up RDDs is to parallelize existing iterables or lists with the library’s parallelize function. The following code shows an example of how a normal Python list may be parallelized.
numList = [10, 20, 30, 40, 50]rddList = spark.parallelize(numList)
Reading data from existing sources
Spark allows data to be read from existing sources such as local storage, HDFS, Cassandra, and Amazon S3. Supported file formats include, but are not limited to, text files, sequence files, etc.
The following code block demonstrates how data can be read from locally stored text files.
rddFile = spark.textFile("data.txt")
An important thing to note is that the path to the file should be the same for all worker nodes. A possible solution might be to have a copy of the file in all worker node directories.
Once set up, parallel operations can be carried out on RDD variables.
Free Resources