Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

spark
big data
apache

What are Resilient Distributed Datasets in Apache Spark?

Educative Answers Team

Resilient Distributed Datasets

Resilient Distributed Datasets (RDDs) are fault-tolerant collections of data in Apache Spark that can be operated on in parallel.

Setting up RDDs

There are two main ways to create RDDs.

Parallelising iterables

A very simple way to set up RDDs is to parallelize existing iterables or lists with the library’s parallelize function. The following code shows an example of how a normal Python list may be parallelized.

numList = [10, 20, 30, 40, 50]
rddList = spark.parallelize(numList)

Reading data from existing sources

Spark allows data to be read from existing sources such as local storage, HDFS, Cassandra, and Amazon S3. Supported file formats include, but are not limited to, text files, sequence files, etc.

The following code block demonstrates how data can be read from locally stored text files.

rddFile = spark.textFile("data.txt")

An important thing to note is that the path to the file should be the same for all worker nodes. A possible solution might be to have a copy of the file in all worker node directories.

Once set up, parallel operations can be carried out on RDD variables.

RELATED TAGS

spark
big data
apache
Copyright ©2022 Educative, Inc. All rights reserved
RELATED COURSES

View all Courses

Keep Exploring