Components and Architecture

Explore the fundamental components and architecture of Apache Spark. Learn about Spark Core, Spark SQL, cluster managers, and how Spark uses in-memory and disk storage for efficient data processing and scalability.

We'll cover the following...

Core components

Spark Core
Spark SQL

Comparison of RDDs and DataFrames

Spark streaming
MLib and GraphX components

Cluster managers
Spark runtime architecture

Storage and memory

Core components

Behind the scenes, Spark is comprised of a core component on top of which different libraries sit. This is no accident, as the creators of Spark applied this type of architecture to continue adding modules pertaining to different functionalities.

This type of architecture resembles a “Plugin Architecture,” in which features can be developed and incorporated over time.

Let’s take a brief look at each of them.

Spark Core

The nucleus of Spark contains the basic but fundamental functionalities for scheduling applications execution, memory management, storage systems’ interaction, fault recovery, etc.

Spark Core is the home of the Resilient Distributed Dataset (RDD) data structure, an in-memory fault-tolerant and immutable collection of elements representing partitioned data. Besides raw data, it can also contain a more complex type of data such as Scala, Python, or Java programmatic ...

1.Course Introduction

2.Spark Introduction and Basics

3.Getting Started with Spark

4.DataFrame Basic Operations

5.DataFrame Advanced Operations

6.Spark SQL and Other Functionalities

7.Building a Big Data Batch Application

8.Deployment and Cluster Execution

9.Monitoring and Performance Fundamentals

10.Conclusion

11.Apendix

Components and Architecture

Core components

Spark Core