Introduction to Distributed Systems for Dummies/

...

Architecture of Apache Spark

Understand the architecture of Apache Spark.

We'll cover the following...

High-level Apache Spark architecture
Key takeaways

In this chapter, we will discuss the architecture of Apache Spark. This is an example of a distributed system that achieves one common goal.

In Spark architecture, there are multiple components that work together. These components consist of multiple processes or nodes. In most cases, Spark is deployed as a cluster of multiple nodes for big data processing. We will discuss the components and show how they interact with each other.

High-level Apache Spark architecture

In order to understand the architecture, we first need to know a few Spark-specific concepts.

Resilient Distributed Dataset (RDD)

RDD is a Spark abstraction for the data that is being processed. When we input some data in a Spark program, Spark reads the data and creates a RDD. Under the hood, Spark uses RDD to process the data and create results.

So what is an RDD? Let’s go over each part of the name.

Resilient: Fault-tolerant. If there is a failure, the data ...

Introduction

What Distributed Systems Achieve for Us

Data in Distributed Systems

Communication Between Nodes

Data Processing in Large Scale

Distributed System Architectural Patterns

Case Study 1: Apache Spark

Case Study 2: Apache Druid

Conclusion

Architecture of Apache Spark

High-level Apache Spark architecture

Resilient Distributed Dataset (RDD)