Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

spark

What are the types of cluster managers in Spark?

Fahad Farid
svg viewer

The Spark Cluster

Spark applications run as independent sets of processes on a cluster. These clusters are coordinated by the SparkContext object in the driver (main) program.

To run on a cluster:

  • SparkContext must connect to multiple cluster managers that allocate resources across the application.
  • Once connected, Spark acquires executors on nodes in the clusters. These are processes that run and store the computations.
  • Spark then sends application code to the executors, and SparkContext sends tasks to the executors to run.
svg viewer

This approach has many advantages:

  • Each app gets its executor processes that run tasks in multiple threads. This process can help isolate applications on the scheduling and executor side. Data, however, cannot be shared across different apps if it’s not written externally.
  • As long as the Spark cluster manager can acquire executor processes and communicate, it is easier to run.
  • The driver program must listen and execute throughout its lifetime, therefore, it should be network addressable.
  • Since the driver schedules the tasks, it should run close to the worker nodes, preferably on the same LAN. For remote requests, it’s better to open an RPC to the driver and have it submit operations.

The spark Cluster manager currently supports the following cluster managers.

  • Standalone: A simple cluster manager included within Spark can access HDFS and is easier to set up as it has a lot of online support. The cluster manager is resilient in nature and can successfully handle failures. It has the capability to manage resources according to the requirements of the applications.

  • Apache Mesos: A general manager that can also run Hadoop MapReduce and service applications. It is a distributed cluster manager that can manage resources per application. We can easily run spark jobs, Hadoop MapReduce, or any other service applications. Apache has API for most programming languages.

  • Hadoop YARN: A general manager in Hadoop. It acts as a distributed computing framework that maintains job scheduling as well as resource management. There are executors and pluggable schedulers that are readily available.

  • Kubernetes: A system for the automation, deployment, scaling, and management of containerized applications. It makes use of a native Kubernetes scheduler that has been added to Spark; however, the Kubernetes scheduler is currently experimental.

RELATED TAGS

spark

CONTRIBUTOR

Fahad Farid
Copyright ©2022 Educative, Inc. All rights reserved
RELATED COURSES

View all Courses

Keep Exploring