Spark can read input data from any HadoopIt is a framework for distributed storage and big data processing that uses the programming model of MapReduce. data source. Every Spark application has driver and worker nodes. An important consideration is how Spark would know which applications need how many workers to be executed. This is the job of the cluster manager.

Cluster manager

There can be multiple applications running in Spark. If a user starts its own application while some other applications are already running on a cluster of machines, they would need resources to allocate to their tasks. This is where the cluster manager comes in. The driver uses cluster manager (an external service) to allocate a cluster of machines to the application. The cluster manager manages the cluster by keeping an eye on the failed workers and replacing them with another, greatly reducing the programming complexity we had to add to Spark.

The cluster managers that Spark can use include Mesos, YARN, and Spark’s standalone. The option that is available on all cluster managers is static partitioning of resources, meaning that each application gets maximum resources and holds on to them for the duration of its execution. However, the following resource allocations can be controlled:

  • The number of executors an application gets

  • The number of cores per executor

  • The executor memory

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.