A Spark cluster is a combination of a Driver Program, Cluster Manager, and Worker Nodes that work together to complete tasks. The SparkContext lets us coordinate processes across the cluster. The SparkContext sends tasks to the Executors on the Worker Nodes to run.
Here’s a diagram to help you visualize a Spark cluster:
The first step to manage a Spark cluster is to launch a Spark cluster. Follow the steps below to launch your own.
This setup is for launching a cluster with one Master Node and two Worker Nodes.
Install Java on all nodes. To install Java, run the following command:
To check if Java is installed successfully, run the following command:
sudo apt update sudo apt install openjdk-8-jre-headless
You can similarly install Scala on all the nodes.
To check if Scala is installed successfully, run the following command:
sudo apt install scala
To allow the cluster nodes to communicate with each other, we need to set up keyless SSH. To do so, install
openssh-client on the Master Node.
sudo apt install openssh-server openssh-client
Create an RSA key pair and name the files accordingly. The following creates key pairs and names the files
cd ~/.ssh ~/.ssh: ssh-keygen -t rsa -P "" Generating public/private rsa key pair. Enter file in which to save the key: rsaID Your identification has been saved in rsaID. Your public key has been saved in rsaID.pub.
Then, manually copy the contents of the rsaID.pub file into the
~/.ssh/authorized_keys file in each worker. The entire contents should be in one line that starts with
ssh-rsa and ends with
To verify if the SSH works, try to SSH from the Master Node into Worker Node. Run the following command:
cat ~/.ssh/id_rsa.pub ssh-rsa GGGGEGEGEA1421afawfa53Aga454aAG... firstname.lastname@example.org
ssh -i ~/.ssh/id_rsa email@example.com
Install Spark on all the nodes using the following command:
Extract the files, move them to
/usr/local/spark, and add the spark/bin into the
tar xvf spark-2.4.3-bin-hadoop2.7.tgz sudo mv spark-2.4.3-bin-hadoop2.7/ /usr/local/spark vi ~/.profile export PATH=/usr/local/spark/bin:$PATH source ~/.profile
Now, configure the Master Node to keep track of its Worker Nodes. To do this, we need to update the shell file,
CAUTION: If the
spark-env.shdoesn’t exist, copy the
spark-env.sh.templateand rename it to
# contents of conf/spark-env.sh export SPARK_MASTER_HOST=<master-private-ip> export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 # For PySpark use export PYSPARK_PYTHON=python3
We will also add all the IPs where the worker will be started. Open the
/usr/local/spark/conf/slaves file and paste the following:
contents of conf/slaves <worker-private-ip1> <worker-private-ip2>
Start the cluster using the following command.
View all Courses