data.tar.gz

HADOOP_HOME

JAVA_HOME

HDFS_NAMENODE_USER

HDFS_DATANODE_USER

HDFS_SECONDARYNAMENODE_USER

YARN_RESOURCEMANAGER_USER

YARN_NODEMANAGER_USER

HADOOP_CONF_DIR

ZK_HOME

PIG_HOME

AvroWriteExample

AvroReadExample

AvroGeneratedCodeReadExample

AvroGeneratedCodeWriteExample

AvroRPCExample

ParquetReadExampleJob

ParquetWriteExampleJob

ParquetAvroReadExampleJob

ParquetAvroWriteExampleJob

ParquetProjectionReadExampleJob

SequenceFileReadExampleJob

SequenceFileWriteExampleJob

SequenceFileSyncPointExampleJob

TestCarMapperJob

TestCarReducerJob

CarCounterMrProgramJob

MyLiveAppJob

DataNodeWebUI2

YarnWebUI

YarnWebUI-copy

YarnWebUI-copy-copy

JHS-UI

Spark-UI-copy

Spark-History-Server-UI-3

This course offers a one-of-a-kind rich and interactive experience to learn the fundamentals and basics of Big Data. Throughout this course, you will have plenty of opportunities to get your hands dirty with functioning Hadoop clusters.

You will start off by learning about the rise of Big Data as well as the different types of data like structured, unstructured, and semi-structured data. You will then dive into the fundamentals of Big Data such as YARN (yet another resource manager), MapReduce, HDFS (Hadoop Distributed File System), and Spark.

By the end of this course, you will have the foundations in place to start working with Big Data, which is a massively growing field.

Introduction to Big Data and Hadoop

## Running Spark Applications

In previous lessons, when we fired-up the __spark-shell__, we interacted with an object of type __SparkSession__, represented by the variable `spark`. Starting with Spark 2.0, SparkSession is the single-unified entry point to manipulate data with Spark. There's a one-to-one correspondence between a Spark application and a SparkSession. Each Spark application is associated with one SparkSession. SparkSession has another field:__SparkContext__ which represents the connection to the Spark Cluster. The __SparkContext__ can create RDDs, accumulators, broadcast variables and run code on the cluster.

The illustration below shows how Spark interacts with and runs jobs on a Hadoop cluster.


# Running Spark Applications

In previous lessons, when we fired-up the __spark-shell__, we interacted with an object of type __SparkSession__, represented by the variable `spark`. Starting with Spark 2.0, SparkSession is the single-unified entry point to manipulate data with Spark. There's a one-to-one correspondence between a Spark application and a SparkSession. Each Spark application is associated with one SparkSession. SparkSession has another field:__SparkContext__ which represents the connection to the Spark Cluster. The __SparkContext__ can create RDDs, accumulators, broadcast variables and run code on the cluster.

The illustration below shows how Spark interacts with and runs jobs on a Hadoop cluster.


This lesson explains SparkSession and SparkContext and demonstrates running a Spark application.

Running Spark Applications

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Running Spark Applications

Running Spark Applications