data.tar.gz

HADOOP_HOME

JAVA_HOME

HDFS_NAMENODE_USER

HDFS_DATANODE_USER

HDFS_SECONDARYNAMENODE_USER

YARN_RESOURCEMANAGER_USER

YARN_NODEMANAGER_USER

HADOOP_CONF_DIR

ZK_HOME

PIG_HOME

AvroWriteExample

AvroReadExample

AvroGeneratedCodeReadExample

AvroGeneratedCodeWriteExample

AvroRPCExample

ParquetReadExampleJob

ParquetWriteExampleJob

ParquetAvroReadExampleJob

ParquetAvroWriteExampleJob

ParquetProjectionReadExampleJob

SequenceFileReadExampleJob

SequenceFileWriteExampleJob

SequenceFileSyncPointExampleJob

TestCarMapperJob

TestCarReducerJob

CarCounterMrProgramJob

MyLiveAppJob

DataNodeWebUI2

YarnWebUI

YarnWebUI-copy

YarnWebUI-copy-copy

JHS-UI

Spark-UI-copy

Spark-History-Server-UI-3

This course offers a one-of-a-kind rich and interactive experience to learn the fundamentals and basics of Big Data. Throughout this course, you will have plenty of opportunities to get your hands dirty with functioning Hadoop clusters.

You will start off by learning about the rise of Big Data as well as the different types of data like structured, unstructured, and semi-structured data. You will then dive into the fundamentals of Big Data such as YARN (yet another resource manager), MapReduce, HDFS (Hadoop Distributed File System), and Spark.

By the end of this course, you will have the foundations in place to start working with Big Data, which is a massively growing field.

Introduction to Big Data and Hadoop

## Introduction

Pig is a language for parallel data processing. One analogy is the existence of higher-level programming languages built indirectly on top of assembly-language. Programs can be of equal quality when written in an assembly language as they can be in languages or Java, C++, etc. However, the former requires a great effort. Pig's purpose is like the purpose of higher level programming languages; i.e. it provides an abstraction over MapReduce and other frameworks for easily expressing data analysis jobs. MapReduce paradigm involves writing a map function followed by a reduce. This can be challenging to implement as a programmer when working with complex workflows such as joins. Pig makes it easy to express a join for a user and by hiding underlying MapReduce complexity from the user.

Pig's language layer consists of a textual language called _Pig Latin_ used to express dataflows. Pig's infrastructure layer refers to the environment where Pig Latin programs are executed. It consists of a compiler that produces sequences of Map-Reduce programs run on an execution engine like MapReduce, Spark, or Tez. Pig is not tied to a particular parallel framework, but was first implemented on Hadoop. Originally developed at Yahoo Research in 2006, it offered for researchers an ad-hoc way of creating and executing MapReduce jobs on large data sets. In 2007, it moved into the Apache Software Foundation. The The unconventional name came from the [guiding principles](https://pig.apache.org/philosophy.html) of the project, which have a lot in common with pigs. Pig is fast, easy-to-use, compatible with different compute engines, and  works with structured or unstructured data.

## Execution modes
Pig has the following execution modes:

+ local
+ MapReduce
+ Spark
+ Tez


# Introduction

Pig is a language for parallel data processing. One analogy is the existence of higher-level programming languages built indirectly on top of assembly-language. Programs can be of equal quality when written in an assembly language as they can be in languages or Java, C++, etc. However, the former requires a great effort. Pig's purpose is like the purpose of higher level programming languages; i.e. it provides an abstraction over MapReduce and other frameworks for easily expressing data analysis jobs. MapReduce paradigm involves writing a map function followed by a reduce. This can be challenging to implement as a programmer when working with complex workflows such as joins. Pig makes it easy to express a join for a user and by hiding underlying MapReduce complexity from the user.

Pig's language layer consists of a textual language called _Pig Latin_ used to express dataflows. Pig's infrastructure layer refers to the environment where Pig Latin programs are executed. It consists of a compiler that produces sequences of Map-Reduce programs run on an execution engine like MapReduce, Spark, or Tez. Pig is not tied to a particular parallel framework, but was first implemented on Hadoop. Originally developed at Yahoo Research in 2006, it offered for researchers an ad-hoc way of creating and executing MapReduce jobs on large data sets. In 2007, it moved into the Apache Software Foundation. The The unconventional name came from the [guiding principles](https://pig.apache.org/philosophy.html) of the project, which have a lot in common with pigs. Pig is fast, easy-to-use, compatible with different compute engines, and  works with structured or unstructured data.

# Execution modes
Pig has the following execution modes:

+ local
+ MapReduce
+ Spark
+ Tez


Pig: Overview

Hadoop

YARN

Map Reduce

HDFS

Spark

Input & Output Formats

Misc

Quiz

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Pig: Overview

Introduction

Execution modes