In this module, we’ll explore Apache Spark in detail with a special focus on its architecture, differentiation, and its relation to big data.

Introduction to Big Data with Apache Spark

Apache Spark is a big data exploratory and analytical tool where large scale data processing becomes easy. In this module, we’ll explore Apache Spark in detail with a special focus on its architecture, differentiation, and its relation to big data. We’ll follow the lifecycle of a single Spark API, its execution, and some of the resilient distributed datasets. In the end, we’ll work with the projects like Spark Maven and Deep Dive and how to enrich basic DataFrame programs. Get a detailed overview of Spark differentiation and its architecture.

Spark Overview

## Getting started with Spark   
   
<p>Spark has become the ubiquitous platform for data processing and has taken over the traditional MapReduce framework. In fact, some technologists would go so far as to declare MapReduce dead. Spark has been proven to outperform MapReduce by several orders of magnitude in numerous benchmarks and performance studies. Below, we briefly recount the history behind Spark's dominance in the big data space.


## History
The big data movement began in earnest with Google's ambition to index the world wide web and make it searchable for users at lightning speed. The result was: 

- **Google File System (GFS):** A fault-tolerant distributed file system running on clusters of cheap commodity hardware.
- **Bigtable:** A scalable store of structured data on top of GFS.

- **MapReduce:** A new parallel programming paradigm that allows for processing large amounts of data distributed across GFS and Bigtable. 

Google's work was proprietary but the papers coming out of the effort let to Hadoop, an open source implementation of Google's ideas by Yahoo engineers. The Hadoop project was later donated to Apache. Although MapReduce works well for batch processing, it is cumbersome, complex, has a steep learning curve and takes too long. The weakness of MapReduce is that it writes intermediate results on disk, which slows down the overall computation. Consider the scenario where one MR job's output is fed into a second job as an input. The first job dumps its output to disk upon completion and then the second job reads the input again from disk. The I/O against the disk slows down the overall workflow.
 



# Getting started with Spark   
   
<p>Spark has become the ubiquitous platform for data processing and has taken over the traditional MapReduce framework. In fact, some technologists would go so far as to declare MapReduce dead. Spark has been proven to outperform MapReduce by several orders of magnitude in numerous benchmarks and performance studies. Below, we briefly recount the history behind Spark's dominance in the big data space.


# History
The big data movement began in earnest with Google's ambition to index the world wide web and make it searchable for users at lightning speed. The result was: 

- **Google File System (GFS):** A fault-tolerant distributed file system running on clusters of cheap commodity hardware.
- **Bigtable:** A scalable store of structured data on top of GFS.

- **MapReduce:** A new parallel programming paradigm that allows for processing large amounts of data distributed across GFS and Bigtable. 

Google's work was proprietary but the papers coming out of the effort let to Hadoop, an open source implementation of Google's ideas by Yahoo engineers. The Hadoop project was later donated to Apache. Although MapReduce works well for batch processing, it is cumbersome, complex, has a steep learning curve and takes too long. The weakness of MapReduce is that it writes intermediate results on disk, which slows down the overall computation. Consider the scenario where one MR job's output is fed into a second job as an input. The first job dumps its output to disk upon completion and then the second job reads the input again from disk. The I/O against the disk slows down the overall workflow.
 



Learn the evolution and history behind Spark, the ubiquitous and unified big data processing platform.

Spark Overview

Getting Started with Spark

Introduction

Getting started with Spark

History