What is a DataFrame?

A DataFrame is both a logical container of data and an API, purposely built as a higher abstraction to the RDDs, as an older Spark abstraction in the case of the Java API and JavaRDDs.

In the Spark context, “logical container” defines a placeholder for data that spark loads and distributes, while the worker nodes process on an actual physical cluster.

The DataFrame provides a simple yet powerful API to simplify distributed data processing. That is, it hides the complexity and the necessity for developers to write difficult code that executes applications in a cluster.

Just like the RDDs, but going one step further, DataFrames leverage the power of distributed processing that a big data processing model needs to deal with huge amounts of information.

Some of its main features are:

The ability to scale from a reduced amount of bytes on a local or single machine to petabytes on a cluster.
Support for a wide range of sources and formats for reading data.
Code execution optimization through the Spark SQL Catalyst Optimizer.

Note: The Catalyst Optimizer is a complex and lengthy topic to cover in this course, but the following link can provide more information: DataBrick optimizer docs. ...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Spark's Java Main Abstraction: The DataFrame

What is a DataFrame?