Parquet: Intro

This lesson introduces the Parquet format used for storing data in columnar format.

We'll cover the following


Parquet literally means a patterned wooden surface. But, in Hadoop world, the term refers to a columnar storage format. Parquet is based on Google’s Dremel paper. It is the love-child of collaboration between Cloudera and Twitter engineers. What sets Parquet apart from other columnar formats is its ability to efficiently store nested data. Deeply nested fields are also stored in a truly columnar fashion and can be read independently of other fields. Let’s see an example. Consider the following record which represents the structure for a car record in JSON:

  make        : "",
  model       : "",
  year        : "",
  horsepower  : "",
  stats       : {
                   zeroToSixty: "",
                   torque: "",
                   topSpeed: ""

We used JSON for readability, but the record could be in any format. The record is an example of nested data. The nested stats field is another object with three fields of its own. When the car record is stored in columnar format, all the top-level fields (make, model, year, horsepower, and stats) can be considered rows. Here’s a chart on of how data from several car records, gets stored as columnar format :

Get hands-on with 1200+ tech skills courses.