Parquet: Intro
Explore the core concepts of Parquet in Hadoop, focusing on its columnar storage approach that enables efficient handling of nested data. Understand how Parquet differs from other formats by storing nested fields independently, allowing faster queries and better space use.
We'll cover the following...
Parquet
Parquet literally means a patterned wooden surface. But, in Hadoop world, the term refers to a columnar storage format. Parquet is based on Google’s Dremel paper. It is the love-child of collaboration between Cloudera and Twitter engineers. What sets Parquet apart from other columnar formats is its ability to efficiently store nested data. Deeply nested fields are also stored in a truly columnar fashion and can be read independently of other fields. Let’s see an example. Consider the following record which represents the structure for a car record in JSON:
{
make : "",
model : "",
year : "",
...