Data Schema: Avro and Protobuf
Explore the use of Avro and Protobuf schemas to maintain data quality by defining data structure and validation rules. Understand their roles in data serialization and deserialization within data engineering pipelines, including use cases in Apache Kafka and Google Cloud services. Learn to choose between these schemas based on schema evolution needs and performance requirements.
We'll cover the following...
One of the effective techniques for ensuring data quality is to implement a data schema. By defining the structure of data in a specific format, a data schema ensures consistency and accuracy in the exchange, storage, and utilization of data.
For example, in the context of data exchange between two applications, a schema defines the structure and constraints of data being passed between systems, including data format (XML, JSON, or CSV), field types (int, float, or string), and any rules such as the range of a numeric value and the date format. We will learn about two common data schema types—Avro and Protobuf—and how to incorporate them into data engineering pipelines.
Apache Avro
Apache Avro is an open-source data serialization system that exchanges and stores data between different applications in an efficient manner, independent of the programming languages they use.
Serialization is the process of converting an object into a format that can be easily stored, transmitted, and reconstructed later. It encodes the object's state and structure into a binary or textural format that other systems or programming languages can read. Deserialization is the process of reconstructing the object from the binary format. In this process, Avro schema plays a critical role in defining the data ...