Producer Serialization

This lesson explains how we can use Apache Avro as a serialization choice when Kafka producers send messages.

We'll cover the following

Kafka comes with serialization classes for simple types such as string, integers, and byte arrays. However, one has to use a serialization library for complex types. We can use JSON, Apache Avro, Thrift or Protobuf for serializing and deserializing Kafka messages.

Using Avro with Kafka

For this course, we’ll use Avro for serialization and deserialization and discuss it in the context of Kafka. Apache Avro is a language neutral serialization framework and thus a good choice for Kafka. The Avro project was developed by Doug Cutting of Hadoop fame and later incubated at Apache.

Avro provides robust support for schema evolution. A schema can be thought of as a blue-print of the structure of each record in an .avro file. For instance, we can define a schema representing a car as follows:

{
  "namespace": "datajek.io.avro",
  "type": "record",
  "name": "Car",
  "fields": [
    {
      "name": "make",
      "type": "string"
    },
    {
      "name": "model",
      "type": [
        "string",
        "null"
      ]
    },
  ]
}

Avro is especially suited for Kafka as producers can switch to new schemas while allowing consumers to read records conforming to both the old or the new schema. For instance, if we decide to add a new field ‘horsepower’ to the car record and remove the ‘model’ field, then an application still using the old schema will receive a null value for the corresponding getter method for the ‘model’ field. If the reader application upgrades to the new schema but encounters a record written using the previous schema, the getter method for the field ‘horsepower’ will return with a null. However, there are certain rules (beyond the scope of this text) that are followed when schema resolution takes place.

The beauty of Avro is that a reader doesn’t need to know the record schema before reading a .avro file. The schema comes embedded in the file. If schemas have evolved, the reader will still need access to the new schema even though the application expects the previous one. Since a Kafka topic can contain messages conforming to different Avro schemas, it necessitates that every message also holds its schema. This becomes impractical for Kafka, as including schema in every message leads to bloat in message size. Kakfa addresses this issue by introducing the Schema Registry, where actual schemas are stored. Kafka messages only contain an identifier to their schema in the registry. This complexity is abstracted away from the user, with the serializer and deserializer responsible for pushing and pulling the schema from the registry.

Get hands-on with 1200+ tech skills courses.