Building Scalable Data Pipelines with Kafka/

...

Messaging Patterns

This lesson explains the two asynchronous messaging patterns and the use cases of Kafka.

We'll cover the following...

Publish subscribe (Pub-Sub)
Message queuing
Use cases of Kafka

In today’s world, most applications are data driven. An application can consume data from several sources and produce data to consumers downstream. Keeping track of all the sources and destinations of data for a single application requires a very tightly coupled design. Ideally, though, we would like to decouple consumers and producers of data from each other. This is where Kafka and other messaging systems come in. These systems are intermediaries that move data records from one application to another and decouple them from each other. The producer of data doesn’t know who the consumer of data is or even when the data is consumed. This allows developers to focus on the core logic of their applications and relieves them of directly connecting producers to consumers. This design that allows for decoupling producers and consumers is called asynchronous messaging and consists of two patterns:

Publish Subscribe (Pub-Sub)
Message Queuing

Publish subscribe (Pub-Sub)

In the publish-subscribe model, a participant in the system produces data and publishes the data to a channel or topic. The message can be consumed by multiple readers and the messages are always delivered to the consumers in the order that they were published in.

Use cases of Kafka

Originally, Kafka was developed at LinkedIn to provide a high performance messaging system to track user activity (page views, click tracking, modifications to profile, etc.) and system metrics in real-time.
Messaging: Kakfa can be used in scenarios where applications need to send out notifications. For instance, various applications can write messages to Kafka and a single application can then read the messages and take appropriate action (e.g. format the message a certain way, filter a message, batching messages in a single notification).
Metrics and logging: Kafka is a great tool for building metrics and logging data pipelines. Applications can publish metrics to Kafka topics which can then be consumed by monitoring and alerting systems. The pipelines can also be used for offline analysis using Hadoop. Similarly, logs can be published to Kafka topics which can then be routed to log search systems such as Elasticsearch or security analysis applications.
Commit log: Kafka is based on the concept of a commit log which opens up the possibility of using it for database changes. The stream of changes can be used to replicate database updates on a remote system.
Stream processing: The term “stream processing” generally refers to Hadoop’s map/reduce style of processing when applied to data in real-time. Kafka can be used by streaming frameworks to allow applications to operate on Kafka messages to perform actions such as counting metrics, partitioning messages for processing by other applications, combining messages, or applying transformations on them.

Basics

Kafka Producer

Kafka Consumer

Kafka Internals

Conclusion

Appendix

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Messaging Patterns

Publish subscribe (Pub-Sub)

Message queuing

Use cases of Kafka