Introduction

This lesson introduces the reader to Kafka and gives an overview of the technology.

Kafka, founded by Jay Kreps, Neha Narkhede, and Jun Rao, emerged on the Big Data scene in January of 2009. It has since enjoyed increasing popularity and use, especially in the enterprise space. Any Big Data department will almost certainly have a team using and maintaining Kafka or a wrapper service over Kafka for the rest of the company to use. Kafka is written in Java and Scala and has its roots in LinkedIn, where it was originally developed before being open-sourced. The original use case for Kafka was to track a user’s actions on the LinkedIn website. These actions served as inputs for an array of applications on the backend such as machine learning systems, search optimizers, and report generators which all play an important role in enriching the user experience.

The software was named after the famed short-story writer and novelist Franz Kafka, as it was intended to be a “system optimized for writing”.

Kafka is described by the official documentation as:

A distributed event streaming platform that lets you read, write, store, and process events (also called records or messages in the documentation) across many machines.

An event can be thought of as an independent piece of information that needs to be relayed from a producer to a consumer. These include events like Amazon payment transactions, iPhone geolocation updates, FedEx shipping orders, sensor measurements from IoT devices or medical equipment, and much more.

Over the years, several tools for data storage and processing have been introduced to address various use cases. Each tool originally came optimized for the special needs of the use cases it was built for. Slowly, the lines differentiating these various tools have blurred, though. A datastore such as Redis can now be used as a queue and a queuing mechanism likeKafka offers database-like durability guarantees, allowing it to function as a datastore. Kafka in particular is used primarily for building data pipelines and implementing streaming solutions.

# Change directory to Kafka's installation on directory
cd /Kafka
# Run Zookeeper from the bin folder but redirect the outputs to null so that
# messages from Zookeeper don't get mixed-up with messages from Kafka on the console
 bin/zookeeper-server-start.sh config/zookeeper.properties > /dev/null 2>&1 &
 # You can run the following command to verify that Zookeeper is indeed running
 ps -aef
 # Run the  Kafka service. You should see output from the service on the console. 
 # Hit Enter key when the output stops to return to the console.
 bin/kafka-server-start.sh config/server.properties &
 # At this point you have a basic Kafka environment setup and running.
 # Create a topic 'datajek-events' to publish to.
 bin/kafka-topics.sh --create --topic datajek-topic --bootstrap-server localhost:9092
# Write some messages to the 'datajek-events' topic
bin/kafka-console-producer.sh --topic datajek-topic --bootstrap-server localhost:9092
# Press Ctrl+C to stop the producer
# Read the messages written to the topic by running the consumer
bin/kafka-console-consumer.sh --topic datajek-topic --from-beginning --bootstrap-server localhost:9092
# You should see all the messages you typed earlier, displayed on the console. Press
# Ctrl+C anytime to stop the consumer.

Basics

Kafka Producer

Kafka Consumer

Kafka Internals

Conclusion

Appendix

Reference: Replication

Reference: Partitioning

Reference: Transactions

Reference: Issues in Distributed Systems

Introduction