Beginner Kafka Tutorial: Get Started with Distributed Systems
Explore the fundamentals of Apache Kafka and distributed systems. Understand Kafka's architecture, key features, including topics, partitions, brokers, producers, and consumers. Learn how Kafka supports scalable, fault-tolerant real-time data streaming and event-driven applications. This lesson prepares you for advanced Kafka topics and hands-on data pipeline development.
We'll cover the following...
Distributed systems are collections of computers that work together to form a single computer for end-users. They allow us to scale at exponential rates, and they can handle billions of requests and upgrades without downtime. Apache Kafka has become one of the most widely used distributed systems on the market today.
According to the official Kafka site, Apache Kafka is an “open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.” Kafka is used by many Fortune 100 companies, including big tech names like LinkedIn, Netflix, and Microsoft.
In this Apache Kafka tutorial, we’ll discuss the uses, key features, and architectural components of the distributed streaming platform. Let’s get started!
What is Kafka?
Apache Kafka is an open-source software platform written primarily in Java and Scala programming languages. Kafka started in 2011 as a messaging system for LinkedIn but has since grown to become a popular distributed event streaming platform. The platform is capable of handling trillions of records per day.
Kafka is a distributed system comprised of servers and clients that communicate through a TCP network protocol. The system allows us to read, write, store, and process events. We can think of an event as an independent piece of information that needs to be relayed from a producer to a consumer. Some relevant examples of this include Amazon payment transactions, iPhone location updates, FedEx shipping orders, and much more. Kafka is primarily used for building data pipelines and implementing streaming solutions.
Kafka allows us to build apps that can constantly and accurately consume and process multiple streams at very high speeds. It works with streaming data from thousands of different data sources. With Kafka, we can:
Process records as they occur
Store records accurately and consistently
Publish or subscribe to data or event streams
The Kafka publish-subscribe messaging system is extremely popular in the Big Data scene and integrates well with Apache Spark and other stream processing frameworks such as Apache Flink.
Kafka use cases
You can use Kafka in many different ways, but here are some examples of different use cases shared on the official Kafka site:
- Processing financial transactions in real-time
- Tracking and monitoring transportation vehicles in real-time
- Capturing and analyzing sensor data
- Collecting and reacting to customer interactions
- Monitoring hospital patients
- Providing a foundation for data platforms, event-driven architectures, and microservices
- Performing large-scale messaging
- Serving as a commit-log for distributed systems
- And much more
Key features of Kafka
Let’s take a look at some of the key features that make Kafka so popular:
Scalability: Kafka manages scalability in event connectors, consumers, producers, and processors.
Fault tolerance: Kafka is fault-tolerant and easily handles failures with masters and databases.
Consistent: Kafka can scale across many different servers and still maintain the ordering of your data.
High performance: Kafka has high throughput and low latency. It remains stable even when working with a multitude of data.
Extensibility: Many different applications have integrations with Kafka.
Replication capabilities: Kafka uses ingest pipelines and can easily replicate events.
Availability: Kafka can stretch clusters over availability zones or connect different clusters across different regions. Depending on the Kafka version, cluster coordination is handled either by ZooKeeper or Kafka’s internal KRaft consensus mechanism.
Connectivity: The Kafka Connect interface allows you to integrate with many different event sources such as JMS and AWS S3.
Community: Kafka is one of the most active projects in the Apache Software Foundation. The community holds events like the Kafka Summit by Confluent.
Components of Kafka architecture
Before we dive into some of the components of the Kafka architecture, let’s take a look at some of the key concepts that will help us understand it:
Kafka topics
Topics help us organize our messages. We can think of them as channels that our data goes through. Kafka producers can publish messages to topics, and Kafka consumers can read messages from topics that they are subscribed to.
Kafka partitions
Kafka topics are divided into partitions. These partitions are reproduced across different brokers. Within each partition, multiple consumers can read from a topic simultaneously.
Topic replication factor
The topic replication factor ensures that data remains accessible and that deployment runs smoothly and efficiently. If a broker goes down, topic replicas on different brokers stay within those brokers to make sure we can access our data.
Kafka brokers
A single Kafka server is called a broker. Typically, multiple brokers operate as one Kafka cluster. The cluster is controlled by one of the brokers, called the controller. The controller is responsible for administrative actions like assigning partitions to other brokers and monitoring for failures and downtime.
Partitions can be assigned to multiple brokers. If this happens, the partition is replicated. This creates redundancy in case one of the brokers fails. A broker is responsible for receiving messages from producers and committing them to disk. Brokers also receive requests from consumers and respond with messages taken from partitions.
Here’s a visualization of a broker hosting several topic partitions:
Kafka consumer groups
Consumer groups consist of a set of related consumers that work together to process records. Kafka assigns partitions of a topic to consumers in a group so that each partition is consumed by at most one consumer within the group at a time.
Kafka consumers
Consumers receive messages from Kafka topics. They subscribe to topics, then receive messages that producers write to a topic. Normally, each consumer belongs to a consumer group. In a consumer group, multiple consumers work together to read messages from a topic.
Let’s take a look at some of the different configurations for consumers and partitions in a topic:
Number of consumers and partitions in a topic are equal
In this scenario, each consumer reads from one partition.
Number of partitions in a topic is greater than the number of consumers in a group
In this scenario, some or all of the consumers read from more than one partition.
Single consumer with multiple partitions
In this scenario, all partitions are consumed by a single consumer.
Number of partitions in a topic is less than the number of consumers in a group
In this scenario, some of the consumers will be idle.
Kafka producers
Producers write messages to Kafka topics. They are responsible for choosing which topic and partition each message is sent to, either explicitly or through Kafka’s partitioning mechanisms.
With Kafka’s core architectural components in place, applications interact with the system through a set of well-defined APIs.
Kafka APIs
Kafka has four essential APIs within its architecture. Let’s take a look at them!
Kafka producer API: The Producer API allows apps to publish streams of records to Kafka topics.
Kafka consumer API: The Consumer API allows apps to subscribe to Kafka topics. This API also allows the app to process streams of records.
Kafka connector API: The Connector API connects apps or data systems to topics. This API helps us build and manage producers and consumers. It also enables us to reuse connections across different solutions.
Kafka streams API: The Streams API allows apps to process data using stream processing. This API enables apps to take in input streams from different topics and process them with a stream processor. Then, the app can produce output streams and send them out to different topics.
Advanced concepts to explore next
Congrats on taking your first steps with Apache Kafka! Kafka is an efficient and powerful distributed system. Kafka’s scaling capabilities allow it to handle large workloads. It’s often the preferred choice over other message queues for real-time data pipelines. Overall, it’s a versatile platform that can support many use cases. You’re now ready to move on to some more advanced Kafka topics such as:
Producer serialization
Consumer configurations
Partition allocation
To get started learning these topics and a lot more, check out Educative’s curated course Building Scalable Data Pipelines with Kafka. In this course, we’ll introduce you to Kafka theory and provide you with a hands-on, interactive browser terminal to execute Kafka commands against a running Kafka broker. You’ll learn more about the concepts we covered in this article, along with other important topics.
By the end, you’ll have a stronger understanding of how to build scalable data pipelines with Apache Kafka.
Happy learning!