Streaming platforms are widespread ways to ingest real-time data. Compared to the CDC, where the data source is mostly databases, a streaming platform is a more universal solution to receive real-time events such as data from IoT sensors, retail, web, and mobile applications. Data is continuously generated by data sources in small batches. Both cloud vendors and open-source communities offer a number of options to ingest streaming data. In this lesson, we will look at two representative streaming platform examples, one is from the open-source community, and the other is from the cloud vendor.

Apache Kafka

Apache Kafka is a popular distributed streaming platform for building real-time data pipelines. It's well known for its high throughput, high scalability, high availability, fault tolerance, and low latency. Kafka has a variety of use cases, including real-time fraud detection, online activity tracking, and operational metrics collection.

Kafka consists of a storage layer and a compute layer. It supports a large number of external data sources such as AWS S3, BigQuery Sink, GitHub source, etc., and the data is stored as topics. Each topic can be split into several partitions for parallel processing across the cluster. Next to the cluster, there are producers and consumers. Producers act as an interface between the data source and topics, and consumers read and process data in the topics.

Get hands-on with 1200+ tech skills courses.