Streaming Data Ingestion with Amazon Kinesis
Streaming data ingestion with Amazon Kinesis enables real-time processing of high-velocity data, essential for modern workloads. Key components include Kinesis Data Streams for low-latency ingestion, Data Firehose for near-real-time delivery to storage, and Amazon Managed Service for Apache Flink for complex stream processing. Effective management of shard capacity and consumer types is crucial to avoid throttling, while strategies like high-cardinality partition keys enhance performance. Lambda integration allows for serverless processing, and Firehose optimizes data delivery to S3 with features like dynamic partitioning and format conversion, making it suitable for efficient querying.
In batch-oriented pipelines, data arrives in discrete chunks on a schedule, but many modern workloads demand continuous processing where events are analyzed within seconds of generation. The AWS Certified Data Engineer – Associate exam (DEA-C01) tests your ability to architect streaming ingestion pipelines that handle high-velocity data, distribute it to multiple consumers, and deliver it efficiently to storage. This lesson focuses on the core Kinesis services that power such architectures and walks through a real-world clickstream use case that ties every component together.
Three AWS services form the backbone of Kinesis-based streaming. Amazon Kinesis Data Streams provides true real-time ingest with sub-second latency and event- and shard-level throughput control. Amazon Data Firehose is a fully managed, near-real-time delivery pipeline that buffers records and writes them to destinations such as Amazon S3 without requiring custom consumer code. Amazon Managed Service for Apache Flink enables complex stream processing such as windowed aggregations and anomaly detection on data flowing through Kinesis Data Streams.
The running use case throughout this lesson captures continuous clickstream data from web servers, routes it through Kinesis Data Streams, performs real-time anomaly detection with Flink, and buffers the enriched output into S3 via Firehose in optimized Parquet format.
Attention: Data Streams combined with Flink delivers true real-time analytics with sub-second latency. Firehose, by contrast, buffers data for 60–900 seconds before delivery, making it near-real-time.
Understanding these latency boundaries sets the stage for the throughput mechanics that govern how data enters and exits a Kinesis stream.
Kinesis Data Streams throughput
The shard is the fundamental unit of capacity in Kinesis Data Streams. Every stream consists of one or more shards, and each shard enforces hard throughput limits that directly affect pipeline design.
Shard capacity and provisioning modes
Each shard supports 1 MB/sec or 1,000 records/sec for writes (whichever limit is hit first) and 2 MB/sec for reads. These limits are non-negotiable, so engineers must right-size shard counts to match expected data volumes.
Kinesis offers two capacity modes for managing shards:
Provisioned mode requires you to calculate the number of shards manually. If your producers generate 5 MB/sec of clickstream data, you need at least five shards for writes alone.
On-demand mode lets AWS auto-scale the shard count based on observed throughput, charging per GB ingested rather than per shard-hour. This simplifies ...