Amazon MSK and Database Streaming
Amazon Managed Streaming for Apache Kafka (MSK) offers a fully managed Kafka experience, allowing data engineers to utilize Kafka's ecosystem without compromising on open-source compatibility. It supports two deployment models: Provisioned, for predictable workloads, and Serverless, for variable demands. Change Data Capture (CDC) is facilitated through DynamoDB Streams for NoSQL databases and AWS DMS for relational databases, enabling real-time data synchronization. AWS Glue Streaming ETL provides a means to transform streaming data before storage, optimizing output for analytics. Together, these services enhance AWS's streaming capabilities, crucial for effective data ingestion and processing.
While the previous lesson explored Kinesis Data Streams, enhanced fan-out, and Firehose delivery as the AWS-native streaming backbone, many production data platforms rely on Apache Kafka, an open-source distributed event streaming platform with a massive ecosystem of connectors and client libraries. AWS bridges this gap with Amazon Managed Streaming for Apache Kafka (MSK), giving data engineers a fully managed Kafka experience without sacrificing open-source compatibility.
This lesson covers three critical pillars tested on the DEA-C01 exam.
Amazon MSK for Kafka-based data ingestion.
Change Data Capture (CDC) through DynamoDB Streams and AWS DMS for tracking incremental database mutations.
AWS Glue Streaming ETL for transforming data in motion before it lands in your analytics layer.
Understanding when to reach for MSK vs. Kinesis, when DynamoDB Streams is appropriate vs. DMS, and how Glue Streaming ETL ties everything together will sharpen your exam-readiness and real-world pipeline design skills.
Amazon MSK fundamentals
Amazon MSK is a fully managed service that provisions, configures, and maintains Apache Kafka broker nodes and Apache ZooKeeper ensembles on your behalf. Because MSK runs Apache Kafka as an open source project, every standard Kafka producer, consumer, topic configuration, and client library works without modification. This is the key differentiator from Kinesis: MSK speaks the native Kafka protocol.
Provisioned vs. serverless deployment
MSK offers two deployment models, each with distinct cost and operational trade-offs.
MSK provisioned requires you to select broker instance types (such as
kafka.m5.large), specify the number of brokers per availability zone, and allocate EBS storage volumes. You control capacity directly, which suits predictable, high-throughput workloads where cost optimization depends on right-sizing.MSK serverless eliminates broker management entirely. The cluster auto-scales capacity based on throughput demand, and you pay per data ingested and retained. This model fits variable or unpredictable workloads where operational simplicity outweighs fine-grained cost control.
Topic partitions serve as the unit of parallelism in Kafka, analogous to Kinesis shards but with important differences. Partitions are configured per topic; throughput per partition depends on the broker instance size and disk I/O rather than a fixed cap, and there is no hard 1 MB/sec write limit per partition, as Kinesis imposes per shard.
Practical tip: MSK Connect lets you deploy Kafka Connect connectors (such as S3 sink or JDBC source connectors) as fully managed workers. This eliminates the need to self-host connector infrastructure and simplifies data movement between Kafka topics and external systems like S3 or relational databases....