Big Data Processing
Explore how to design AWS big data architectures using a durable S3-centric data lake and the complementary execution models of Amazon EMR, AWS Glue, and AWS Batch. Learn to match workloads to services for scalable, secure, and cost-effective data processing across enterprise-level pipelines.
Modern enterprise data platforms process petabytes daily, yet many scenarios challenge architects to move beyond the assumption that “big data equals Hadoop clusters.” AWS provides three distinct execution models for large-scale data processing, each optimized for different workload characteristics, operational requirements, and cost profiles. The architectural foundation across all three is an
The following diagram illustrates how these three services operate within a secure VPC boundary. Each consumes data from S3 through VPC endpoints while serving distinct processing patterns.
Understanding when each path applies requires examining its architectural characteristics, starting with migrating existing Hadoop and Spark workloads to Amazon EMR.
Migrating Hadoop and Spark to Amazon EMR
Amazon EMR provides managed infrastructure for Apache Hadoop, Spark, Hive, Presto, and HBase while maintaining full API compatibility with existing big data frameworks. The migration strategy from on-premises clusters centers on decoupling storage from compute by replacing HDFS with EMRFS, which presents S3 as a Hadoop-compatible file system. Jobs are refactored to read and write directly to S3, eliminating the need for persistent cluster storage and enabling clusters to scale independently of data volume.
Instance fleets and managed scaling
EMR offers two capacity models with distinct trade-offs for common scenario questions:
Instance groups define a single instance type per group (primary, core, task), providing simplicity but limiting flexibility when specific Spot capacity is unavailable.
Instance fleets allow multiple instance types and purchasing options within a single logical group, enabling EMR to select from a diversified pool of On-Demand, Spot, and Reserved capacity to maximize availability and minimize cost. ...