Data Operations and Support III

Explore practical approaches to troubleshoot data skew in AWS Glue Spark jobs, optimize joining large datasets in Redshift with S3, and implement sampling strategies for large-scale clickstream data. Understand how to prevent duplicate processing in Lambda functions triggered by S3 events and evaluate when to use serverless AWS Glue versus Amazon EMR for ETL workloads.

We'll cover the following...

Question 50
Question 51
Question 52
Question 53
Question 54

Question 50

A company processes IoT sensor data using an AWS Glue Spark ETL job. The data arrives in Amazon S3 with a highly skewed distribution, 80% of the records belong to just 5 out of 10,000 sensor IDs. The Glue job performs a groupBy aggregation on sensor_id, and the job frequently fails or runs extremely slowly due to executor out-of-memory errors. The data engineer must implement a mechanism to handle the data skew.

Which solution should the data engineer implement?

A. Increase the number of DPUs allocated to the Glue job to provide more aggregate memory and compute resources across all executors.

B. Repartition the DataFrame by a randomly generated column before performing the groupBy aggregation to distribute records evenly across partitions.

C. Implement a salting technique by appending a random salt value to the skewed sensor_id values, performing the groupBy aggregation on the salted key, and then performing a second aggregation to combine the salted results back to the original sensor_id.

D. Convert the Spark DataFrames to AWS Glue DynamicFrames, which automatically handle data skew through their built-in partitioning strategy.

1.Introduction

2.Data Ingestion Architectures

Cloud Lab

3.AWS Data Stores

Cloud Lab

4.Data Cataloging and Lifecycle Management

5.Data Processing and Programming Logic

Cloud Lab

Cloud Lab

Cloud Lab

6.Pipeline Orchestration and Operations

Cloud Lab

Cloud Lab

Cloud Lab

7.Data Analysis and Quality Control

Cloud Lab

Cloud Lab

8.Pipeline Monitoring, Maintenance, and Auditing

Cloud Lab

Cloud Lab

9.Data Security and Governance

Assessment

10.Practice Exam Solution 1: AWS Certified Data Engineer – Associate

11.Free AWS Certified Data Engineer Associate Practice Exam

12.Conclusion

Data Operations and Support III

Question 50