Search⌘ K
AI Features

Data Operations and Support III

Explore practical approaches to troubleshoot data skew in AWS Glue Spark jobs, optimize joining large datasets in Redshift with S3, and implement sampling strategies for large-scale clickstream data. Understand how to prevent duplicate processing in Lambda functions triggered by S3 events and evaluate when to use serverless AWS Glue versus Amazon EMR for ETL workloads.

Question 50

A company processes IoT sensor data using an AWS Glue Spark ETL job. The data arrives in Amazon S3 with a highly skewed distribution, 80% of the records belong to just 5 out of 10,000 sensor IDs. The Glue job performs a groupBy aggregation on sensor_id, and the job frequently fails or runs extremely slowly due to executor out-of-memory errors. The data engineer must implement a mechanism to handle the data skew.

Which solution should the data engineer implement?

A. Increase the number of DPUs allocated to the Glue job to provide more aggregate memory and compute resources across all executors.

B. Repartition the DataFrame by a randomly generated column before performing the groupBy aggregation to distribute records evenly across partitions.

C. Implement a salting technique by appending a random salt value to the skewed sensor_id values, performing the groupBy aggregation on the salted key, and then performing a second aggregation to combine the salted results back to the original sensor_id.

D. Convert the Spark DataFrames to AWS Glue DynamicFrames, which automatically handle data skew through their built-in partitioning strategy.

...