Data Preparation for Machine Learning (ML)
Explore how to prepare and process datasets efficiently for AWS machine learning projects. Understand optimal storage formats like Parquet, real-time ingestion methods, data anonymization for compliance, feature engineering, and handling data imbalance. This lesson guides you through practical solutions for improving ML data pipelines from ingestion to training.
Question 1
A company has a 500 GB dataset of customer transaction records stored as CSV files in Amazon S3. Data scientists query this data frequently using Amazon Athena for exploratory analysis, but queries are slow and costly because of full-table scans. The team needs to optimize the storage format to reduce both query execution time and the amount of data scanned per query.
Which approach should the team implement to achieve the most significant improvement in Athena query performance and cost efficiency?
A. Use an AWS Glue ETL job to convert the CSV files to JSON format with GZIP compression and store the output in Amazon S3.
B. Use an AWS Glue ETL job to convert the CSV files to Apache Parquet format with Snappy compression and store the output in Amazon S3.
C. Keep the data in CSV format, but enable Amazon S3 Transfer Acceleration on the bucket to speed up data access for Athena.
D. Use an AWS Glue ETL job to convert the CSV files to Apache Avro format with no compression and store the output in Amazon S3.
Question 2
An ML engineering team needs to ingest real-time clickstream data from a web application and store it in Amazon S3 for nightly batch model training. The data arrives at approximately 50,000 records per second. The team wants a solution that requires minimal operational overhead, supports automatic batching of records into larger files, and optionally converts data to a columnar format before landing in S3.
Which solution meets these requirements?
A. Use Amazon Data Firehose configured to deliver data to Amazon S3 with buffering hints and Apache Parquet format conversion enabled.
B. Use Amazon Kinesis Data Streams with a custom AWS Lambda consumer that writes accumulated records to Amazon S3 in batches.
C. Deploy an Apache Kafka cluster on Amazon EC2 instances with a custom S3 sink connector to write data to Amazon S3.
D. Use an AWS Glue streaming ETL job to consume the clickstream data and write it to Amazon S3.
Question 3
A health care company is preparing a training dataset that contains protected health information (PHI) stored in Amazon S3. The company must ensure compliance with HIPAA regulations while still making the data usable for ML model training. The team needs to implement technical safeguards that address data discovery, anonymization, and data protection at rest.
Which two strategies should the team implement? (Select two.)
A. Store the training data in a public S3 bucket with server-side encryption enabled using SSE-S3.
B. Use Amazon Macie to identify and classify PHI fields in the S3 bucket, then apply data masking and anonymization techniques to those fields before using the data for training.
C. Use Amazon Comprehend Medical to extract PHI entities from the dataset and then delete the original data files from S3.
D. Encrypt data at rest using AWS KMS with customer-managed keys (SSE-KMS) and restrict access through S3 bucket policies and IAM roles scoped to the ML team.
E. Apply ...