Free AWS Certified Machine Learning Engineer Associate Exam
Explore a range of practical exam questions designed to test your knowledge of AWS machine learning engineering concepts. This lesson covers data ingestion strategies, preprocessing techniques, model training, deployment and scaling, monitoring, and cost optimization on AWS. You'll gain hands-on insights to help you prepare effectively for the AWS Certified Machine Learning Engineer Associate exam.
Question 1
A company collects IoT sensor data from 10,000 devices, with each device sending a JSON payload every second. The data must be stored in Amazon S3 for ML training, and the team uses Amazon Athena for downstream queries. The team is concerned about the small files problem: millions of tiny files in S3 that degrade Athena query performance because of excessive file listing and opening overhead.
Which solution addresses the small files problem with minimal operational overhead?
A. Use an AWS Lambda function triggered by each device event to write each JSON payload as an individual object in Amazon S3.
B. Use Amazon S3 event notifications to trigger a compaction Lambda function that periodically merges small files into larger ones.
C. Store the sensor data in Amazon DynamoDB instead of Amazon S3 to avoid the small files problem entirely.
D. Use Amazon Data Firehose with buffering configured (for example, a 128 MB buffer size or a 300-second buffer interval) and enable Apache Parquet format conversion.
Question 2
An ML team is preparing a dataset for a sentiment analysis model. The dataset contains raw customer review text that needs to be preprocessed before feature extraction and model training. The team needs to apply appropriate text preprocessing techniques to normalize and structure the raw text.
Which text preprocessing techniques should the team apply? (Select any two options.)
A. Tokenization: splitting raw text into individual tokens (words or subwords) as a fundamental preprocessing step.
B. One-hot encode the entire review text to convert each review into a binary feature vector.
C. Apply min-max scaling to the text data to normalize values between zero and one.
D. Remove stop words and apply lowercasing to normalize the text before vectorization.
E. Bin the review text into predefined numeric categories based on text length.
Question 3
A financial services company needs to validate the quality of a newly ingested dataset before using it for model retraining. The company wants automated data quality checks that flag issues such as missing values exceeding 5%, duplicate rows, and schema drift compared to the expected schema. The validation must be integrated into the existing AWS Glue ETL pipeline.
Which solution should the company implement?
A. Use Amazon SageMaker Model Monitor to detect data quality issues in the ingested dataset before training.
B. Use Amazon Macie to scan the dataset for data quality anomalies and schema violations.
C. Write custom Python validation scripts in an AWS Lambda function triggered by S3 upload events.
D. Use AWS Glue Data Quality rules defined in Data Quality Definition Language (DQDL) to specify completeness, uniqueness, and schema conformity checks, and integrate them into the Glue ETL pipeline.
Question 4
An ML engineer is configuring a SageMaker training job for a deep learning model that requires multiple epochs over an 800 GB dataset stored in Amazon S3. The engineer needs to choose a data input strategy that minimizes training startup time and efficiently supports repeated passes over the full dataset.
Which configuration should the engineer use?
A. Use Pipe mode to stream data directly from Amazon S3 to the training algorithm for maximum throughput.
B. Download the full 800 GB dataset to an Amazon EBS volume attached to the training instance before training begins.
C. Enable Amazon S3 Transfer Acceleration on the bucket to speed up data reads during training.
D. Use File mode with Amazon FSx for Lustre linked to the S3 bucket, enabling ...