Search⌘ K
AI Features

Free AWS Certified Data Engineer Associate Practice Exam

The practice exam for the AWS Certified Data Engineer Associate includes a series of questions that test knowledge on various AWS services and data engineering concepts. Key topics covered include real-time data ingestion using Amazon Kinesis, data enrichment with DynamoDB, managing API throttling, configuring network access for on-premises data sources, and ensuring data pipeline replayability. Additionally, the exam addresses data cataloging with AWS Glue, lifecycle management in S3, and optimizing data processing in Amazon Redshift. Each question presents a scenario requiring the selection of the most efficient AWS solution to meet specific operational needs.

Question 1

A media company receives clickstream data from millions of users in real time and needs to ingest this data into an Amazon S3 data lake for downstream analytics. The data volume is highly unpredictable, with massive spikes during live events. The company wants a managed streaming ingestion solution that automatically scales with traffic and delivers data to S3 in near real time.

Which solution meets these requirements with the least operational overhead?

A. Use Amazon Kinesis Data Streams in provisioned mode with a Lambda consumer that writes records to Amazon S3.

B. Use Amazon Data Firehose configured with Amazon S3 as the destination.

C. Deploy an Amazon MSK cluster and configure a Kafka Connect S3 sink connector to write data to Amazon S3.

D. Use AWS Glue Streaming ETL to read from a Kinesis data stream and write output to Amazon S3.

Question 2

A company uses Amazon Kinesis Data Streams to ingest IoT sensor data. The company needs to enrich each record with device metadata stored in Amazon DynamoDB before writing the enriched data to Amazon S3. The enrichment logic is lightweight and completes in under 30 seconds per batch.

Which approach should the data engineer use to perform this enrichment?

A. Configure Amazon Data Firehose with a data transformation Lambda function to enrich records before delivery to S3.

B. Deploy an Amazon EC2 instance running a Kinesis Client Library (KCL) application that reads from the stream, queries DynamoDB, and writes to S3.

C. Use AWS Glue Streaming ETL to read from the Kinesis data stream, perform a DynamoDB lookup for enrichment, and write results to S3.

D. Configure a Lambda event source mapping on the Kinesis data stream, where the Lambda function reads device metadata from DynamoDB and writes enriched records to S3.

Question 3

A data engineering team needs to ingest data from a third-party REST API that enforces a rate limit of 100 requests per minute. The team’s current ingestion script frequently exceeds this limit, resulting in HTTP 429 errors and data loss. The data engineer must implement a solution to handle throttling gracefully.

Which solution addresses the throttling issue while preventing data loss?

A. Increase the AWS Lambda function’s reserved concurrency to process more requests in parallel.

B. Place an Amazon API Gateway in front of the third-party API to manage rate limiting.

C. Implement exponential backoff with jitter in the Lambda function and use an Amazon SQS queue as a buffer to control the rate of API calls.

D. Store failed request payloads in Amazon DynamoDB and retry them manually during off-peak hours.

Question 4

A company has an on-premises data source that must send data to AWS services. The data source is behind a corporate firewall. The data engineer must configure network access so that the on-premises system can connect to an Amazon Kinesis Data Streams endpoint.

Which approach should the data engineer recommend?

A. Allowlist the AWS IP address ranges for the Kinesis service in the relevant AWS Region on the corporate firewall, or establish a VPN or Direct Connect connection combined with a VPC interface endpoint for Kinesis.

B. Create a VPC interface endpoint for Kinesis Data Streams and configure the on-premises application to use the endpoint DNS name without any additional network connectivity.

C. Open all outbound traffic on the corporate firewall to allow connections to any AWS endpoint.

D. Configure AWS PrivateLink for Kinesis Data Streams and route on-premises traffic through the public internet to the PrivateLink endpoint.

Question 5

A data engineer is building a data pipeline that reads events from an Amazon Kinesis data stream. If the pipeline fails, the team must be able to reprocess all events from the last 72 hours. The data engineer must ensure the pipeline supports replayability.

Which configuration ensures the pipeline can replay events from the last 72 hours?

A. Use Amazon SQS as the ingestion layer instead of Kinesis Data Streams, since SQS retains messages for up to 14 days.

B. Increase the Kinesis Data Streams retention period to at least 72 hours and use the AT_TIMESTAMP or TRIM_HORIZON iterator to reprocess records from the desired point in time.

C. Enable S3 versioning on the output bucket to recover previously processed data.

D. Keep the default 24-hour Kinesis Data Streams retention period and use the LATEST iterator type to resume processing from the most recent record.

Question 6

A healthcare company receives terabytes of data files daily from external partners via SFTP. Partners use standard SFTP clients and cannot modify their existing file transfer workflows. The company wants to migrate these file transfers to AWS and store incoming files directly in Amazon S3 using a fully managed solution.

Which AWS service should the data engineer use?

A. AWS Transfer Family configured with the SFTP protocol and an Amazon S3 backend

B. An Amazon EC2 instance running an open-source SFTP server with scripts to upload files to Amazon S3

C. AWS DataSync with an SFTP agent installed at each partner location

D. Amazon FSx for Lustre linked to an Amazon S3 bucket

Question 7

A company uses Amazon Redshift as its data warehouse. During a critical monthly reporting window, a long-running analytical query acquires locks on key tables, preventing other users from executing UPDATE statements. These blocked queries eventually time out, disrupting downstream processes.

Which two strategies the data engineer must implement to manage locking and prevent prolonged access conflicts? (Select any two options.)

A. Use SET lock_timeout to configure a maximum wait time so that blocked queries fail fast rather than waiting indefinitely.

B. Review blocking sessions using STVLOCKS and terminate them with pg_terminate_backend when necessary.

C. Enable Amazon Redshift concurrency scaling to handle the additional query load during the reporting window.

D. Increase the number of WLM queue slots to allow more queries to run concurrently.

E. Migrate the Redshift cluster to RA3 node types to improve lock handling performance.

Question 8

A data engineering team is building a centralized data catalog for their S3-based data lake. The lake contains data in CSV, JSON, and Apache Parquet formats distributed across hundreds of S3 prefixes. The team wants to automatically discover schemas and populate the catalog without writing custom code or manually defining table structures.

Which approach should the data engineer use?

A. Use AWS Glue crawlers to automatically discover schemas and populate the AWS Glue Data Catalog.

B. Manually create table definitions in the AWS Glue Data Catalog for each S3 prefix.

C. Use Amazon Athena CREATE TABLE statements to define each table manually.

D. Use AWS Lake Formation blueprints to discover and catalog the schemas.

Question 9

A company stores application logs in Amazon S3. Logs less than 30 days old are accessed frequently for troubleshooting. Logs between 30 and 90 days old are accessed rarely but must remain retrievable. Logs older than 90 days are never accessed but must be retained for compliance for a total of one year. After one year, logs must be permanently deleted.

Which S3 Lifecycle configuration is the most cost-effective?

A. Transition objects to S3 Standard-IA after 30 days, transition to S3 Glacier Flexible Retrieval after 90 days, and expire (delete) objects after 365 days.

B. Enable S3 Intelligent-Tiering on the bucket and configure expiration at 365 days.

C. Transition objects to S3 Glacier Deep Archive after 30 days and expire objects after 365 days.

D. Keep all objects in S3 Standard for the entire year and expire objects after 365 days.

Question 10

A company is migrating an on-premises Oracle database to Amazon Aurora PostgreSQL. The Oracle database contains hundreds of stored procedures, views, triggers, and complex schema objects. Before using AWS DMS for data migration, the data engineer needs to assess the migration complexity and convert the Oracle schema, including stored procedures, to PostgreSQL-compatible DDL.

Which tool should the data engineer use?

A. AWS Schema Conversion Tool (AWS SCT)

B. AWS Database Migration Service (AWS DMS) alone

C. Manually rewrite all stored procedures and schema ...