Search⌘ K
AI Features

Data Warehousing and Advanced Access

Amazon Redshift is a petabyte-scale columnar data warehouse optimized for online analytical processing (OLAP), enabling efficient aggregation queries by storing data in columns. Key considerations include choosing between serverless and provisioned deployment models based on workload patterns, optimizing schema design with distribution and sort keys, and utilizing compression techniques. Redshift also facilitates querying external data through Redshift Spectrum, federated queries, and materialized views, allowing for efficient data access without extensive ETL processes. Proper schema optimization and ongoing maintenance are crucial for maintaining performance over time.

In the previous lesson, you designed DynamoDB tables around application access patterns, optimizing for low-latency point lookups and high-volume transactional writes. Analytical workloads use a different design model. When a query aggregates revenue across billions of rows but reads only three columns, row-oriented storage can waste I/O by scanning data the query does not need. Amazon Redshift is designed for this type of workload: a petabyte-scale, columnar data warehouse on AWS built for online analytical processing, or OLAP.

Redshift stores data in columns rather than rows, so an aggregation over sale_amount across two billion records reads only that single column’s compressed blocks, skipping everything else. This lesson covers three pillars that the AWS Certified Data Engineer – Associate exam tests heavily:

  • Choosing between Redshift’s two deployment models.

  • Optimizing warehouse schemas through distribution keys, sort keys, and compression.

  • Querying external data in place using federated queries, Redshift Spectrum, and materialized views.

Redshift deployment models

Redshift offers two deployment models that share the same SQL engine and RA3 managed storage layer but differ in how compute is provisionedthe compute billing unit for Redshift Serverless, automatically scaled based on workload demand. The following table breaks down the critical differences the exam expects you to know.

Dimension

Serverless

Provisioned

Compute Unit

Redshift Processing Units (RPUs); 1 RPU = 16 GB memory

Node types (RA3, DC2) with vCPUs, RAM, and slices per node (e.g., RA3.xlplus: 4 vCPUs, 32 GiB, 2 slices)

Scaling Behavior

Automatic up/down based on workload; define base and max RPU capacity

Manual elastic/classic resize; concurrency scaling for peak demand

Billing Model

Per-RPU-hour (per-second, 60s minimum); no charge when idle

Hourly per node (on-demand); reserved instances available for 1-3 year commitments

Ideal Workload Pattern

Intermittent, unpredictable workloads with spikes and idle periods

Steady, high-concurrency, continuously heavy workloads

Management Overhead

Minimal; AWS handles provisioning, patching, maintenance

Higher; requires node selection, WLM tuning, resize management

WLM Configuration

Automated via workgroups and namespaces

Manual queue configuration with concurrency, user groups, and query labeling options

For an intermittent analytics workload with unpredictable spikes, serverless is the correct answer. When you have a 24/7 business intelligence (BI) dashboard layer with hundreds of concurrent users, provisioned with reserved instances wins on cost predictability. With deployment selection covered, the next step is designing schemas that make the chosen compute layer as efficient as possible.

Distribution and sort key design

Redshift distributes table data across compute slices, and the distribution style you choose determines whether joins require ...