Data Warehousing and Advanced Access

Amazon Redshift is a petabyte-scale columnar data warehouse optimized for online analytical processing (OLAP), enabling efficient aggregation queries by storing data in columns. Key considerations include choosing between serverless and provisioned deployment models based on workload patterns, optimizing schema design with distribution and sort keys, and utilizing compression techniques. Redshift also facilitates querying external data through Redshift Spectrum, federated queries, and materialized views, allowing for efficient data access without extensive ETL processes. Proper schema optimization and ongoing maintenance are crucial for maintaining performance over time.

We'll cover the following...

Redshift deployment models
Distribution and sort key design
- Distribution styles
- Sort keys and compression
Querying external data without movement
Storage optimization techniques
Conclusion

In the previous lesson, you designed DynamoDB tables around application access patterns, optimizing for low-latency point lookups and high-volume transactional writes. Analytical workloads use a different design model. When a query aggregates revenue across billions of rows but reads only three columns, row-oriented storage can waste I/O by scanning data the query does not need. Amazon Redshift is designed for this type of workload: a petabyte-scale, columnar data warehouse on AWS built for online analytical processing, or OLAP.

Redshift stores data in columns rather than rows, so an aggregation over sale_amount across two billion records reads only that single column’s compressed blocks, skipping everything else. This lesson covers three pillars that the AWS Certified Data Engineer – Associate exam tests heavily:

Choosing between Redshift’s two deployment models.
Optimizing warehouse schemas through distribution keys, sort keys, and compression.
Querying external data in place using federated queries, Redshift Spectrum, and materialized views.

Redshift deployment models

Redshift offers two deployment models that share the same SQL engine and RA3 managed storage layer but differ in how compute is provisionedthe compute billing unit for Redshift Serverless, automatically scaled based on workload demand. The following table breaks down the critical differences the exam expects you to know.

Dimension	Serverless	Provisioned
Compute Unit	Redshift Processing Units (RPUs); 1 RPU = 16 GB memory	Node types (RA3, DC2) with vCPUs, RAM, and slices per node (e.g., RA3.xlplus: 4 vCPUs, 32 GiB, 2 slices)
Scaling Behavior	Automatic up/down based on workload; define base and max RPU capacity	Manual elastic/classic resize; concurrency scaling for peak demand
Billing Model	Per-RPU-hour (per-second, 60s minimum); no charge when idle	Hourly per node (on-demand); reserved instances available for 1-3 year commitments
Ideal Workload Pattern	Intermittent, unpredictable workloads with spikes and idle periods	Steady, high-concurrency, continuously heavy workloads
Management Overhead	Minimal; AWS handles provisioning, patching, maintenance	Higher; requires node selection, WLM tuning, resize management
WLM Configuration	Automated via workgroups and namespaces	Manual queue configuration with concurrency, user groups, and query labeling options

For an intermittent analytics workload with unpredictable spikes, serverless is the correct answer. When you have a 24/7 business intelligence (BI) dashboard layer with hundreds of concurrent users, provisioned with reserved instances wins on cost predictability. With deployment selection covered, the next step is designing schemas that make the chosen compute layer as efficient as possible.

Distribution and sort key design

Redshift distributes table data across compute slices, and the distribution style you choose determines whether joins require ...

1.Introduction

2.Data Ingestion Architectures

Cloud Lab

3.AWS Data Stores

Cloud Lab

4.Data Cataloging and Lifecycle Management

5.Data Processing and Programming Logic

Cloud Lab

Cloud Lab

Cloud Lab

6.Pipeline Orchestration and Operations

Cloud Lab

Cloud Lab

Cloud Lab

7.Data Analysis and Quality Control

Cloud Lab

Cloud Lab

8.Pipeline Monitoring, Maintenance, and Auditing

Cloud Lab

Cloud Lab

9.Data Security and Governance

Assessment

10.Practice Exam Solution 1: AWS Certified Data Engineer – Associate

11.Free AWS Certified Data Engineer Associate Practice Exam

12.Conclusion

Data Warehousing and Advanced Access

Redshift deployment models

Distribution and sort key design