Data Warehousing and Advanced Access
Amazon Redshift is a petabyte-scale columnar data warehouse optimized for online analytical processing (OLAP), enabling efficient aggregation queries by storing data in columns. Key considerations include choosing between serverless and provisioned deployment models based on workload patterns, optimizing schema design with distribution and sort keys, and utilizing compression techniques. Redshift also facilitates querying external data through Redshift Spectrum, federated queries, and materialized views, allowing for efficient data access without extensive ETL processes. Proper schema optimization and ongoing maintenance are crucial for maintaining performance over time.
In the previous lesson, you designed DynamoDB tables around application access patterns, optimizing for low-latency point lookups and high-volume transactional writes. Analytical workloads use a different design model. When a query aggregates revenue across billions of rows but reads only three columns, row-oriented storage can waste I/O by scanning data the query does not need. Amazon Redshift is designed for this type of workload: a petabyte-scale, columnar data warehouse on AWS built for online analytical processing, or OLAP.
Redshift stores data in columns rather than rows, so an aggregation over sale_amount across two billion records reads only that single column’s compressed blocks, skipping everything else. This lesson covers three pillars that the AWS Certified Data Engineer – Associate exam tests heavily:
Choosing between Redshift’s two deployment models.
Optimizing warehouse schemas through distribution keys, sort keys, and compression.
Querying external data in place using federated queries, Redshift Spectrum, and materialized views.
Redshift deployment models
Redshift offers two deployment models that share the same SQL engine and RA3 managed storage layer but differ in how compute is provisioned
Dimension | Serverless | Provisioned |
Compute Unit | Redshift Processing Units (RPUs); 1 RPU = 16 GB memory | Node types (RA3, DC2) with vCPUs, RAM, and slices per node (e.g., RA3.xlplus: 4 vCPUs, 32 GiB, 2 slices) |
Scaling Behavior | Automatic up/down based on workload; define base and max RPU capacity | Manual elastic/classic resize; concurrency scaling for peak demand |
Billing Model | Per-RPU-hour (per-second, 60s minimum); no charge when idle | Hourly per node (on-demand); reserved instances available for 1-3 year commitments |
Ideal Workload Pattern | Intermittent, unpredictable workloads with spikes and idle periods | Steady, high-concurrency, continuously heavy workloads |
Management Overhead | Minimal; AWS handles provisioning, patching, maintenance | Higher; requires node selection, WLM tuning, resize management |
WLM Configuration | Automated via workgroups and namespaces | Manual queue configuration with concurrency, user groups, and query labeling options |
For an intermittent analytics workload with unpredictable spikes, serverless is the correct answer. When you have a 24/7 business intelligence (BI) dashboard layer with hundreds of concurrent users, provisioned with reserved instances wins on cost predictability. With deployment selection covered, the next step is designing schemas that make the chosen compute layer as efficient as possible.
Distribution and sort key design
Redshift distributes table data across compute slices, and the distribution style you choose determines whether joins require ...