Speeding up ML with SageMaker Lakehouse

This newsletter details how the SageMaker Lakehouse architecture unifies data in S3 (via Apache Iceberg) and Redshift, guaranteeing ACID transactional consistency and time travel for reproducible ML — all governed by AWS Lake Formation.

13 mins read

Nov 14, 2025

#

Modern enterprises suffer from fractured data architectures and isolated silos. Data resides in multiple systems (S3 data lakes, Redshift warehouses, NoSQL stores, etc.), forcing complex ETL pipelines and data copies. These pipelines introduce latency, inconsistency, and high maintenance. As AWS notes, organizations often struggle to unify their data ecosystems across multiple platforms, resulting in redundant data and slow analytics. Relying on hand-rolled dependency management (e.g., custom singleton tables or manual locking) makes data workflows brittle and error-prone, further hampering ML velocity.

Amazon SageMaker Lakehouse provides an open, unified data platform that breaks down silos. Built on Amazon S3 and Apache Iceberg, it enables data scientists to work from a single copy of data across lakes and warehouses. Through SageMaker Unified Studio and Glue Data Catalog/Lake Formation, Lakehouse unifies access and governance. S3 tabular data (including new S3 Tables), Redshift schemas, and third-party sources are all queryable in-place. Central orchestration and versioned Iceberg tables ensure reliability, consistency, and historical traceability, allowing teams to focus on ML rather than plumbing. For example, AWS reports that customers using Lakehouse can query Iceberg tables without the need for complex ETL processes or data duplication, dramatically accelerating insights.

Architectural foundation (SageMaker Unified Studio and S3 Tables)#

The combination of SageMaker Unified Studio and Amazon S3 Tables delivers a fully managed lakehouse experience. It bridges data engineering, model training, and analytics by coupling Iceberg-based table storage on S3 with a collaborative ML workspace that natively understands governed datasets.

Key capabilities of SageMaker Unified Studio#

Amazon SageMaker Unified Studio provides an integrated workspace for end-to-end data and AI development within the lakehouse. It connects analytics, governance, and machine learning tools under one interface, enabling teams to move seamlessly from data exploration to model deployment. The following are the key capabilities of SageMaker Unified Studio:

Unified environment: Find and access all organizational data (S3, Redshift, databases, applications) in one place. SageMaker Unified Studio integrates functionality from EMR, Athena, Glue, Redshift, and SageMaker AI into a single portal. Data and AI assets are exposed via the SageMaker Lakehouse layer (an Apache Iceberg–compatible catalog), ensuring everyone sees the same data.
Interactive querying and notebooks: Built-in SQL and notebook editors let you query any registered dataset. For example, Unified Studio includes a query editor where you select a lakehouse data source and run SQL against S3/Iceberg tables or Redshift tables with auto-populated schemas. It supports multiple engines (Athena, Spark, Redshift Spectrum) automatically.
Centralized workflow management: Within Unified Studio, you can create collaborative projects, build ETL pipelines (with visual Glue jobs), and version models, all under one governance model. Shared catalogs (federated or managed) and Lake Formation permissions flow through this studio, ensuring consistent policies. In practice, Unified Studio serves as the gateway for all data and ML workflows, spanning from data discovery and prototyping to production pipelines; you never leave this unified interface.

Amazon S3 Tables (Optimized storage with built-in Iceberg support)#

Amazon S3 Tables introduces a new S3 bucket type designed specifically for analytics. Instead of generic buckets, a table bucket holds one or more Iceberg tables as subresources. Its key attributes include:

Purpose-built table storage: Table buckets give higher throughput and transaction rates (transactions per second, or TPS) than standard S3 buckets. They maintain the same durability and scalability of S3, but are architected for tables and queries. Typical use cases include daily transaction logs or sensor event streams (columnar tabular data).
Apache Iceberg integration: All S3 Tables are stored in the Apache Iceberg format. This means you can run standard SQL queries via Athena, Redshift Spectrum/Serverless, Spark, and other tools on the underlying data. Importantly, Iceberg enables ACID transactions, schema evolution, and time-travel on S3 Tables natively.
Automated optimization: S3 Tables continuously perform table maintenance (compaction, snapshot cleanup, orphan-file deletion) under the hood. This ensures high query performance and cost efficiency without manual tuning. In effect, S3 Tables and SageMaker Lakehouse together provide an automated, versioned store for analytics data across your lakehouse.

Step-by-step lakehouse catalog integration and governance#

Integrating S3 Tables into the SageMaker Lakehouse is streamlined and managed. You simply enable the analytics integration when creating a table bucket or catalog, and AWS handles the rest:

Glue Data Catalog setup: When you integrate S3 Tables, SageMaker (via S3) automatically creates a Glue catalog named s3tablescatalog in your account/region. Under this parent catalog, each table bucket gets its own child database (namespace) and tables. For example, creating a new table bucket via Unified Studio will produce a Glue database inside s3tablescatalog. This catalog is created with the necessary flag (AllowFullTableExternalDataAccess) so that analytics services can query all tables.
IAM and Lake Formation roles: The integration process creates a special IAM service role for Lake Formation access. This role (e.g., AWSServiceRoleForLakeFormationDataAccess) is granted permissions to all current and future S3 table buckets in the account, enabling Lake Formation to enforce permissions across them. In effect, Lake Formation becomes the central authority for granting or revoking access to any table bucket.

Redshift read-only admin: SageMaker Lakehouse adds the Amazon Redshift service role (AWSServiceRoleForRedshift) as a Lake Formation Read-only administrator. This enables Redshift to be aware of the S3 table metadata, allowing it to mount and query these tables transparently. In practice, once your s3tablescatalog is registered with Lake Formation, Redshift (Spectrum/Serverless) will automatically see all Iceberg tables under it.
Registering legacy S3: Note that this auto-integrate feature only covers table buckets. If you have legacy Iceberg tables in general-purpose S3 buckets, you must manually register those S3 data paths with Lake Formation. This registration step is required so that the Glue catalog entries for those tables are managed and secured by Lake Formation, just like the new S3 Table buckets.

Apache Iceberg: Guaranteeing data reliability and consistency for AI#

Apache Iceberg brings database-level reliability to the data lake, enabling consistent, versioned, and atomic operations across massive datasets. It serves as the backbone for maintaining trustworthy, up-to-date data in AI and analytics workflows.

The transactional advantage: ACID properties at scale#

Apache Iceberg brings full ACID transactions to S3 data lakes. Every write operation (INSERT/UPDATE/DELETE) is an atomic metadata commit, swapping one metadata file for another. In practice, this means concurrent writers never corrupt the table. If two jobs attempt conflicting updates, Iceberg’s optimistic commit will cause one to retry based on the new table version. By design, Apache Iceberg’s transactional architecture guarantees atomicity and consistency, providing a reliable data foundation for modern AI pipelines.

Atomicity and isolation: Each commit is atomic—meaning it follows an all-or-nothing principle—and provides serializable isolation. Iceberg maintains consistency by atomically swapping table metadata files, which forms the basis for serializable isolation. In simpler terms, readers always see a complete, committed snapshot of the data, never a partial update.
Consistent reads: Iceberg guarantees that every reader sees a consistent view. When a query starts, it pins a specific metadata snapshot. Even if concurrent writes occur, the reader continues to use its original snapshot until it refreshes. As a result, no query ever sees an in-progress change, and users can rely on repeatable reads with full confidence that each SELECT operation is performed on a coherent dataset.
Durability: All Iceberg tables live on S3, so data durability is inherited from S3’s 99.99 percent durability. Because the AWS Glue catalog maintains all table metadata, the table state remains consistently available and cannot be lost.

Reproducibility as a feature: Time travel for model auditing#

Iceberg’s versioning makes time travel queries straightforward. The system maintains a history of table snapshots with their timestamps. You can run a query as of a past time, effectively rewinding the table to that state. This is important for machine learning because you can tie a model training run to the exact snapshot of data it used. In other words, if a model fails or drifts, you can reconstruct the exact training data by going back to that point in time. From a compliance perspective, time travel enables you to audit precisely what data was used for any analysis by querying the historical table version.

Solving schema drift and partition optimization#

Iceberg was built for evolving data. Its metadata-driven design handles schema and partition changes without the need to rewrite data files. Iceberg achieves this adaptability through two key features that manage change efficiently:

Schema evolution: You can add, drop, or rename columns in the table’s metadata while keeping existing data files intact. Downstream queries are automatically updated to reflect the new schema. For example, Iceberg allows columns to be modified without breaking reports, as each schema change is an atomic metadata update. This means that schema drift, such as the introduction of new features in datasets, does not require costly backfills or ETL jobs, allowing the table to adapt seamlessly.
Partition evolution: Over time, data organization strategies may change, such as moving from hourly to daily partitions or introducing a sort order. Iceberg’s partition evolution feature lets you redefine how data is partitioned without rewriting existing files. It automatically manages new data under the updated scheme while continuing to read older partitions. Combined with automated compaction, this ensures optimized query performance and reduces manual maintenance.

Seamless interoperability: Bridging the lake and the warehouse with Redshift#

The SageMaker Lakehouse architecture integrates Amazon Redshift with Apache Iceberg tables on S3, enabling analytics teams to query both lake and warehouse data through a unified, governed interface.

Unified analytics strategy via Redshift#

A key benefit of SageMaker Lakehouse is that you can query both your high-performance Redshift data and your massive Iceberg datasets together. AWS makes this easy via Redshift Spectrum/Serverless. You simply point Redshift at the Glue catalog containing your Iceberg tables (including s3tablescatalog). From any Redshift workgroup, you can CREATE EXTERNAL SCHEMA on that catalog and then SELECT from Iceberg tables as if they were native external tables. This enables classic lakehouse patterns, allowing joins to be run between Redshift warehouse tables and S3-based tables in a single query. In practice, this means analysts and BI tools can combine the best of both worlds: fast Redshift for structured workloads and open S3/Iceberg for exabytes of raw data, without requiring data movement or copying.

Transactional constraints and strategic data movement#

One current limitation is that Amazon Redshift can only read from Iceberg tables. It supports ACID-consistent SELECT operations but cannot perform INSERT, UPDATE, or DELETE commands directly on S3 tables. You can use other engines for write operations. In practice, if you need to modify an Iceberg table, you can use Athena, EMR, or Spark (through Glue ETL jobs) to handle those writes. Redshift will automatically recognize the new data during the next query because the Iceberg table metadata has been updated. As a result, many teams employ a hybrid approach, utilizing Redshift for high-performance analytics while offloading update-heavy or historical data to S3 and Iceberg. Over time, entire tables can even be migrated to Iceberg using Spark or Redshift’s INSERT INTO SELECT command to fully rely on S3 for storage.

Redshift catalogs in Glue: Bidirectional governance#

The integration goes both ways. Redshift data can also be exposed in the Glue Data Catalog. You can register Redshift clusters or Serverless namespaces with the Data Catalog via Lake Formation. This creates a federated catalog for each cluster. Redshift databases and schemas are automatically mapped to Glue databases and tables. Once registered, any Lake Formation permitted user can query Redshift tables through other Iceberg-compatible engines (such as Athena and EMR), without needing to copy data. The benefit is uniform governance; you manage access in one place. In conjunction with Lake Formation’s fine-grained policies and SageMaker Lakehouse’s open Iceberg REST API, this means secure data sharing across accounts. In short, SageMaker Lakehouse supports bidirectional data sharing, S3 tables become queryable in Redshift, and Redshift tables become queryable in the lake, all under the same catalog and permissions.

Operationalizing governance, security, and data quality#

A unified data and AI platform is only as strong as its governance layer. The SageMaker Lakehouse operationalizes access control, compliance, and data quality directly within the data workflow, ensuring that every dataset used for analytics or ML adheres to enterprise standards.

Centralized fine-grained access control (FGAC) with Lake Formation#

The SageMaker Lakehouse relies on AWS Lake Formation as the security backbone. Lake Formation allows administrators to define table- and column-level access rules (GRANT/REVOKE) that are enforced across all services, S3 tables, Glue catalogs, and even Redshift external tables. In this model, instead of IAM bucket policies or ad hoc SQL grants, you manage one centralized policy graph. Redshift even fully honors these policies on Iceberg tables. All access is provided through the Glue/Lake Formation catalog interface (including the Iceberg REST endpoints), ensuring uniform FGAC across all platforms. For auditability, Lake Formation logs every access decision (via CloudTrail) and tracks DDL changes. Combined with Iceberg’s time-travel capabilities, you get a comprehensive compliance playbook; you can see who was allowed what and exactly what data snapshot they queried. In practice, no analytics query can bypass Lake Formation’s guardrails.

Data quality validation integrated into the ML workflow#

High-quality data is non-negotiable for AI. SageMaker Lakehouse embraces this with built-in data quality (DQ) checks. Through AWS Glue Data Quality integration, you can define and run quality rules (completeness, accuracy, etc.) directly on Iceberg/S3 tables. AWS recently announced that Glue Data Quality is now integrated with SageMaker Lakehouse, Apache Iceberg (on S3), and S3 Tables. In practice, you author DQ rules (or use Glue’s rule recommendations) and run them on your data lake tables. The results are surfaced in SageMaker Unified Studio, where you can visualize data quality scores and rule violations for each asset. This tight integration shortens the feedback loop, allowing data issues to be detected before they are incorporated into models. Teams can retain jobs or cancel the commit if DQ fails, preventing low-quality inputs from poisoning production models.

Consistency management: Replacing brittle dependencies with reliable access#

By unifying data access, the Lakehouse obviates many old workarounds. There’s no longer a need for fragile singleton tables, ad hoc locking, or manual ETL sequencing to ensure consistency. All teams work off the same governed Iceberg tables, and the platform handles concurrency and versioning. For instance, data scientists simply query the lakehouse catalog instead of writing custom workflows to synchronize data. In effect, SageMaker Lakehouse absorbs the complexity of consistency management. The result is faster iterations and more predictable models, as everyone is always looking at the true, up-to-date (or properly versioned) data without relying on manual glue code.

Conclusion and strategic recommendations#

To fully realize the value of Amazon SageMaker Lakehouse, organizations should begin by establishing strong data governance through AWS Lake Formation and Glue Data Catalogs, ensuring fine-grained access control from the outset. Next, adopt Apache Iceberg on Amazon S3 for transactional workloads that require reliable inserts, updates, and ACID compliance, utilizing engines such as Athena, EMR, or Glue for writes, while maintaining Redshift as a performant, read-only query layer. Finally, operationalize data quality enforcement by embedding Glue Data Quality checks into ML pipelines and automating quality gating in SageMaker Studio. Together, these steps create a unified, governed, and high-quality data foundation where data silos are eliminated, models are trained on trusted data, and analytics deliver faster and more accurate insights.

Ready for more? Gain hands-on experience with the following cloud labs:

Written By:

Fahim ul Haq

Free Edition

A practical guide to vector search in Amazon DocumentDB

Discover how Amazon DocumentDB brings vector search natively to your document database—enabling intent-based and semantic search without managing a separate vector store.

11 mins read

Nov 21, 2025