Search⌘ K
AI Features

Training Approaches in SageMaker AI

Explore the three SageMaker training approaches—built-in algorithms, Script Mode, and Bring Your Own Container—to understand trade-offs in control, complexity, and cost. Gain insights into data handling, job orchestration, and cost-saving options like Managed Spot Training with checkpointing to efficiently scale and manage production ML workloads.

When a machine learning team ships its first model to production, the initial excitement fades quickly once it realizes that the training infrastructure demands constant attention: dependency conflicts break overnight builds, framework upgrades introduce silent regressions, and preprocessing logic duplicated between training and serving creates prediction drift that erodes model quality. The core architectural decision in SageMaker training, choosing how much control to retain vs. how much to delegate, determines whether our team spends cycles on ML innovation or container maintenance. This lesson maps the decision space across three training approaches, all unified by SageMaker’s standardized job orchestration, and positions us to make deliberate trade-offs that align with our workload’s complexity, compliance requirements, and cost constraints.

The training approach spectrum

Amazon SageMaker frames training workloads as a spectrum with three tiers: built-in algorithms, Script Mode, and Bring Your Own Container (BYOC). Each tier shifts the boundary between what SageMaker manages and what we own. The critical insight is that, regardless of which tier we choose, SageMaker enforces an identical contract for job orchestration. Data channels map from S3 to /opt/ml/input/data/{channel_name}/. Hyperparameters are injected into /opt/ml/input/config/hyperparameters.json. Trained model artifacts must be written to /opt/ml/model/, which SageMaker compresses and uploads to our specified S3 output path. Logs stream to CloudWatch automatically.

This standardization means the services we interact with remain consistent: the SageMaker Estimator API launches jobs, Amazon ECR stores container images (whether SageMaker-managed or custom), and S3 serves as the durable layer for inputs and outputs. Understanding this contract is a prerequisite for the next lesson on automated hyperparameter tuning, where SageMaker orchestrates hundreds of training jobs, each following this same directory structure, to search the hyperparameter space efficiently.

Built-in algorithms for rapid experimentation

SageMaker provides more than 17 pre-optimized algorithm containers, including XGBoost, Linear Learner, BlazingText, Image Classification, Object Detection, k-Nearest Neighbors, and more, hosted in Amazon ECR across all commercial regions. These containers ship with optimized implementations that support distributed training and GPU acceleration without any user code. You start by specifying the algorithm’s image URI (retrieved via sagemaker.image_uris.retrieve()), define hyperparameters as key-value pairs, point to S3 data channels for training and validation splits, and call fit().

No custom training script is required. We only need to configure the algorithm, and SageMaker executes it. However, each algorithm mandates specific input formats: RecordIO-protobuf for high-throughput algorithms like Linear Learner, CSV for XGBoost, or image files for vision algorithms. Format mismatches are the most common source of job failures for newcomers.

Attention: Built-in algorithms constrain us to their supported preprocessing and feature engineering logic. If our use case requires custom tokenization, domain-specific augmentation, or nonstandard loss functions, we must move up the spectrum to Script Mode.

Built-in algorithms are ideal for establishing baseline models during rapid prototyping. Once we validate that the problem is tractable, we graduate to custom code for production refinement.

Let’s examine the middle ground that most production teams adopt.

Script Mode and managed framework containers

Script Mode is the balanced default for production workloads. We write a custom training script, typically train.py, while SageMaker provides pre-built framework containers for TensorFlow, PyTorch, MXNet, scikit-learn, and Hugging Face. Our script must honor the /opt/ml/ contract: read hyperparameters from environment variables (for example, SM_HP_LEARNING_RATE) or the JSON config file, load data from /opt/ml/input/data/{channel_name}/, and persist the trained model to /opt/ml/model/.

Execution flow and dependency management

SageMaker handles instance provisioning, container orchestration, log streaming to CloudWatch, and artifact uploads to S3. When our script requires additional Python packages beyond the framework container’s defaults, include a requirements.txt file alongside our script. SageMaker installs these at job startup, which eliminates the need for a full custom container for minor dependency additions.

Practical tip: Separate preprocessing logic into a dedicated SageMaker Processing job rather than embedding it inside our training script. Coupling these concerns creates training-serving skew. The exact transformations applied during training become difficult to replicate at inference time unless we extract them into a shared, versioned artifact.

Script Mode supports distributed training via framework-native strategies (PyTorch DDP and TensorFlow MirroredStrategy), with SageMaker managing the cluster topology and inter-node communication. This positions Script Mode as production-ready for most custom workloads without the operational burden of container maintenance.

The following diagram illustrates how all three approaches converge into SageMaker’s unified training architecture.

SageMaker training job orchestration with standardized directory contract across all container strategies
SageMaker training job orchestration with standardized directory contract across all container strategies

With the unified contract established, we can examine the scenario where even Script Mode’s flexibility proves insufficient.

Bring Your Own Container for full control

BYOC is the maximum-flexibility option. We build a custom Docker image, push it to Amazon ECR, and reference the image URI in the Estimator. This approach becomes necessary when our workload requires proprietary libraries with complex system-level dependencies (custom-compiled C++ libraries and specialized CUDA kernels), unsupported frameworks, or compliance mandates that require a hardened, security-scanned base image.

The Docker contract

SageMaker expects the container to implement a train entry point, specified via the Dockerfile’s ENTRYPOINT instruction, that reads from the standard /opt/ml/ directory structure. As long as our container honors this contract, SageMaker treats it identically to any managed container by provisioning instances, mounting data channels, and collecting artifacts.

Note: SageMaker offers a hybrid path via the open-source SageMaker Training Toolkit. Installing this library inside our custom container gives us Script Mode behavior (environment variable injection and automatic model upload) while retaining full Dockerfile control over system dependencies.

BYOC carries the highest operational burden. We own patching, vulnerability scanning, dependency resolution, and framework version management. However, the Estimator API remains unified: fit() launches the job, data channels are specified identically, and model artifacts land in the same S3 output path. This consistency enables SageMaker Pipelines to orchestrate heterogeneous training steps, some using built-in algorithms and others using BYOC, within a single DAG.

The following table compares all three training approaches across critical dimensions to guide our selection.

SageMaker ML Implementation Approaches Comparison

Dimension

Built-in Algorithms

Script Mode

BYOC

Customization Level

No custom code

Custom training script

Full Dockerfile control

Container Management

SageMaker-managed

SageMaker-managed framework container

User-managed in ECR

Typical Use Case

Rapid prototyping with standard algorithms

Custom model logic with managed environment

Proprietary libraries or unsupported frameworks

Framework Support

Limited to supported algorithms

TensorFlow, PyTorch, Scikit-learn, MXNet, Hugging Face

Any framework or runtime

Infrastructure Overhead

Lowest

Moderate

Highest

Example Scenario

Baseline XGBoost model

Custom neural network with PyTorch

Custom C++ inference library or specialized CUDA kernels

With the right training approach selected, cost optimization becomes the next lever for production readiness.

Cost optimization with Managed Spot Training

Managed Spot Training applies across all three training approaches and leverages spare EC2 capacity at up to a 90% discount compared to on-demand pricing. Spot instances can be interrupted with a two-minute warning, so checkpointing is mandatory for production workloads.

Checkpointing mechanics and recovery

The training script periodically saves model state to /opt/ml/checkpoints/, which SageMaker automatically syncs to S3. If a spot interruption occurs, SageMaker relaunches the job on a new instance, restores from the latest S3 checkpoint, and resumes training from the last saved state, which avoids full retraining.

The Estimator configuration requires three parameters:

  • use_spot_instances=True: Enables spot capacity.

  • max_run: Maximum training time in seconds.

  • max_wait: Maximum total time, including interruptions (must exceed max_run).

Attention: Setting max_wait equal to max_run leaves zero buffer for spot recovery. In practice, set max_wait to 1.5 to 2 times max_run for jobs that tolerate interruption delays.

Checkpointing is not exclusively a spot training concern. Any long-running job, whether it is distributed training across multiple nodes or large-scale fine-tuning, benefits from periodic state persistence for fault tolerance and experiment resumability. SageMaker Pipelines can orchestrate training steps with spot configuration baked into the pipeline definition, which connects cost optimization directly to the broader MLOps workflow.

The following diagram illustrates the spot training lifecycle from initial execution through interruption and recovery.

Managed Spot Training lifecycle
Managed Spot Training lifecycle

This cost optimization strategy feeds directly into automated tuning workflows, where dozens or hundreds of training jobs execute in parallel, each benefiting from spot pricing when configured correctly.