Orchestrating MLOps with Bedrock AgentCore and SageMaker

Building an autonomous agent that handles the entire ML lifecycle: from exploratory data analysis in a Code Interpreter sandbox to Bayesian hyperparameter tuning and model deployment on AWS SageMaker.

Dan
Dan
5 min read
Cover Image for Orchestrating MLOps with Bedrock AgentCore and SageMaker
Table of Contents

When you build machine learning pipelines, you realize quickly that the modeling itself is only a small fraction of the effort. The real work is in the plumbing: cleaning datasets, exploring stationarity and seasonality, engineering lagging or rolling window features, running hyperparameter search, tracking jobs, building reports, and deploying endpoints.

Traditionally, data scientists write messy Jupyter notebooks to do this manually. But what if an autonomous AI agent could handle this entire lifecycle?

I set out to build exactly that: a production-ready forecasting system that takes a raw CSV in S3, performs advanced statistical analysis, tunes hyperparameters, trains ARIMA/LSTM models, and deploys a live endpoint—all managed by an LLM-driven orchestrator.

Here is how I built a three-layer architecture combining AWS Bedrock AgentCore, Code Interpreter sandboxes, and AWS SageMaker.


The Three-Layer Architecture

An autonomous system needs to split concerns between reasoning, executing code, and running heavy machine learning jobs. I designed a three-layer architecture:

graph TD
    subgraph Layer1["Layer 1: Orchestration (AgentCore)"]
        Agent["Intelligent Agent<br/>(16 Tools)"]
        Boto3["AWS SDK (boto3)"]
    end

    subgraph Layer2["Layer 2: Data Science (Code Interpreter)"]
        Sandbox["Isolated Python Sandbox"]
        EDA["EDA / Statistical Tests"]
        Features["Feature Engineering"]
        Plotly["Plotly HTML Reports"]
    end

    subgraph Layer3["Layer 3: Scalable ML (SageMaker)"]
        Tuning["Bayesian HPO Jobs"]
        Training["Scale Model Training"]
        Endpoint["Production REST Endpoint"]
    end

    Agent --> Sandbox
    Agent --> Boto3
    Boto3 --> Tuning
    Boto3 --> Training
    Boto3 --> Endpoint

Let's break down how these layers interact.


Layer 1: Orchestration with AWS Bedrock AgentCore

The orchestrator is built using AWS Bedrock AgentCore (packaged via the Strands library). The agent has access to 16 specialized tools ranging from statistical analysis to SageMaker deployment commands.

In agent.py, the agent is initialized with a list of Python tool references:

from strands import Agent
from agents.advanced_eda_agent import run_advanced_eda
from agents.intelligent_feature_engineering_agent import recommend_features, create_features
from agents.sagemaker_simple import (
    create_sagemaker_training_job,
    get_training_job_status,
    deploy_sagemaker_model,
    invoke_sagemaker_endpoint
)

agent = Agent(
    name="IntelligentForecastingAgent",
    description="Intelligent time series forecasting with ARIMA and LSTM model comparison.",
    tools=[
        run_advanced_eda,
        recommend_features,
        create_features,
        create_sagemaker_training_job,
        get_training_job_status,
        deploy_sagemaker_model,
        invoke_sagemaker_endpoint,
        # ... other tools
    ]
)

The agent uses these tools to execute a 7-step forecasting workflow:

  1. Advanced EDA: Runs ADF/KPSS tests and seasonal decomposition.
  2. Feature Recommendations: The agent examines stats and recommends features.
  3. Feature Engineering: Generates rolling stats, lag columns, and calendar features.
  4. Bayesian Tuning: Launches SageMaker HPO to find optimal hyperparameters.
  5. Model Training: Trains classical models (ARIMA) and deep learning models (LSTM).
  6. Comparison: Evaluates metrics on hold-out test sets.
  7. Report Generation: Deploys the best model and outputs a presigned URL to an interactive Plotly report.

Layer 2: Safe Execution via Code Interpreter

A major challenge for LLM agents is handling arbitrary code execution. I cannot run unverified code written by an LLM directly on my application server or inside my main database container.

My solution is to offload all data science computations (EDA, feature engineering, Plotly charting) to an isolated Code Interpreter sandbox. This sandbox has pre-installed scientific libraries (pandas, scipy, statsmodels, plotly) but is locked down with limited CPU, memory, and network permissions.

Here is the design pattern I used for running our Advanced EDA tool:

@tool
def run_advanced_eda(dataset_s3_path: str, time_column: str = None, value_column: str = None) -> str:
    """
    Runs stationarity tests (ADF, KPSS), seasonal decomposition, and ACF/PACF analysis.
    """
    # 1. Parse the target S3 paths
    bucket, key = parse_s3_uri(dataset_s3_path)
    
    # 2. Build the Python script to run in the sandbox
    code = f'''
import pandas as pd
import numpy as np
import json
import subprocess
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.seasonal import seasonal_decompose

# Downloader helper - Sandbox has pre-configured read-only AWS CLI credentials
subprocess.run(['aws', 's3', 'cp', 's3://{bucket}/{key}', '/tmp/data.csv'], check=True)
df = pd.read_csv('/tmp/data.csv')

# ... Auto-detect time/value columns and clean data ...
series = df[value_col].dropna()

# Run ADF and KPSS tests
adf_stat, adf_pvalue, _, _, _, _ = adfuller(series)
kpss_stat, kpss_pvalue, _, _ = kpss(series, regression='c')

# Return results as JSON
print("===JSON_START===")
print(json.dumps({{
    "adf": {{"statistic": adf_stat, "p_value": adf_pvalue}},
    "kpss": {{"statistic": kpss_stat, "p_value": kpss_pvalue}}
}}))
print("===JSON_END===")
'''
    # 3. Send script to the Code Interpreter sandbox and parse stdout
    stdout = sandbox_client.execute(code)
    return extract_json_from_stdout(stdout)

By packaging Python code as a text block and passing it to the sandbox, we keep our agent's host server secure. The sandbox pulls the data from S3, computes stats, and returns clean, structured JSON back to the agent.


Layer 3: Scalable ML Training and Deployment on SageMaker

While Code Interpreter is great for light calculations, it lacks the memory and compute capacity to train deep learning models (like LSTMs) or run large-scale hyperparameter searches. For that, I delegate to AWS SageMaker.

When the agent reaches the model training step, it uses the AWS SDK (boto3) to launch real SageMaker HPO (Hyperparameter Optimization) jobs using preconfigured Scikit-Learn or PyTorch ECR containers:

import boto3
import json

sagemaker = boto3.client('sagemaker')

@tool
def create_sagemaker_training_job(
    job_name: str,
    dataset_s3_path: str,
    role_arn: str
) -> str:
    """
    Launches a Scikit-Learn ECR container on SageMaker for training.
    """
    try:
        response = sagemaker.create_training_job(
            TrainingJobName=job_name,
            AlgorithmSpecification={
                'TrainingImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3',
                'TrainingInputMode': 'File'
            },
            RoleArn=role_arn,
            InputDataConfig=[{
                'ChannelName': 'train',
                'DataSource': {
                    'S3DataSource': {
                        'S3DataType': 'S3Prefix',
                        'S3Uri': dataset_s3_path,
                    }
                },
                'ContentType': 'text/csv'
            }],
            OutputDataConfig={
                'S3OutputPath': f's3://{BUCKET}/sagemaker/models/'
            },
            ResourceConfig={
                'InstanceType': 'ml.m5.xlarge',
                'InstanceCount': 1,
                'VolumeSizeInGB': 10
            },
            HyperParameters={
                'p': '2',
                'd': '1',
                'q': '3',
                'seasonal-p': '1',
            },
            StoppingCondition={'MaxRuntimeInSeconds': 3600}
        )
        return json.dumps({'success': True, 'job_arn': response['TrainingJobArn']})
    except Exception as e:
        return json.dumps({'success': False, 'error': str(e)})

Because SageMaker jobs run asynchronously and can take 5 to 60 minutes, the agent uses a polling loop tool (get_training_job_status) to track the execution. Once the job succeeds, the agent triggers deploy_sagemaker_model to package the model artifact (model.tar.gz) from S3 and host it behind a real-time SageMaker endpoint.


The Two-Phase Data Pattern for Reports

Generating and serving rich reports presents a unique security challenge. The Code Interpreter sandbox (which generates the Plotly interactive HTML files) should never have write access to my production S3 buckets, nor should it hold long-term AWS credentials.

To solve this, I implemented a Two-Phase Data Handling pattern:

sequenceDiagram
    participant Agent as AgentCore (Has S3 Write Credentials)
    participant CI as Code Interpreter Sandbox (No S3 Credentials)
    participant S3 as AWS S3

    Agent->>CI: Execute charting script
    CI->>CI: Generate Plotly chart & serialize to HTML
    CI-->>Agent: Print HTML string to stdout
    Agent->>Agent: Extract HTML string from stdout
    Agent->>S3: Upload HTML content using boto3
    S3-->>Agent: Return secure presigned URL

By keeping the AWS write credentials strictly in the Orchestration Layer (AgentCore) and treating the Code Interpreter as a pure calculation engine, I maintain least-privilege security across my cloud infrastructure.


Conclusion & Lessons Learned

Building an autonomous ML pipeline taught me several key lessons:

  1. Decouple CPU Profiles: Keep light calculations (cleaning, simple charts) inside rapid, cheap sandboxes, and reserve expensive cloud computing instances (SageMaker ml.m5/ml.p3 nodes) for heavy model training.
  2. Handle Asynchrony Gracefully: LLMs are naturally sequential. To prevent the agent from locking up during a 30-minute SageMaker job, use explicit state-machine trackers and periodic polling tools.
  3. Sandbox Everything: Never let an LLM-generated script run natively. Use a secure container runtime with strict network isolation.

By combining AWS Bedrock AgentCore's reasoning capabilities with SageMaker's massive scaling power, I successfully took a standard time-series pipeline and turned it into a fully autonomous MLOps platform.