# Running Codex on a local model with LM Studio

The orchestration system I am building hands Codex jobs from a DAG, and Codex runs the whole agentic loop for each of them, writing files, running shell commands, editing code, and checking its own work. Most of those jobs are small, renaming something, scaffolding a file, summarizing a diff, judging whether an output meets a criterion, and none of them needs a frontier model, yet every one of them goes out over the wire and shows up on the bill. The harness that plans, calls tools, and verifies is identical wherever the model lives, so I wanted the small jobs answered by a model running on my own Mac, for free and offline. I assumed this was a config change, it took the whole afternoon instead, and this blogpost collects everything that broke along the way to a setup that works.

## Choosing a model and getting it onto the machine

A model that sits inside an agent loop gets called many times per task, so latency compounds and the runtime choice matters. On a Mac that choice is [MLX](https://github.com/ml-explore/mlx), Apple's Metal-native array framework, which drives the unified-memory GPU directly and is the fastest option on Apple Silicon at a given quantization. [vLLM](https://github.com/vllm-project/vllm) has no real Metal backend, and the GGUF path in Ollama or LM Studio runs on llama.cpp, which MLX edges out at the same quant.

For the model itself, the lever is active parameters rather than total. A Mixture-of-Experts model only fires a slice of its weights on each token, so a 35B model with 3B active generates faster than a 27B dense model while carrying more total capacity for quality. I settled on [`unsloth/Qwen3.6-35B-A3B`](https://huggingface.co/unsloth) in MLX 4-bit, about 21.6 GB on disk with 3B active per token, the strongest and the fastest of everything I tried.

Two download problems cost me an hour before the model ever loaded. Unauthenticated Hugging Face pulls are throttled to a crawl, so set a free token. And Unsloth's repos use Xet storage, whose client kept freezing at 0 MB/s with a pile of half-downloaded chunks, while turning Xet off and falling back to plain HTTPS jumped the download straight to around 10 MB/s:

```bash
export HF_TOKEN=hf_xxx                 # otherwise the download crawls
export HF_HUB_DISABLE_XET=1            # the real fix — Xet stalls, plain HTTPS doesn't
export HF_HUB_ENABLE_HF_TRANSFER=1
uv tool install mlx-lm --with hf-transfer
```

## An afternoon of 404s

MLX ships a server, `mlx_lm.server`, that exposes an OpenAI-compatible endpoint, and Codex lets you define a custom model provider pointed at any OpenAI-compatible URL, so pointing one at the other looked like the whole job. Instead, every agent turn came back `404` and Codex retried, forever, and I watched it loop for forty minutes before digging into why.

The reason is which HTTP API Codex speaks. Modern Codex, 0.139 in my case, drives a custom model provider over the OpenAI Responses API, meaning it POSTs to `/v1/responses`, while `mlx_lm.server` only implements the older Chat Completions API at `/v1/chat/completions`. Codex knocks on a door the MLX server does not have, gets a 404, assumes a transient failure, and tries again. There used to be an escape hatch, `wire_api = "chat"`, which forced Codex back onto chat-completions, and it was removed in 0.139, so setting it now throws `wire_api = "chat" is no longer supported` and refuses to start. The advice the internet still gives you for this situation is a hard error today, and the honest conclusion is that a bare MLX server cannot run Codex's agent loop at all.

Two more failures showed up while I was flailing, and both look like a broken model when they have nothing to do with the model. A reconnect loop printing `rmcp ... Auth(AuthorizationRequired)` on every single run turned out to be a stale remote MCP server still listed in my Codex config, trying and failing to authenticate before any real work could start. And an older LM Studio build flatly rejected Codex's tool schema with `tools.N.type invalid`, which a current build accepts without complaint.

## The setup that works

Instead of forcing `mlx_lm.server` to speak the Responses API, I served the same MLX model through LM Studio and used Codex's built-in local-provider path, `--oss --local-provider lmstudio`, which speaks chat-completions and handles tool calls correctly. With that in place, Codex's agent harness ran entirely on the local Qwen3.6 and created a file on disk, about 20k tokens of agent loop with nothing leaving the machine.

Four conditions had to hold at once, and each one had failed independently before they lined up. The MLX model has to be served by LM Studio rather than converted to GGUF, and there is nothing to re-download, since the Hugging Face snapshot can be symlinked straight into `~/.lmstudio/models/<org>/<repo>` and LM Studio picks it up as-is. It has to load with a large context window, because at 8192 tokens Codex's own agent system prompt overflows before your task even starts and you get the cryptic `tokens to keep is greater than the context length`, while 32768 works. Any auth-failing MCP server has to come out of `~/.codex/config.toml`, or the `rmcp` reconnect loop stalls every run before the model gets a turn. And LM Studio itself has to be a current build, for the tool-schema reason above.

```bash
lms load qwen3.6-35b-a3b-ud-mlx --gpu max --context-length 32768 && lms server start
codex exec --oss --local-provider lmstudio -m qwen3.6-35b-a3b-ud-mlx \
  -C /tmp/t --sandbox workspace-write --skip-git-repo-check -c 'mcp_servers={}' \
  "Create hello.txt containing: hi. Then stop."   # → writes the file, zero cloud
```

One model-specific detail worth knowing before you blame the setup: Qwen3.6 is a thinking model, it writes its reasoning into a separate channel and only emits the real answer after it closes `</think>`, so if you starve it on `max_tokens` it spends the entire budget thinking and hands back empty content. Give it room to finish the thought.

## Splitting the work between local and cloud

With local Codex working, the decision stops being all-or-nothing. The multi-step changes where a wrong move is expensive stay on cloud Codex, since that is where the capability gap is real and worth paying for, while the long tail of simpler tasks, generating a file, renaming across a module, summarizing a run, judging whether an output meets a bar, runs locally for nothing, offline, with no data leaving the machine. In my setup the split is decided by an eval pass rather than by feel: a node runs on both engines, the outputs are compared, and the node only graduates to the local model once it demonstrably matches cloud quality on that task.

## A DSPy-style loop where generation is free

The part I wanted all of this for is prompt optimization. A [DSPy](https://github.com/stanfordnlp/dspy)-style loop proposes candidate prompts, runs them, scores which is better, and promotes the winner, and the cost and the risk of that loop live in different halves. Proposing and running happens hundreds of times per round and is forgiving, since you generate several candidates and keep only the best, so a single bad generation is harmless. Scoring happens a handful of times and is unforgiving, since a biased scorer optimizes you toward the wrong prompt and every later round inherits the mistake. So generation goes to the local model, where it costs nothing, and scoring runs as a Codex LLM judge on the most consistent model available, billed but tiny.

The judge needs one piece of care to be trustworthy. LLM judges have a real, measurable preference for whichever output they read first, so I judge twice with the order swapped between rounds, and only trust a verdict that survives the swap:

```js
// Round 1: A first.  Round 2: B first (positions swapped, then remapped).
if (round1Winner === round2Winner && round1Winner !== "tie") {
  verdict = round1Winner;   // it agreed with itself under reordering — trust it
} else {
  verdict = "tie";          // it flipped when you flipped the order — that was bias
}
```

If reordering the two outputs changes the answer, the preference was positional rather than a judgment of quality, and collapsing it to a tie throws that noise away. I also instruct the judge to grade depth and specificity rather than the mere presence of the required pieces, and to default to a fail unless the work is good, because a lenient judge waves box-ticking work through and the optimizer happily learns to produce more of it.

The last piece of glue is structured output. Every candidate result and every verdict has to parse into a record, and a thinking model will occasionally wrap its JSON in prose or a markdown fence no matter how you asked, so every response goes through a small salvage extractor that finds the first balanced object and validates it before accepting, and a failed parse becomes a dropped case rather than a crash:

```go
func extractJSON(text string) json.RawMessage {
    t := strings.TrimSpace(text)
    if json.Valid([]byte(t)) {
        return json.RawMessage(t)
    }
    first := strings.IndexAny(t, "[{")
    last := strings.LastIndexAny(t, "]}")
    if first >= 0 && last > first {
        if cand := t[first : last+1]; json.Valid([]byte(cand)) {
            return json.RawMessage(cand)
        }
    }
    return nil // a failed parse is a skipped case, not an exception
}
```

The evalset and the history of promotions are stored the same defensive way, newline-delimited JSON appended as the loop runs and read back line by line, so one malformed line drops itself instead of poisoning the whole file.

## Where this setup ends

The configuration above is specific to Codex 0.139 and its Responses-API requirement, and both Codex and LM Studio move quickly, so the exact flags will age. What I expect to age better is the shape of the thing: an agent harness pointed at a local brain for the forgiving work, a billed judge for scoring, since scoring mistakes compound across rounds, and an eval deciding which task goes where. If you try the same wiring against a different local server, I would be curious whether the Responses-versus-chat-completions mismatch bites there too.

## References

- [MLX](https://github.com/ml-explore/mlx)
- [vLLM](https://github.com/vllm-project/vllm)
- [Unsloth model repos](https://huggingface.co/unsloth)
- [LM Studio](https://lmstudio.ai)
- [DSPy](https://github.com/stanfordnlp/dspy)


Running Codex on a local model with LM Studio


# Running time-series forecasting in the browser with Rust and WebAssembly

Asking an enterprise user to upload historical sales or inventory CSVs to your servers starts a compliance process: InfoSec questionnaires, data processing agreements, GDPR and CCPA audits, and the standing fear of a leak. When I built [WaySightAI](https://github.com/xdanny/WaysightAI), I wanted to skip that conversation entirely, so I set one strict constraint: raw time-series data never leaves the user's browser. The backend handles authentication, saved metadata such as project names and configurations, usage limits, and high-level AI orchestration, while the actual work, parsing the CSV, cleaning the data, running stationarity tests, and computing forecasts, happens locally. To get statistical models like ARIMA and Holt-Winters running client-side at a reasonable speed, I wrote the core math in Rust and compiled it to WebAssembly, and this blogpost goes through how the pieces fit together.

## The monorepo

WaySightAI is organized as a monorepo to keep a clear boundary between the UI, the mathematical engine, and the management layer:

```text
packages/
  web/        React 18 + TypeScript + Vite (renders the workflow UI)
  wasm/       Rust forecasting engine compiled with wasm-pack
  api/        FastAPI backend (SQLAlchemy, Clerk authentication, Redis, Stripe hooks)
```

The forecasting path is strictly client-side:

1. The user uploads a CSV in the React web application.
2. The browser parses the CSV locally and lets the user select date and target columns.
3. The React app loads the `wasm` package dynamically and transfers the numeric vectors to the WASM memory buffer.
4. The Rust/WASM engine processes the calculations and returns JSON containing historical fitted values, forecast points, confidence intervals, and diagnostics.
5. The React app draws the charts using local data.

## The Rust core

Rust gives us deterministic memory management, zero-cost abstractions, and numerical libraries like `ndarray` and `statrs`. Uploaded data usually contains missing values and extreme anomalies that would break classical models like ARIMA, so preprocessing runs first, and here is the IQR-based outlier detection from `packages/wasm/src/preprocessing.rs`:

```rust
use serde::{Deserialize, Serialize};
use wasm_bindgen::prelude::*;

#[derive(Serialize, Deserialize)]
pub struct OutlierIndices {
    pub indices: Vec<usize>,
    pub z_scores: Vec<f64>,
}

/// Detect outliers using the IQR method
#[wasm_bindgen]
pub fn detect_outliers_iqr(values: &[f64]) -> Result<JsValue, JsValue> {
    let mut valid_values: Vec<(usize, f64)> = values
        .iter()
        .enumerate()
        .filter(|(_, &v)| v.is_finite())
        .map(|(i, &v)| (i, v))
        .collect();

    if valid_values.len() < 4 {
        return Ok(serde_wasm_bindgen::to_value(&OutlierIndices {
            indices: vec![],
            z_scores: vec![],
        })?);
    }

    // Sort to find quartiles
    valid_values.sort_by(|a, b| a.1.partial_cmp(&b.1).unwrap());

    let q1_idx = valid_values.len() / 4;
    let q3_idx = (valid_values.len() * 3) / 4;
    let q1 = valid_values[q1_idx].1;
    let q3 = valid_values[q3_idx].1;
    let iqr = q3 - q1;

    let lower_bound = q1 - 1.5 * iqr;
    let upper_bound = q3 + 1.5 * iqr;

    let mean: f64 = valid_values.iter().map(|(_, v)| v).sum::<f64>() / valid_values.len() as f64;
    let variance: f64 = valid_values
        .iter()
        .map(|(_, v)| (v - mean).powi(2))
        .sum::<f64>()
        / valid_values.len() as f64;
    let std_dev = variance.sqrt();

    let mut outlier_indices = Vec::new();
    let mut z_scores = Vec::new();

    for (idx, &val) in values.iter().enumerate() {
        if val.is_finite() && (val < lower_bound || val > upper_bound) {
            outlier_indices.push(idx);
            z_scores.push((val - mean) / std_dev);
        }
    }

    Ok(serde_wasm_bindgen::to_value(&OutlierIndices {
        indices: outlier_indices,
        z_scores,
    })?)
}
```

Compiled with `wasm-pack`, this function is directly callable from JavaScript, which passes array buffers in and receives the results back as native JS values through `serde_wasm_bindgen`, and it runs in microseconds.

## Loading the engine from React

To make loading the compiled WASM file painless, I built a TypeScript wrapper class that handles dynamic imports, module initialization, and the conversion between standard JavaScript arrays and the typed arrays WebAssembly wants:

```typescript
import type {
  WaySightAIWasm,
  ForecastConfig,
  ForecastResult,
  DataStats
} from '../../types/wasm';

export class WasmEngine {
  private wasmModule: WaySightAIWasm | null = null;
  private isInitialized = false;
  private loadingPromise: Promise<void> | null = null;

  /**
   * Dynamically import and initialize the WASM module
   */
  async load(): Promise<void> {
    if (this.loadingPromise) return this.loadingPromise;
    if (this.isInitialized && this.wasmModule) return Promise.resolve();

    this.loadingPromise = (async () => {
      try {
        // Dynamically import the JS entry point generated by wasm-pack
        const wasmModule = await import('/pkg/waysightai_wasm.js');

        // Initialize the WebAssembly binary instance
        await wasmModule.default();
        wasmModule.init();

        this.wasmModule = wasmModule as unknown as WaySightAIWasm;
        this.isInitialized = true;
        console.log('[WasmEngine] WASM module loaded successfully');
      } catch (error) {
        console.error('[WasmEngine] Failed to load WASM module:', error);
        throw error;
      }
    })();

    await this.loadingPromise;
    this.loadingPromise = null;
  }

  private ensureInitialized(): WaySightAIWasm {
    if (!this.isInitialized || !this.wasmModule) {
      throw new Error('WASM module not initialized. Call load() first.');
    }
    return this.wasmModule;
  }

  /**
   * Run a local forecast job
   */
  async runForecast(
    timestamps: number[],
    values: number[],
    isUnixMs: boolean,
    config: ForecastConfig
  ): Promise<ForecastResult> {
    const wasm = this.ensureInitialized();
    const configJson = JSON.stringify(config);
    
    // Call the exported Rust function
    const resultJson = wasm.run_forecast(timestamps, values, isUnixMs, configJson);
    return JSON.parse(resultJson);
  }
}
```

wasm-pack produces the `.wasm` binary alongside a `.js` wrapper, and Vite loads both dynamically on demand, so the initial page bundle stays small and the 1.2MB forecasting package downloads only when the user reaches the active editor dashboard.

## Browser performance

Running ARIMA, which relies on MLE optimization, and triple exponential smoothing in a single-threaded browser environment sounds slow, and the measurements say otherwise:

| Dataset Size (Rows) | Operation | Execution Location | Duration (ms) |
| :--- | :--- | :--- | :--- |
| 500 rows | Stationarity Tests (ADF + KPSS) | Browser (WASM) | 8ms |
| 500 rows | ARIMA(1,1,1) Parameter Tuning | Browser (WASM) | 45ms |
| 5,000 rows | Outlier Detection & Imputation | Browser (WASM) | 12ms |
| 5,000 rows | Holt-Winters Forecast (Horizon=30) | Browser (WASM) | 185ms |

For business dashboards with datasets between 100 and 10,000 rows this is effectively instantaneous, while sending the same data to a Python service to parse, compute, serialize, and return takes 800ms to 2.5 seconds before server cold starts or database queries enter the picture. Local computation also enables a kind of interactivity a round-trip cannot offer: the seasonal parameters (alpha, beta, gamma) are exposed as sliders in React, and the chart updates in under 50ms as the WASM engine recalculates, so the user can feel what each parameter does instead of reading about it.

## The tradeoffs

Going fully client-side comes with real costs, and I want to name them rather than sell past them. The WASM binary has to be fetched on first load, which adds a couple of seconds of initialization on slow connections. The compiled code ships to the client, so anything proprietary in it can be reverse-engineered, which does not matter for standard algorithms like ARIMA and the exponential smoothing family, and rules this approach out for models you need to keep secret. And the browser has hard CPU and memory limits, so a dataset with millions of rows will lag or hit the memory ceiling, which is where a server-side orchestrator like SageMaker still belongs.

What the constraint buys in exchange is a backend that does no math at all. Since my servers never touch the data, the FastAPI backend runs on a cheap container instance, and thousands of concurrent users cost nothing in auto-scaling CPU nodes or GPU workers, because the heavy lifting happens on hardware the users already own.

## When to consider it

If you are building a data-heavy SaaS tool, it is worth asking which of your computations need the data shipped to them and which could run where the data already lives. For standard statistical forecasting on business-sized datasets, the browser turned out to be quite enough, and the compliance conversation it removes is the part my enterprise users notice most.

## References

- [WaySightAI](https://github.com/xdanny/WaysightAI)
- [wasm-pack](https://github.com/rustwasm/wasm-pack)


Running time-series forecasting in the browser with Rust and WebAssembly


# Orchestrating a forecasting pipeline with Bedrock AgentCore and SageMaker

Most of the work in a machine learning pipeline is not the modeling, it is the plumbing around it: cleaning datasets, checking stationarity and seasonality, engineering lag and rolling-window features, running hyperparameter search, tracking jobs, building reports, and deploying endpoints. That work usually lives in manually driven Jupyter notebooks, and I wanted to see how much of it an LLM agent can run on its own. So I built a forecasting system that takes a raw CSV in S3, performs the statistical analysis, tunes hyperparameters, trains ARIMA and LSTM models, and deploys a live endpoint, with an LLM-driven orchestrator managing every step. The architecture has three layers, AWS Bedrock AgentCore for reasoning, a Code Interpreter sandbox for data science code, and SageMaker for heavy training, and most of what I learned lives in the seams between them.

## The three layers

An autonomous system needs to split concerns between reasoning, executing code, and running heavy machine learning jobs:

```mermaid
graph TD
    subgraph Layer1["Layer 1: Orchestration (AgentCore)"]
        Agent["Intelligent Agent<br/>(16 Tools)"]
        Boto3["AWS SDK (boto3)"]
    end

    subgraph Layer2["Layer 2: Data Science (Code Interpreter)"]
        Sandbox["Isolated Python Sandbox"]
        EDA["EDA / Statistical Tests"]
        Features["Feature Engineering"]
        Plotly["Plotly HTML Reports"]
    end

    subgraph Layer3["Layer 3: Scalable ML (SageMaker)"]
        Tuning["Bayesian HPO Jobs"]
        Training["Scale Model Training"]
        Endpoint["Production REST Endpoint"]
    end

    Agent --> Sandbox
    Agent --> Boto3
    Boto3 --> Tuning
    Boto3 --> Training
    Boto3 --> Endpoint
```

## Layer 1: the orchestrator

The orchestrator is built on [AWS Bedrock AgentCore](https://aws.amazon.com/bedrock/), packaged via the Strands library, and holds 16 specialized tools ranging from statistical analysis to SageMaker deployment commands. In `agent.py` the agent is initialized with plain Python tool references:

```python
from strands import Agent
from agents.advanced_eda_agent import run_advanced_eda
from agents.intelligent_feature_engineering_agent import recommend_features, create_features
from agents.sagemaker_simple import (
    create_sagemaker_training_job,
    get_training_job_status,
    deploy_sagemaker_model,
    invoke_sagemaker_endpoint
)

agent = Agent(
    name="IntelligentForecastingAgent",
    description="Intelligent time series forecasting with ARIMA and LSTM model comparison.",
    tools=[
        run_advanced_eda,
        recommend_features,
        create_features,
        create_sagemaker_training_job,
        get_training_job_status,
        deploy_sagemaker_model,
        invoke_sagemaker_endpoint,
        # ... other tools
    ]
)
```

With these tools the agent executes a seven-step workflow:

1. Advanced EDA: ADF/KPSS tests and seasonal decomposition.
2. Feature recommendations: the agent examines the stats and recommends features.
3. Feature engineering: rolling stats, lag columns, and calendar features.
4. Bayesian tuning: SageMaker HPO to find optimal hyperparameters.
5. Model training: classical models (ARIMA) and deep learning models (LSTM).
6. Comparison: metrics on hold-out test sets.
7. Report generation: the best model is deployed and an interactive Plotly report is delivered through a presigned URL.

## Layer 2: the sandbox

An LLM agent that writes and runs arbitrary code cannot be allowed to do so on the application server or anywhere near the production database, so all data science computation, the EDA, the feature engineering, the Plotly charting, is offloaded to an isolated Code Interpreter sandbox. The sandbox has the scientific libraries pre-installed (`pandas`, `scipy`, `statsmodels`, `plotly`) and is locked down with limited CPU, memory, and network permissions. Here is the pattern for the EDA tool:

```python
@tool
def run_advanced_eda(dataset_s3_path: str, time_column: str = None, value_column: str = None) -> str:
    """
    Runs stationarity tests (ADF, KPSS), seasonal decomposition, and ACF/PACF analysis.
    """
    # 1. Parse the target S3 paths
    bucket, key = parse_s3_uri(dataset_s3_path)
    
    # 2. Build the Python script to run in the sandbox
    code = f'''
import pandas as pd
import numpy as np
import json
import subprocess
from statsmodels.tsa.stattools import adfuller, kpss
from statsmodels.tsa.seasonal import seasonal_decompose

# Downloader helper - Sandbox has pre-configured read-only AWS CLI credentials
subprocess.run(['aws', 's3', 'cp', 's3://{bucket}/{key}', '/tmp/data.csv'], check=True)
df = pd.read_csv('/tmp/data.csv')

# ... Auto-detect time/value columns and clean data ...
series = df[value_col].dropna()

# Run ADF and KPSS tests
adf_stat, adf_pvalue, _, _, _, _ = adfuller(series)
kpss_stat, kpss_pvalue, _, _ = kpss(series, regression='c')

# Return results as JSON
print("===JSON_START===")
print(json.dumps({{
    "adf": {{"statistic": adf_stat, "p_value": adf_pvalue}},
    "kpss": {{"statistic": kpss_stat, "p_value": kpss_pvalue}}
}}))
print("===JSON_END===")
'''
    # 3. Send script to the Code Interpreter sandbox and parse stdout
    stdout = sandbox_client.execute(code)
    return extract_json_from_stdout(stdout)
```

The tool builds a Python script as a text block, ships it into the sandbox, and parses structured JSON back out of stdout between markers, so the agent's host never executes model-written code and still receives clean, structured results. The sandbox pulls the data from S3 with read-only credentials, computes the stats, and hands the numbers back.

## Layer 3: SageMaker

The sandbox is fine for light calculations and lacks the memory and compute to train LSTMs or run large hyperparameter searches, so at the training step the agent switches to plain `boto3` calls that launch real SageMaker jobs on preconfigured Scikit-Learn or PyTorch ECR containers:

```python
import boto3
import json

sagemaker = boto3.client('sagemaker')

@tool
def create_sagemaker_training_job(
    job_name: str,
    dataset_s3_path: str,
    role_arn: str
) -> str:
    """
    Launches a Scikit-Learn ECR container on SageMaker for training.
    """
    try:
        response = sagemaker.create_training_job(
            TrainingJobName=job_name,
            AlgorithmSpecification={
                'TrainingImage': '683313688378.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:1.2-1-cpu-py3',
                'TrainingInputMode': 'File'
            },
            RoleArn=role_arn,
            InputDataConfig=[{
                'ChannelName': 'train',
                'DataSource': {
                    'S3DataSource': {
                        'S3DataType': 'S3Prefix',
                        'S3Uri': dataset_s3_path,
                    }
                },
                'ContentType': 'text/csv'
            }],
            OutputDataConfig={
                'S3OutputPath': f's3://{BUCKET}/sagemaker/models/'
            },
            ResourceConfig={
                'InstanceType': 'ml.m5.xlarge',
                'InstanceCount': 1,
                'VolumeSizeInGB': 10
            },
            HyperParameters={
                'p': '2',
                'd': '1',
                'q': '3',
                'seasonal-p': '1',
            },
            StoppingCondition={'MaxRuntimeInSeconds': 3600}
        )
        return json.dumps({'success': True, 'job_arn': response['TrainingJobArn']})
    except Exception as e:
        return json.dumps({'success': False, 'error': str(e)})
```

SageMaker jobs run asynchronously for anywhere between 5 and 60 minutes, and an LLM is naturally sequential, so the agent tracks jobs through a polling tool, `get_training_job_status`, instead of blocking on them. Once a job succeeds, the agent calls `deploy_sagemaker_model` to package the `model.tar.gz` artifact from S3 and host it behind a real-time SageMaker endpoint.

## Keeping credentials out of the sandbox

The sandbox generates the interactive Plotly HTML reports, and it should never have write access to my production S3 buckets or hold long-term AWS credentials, which leads to a two-phase pattern for getting reports out:

```mermaid
sequenceDiagram
    participant Agent as AgentCore (Has S3 Write Credentials)
    participant CI as Code Interpreter Sandbox (No S3 Credentials)
    participant S3 as AWS S3

    Agent->>CI: Execute charting script
    CI->>CI: Generate Plotly chart & serialize to HTML
    CI-->>Agent: Print HTML string to stdout
    Agent->>Agent: Extract HTML string from stdout
    Agent->>S3: Upload HTML content using boto3
    S3-->>Agent: Return secure presigned URL
```

The sandbox prints the finished HTML to stdout, the orchestrator extracts it and uploads it with its own credentials, and the user receives a presigned URL. Write credentials exist only in the orchestration layer, and the sandbox stays a pure calculation engine, which keeps least-privilege intact across the whole system.

## What running it taught me

Splitting compute profiles matters more than I expected: cleaning and charting belong in cheap, fast sandboxes, and the expensive SageMaker instances should only ever see actual training, because without that boundary the agent happily runs everything on the biggest machine available. Asynchrony is the part that fights the LLM's nature, since the model wants to finish its turn and a 30-minute training job does not care, and explicit job-state tools with periodic polling were the only reliable answer I found. And sandboxing model-written code is not optional at any scale, it runs in a locked-down container or it does not run.

What I have at the end is a pipeline that goes from a raw CSV in S3 to a deployed forecasting endpoint without me driving any of the steps, and the open question I am left with is how far this same three-layer pattern stretches beyond forecasting, since nothing in it is specific to time series except the tools themselves.

## References

- [AWS Bedrock AgentCore](https://aws.amazon.com/bedrock/)
- [AWS SageMaker](https://aws.amazon.com/sagemaker/)
- [Strands SDK](https://strandsagents.com)


Orchestrating a forecasting pipeline with Bedrock AgentCore and SageMaker


# Fine-tuning a local 9B model for multi-turn text-to-SQL

```text
Show revenue by country.
Only France.
Now break it down by month.
That looks empty. Try the country code instead.
```

This is the four-message conversation I used to test a text-to-SQL agent I was building, and it is the reason this blogpost exists. The first two messages went fine, somewhere around the third the model lost track of which table we had been working with, and it answered the last message with SQL against a table it had abandoned earlier in the conversation. None of this shows up when you test one question at a time, which is how I had been testing, and it is also how most of the published benchmarks test.

## Single-turn benchmarks do not contain this failure

When a model reports something like 85% on text-to-SQL, the number usually comes from [Spider](https://yale-lily.github.io/spider) or [BIRD](https://arxiv.org/abs/2305.03111), where every question arrives alone with fresh context. Both benchmarks have driven real progress, and neither contains the failure above, because that failure needs history to exist. The benchmarks that do contain it are newer and less quoted: [SParC](https://arxiv.org/abs/1906.02285) runs sequences of related questions over unseen databases, [CoSQL](https://arxiv.org/abs/1909.05378) adds real dialogue with ambiguity and clarification requests, and [BIRD-Interact](https://arxiv.org/abs/2510.05318) goes furthest and expects the assistant to recover from execution errors and from its own earlier wrong answers. Frontier models that dominate the single-turn leaderboards drop considerably on these, and that drop is what turned my broken demo into an experiment.

What made the experiment tempting is the shape of the failures. Forgetting the active table, dropping a filter established two turns earlier, not resolving what "it" refers to in "break it down by month": these read as behavior problems rather than intelligence problems, and behavior is what fine-tuning is good at shaping. So I set out to measure how much of the multi-turn gap a small local model can close with task-specific training.

## The experiment

The base model is [Qwen 3.5 9B](https://huggingface.co/Qwen), picked because it runs on consumer hardware, Unsloth supports it well, and the instruction-tuned variant already writes reasonable SQL. Training was bf16 LoRA on a single RTX 5090, rank 32, alpha 64, on the standard attention and MLP projections, and each run took about four hours.

The evaluation deserves more words than the training, because two decisions there shape every number that follows. The data is a CoSQL proxy of 100 turns across 32 real dialogs, and the results below come from a fixed subset of 43 follow-up turns from 12 of those dialogs, so every run in the comparison saw the same rows, the same scorer, and the same protocol. And the conversation history is generated, not teacher-forced: at every turn the model sees its own previous SQL, right or wrong, instead of the reference answer it should have produced. Teacher forcing is the more flattering setup and a useful diagnostic, but it evaluates an assistant that cannot exist in production, where turn 3 has to live with whatever turn 2 did. Generated history makes every number in this post lower than its teacher-forced equivalent, and it is the setup that matches what I would deploy.

The frontier baseline is Claude Sonnet 4.6 through [OpenRouter](https://openrouter.ai/), on exactly the same rows, scorer, and protocol.

## Five training strategies

[DIN-SQL](https://arxiv.org/abs/2304.11015) made the case that text-to-SQL decomposes into subtasks: choosing tables, understanding what values are stored, understanding what the user wants, generating the code, catching mistakes. If that is right, the interesting question for multi-turn work is which subtask carries the failure, so each strategy below trains a different bet.

### #1 Direct SQL

The control: train on the user's question and the database schema, producing SQL directly with no intermediate representation. If this does not beat the base model, nothing else here matters.

### #2 Semantic decomposition

Before any SQL, the model first writes down what the user is asking in structured terms, which entity, which metric, which filter, which grouping. The bet is that follow-ups fail at the level of meaning, since resolving "break it down by month" requires first deciding what "it" is, and that is not a SQL skill. The training data included a Cube-inspired semantic model of each database, entities, dimensions, measures, and join hints derived from the schema, so the model could learn to map user language onto governed business concepts before touching SQL.

### #3 Metric DSL

An intermediate language for business metrics, where the model first emits something like `MEASURE(revenue) BY customer_country` and a compiler expands it into the actual SQL. The bet here is that analysts ask for metrics rather than columns, so the model should preserve that intent and let deterministic code handle the plumbing.

### #4 Behavior recovery

Training conversations usually contain ideal history, every earlier turn answered perfectly, which is not the world the model will live in. For this run I generated training examples whose earlier turns contain the model's actual, sometimes wrong, SQL, so it could learn to notice an empty result or wrong data and adjust course. The bet is that repair is a trainable skill of its own.

### #5 Schema selection

A deliberate cheat, built to answer a diagnostic question rather than to be deployed. I parsed the correct reference SQL of each question with [sqlglot](https://github.com/tobymao/sqlglot) and injected the tables, columns, and joins it uses as hints into the prompt, so the model knows where to look because we peeked at the answer. Useless in production, where there is no answer to peek at, but it measures something I wanted to know: if navigation were solved, how good is the model's SQL?

## The numbers

| Run | Value accuracy | Strict accuracy | Notes |
| --- | ---: | ---: | --- |
| Claude Sonnet 4.6 (via OpenRouter) | `0.674` | `0.395` | The frontier baseline, same questions and scorer |
| Qwen 9B + schema hints from correct answer | `0.651` | `0.465` | The diagnostic cheat: told which tables and columns to use, not deployable |
| Qwen 9B + semantic training (50 steps) | `0.581` | `0.326` | Best result from a fair, deployable local model |
| Qwen 9B + direct SQL training (50 steps) | `0.581` | `0.302` | Same value accuracy, slightly less precise output shape |
| Qwen 9B out of the box | `0.558` | `0.302` | No training, just the base model with a SQL prompt |
| Qwen 9B + recovery training | `0.558` | `0.302` | Trained on messy histories, no improvement yet |

Value accuracy asks whether the SQL returned the right data and is lenient about aliases, so `total` instead of `revenue` with the right numbers still passes. Strict accuracy also cares about output shape. Value accuracy is the closer proxy for whether the answer was useful to the analyst, so it is the one I mostly reason from.

Three things stand out to me in this table. The clean, deployable fine-tunes give a real but small lift: semantic and direct SQL both reach `0.581` from the base model's `0.558`, a 2.3 percentage point improvement that shows the training data teaches something, and stays well short of Sonnet's `0.674`. Recovery training gave nothing at all, it tied the base model exactly, and my reading is that the generated repair data did not yet contain moves worth learning, so I file the idea as untested rather than disproven. And the cheat run jumped to `0.651` value accuracy, within about two points of Sonnet, with strict accuracy ahead of Sonnet's, `0.465` against `0.395`.

## Reading the cheat

The reason the schema-hints number matters is what the errors look like without the hints. Going through the failures, the model rarely writes garbage. Most wrong SQL is plausible SQL: a reasonable-looking query against `customer_orders` when the table is called `orders`, a `LEFT JOIN` where the question needs an `INNER JOIN`, a filter on "France" when the column stores "FR". Syntax was never the problem, since raw Qwen produces syntactically valid SQL essentially every time, `1.000` across the board, and it generally understands the English question too. What it cannot do reliably is find its way around a database it has never seen, and in a conversation that navigation problem compounds, because the model is also carrying context forward, resolving references, and adjusting when the filter that worked on turn 2 needs a different column on turn 4.

So the experiment I designed to test SQL writing ended up measuring something else. Table and column selection is the bottleneck, and once the hints remove it, a 9B model on my desk sits two points from the frontier on this data. That changed what I plan to work on next.

## Limits of this comparison

The comparison is controlled, same rows, same scorer, same protocol for every run including Sonnet, and it is small: 43 generated-history follow-up turns from 12 CoSQL dialogs. It supports the claims that fine-tuning lifts a local model a little, that Sonnet 4.6 stays ahead of every deployable local run, and that the gap nearly closes when navigation is handed to the model for free. It does not support claiming that a local 9B beats Sonnet, that schema hints are usable outside the lab, or that recovery training works. I am comfortable with those limits, since what I wanted from the exercise was a direction for the next round of work.

## What comes next

If table and column selection is the bottleneck, the next training data should teach it directly:

1. Which tables matter for this question?
2. Which columns express the requested metric or dimension?
3. Which stored values match the user's wording?
4. What should carry forward from the previous turn?
5. What changed in the follow-up: filter, grouping, metric, or repair?
6. When execution returns nothing, what is the next reasonable attempt?

Once that protocol is stable locally, I want to scale the same evaluation toward [BIRD-Interact](https://arxiv.org/abs/2510.05318) and re-ask the question against the hosted models there.

The fine-tuning code, the training data generation, and the eval harness are in [`github.com/xdanny/multiturn-sql-finetuning`](https://github.com/xdanny/multiturn-sql-finetuning). If your own agent handles single questions and gets lost in follow-ups, I would be curious whether table selection turns out to be the bottleneck on your schema as well.

## References

- [BIRD paper](https://arxiv.org/abs/2305.03111)
- [SParC paper](https://arxiv.org/abs/1906.02285)
- [CoSQL paper](https://arxiv.org/abs/1909.05378)
- [BIRD-Interact paper](https://arxiv.org/abs/2510.05318)
- [DIN-SQL paper](https://arxiv.org/abs/2304.11015)
- [Qwen model family](https://huggingface.co/Qwen)
- [OpenRouter](https://openrouter.ai/)


Fine-tuning a local 9B model for multi-turn text-to-SQL


# Building a code search chatbot with FAISS and the Strands SDK

Everyone spent last year declaring RAG dead, and after building a retrieval system from scratch I mostly agree about the word, since it always collapsed three different engineering problems, finding information, assembling context, and generating answers, into one buzzword that sounded like a solved pattern. The problems themselves did not go anywhere. Jeff Huber, the CEO of Chroma, gave the framing I ended up building against in a [Latent Space interview](https://www.latent.space/p/chroma):

> "Context engineering is the job of figuring out what should be in the context window at any given LLM generation step."

What I like about this framing is that it decomposes into pieces you can reason about independently, retrieval, filtering, re-ranking, assembly, memory, evaluation, each with its own failure modes and its own knobs. So I built a small system to find out which piece deserves the attention: a chatbot that answers questions about the [Strands SDK](https://strandsagents.com) by searching its own source code, the Python files, the markdown docs, and the examples. The stack is [FAISS](https://github.com/facebookresearch/faiss) for dense vector search with local embeddings, the Strands SDK itself for agent orchestration, [mem0](https://mem0.ai) for conversation memory, and Gemini 2.5 Flash Lite for generation. The full tutorial, a self-contained Jupyter notebook with one-command setup, lives at [learn-strands/rag-chatbot](https://github.com/xdanny/learn-strands/tree/main/rag-chatbot).

I expected retrieval to take most of the time. It was done in a day, while assembling context and memory filled the rest of the week, and the sections below follow that same order.

## Retrieval in a day

For a codebase under 10K chunks you do not need approximate indices. FAISS with `IndexFlatL2` and sentence-transformers gives you exact search with 100% recall, fast enough that nothing more sophisticated earns its complexity:

```python
class FAISSVectorStore:
    """FAISS-based vector store with local embeddings."""

    def __init__(self, local_model: str = "all-MiniLM-L6-v2", dimension: int = 384):
        self.embedder = SentenceTransformer(local_model)
        self.dimension = dimension
        self.index = faiss.IndexFlatL2(dimension)
        self.documents = []

    def add_documents(self, documents: List[Dict[str, Any]], batch_size: int = 32):
        """Add documents with batched embedding generation."""
        texts = [doc.get('text', '') for doc in documents]

        all_embeddings = []
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            embeddings = self.embedder.encode(batch, show_progress_bar=False)
            all_embeddings.append(embeddings)

        embeddings_array = np.vstack(all_embeddings).astype('float32')
        faiss.normalize_L2(embeddings_array)

        self.index.add(embeddings_array)
        self.documents.extend(documents)

    def search(self, query: str, k: int = 5, threshold: float = 0.3) -> List[Dict[str, Any]]:
        """Dense vector search with cosine similarity."""
        if self.index.ntotal == 0:
            return []

        query_embedding = self.embedder.encode([query]).astype('float32')
        faiss.normalize_L2(query_embedding)

        distances, indices = self.index.search(query_embedding, min(k, self.index.ntotal))
        similarities = 1 - (distances[0] / 2)  # L2 on normalized vectors → cosine
        results = []
        for idx, similarity in zip(indices[0], similarities):
            if similarity >= threshold and idx < len(self.documents):
                doc = self.documents[int(idx)].copy()
                doc['similarity'] = float(similarity)
                results.append(doc)

        return results
```

L2 normalization on the embeddings turns Euclidean distance into cosine similarity, and the 0.3 threshold is deliberately aggressive, since I would rather feed the agent fewer, better chunks than flood the context with marginal matches.

The search itself becomes a Strands tool that the agent calls explicitly:

```python
@tool
def search_strands_sdk(query: str, max_results: int = 5) -> List[Dict[str, Any]]:
    """Semantic search over Strands SDK codebase."""
    results = vector_store.search(query=query, k=max_results, threshold=0.3)

    return [{
        'text': r['text'],
        'similarity': r['similarity'],
        'file_path': r.get('file_path', 'unknown'),
        'file_type': r.get('file_type', 'unknown')
    } for r in results]


@tool
def search_by_file_type(query: str, file_type: str, max_results: int = 5) -> List[Dict[str, Any]]:
    """Combine dense search with metadata filtering."""
    all_results = vector_store.search(query=query, k=max_results * 3, threshold=0.3)
    filtered = [r for r in all_results if r.get('file_type', '') == file_type][:max_results]

    return [{
        'text': r['text'],
        'similarity': r['similarity'],
        'file_path': r.get('file_path', 'unknown')
    } for r in filtered]
```

Huber calls this "naming the primitives", and the name earns its keep in practice: when retrieval is an explicit, composable step instead of a hidden subroutine, you can reason about it, debug it, and swap it out. `search_strands_sdk` handles broad semantic queries, `search_by_file_type` narrows by extension, and both show up in the agent's tool log when something goes wrong.

## Accurate retrieval saves tokens downstream

The effect I did not anticipate: when the first stage is precise, going from 6,000 chunks to 20 good ones instead of 200 mediocre ones, the downstream agent burns dramatically fewer tokens. Sloppy retrieval hands the agent a pile of vaguely relevant chunks, it cannot find a clear answer, and it starts exploring, calling the search tool again with a rephrased query, reasoning through ambiguous context, or producing a hedged answer that invites a follow-up question, and every one of those extra steps costs tokens. With tight retrieval the agent reads five chunks, finds the answer, cites the source, and stops.

Huber frames the first stage in terms of recall:

> "Using signals like vector search, like full text search, like metadata filtering... to go from 10,000 down to 300."

What I saw is that precision matters at least as much once an agent is the consumer, because an agent with twenty excellent chunks answers immediately, while one with two hundred okay chunks wanders off into re-queries and hedges.

## Re-ranking, the corner I cut

Huber is direct about re-ranking:

> "Using an LLM as a re-ranker and brute forcing from 300 down to 30, I've seen now emerging... way more cost effective than a lot of people realize."

That is a separate pass: take the 300 first-stage candidates, score them with a cross-encoder or an LLM, keep the top 30, and only then generate the answer from those 30, so the re-ranking step reduces what enters the generation context. My system does not have that pass. What I built is a response specialist that receives all retrieved chunks in a single prompt and generates from them:

```python
@tool
def response_specialist_tool(query: str, context: str) -> str:
    """Generate response from retrieved context."""
    agent = Agent(
        system_prompt="""You are a response generation specialist for Strands SDK queries.

Generate answers based ONLY on provided context.

Guidelines:
1. PRIORITIZE the most relevant chunks from context
2. CITE sources with file paths and line numbers
3. COMBINE information from multiple chunks coherently
4. If context is insufficient, say so clearly
5. Provide runnable code snippets when possible""",
        tools=[use_llm],
        model=gemini_model
    )

    response = agent(f"""User Question: {query}

Retrieved Context:
{context}

Generate a comprehensive answer using ONLY the context above.""")

    return str(response)
```

For a while I told myself the LLM's attention was doing "implicit re-ranking", focusing on the relevant chunks and ignoring the rest, and that is not what re-ranking is. Everything is still in the context window, every token still counts against the budget, and the marginal chunks contribute to what Huber calls context rot:

> "As you use more and more tokens, the model can pay attention to less and then also can reason sort of less effectively."

With tight first-stage retrieval sending 5-10 chunks instead of 300, the missing pass is survivable for a codebase this size, and it will not survive scale, so a proper scoring step is the next thing on the list.

## Assembly and memory took the rest of the week

Assembly is the problem of taking retrieved chunks, disconnected fragments from different files and different sections of documentation, and composing them into context the model can reason about coherently, and it is more than concatenation: the order matters, the framing matters, and so does whether you include file paths and line numbers or show a chunk verbatim rather than summarized.

Conversation memory is the same problem stretched across turns. The model needs context from earlier in the chat, and appending the full history grows without bound and walks straight into context rot, which is where mem0 earned its place in the stack:

```python
mem0_config = {
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "strands_chat",
            "embedding_model_dims": 384,
            "path": ":memory:"
        }
    },
    "embedder": {
        "provider": "huggingface",
        "config": {"model": "all-MiniLM-L6-v2"}
    }
}

memory = Memory.from_config(mem0_config)


@tool
def remember_conversation(user_message: str, assistant_response: str, user_id: str = "user") -> str:
    """Extract and store salient facts from the conversation."""
    memory.add(
        f"User asked: {user_message}\nAssistant responded: {assistant_response}",
        user_id=user_id
    )
    return f"Stored conversation in memory for {user_id}"


@tool
def recall_conversation(query: str = "", user_id: str = "user") -> str:
    """Retrieve relevant conversation history — not the full transcript."""
    if not query:
        query = "recent conversation history"

    results = memory.search(query, user_id=user_id, limit=5)

    if not results or 'results' not in results or not results['results']:
        return "No previous conversation history found."

    history = []
    for item in results['results']:
        if 'memory' in item:
            history.append(item['memory'])

    return "\n\n".join(history) if history else "No relevant conversation history found."
```

Instead of appending full dialogue turns, mem0 extracts salient facts and stores them as searchable memories, so when the agent needs conversation context it retrieves the relevant memories and not the entire transcript.

## The orchestrator

The orchestrator ties retrieval, response generation, and memory into a single agent:

```python
rag_chatbot = Agent(
    system_prompt="""You are an intelligent assistant with expertise in the Strands SDK.

WORKFLOW:
1. RETRIEVE: Use retrieval_specialist to find relevant docs and code
2. RECALL: Use recall_conversation for relevant conversation context
3. RESPOND: Use response_specialist to generate a cited answer
4. REMEMBER: Use remember_conversation to store salient facts

Always retrieve context before answering technical questions.
Prefer code examples from the actual Strands SDK codebase.""",
    tools=[
        retrieval_specialist,
        response_specialist_tool,
        remember_conversation,
        recall_conversation,
        use_llm
    ],
    model=gemini_model
)
```

When you ask "How do I create an agent with custom tools?", the orchestrator calls `retrieval_specialist` to search the codebase, `recall_conversation` to pull relevant memories from earlier turns, passes both to `response_specialist` for an answer citing specific files, and finishes with `remember_conversation` to store the key facts for future turns. Every step is visible and independently debuggable, and when an answer is wrong or a citation is off, you can tell which step failed instead of staring at one opaque pipeline.

## Bootstrapping the runtime

To run this outside a notebook, the FAISS store has to be initialized, the index loaded from disk or built on the fly, the memory database configured, and a chat loop started:

```python
import os
from pathlib import Path
from strands import Agent, tool
from strands.models.gemini import GeminiModel
from strands_tools import use_llm
from mem0 import Memory

# 1. Initialize the LLM
GOOGLE_API_KEY = os.environ["GOOGLE_API_KEY"]
gemini_model = GeminiModel(
    client_args={"api_key": GOOGLE_API_KEY},
    model_id="gemini-2.5-flash-lite"
)

# 2. Boot the FAISS vector store
vector_store = FAISSVectorStore(
    local_model="all-MiniLM-L6-v2",
    dimension=384
)

# 3. Load or index the codebase
if Path("data/strands_sdk.faiss").exists():
    vector_store.load("data/strands_sdk.faiss", "data/documents.json")
else:
    # Index the SDK docs and source code on the fly
    documents = load_and_chunk_documents(
        repo_path="data/strands-sdk",
        chunk_size=1000,
        chunk_overlap=200
    )
    vector_store.add_documents(documents)
    vector_store.save("data/strands_sdk.faiss", "data/documents.json")

# 4. Initialize in-memory conversation memory
memory = Memory.from_config({
    "vector_store": {
        "provider": "qdrant",
        "config": {
            "collection_name": "strands_chat",
            "embedding_model_dims": 384,
            "path": ":memory:"
        }
    },
    "embedder": {
        "provider": "huggingface",
        "config": {"model": "all-MiniLM-L6-v2"}
    }
})

# 5. Start the interactive chat loop
print("💬 Strands SDK RAG Chatbot is online.")
while True:
    query = input("👤 You: ")
    if query.lower() in ['exit', 'quit']:
        break

    response = rag_chatbot(query, user_id="user")
    print(f"\n🤖 Assistant: {response}\n")
```

Loading a pre-built index when one exists and indexing the SDK on the fly when it does not keeps iteration fast during development.

## What I would measure next

Huber's most practical recommendation is the one I have not implemented yet, golden datasets:

> "People should be creating small golden data sets of what queries they want to work and what chunks should return... quantitatively evaluate what matters."

The idea is to spend an evening labeling query-chunk pairs:

```json
[
  {
    "query": "How do I create an agent with custom tools?",
    "expected_chunks": [
      "data/strands-sdk/examples/custom_tools.py",
      "data/strands-sdk/docs/agents.md"
    ],
    "expected_concepts": ["@tool decorator", "Agent class initialization"]
  }
]
```

and then wire Recall@10 into CI so the build fails when it drops below a threshold. Without this, retrieval quality degrades silently: you swap an embedding model, change the chunking strategy, re-index the codebase, and never notice that three important queries stopped returning the right files.

## Where this leaves the RAG question

The project started because everyone was saying RAG is dead and I wanted to know what replaces it, and the answer I can defend after building one is the same engineering work, decomposed into stages that are honest about where the difficulty lives. For this codebase the difficulty lived in assembly and memory rather than retrieval, retrieval's precision turned out to set the agent's token bill, and the re-ranking pass and the golden dataset are the two gaps I would close before trusting the system at a larger scale. If you have built something similar and your expensive stage was a different one, I would be curious to hear which.

## References

- [Latent Space: Jeff Huber on Context Engineering](https://www.latent.space/p/chroma)
- [learn-strands Tutorial](https://github.com/xdanny/learn-strands/tree/main/rag-chatbot)
- [FAISS](https://github.com/facebookresearch/faiss)
- [Strands SDK](https://strandsagents.com)
- [mem0](https://mem0.ai)


Building a code search chatbot with FAISS and the Strands SDK


# Building data quality checks in your pySpark data pipelines

Data quality is a rather critical part of any production data pipeline. In order to provide accurate SLA metrics
and to ensure that the data is correct, it is important to have a way to validate the data and report the metrics
for further analysis. In this post, we will look at how to build data quality checks in your pySpark data pipelines.

## Exploring Delta Live Tables

Delta Live Tables is a new feature in Databricks that allows users to build reliable data pipelines with built-in
data quality metrics and monitoring. It is a new abstraction on top of Delta Lake that allows users to query the
data using streaming live tables. The data is updated in real-time as the underlying data changes. What caught my eye
was the data quality capabilities that the users can specify on dataset level. Using python decorators we can specify
@expect_all, @expect_all_or_drop, and @expect_all_or_fail expectations that accept a python dictionary as an argument,
where the key is the expectation name and the value is the expectation constraint. Example:

```python
@dlt.expect("valid timestamp", "col("timestamp") > '2012-01-01'")
@dlt.expect_or_drop("valid_current_page", "current_page_id IS NOT NULL AND current_page_title IS NOT NULL")
@dlt.expect_or_fail("valid_count", "count > 0")
```

Metrics of clean records and failed records are automatically collected and stored in the Delta Live Table metadata, so
the users can set up alerts and monitor the data quality of their pipelines.

Delta Live Tables however still face quite some limitations and are not yet ready for production use. Some limitations
include:
1. The data quality checks are only available for streaming live tables, not for batch tables. We can still create
streaming tables from batch tables, but if the version of your data is changing the pipeline will fail.
2. Lack of testing capabilities. There is no way to test the data quality checks in a local environment because
dlt package is available only in Databricks runtime.
3. Lack of documentation. The documentation is very limited and it is not clear how to use the data quality checks.
Currently only python and SQL API are supported.
4. Setting up DLT job doesn't support all the parameters that are available in the Databricks job.

## Building your own data quality checks as python decorators

In order to overcome the limitations of Delta Live Tables, we can build our own data quality checks as python decorators.
The idea is to create a decorator that will accept a python list of arguments, which will be constraints that we will
apply for a determined column. We will collect all the necessary metrics and store them as part of the Delta Lake
metadata.

We will start by building two simple conditions for our data, uniqueness and filtering based on a condition.

```python
from abc import ABC, abstractmethod

class ColumnCondition(ABC):
    @abstractmethod
    def get_cols(self):
        pass


class UniqueCondition(ColumnCondition):

    def __init__(self, col):
        self.col = col

    def get_cols(self):
        return self.col


class FilterCondition(ColumnCondition):

    def __init__(self, left_col, right_col):
        self.left_col = left_col
        self.right_col = right_col

    def get_cols(self):
        return self.left_col, self.right_col
```

For ease of use, let's define a couple of factory functions that will return the right condition:

```python
def is_not_null(col):
    return FilterCondition(col + " is not null", col + " is null")

def is_unique(col):
    return UniqueCondition(col)
```

The main idea is to use a function as a decorator argument using a certain column, which will return a condition object.
We can use the condition object to pattern match and apply the specific function depending on the condition type.
We start by creating a simple python decorator using functools wraps:

```python
    def expect_or_drop(self, conditions: List[FilterCondition]):
        def decorator(function):
            @wraps(function)
            def wrapper(*args, **kwargs):
                retval = function(*args, **kwargs)
                # apply conditions
                return retval
            return wrapper
        return decorator
```

We will create an Expectations class that will contain all the data quality checks. The rsd arguments represents the
maximum relative standard deviation allowed for the approx_count_distinct_functions. Read more
[here](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.functions.approx_count_distinct.html).

```python

class Expectations:

    def __init__(self, spark: SparkSession, rsd=0.05):
        self.spark = spark
        self.schema = StructType([StructField("condition", StringType(), True),
                                  StructField("dropped_records", IntegerType(), True),
                                  StructField("clean_records", IntegerType(), True)])
        emptyRDD = spark.sparkContext.emptyRDD()
        self.metrics = spark.createDataFrame(emptyRDD, schema=self.schema)
        self.rsd = rsd
```

The metrics dataframe will contain the metrics for each data quality check. We can proceed to create our filtering
and uniqueness checks:

```python

    def apply_condition(self, dataframe, condition):
        if isinstance(condition, FilterCondition):
            return self.filter_condition(dataframe, condition.get_cols())
        elif isinstance(condition, UniqueCondition):
            return self.is_unique_extend(dataframe, condition.get_cols())
        return dataframe

    def filter_condition(self, dataframe: DataFrame, left_right) -> DataFrame:
        left, right = left_right
        total_records = dataframe.count()
        dropped_records = dataframe.filter(right).count()
        df = self.spark.createDataFrame([(left, dropped_records, (total_records - dropped_records))], schema=self.schema)
        self.metrics = self.metrics.unionAll(df)
        return dataframe.filter(left)

    def is_unique_extend(self, dataframe: DataFrame, col) -> DataFrame:
        total_records = dataframe.select(F.col(col)).count()
        distinct_records = dataframe.select(F.approx_count_distinct(col, self.rsd)).collect()[0][0]
        dropped_records = total_records - distinct_records
        df = self.spark.createDataFrame([(col + " is unique", dropped_records, distinct_records)], schema=self.schema)
        self.metrics = self.metrics.unionAll(df)
        return dataframe.dropDuplicates([col])
```

In order to apply the conditions we will use the apply_condition function to every condition in the list. In order to
do that we will use the functools reduce function as foldLeft:

```python
   foldl = lambda func, acc, xs: reduce(func, xs, acc)
   @wraps(function)
   def wrapper(*args, **kwargs):
        retval = function(*args, **kwargs)
        return foldl(self.apply_condition, retval, conditions)
```

Wrapping it all together:

```python

from abc import ABC, abstractmethod
from functools import reduce, wraps
from typing import List

import pyspark.sql.functions as F
from pyspark.sql import DataFrame, SparkSession
from pyspark.sql.types import IntegerType, StringType, StructField, StructType

foldl = lambda func, acc, xs: reduce(func, xs, acc)


class ColumnCondition(ABC):
    @abstractmethod
    def get_cols(self):
        pass


class UniqueCondition(ColumnCondition):

    def __init__(self, col):
        self.col = col

    def get_cols(self):
        return self.col


class FilterCondition(ColumnCondition):

    def __init__(self, left_col, right_col):
        self.left_col = left_col
        self.right_col = right_col

    def get_cols(self):
        return self.left_col, self.right_col


def is_not_null(col):
    return FilterCondition(col + " is not null", col + " is null")


def is_unique(col):
    return UniqueCondition(col)


class Expectations:

    def __init__(self, spark: SparkSession, rsd=0.05):
        self.spark = spark
        self.schema = StructType([StructField("condition", StringType(), True),
                                  StructField("dropped_records", IntegerType(), True),
                                  StructField("clean_records", IntegerType(), True)])
        emptyRDD = spark.sparkContext.emptyRDD()
        self.metrics = spark.createDataFrame(emptyRDD, schema=self.schema)
        self.rsd = rsd

    def expect_or_drop(self, conditions: List[FilterCondition]):
        def decorator(function):
            @wraps(function)
            def wrapper(*args, **kwargs):
                retval = function(*args, **kwargs)
                return foldl(self.apply_condition, retval, conditions)
            return wrapper
        return decorator

    def apply_condition(self, dataframe, condition):
        if isinstance(condition, FilterCondition):
            return self.filter_condition(dataframe, condition.get_cols())
        elif isinstance(condition, UniqueCondition):
            return self.is_unique_extend(dataframe, condition.get_cols())
        return dataframe

    def filter_condition(self, dataframe: DataFrame, left_right) -> DataFrame:
        left, right = left_right
        total_records = dataframe.count()
        dropped_records = dataframe.filter(right).count()
        df = self.spark.createDataFrame([(left, dropped_records, (total_records - dropped_records))], schema=self.schema)
        self.metrics = self.metrics.unionAll(df)
        return dataframe.filter(left)

    def is_unique_extend(self, dataframe: DataFrame, col) -> DataFrame:
        total_records = dataframe.select(F.col(col)).count()
        distinct_records = dataframe.select(F.approx_count_distinct(col, self.rsd)).collect()[0][0]
        dropped_records = total_records - distinct_records
        df = self.spark.createDataFrame([(col + " is unique", dropped_records, distinct_records)], schema=self.schema)
        self.metrics = self.metrics.unionAll(df)
        return dataframe.dropDuplicates([col])
```

Let's see the decorator in action:

```python
    @expectation.expect_or_drop([is_not_null("row"), is_unique("row")])
    def read_dataframe(df):
        return df

    result = read_dataframe(df_1)
    print(result.collect())
    print(expectation.metrics.collect())
```

The console will print:
```bash
[Row(row='row1', row_number=1)]
[Row(condition='row is not null', dropped_records=1, clean_records=2), Row(condition='row is unique',
dropped_records=1, clean_records=1)]
```

We can extend and add some plotting functions as well to our Expectations class.

```python
    def plot_pie_with_total(self, figsize=(10, 10)):
        labels = ["clean_records", "dropped_records"]
        df = self.metrics.toPandas()
        size = len(df.index)
        if size == 1:
            fig, axs = plt.subplots(1)
            axs.pie(df.iloc[0][labels], labels=labels, autopct='%1.1f%%')
            axs.set_title(df.iloc[0]["condition"])
        else:
            fig, axs = plt.subplots(1, size, figsize=figsize)
            for i in range(size):
                axs[i].pie(df.iloc[i][labels], labels=labels, autopct='%1.1f%%')
                axs[i].set_title(df.iloc[i]["condition"])
        plt.show()
```

That will generate a pie chart with the metrics for every condition:
![Image Name](/assets/blog/data-quality-pyspark/data_quality.png)

Happy coding and stay safe!


Building data quality checks in your pySpark data pipelines


# Improve your PySpark ETL's performance by providing explicit schema

Have you ever stumbled upon a Spark ETL and you were left wondering how a simple loading of a
dataset can take hours, even though the filtered dataset you are specifying is relatively small?
While there can be multiple reasons for the ETL being slow, from the cloud provider to wrong cluster Spark
configuration of executors, we will focus in this blogpost on optimizing dataframe reads for json and csv datasets.

Going through the [Spark Scala source code](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala#L213)
we can already understand one of the reasons of our query being slow. When reading json datasets Spark will go
through all the json files once to infer the schema before loading our datasets. In case of a heavily distributed
dataset across multiple json files, this can be a source of considerable time spent by Spark Executors. We can
reference the [Spark Documentation](https://spark.apache.org/docs/latest/sql-data-sources-json.html) in order
to better understand what kind of properties are set by default when reading json datasets. A quick fix to speed
our query is to set a smaller sampling ratio for Spark schema inference, by default this value is set to `1.0`.
A better long term fix for your ETL code, especially if the data infrastructure provides some schema guarantees,
is to provide Spark Schema when reading your datasets.

## How to provide a Spark schema when querying json files
### #1 Provide a typed PySpark Schema

As an example we will use an order event that we might receive via our event-bus. This event will have some custom
metadata as well as the order itself with custom order items. In this case our PySpark schema would look like this:

```python
from pyspark.sql.types import *

schema = StructType([
                StructField("metadata",StringType(),False),
                StructField("order",
                            StructType([
                                StructField("order_id",StringType(),False),
                                StructField("created_at",TimestampType(),True),
                                StructField("updated_at",TimestampType(),True),
                                StructField("customer_id",StringType(),False),
                                StructField("order_items",
                                            ArrayType(
                                                StructType([
                                                    StructField("item_id",StringType(),False),
                                                    StructField("item_value",DoubleType(),False)]),False),False)]),False)])

```

We would read the dataframe with the following PySpark command:
```python
df = spark.read.json("our/path", schema=schema)
```

With this configuration spark will read the dataset directly without trying to infer the schema. This setup
can feel quite cumbersome as the data engineer needs to work with Spark types directly, and the definition
of the struct can be quite verbose.

### #2 Provide PySpark schema as Python dataclasses

We can improve the example above using the library `tinsel` that we have leveraged in previous blogposts.
To install the library in your notebook simply run:
```%pip install tinsel```
Using tinsel transformer we can write our schema as dataclasses as in the following example:
```python
from typing import List, NamedTuple, Optional
from tinsel import struct, transform

@struct
class OrderItem(NamedTuple):
  item_id: str
  item_value: float

@struct
class Order(NamedTuple):
  order_id: str
  created_at: Optional[datetime]
  updated_at: Optional[datetime]
  customer_id: str
  order_items: List[OrderItem]

@struct
class Event(NamedTuple):
    metadata: str
    order: Order

schema = transform(Event)
df = spark.read.json("our/path", schema=schema)
```

### #3 Save and load json schema for PySpark dataframes

Writing our schema as python dataclasses is already an excellent step forward, however this might not always
be the right solution. Maintaining schema and schema migration can be quite challenging, and the software
developers might opt on using version control to specify the schemas as yaml or json. With PySpark we can
load the schema specified as json as a static resource, for example from S3. Using the example above we
can generate the json schema:
```python
df.schema.json()
```
Which would print our schema:
```python
json_schema =
"""
    {
  "fields": [
    {
      "metadata": {},
      "name": "metadata",
      "nullable": false,
      "type": "string"
    },
    {
      "metadata": {},
      "name": "order",
      "nullable": false,
      "type": {
        "fields": [
          {
            "metadata": {},
            "name": "order_id",
            "nullable": false,
            "type": "string"
          },
          {
            "metadata": {},
            "name": "created_at",
            "nullable": true,
            "type": "timestamp"
          },
          {
            "metadata": {},
            "name": "updated_at",
            "nullable": true,
            "type": "timestamp"
          },
          {
            "metadata": {},
            "name": "customer_id",
            "nullable": false,
            "type": "string"
          },
          {
            "metadata": {},
            "name": "order_items",
            "nullable": false,
            "type": {
              "containsNull": false,
              "elementType": {
                "fields": [
                  {
                    "metadata": {},
                    "name": "item_id",
                    "nullable": false,
                    "type": "string"
                  },
                  {
                    "metadata": {},
                    "name": "item_value",
                    "nullable": false,
                    "type": "double"
                  }
                ],
                "type": "struct"
              },
              "type": "array"
            }
          }
        ],
        "type": "struct"
      }
    }
  ],
  "type": "struct"
}"""
```

The schema can now be loaded using the following command:
```python
import json

new_schema = StructType.fromJson(json.loads(json_schema))
df = spark.read.json("our/path", schema=new_schema)
```


tinsel1 posts

Improve your PySpark ETL's performance by providing explicit schema

Recommended

Running Codex on a local model with LM Studio

Running time-series forecasting in the browser with Rust and WebAssembly

Orchestrating a forecasting pipeline with Bedrock AgentCore and SageMaker