Running Codex on a local model with LM Studio

The orchestration system I am building hands Codex jobs from a DAG, and Codex runs the whole agentic loop for each of them, writing files, running shell commands, editing code, and checking its own work. Most of those jobs are small, renaming something, scaffolding a file, summarizing a diff, judging whether an output meets a criterion, and none of them needs a frontier model, yet every one of them goes out over the wire and shows up on the bill. The harness that plans, calls tools, and verifies is identical wherever the model lives, so I wanted the small jobs answered by a model running on my own Mac, for free and offline. I assumed this was a config change, it took the whole afternoon instead, and this blogpost collects everything that broke along the way to a setup that works.

Choosing a model and getting it onto the machine

A model that sits inside an agent loop gets called many times per task, so latency compounds and the runtime choice matters. On a Mac that choice is MLX, Apple's Metal-native array framework, which drives the unified-memory GPU directly and is the fastest option on Apple Silicon at a given quantization. vLLM has no real Metal backend, and the GGUF path in Ollama or LM Studio runs on llama.cpp, which MLX edges out at the same quant.

For the model itself, the lever is active parameters rather than total. A Mixture-of-Experts model only fires a slice of its weights on each token, so a 35B model with 3B active generates faster than a 27B dense model while carrying more total capacity for quality. I settled on unsloth/Qwen3.6-35B-A3B in MLX 4-bit, about 21.6 GB on disk with 3B active per token, the strongest and the fastest of everything I tried.

Two download problems cost me an hour before the model ever loaded. Unauthenticated Hugging Face pulls are throttled to a crawl, so set a free token. And Unsloth's repos use Xet storage, whose client kept freezing at 0 MB/s with a pile of half-downloaded chunks, while turning Xet off and falling back to plain HTTPS jumped the download straight to around 10 MB/s:

export HF_TOKEN=hf_xxx                 # otherwise the download crawls
export HF_HUB_DISABLE_XET=1            # the real fix — Xet stalls, plain HTTPS doesn't
export HF_HUB_ENABLE_HF_TRANSFER=1
uv tool install mlx-lm --with hf-transfer

An afternoon of 404s

MLX ships a server, mlx_lm.server, that exposes an OpenAI-compatible endpoint, and Codex lets you define a custom model provider pointed at any OpenAI-compatible URL, so pointing one at the other looked like the whole job. Instead, every agent turn came back 404 and Codex retried, forever, and I watched it loop for forty minutes before digging into why.

The reason is which HTTP API Codex speaks. Modern Codex, 0.139 in my case, drives a custom model provider over the OpenAI Responses API, meaning it POSTs to /v1/responses, while mlx_lm.server only implements the older Chat Completions API at /v1/chat/completions. Codex knocks on a door the MLX server does not have, gets a 404, assumes a transient failure, and tries again. There used to be an escape hatch, wire_api = "chat", which forced Codex back onto chat-completions, and it was removed in 0.139, so setting it now throws wire_api = "chat" is no longer supported and refuses to start. The advice the internet still gives you for this situation is a hard error today, and the honest conclusion is that a bare MLX server cannot run Codex's agent loop at all.

Two more failures showed up while I was flailing, and both look like a broken model when they have nothing to do with the model. A reconnect loop printing rmcp ... Auth(AuthorizationRequired) on every single run turned out to be a stale remote MCP server still listed in my Codex config, trying and failing to authenticate before any real work could start. And an older LM Studio build flatly rejected Codex's tool schema with tools.N.type invalid, which a current build accepts without complaint.

The setup that works

Instead of forcing mlx_lm.server to speak the Responses API, I served the same MLX model through LM Studio and used Codex's built-in local-provider path, --oss --local-provider lmstudio, which speaks chat-completions and handles tool calls correctly. With that in place, Codex's agent harness ran entirely on the local Qwen3.6 and created a file on disk, about 20k tokens of agent loop with nothing leaving the machine.

Four conditions had to hold at once, and each one had failed independently before they lined up. The MLX model has to be served by LM Studio rather than converted to GGUF, and there is nothing to re-download, since the Hugging Face snapshot can be symlinked straight into ~/.lmstudio/models/<org>/<repo> and LM Studio picks it up as-is. It has to load with a large context window, because at 8192 tokens Codex's own agent system prompt overflows before your task even starts and you get the cryptic tokens to keep is greater than the context length, while 32768 works. Any auth-failing MCP server has to come out of ~/.codex/config.toml, or the rmcp reconnect loop stalls every run before the model gets a turn. And LM Studio itself has to be a current build, for the tool-schema reason above.

lms load qwen3.6-35b-a3b-ud-mlx --gpu max --context-length 32768 && lms server start
codex exec --oss --local-provider lmstudio -m qwen3.6-35b-a3b-ud-mlx \
  -C /tmp/t --sandbox workspace-write --skip-git-repo-check -c 'mcp_servers={}' \
  "Create hello.txt containing: hi. Then stop."   # → writes the file, zero cloud

One model-specific detail worth knowing before you blame the setup: Qwen3.6 is a thinking model, it writes its reasoning into a separate channel and only emits the real answer after it closes </think>, so if you starve it on max_tokens it spends the entire budget thinking and hands back empty content. Give it room to finish the thought.

Splitting the work between local and cloud

With local Codex working, the decision stops being all-or-nothing. The multi-step changes where a wrong move is expensive stay on cloud Codex, since that is where the capability gap is real and worth paying for, while the long tail of simpler tasks, generating a file, renaming across a module, summarizing a run, judging whether an output meets a bar, runs locally for nothing, offline, with no data leaving the machine. In my setup the split is decided by an eval pass rather than by feel: a node runs on both engines, the outputs are compared, and the node only graduates to the local model once it demonstrably matches cloud quality on that task.

A DSPy-style loop where generation is free

The part I wanted all of this for is prompt optimization. A DSPy-style loop proposes candidate prompts, runs them, scores which is better, and promotes the winner, and the cost and the risk of that loop live in different halves. Proposing and running happens hundreds of times per round and is forgiving, since you generate several candidates and keep only the best, so a single bad generation is harmless. Scoring happens a handful of times and is unforgiving, since a biased scorer optimizes you toward the wrong prompt and every later round inherits the mistake. So generation goes to the local model, where it costs nothing, and scoring runs as a Codex LLM judge on the most consistent model available, billed but tiny.

The judge needs one piece of care to be trustworthy. LLM judges have a real, measurable preference for whichever output they read first, so I judge twice with the order swapped between rounds, and only trust a verdict that survives the swap:

// Round 1: A first.  Round 2: B first (positions swapped, then remapped).
if (round1Winner === round2Winner && round1Winner !== "tie") {
  verdict = round1Winner;   // it agreed with itself under reordering — trust it
} else {
  verdict = "tie";          // it flipped when you flipped the order — that was bias
}

If reordering the two outputs changes the answer, the preference was positional rather than a judgment of quality, and collapsing it to a tie throws that noise away. I also instruct the judge to grade depth and specificity rather than the mere presence of the required pieces, and to default to a fail unless the work is good, because a lenient judge waves box-ticking work through and the optimizer happily learns to produce more of it.

The last piece of glue is structured output. Every candidate result and every verdict has to parse into a record, and a thinking model will occasionally wrap its JSON in prose or a markdown fence no matter how you asked, so every response goes through a small salvage extractor that finds the first balanced object and validates it before accepting, and a failed parse becomes a dropped case rather than a crash:

func extractJSON(text string) json.RawMessage {
    t := strings.TrimSpace(text)
    if json.Valid([]byte(t)) {
        return json.RawMessage(t)
    }
    first := strings.IndexAny(t, "[{")
    last := strings.LastIndexAny(t, "]}")
    if first >= 0 && last > first {
        if cand := t[first : last+1]; json.Valid([]byte(cand)) {
            return json.RawMessage(cand)
        }
    }
    return nil // a failed parse is a skipped case, not an exception
}

The evalset and the history of promotions are stored the same defensive way, newline-delimited JSON appended as the loop runs and read back line by line, so one malformed line drops itself instead of poisoning the whole file.

Where this setup ends

The configuration above is specific to Codex 0.139 and its Responses-API requirement, and both Codex and LM Studio move quickly, so the exact flags will age. What I expect to age better is the shape of the thing: an agent harness pointed at a local brain for the forgiving work, a billed judge for scoring, since scoring mistakes compound across rounds, and an eval deciding which task goes where. If you try the same wiring against a different local server, I would be curious whether the Responses-versus-chat-completions mismatch bites there too.

Running Codex on a local model with LM Studio

Choosing a model and getting it onto the machine

An afternoon of 404s

The setup that works

Splitting the work between local and cloud

A DSPy-style loop where generation is free

Where this setup ends

References

Recommended

Running time-series forecasting in the browser with Rust and WebAssembly

Orchestrating a forecasting pipeline with Bedrock AgentCore and SageMaker

Fine-tuning a local 9B model for multi-turn text-to-SQL