I Ran Codex on a Local Model — No Cloud, No Bill
The agent harness is the same whether the brain is in a datacenter or on your desk. I wanted the simpler, cheaper tasks to run on a model on my own Mac — free, offline, private. Getting there meant 40 minutes of 404s, a config flag that no longer exists, and one fix that finally made Codex write a file with zero cloud. The DSPy payoff at the end is the fun part.

Table of Contents
I use Codex as the headless engine that distributes the orchestration tasks in a self-verifying system I'm building. Each node in a DAG hands Codex a job and it runs the whole agentic loop — writes files, runs shell, edits code, checks its own work. And most of the time the job is something trivial. Rename a thing. Scaffold a file. Summarize a diff. Judge whether two outputs satisfy a criterion. None of that needs a frontier model in a datacenter, but every single node still goes out over the wire and still shows up on the bill.
That bothered me more than the money did, honestly. The harness — the part that plans, calls tools, and verifies — is identical whether the brain answering each turn lives in the cloud or on my desk. The model is just a function the harness calls. So why not swap in a function that runs on my own machine, for free, offline, with nothing leaving the laptop?
That's the whole idea: keep Codex's harness, swap the brain for a local model, and route the simpler tasks to it. Hard, critical, agentic work stays on cloud Codex where it belongs. Everything cheap and forgiving runs locally for nothing.
It sounds like a config change. It was not a config change. Here's everything that broke, and the one thing that finally worked.
The brain: a Qwen MoE on Apple Silicon
Before any of the Codex wiring, you need a local model that's actually fast enough to sit inside an agent loop, where the model gets called many times per task and latency compounds.
On a Mac the runtime decision isn't close. MLX is Apple's Metal-native array framework, and it drives the unified-memory GPU directly, so for a given quantization it's the fastest option on Apple Silicon. vLLM is wonderful on an NVIDIA box but has no real Metal backend, so it's the wrong tool here. Ollama and LM Studio's GGUF path both work, but under the hood that's llama.cpp, and MLX edges it out at the same quant.
For the model itself, the lever that matters is active parameters, not total. A Mixture-of-Experts model only fires a slice of its weights on each token, so a 35B model with 3B active actually generates faster than a 27B dense model that has to push all 27B every token — while carrying more total capacity for quality. I settled on unsloth/Qwen3.6-35B-A3B in MLX 4-bit: about 21.6 GB on disk, 3B active per token, the strongest and the fastest thing I tried.
Two download gotchas cost me an hour before the model ever loaded, and they're worth saving you. Unauthenticated Hugging Face pulls are throttled to a crawl, so set a free token. And Unsloth's repos use Xet storage, whose client kept freezing at 0 MB/s with a pile of half-downloaded chunks; turning Xet off and falling back to plain HTTPS jumped it straight to ~10 MB/s.
export HF_TOKEN=hf_xxx # otherwise the download crawls
export HF_HUB_DISABLE_XET=1 # the real fix — Xet stalls, plain HTTPS doesn't
export HF_HUB_ENABLE_HF_TRANSFER=1
uv tool install mlx-lm --with hf-transfer
With the weights down, you've got a model. Now comes the part that took the afternoon.
Where it all went wrong: Codex speaks an API your local server doesn't
My first instinct was the obvious one. MLX ships a server — mlx_lm.server — that exposes an OpenAI-compatible endpoint. Codex lets you define a custom model provider pointed at any OpenAI-compatible URL. Point one at the other, done.
Except every agent turn came back 404, and Codex just retried, forever. I sat and watched it 404-loop for forty minutes before I dug into why.
The reason is subtle and it's the single most important thing in this post: it's not about the tool schema, it's about which HTTP API Codex actually speaks. Modern Codex (0.139 in my case) drives a custom model provider over the OpenAI Responses API — it POSTs to /v1/responses. But mlx_lm.server only implements the older Chat Completions API, /v1/chat/completions. Those are two different endpoints. Codex knocks on a door that the MLX server simply doesn't have, gets a 404, assumes a transient failure, and tries again. And again.
There used to be an escape hatch — wire_api = "chat" in the config, which forced Codex back onto chat-completions. I went looking for it and found it had been removed in 0.139, so setting it just throws wire_api = "chat" is no longer supported and refuses to start. The old advice the internet gives you is now a hard error.
Two more failures showed up while I was flailing, and they're the kind of thing that sends you down the wrong debugging path entirely. One was a reconnect loop — rmcp ... Auth(AuthorizationRequired), fired on every single run — that turned out to be a stale remote MCP server still listed in my Codex config, trying and failing to authenticate before any real work could start. The other was an older LM Studio build flatly rejecting Codex's tool schema with tools.N.type invalid. Neither has anything to do with the model; both look, at a glance, like the model is broken.
So the honest conclusion was: a bare MLX server cannot run Codex's agent loop, full stop. The mismatch is structural, not a missing flag.
The fix that actually worked: LM Studio plus Codex's --oss path
The breakthrough was realizing I'd been trying to make the wrong server speak the right protocol. Instead of fighting mlx_lm.server into the Responses API, I served the same MLX model through LM Studio and used Codex's built-in local-provider path — --oss --local-provider lmstudio — which is designed to speak chat-completions and handle tool calls correctly.
And it worked, end to end. Codex's agent harness, running entirely on the local Qwen3.6 MLX, created a file on disk — about 20k tokens of agent loop, not one of them leaving the machine.
Four things all had to be true at once, and each one had failed independently before they lined up:
The MLX model has to be served by LM Studio, not converted to GGUF. The nice part is you don't re-download anything — the Hugging Face MLX snapshot can be symlinked straight into ~/.lmstudio/models/<org>/<repo> and LM Studio picks it up as-is.
It has to load with a big context window. At 8192 tokens, Codex's own agent system prompt overflows before your task even starts, and you get the cryptic tokens to keep is greater than the context length. Loading at 32768 fixes it: lms load <id> --gpu max --context-length 32768.
Any auth-failing MCP server has to come out of ~/.codex/config.toml first, or that rmcp Auth(AuthorizationRequired) reconnect loop will stall every run before the model gets a turn.
And LM Studio itself has to be current — the older builds are the ones that reject Codex's tool schema.
Once those held, the smoke test is genuinely satisfying to run:
lms load qwen3.6-35b-a3b-ud-mlx --gpu max --context-length 32768 && lms server start
codex exec --oss --local-provider lmstudio -m qwen3.6-35b-a3b-ud-mlx \
-C /tmp/t --sandbox workspace-write --skip-git-repo-check -c 'mcp_servers={}' \
"Create hello.txt containing: hi. Then stop." # → writes the file, zero cloud
There's one model-specific wrinkle worth flagging, because it'll make you think the model failed when it didn't. Qwen3.6 is a thinking model: it writes its reasoning into a separate channel and only emits the real answer after it closes </think>. If you starve it on max_tokens, it spends the entire budget thinking and hands back empty content. Give it room to finish the thought.
Why this is worth the trouble: route the simple stuff for free
With local Codex working, the payoff is a routing decision rather than an all-or-nothing one. You don't have to choose between "always cloud" and "always local." You split by what the task actually needs.
The hard, high-stakes, genuinely agentic work — the multi-step changes where a wrong move is expensive — stays on cloud Codex, because that's where the capability gap is real and worth paying for. But the long tail of simpler tasks — generate a file, rename across a module, summarize a run, draft a first pass, judge whether an output meets a bar — can run on the local model for nothing, offline, with no data leaving the machine. The harness is identical. Only the brain, and the bill, changes.
The trick to doing this responsibly is to measure which tasks survive the swap rather than guessing. In my setup that's an eval pass: run a node on both engines, compare, and only graduate it to local once it demonstrably matches cloud quality on that task. Cost routing without measurement is just hoping; with a cheap eval in front of it, it's a decision you can defend.
The fun add-on: a DSPy loop where generation is free
Here's where it gets genuinely useful beyond saving a few dollars on housekeeping tasks. Once you have a local model that the harness can drive for free, you can build a DSPy-style prompt-optimization loop and run the expensive half of it at no cost.
DSPy optimization is, at heart, a loop: propose candidate prompts, run them, score which is better, promote the winner. The quiet insight is that the cost and the risk live in different halves. The expensive half is proposing and running — you do it hundreds of times per round — but it's forgiving, because you generate several candidates and just keep the best, so any single bad generation is harmless. The cheap half is scoring; you do it only a handful of times, but it's unforgiving, because a biased scorer quietly optimizes you toward the wrong prompt and every later round inherits the mistake.
So you put each half where it belongs. Generation runs on the local model for free. Scoring runs as a Codex LLM-judge — billed, but tiny, and pointed at your most consistent model because it's the one decision that compounds.
That judge needs one piece of care to be trustworthy: debias it for position. LLM judges have a real, measurable preference for whichever output they read first. So you judge twice, swapping the order between rounds, and only trust a verdict that survives the swap:
// Round 1: A first. Round 2: B first (positions swapped, then remapped).
if (round1Winner === round2Winner && round1Winner !== "tie") {
verdict = round1Winner; // it agreed with itself under reordering — trust it
} else {
verdict = "tie"; // it flipped when you flipped the order — that was bias
}
If reordering the two outputs changes the answer, the "preference" was an artifact of position, not quality, and collapsing it to a tie throws that noise away. It's ten lines, and it's the difference between a metric and a coin flip. I also tell the judge to grade depth and specificity rather than the mere presence of the required pieces, and to default to a fail unless the work is genuinely good — a lenient judge will wave box-ticking work straight through and your optimizer will happily learn to produce more of it.
The last bit of glue is unglamorous but it's what keeps the loop alive: structured output. Every candidate result and every verdict has to parse into a record, and a thinking model will occasionally wrap its JSON in prose or a markdown fence. The defense is to ask for JSON mode, but never trust it — run every response through a small salvage extractor that finds the first balanced object and validates it before accepting, and treat anything that fails as a dropped case rather than a crash:
func extractJSON(text string) json.RawMessage {
t := strings.TrimSpace(text)
if json.Valid([]byte(t)) {
return json.RawMessage(t)
}
first := strings.IndexAny(t, "[{")
last := strings.LastIndexAny(t, "]}")
if first >= 0 && last > first {
if cand := t[first : last+1]; json.Valid([]byte(cand)) {
return json.RawMessage(cand)
}
}
return nil // a failed parse is a skipped case, not an exception
}
The evalset and the history of promotions are stored the same defensive way — newline-delimited JSON, appended as the loop runs, read back line by line so one malformed line drops itself instead of poisoning the whole file. It's not elegant. It's what survives a model that hiccups once an hour.
What I'd tell you if you were about to try this
Most of the difficulty here isn't where you'd expect. The model runs fine; the optimization loop is conceptually simple; the actual quicksand is the seam between Codex and a local server. So if you take three things from this, take these.
The make-or-break is which API Codex speaks, not the tool schema or the model. If you're seeing endless 404s, you're almost certainly serving Chat Completions to a Codex that wants the Responses API — and the old wire_api = "chat" workaround is gone. Serving the MLX model through LM Studio and using Codex's own --oss --local-provider lmstudio path sidesteps the whole problem, which is why that's the configuration I landed on.
Treat local versus cloud as a routing decision you measure, not a flag you flip. Send the simple, forgiving, private work to the free local engine and keep the hard, high-stakes work on cloud Codex — and let an eval, not a hunch, decide which task is which.
And if you build the DSPy loop on top, spend your reliability budget on the judge. Generation is cheap and forgiving and belongs wherever it's free. Judgment is the decision that compounds, so make it your strongest, most consistent model, and always debias it for order. That's the whole game: free where you can afford to be wrong, careful where you can't.


