What if a 9B model could beat Sonnet at SQL?

It started as a thought experiment: multi-turn benchmarks expose a surprising weakness in frontier models. How close could a fine-tuned local model get?

Dan
Dan
5 min read
Cover Image for What if a 9B model could beat Sonnet at SQL?
Table of Contents

I was building a data agent. The kind of thing where you type a question in plain English and it writes SQL, runs it, and shows you a chart. The demo worked great. Single question, single query, clean result. Ship it.

Then I tried using it the way a real analyst would.

Show revenue by country.
Only France.
Now break it down by month.
That looks empty. Try the country code instead.

Four turns. By the third one, the model had forgotten what we were looking at. By the fourth, it was writing SQL against a table it had already abandoned. The demo was a lie. Not because the model couldn't write SQL — it clearly could. But because it couldn't hold a conversation about data.

That failure sent me down a rabbit hole.

The benchmark gap nobody talks about

When you hear that frontier models score 85% on text-to-SQL, that number almost certainly comes from a single-turn benchmark. One question, one answer, fresh context. Spider and BIRD are the gold standards, and they've driven real progress.

But I kept pulling at the thread: what happens when we test models the way analysts actually work? Not one question at a time, but in sequences. Follow-ups. Refinements. "Wait, that doesn't look right — try it this way."

That's where SParC and CoSQL come in. SParC tests whether a model can handle sequences of related questions over databases it has never seen. CoSQL goes further — it's actual dialogue, with ambiguity, clarification requests, and all the messiness of a real conversation between an analyst and a SQL assistant.

And then there's BIRD-Interact, which argues that even those benchmarks are too polite. Real assistants need to recover from execution errors, ask for clarification, and deal with the fact that their own earlier answers might have been wrong.

Here's the thing that got me: frontier models that look dominant on single-turn benchmarks get dramatically less impressive on multi-turn ones. The gap isn't small. It's the kind of gap that makes you wonder whether we've been measuring the right thing.

So I started a thought experiment. If multi-turn is where the big models stumble, and if the problem is more about behavior than raw intelligence — keeping context, carrying filters, resolving references — could a small model be fine-tuned to close that gap?

Could a 9B model running on my desk beat Sonnet at the thing that actually matters?

Setting up the experiment

I grabbed Qwen 3.5 9B as my base. Why Qwen? It runs on consumer hardware, Unsloth has good support for it, and the instruction-tuned variant already knows what SQL looks like. The training ran on a single RTX 5090 with bf16 LoRA — rank 32, alpha 64, all the standard attention and MLP projections. Nothing exotic. About four hours of compute for each run.

For evaluation, I set up a CoSQL proxy: 100 turns spread across 32 real dialogs. The blog results I'm reporting use a clean subset of 43 follow-up turns from 12 of those dialogs, chosen because every run used the same rows, the same scorer, and the same protocol.

One crucial design choice: I used generated-history rollout, not teacher-forced history. In teacher-forced evaluation, the model gets perfect prior turns — it sees what the ideal previous SQL should have been. That's a useful diagnostic, but it's a fantasy. In practice, your model sees its own earlier output. If turn 2 was wrong, turn 3 has to deal with that. Generated-history makes the numbers uglier, but it's the only honest way to test a real assistant.

For the frontier baseline, I ran Claude Sonnet 4.6 through OpenRouter on the exact same rows, same scorer, same protocol. Same rules for both sides.

Five ideas, one clear signal

I didn't just try one training strategy. I was exploring a research question, and that meant testing different hypotheses about what the model needs to learn.

DIN-SQL had shown that decomposing the text-to-SQL problem into subtasks could help. That felt right intuitively — writing SQL isn't one skill. It's figuring out which tables to use, understanding what values look like in the database, grasping what the user actually wants, generating the code, and catching mistakes. All tangled together. The question was which piece matters most for multi-turn conversations.

Here's what I tried:

Direct SQL — the control group. Take the user's question and the database schema, train the model to produce SQL directly. No intermediate steps, no clever representations. Just question in, query out. If this doesn't improve over the base model, nothing else matters.

Semantic decomposition — before writing any SQL, first have the model describe what the user is asking for in structured terms: which entity, which metric, which filter, which grouping. The idea is that understanding "break it down by month" depends on first understanding what "it" refers to from the previous turn. That's a meaning problem before it's a SQL problem. The training data included a Cube-inspired semantic model (entities, dimensions, measures, join hints) derived from the database schema, so the model could learn to map user language to governed business concepts before touching SQL.

Metric DSL — a small intermediate language for business metrics. Instead of going straight to SUM(orders.amount), the model would first emit something like MEASURE(revenue) BY customer_country, and a compiler would expand that into the actual SQL. Attractive in theory, because when an analyst says "show me revenue," they're thinking about a business metric, not a column name. The model should preserve that intent and let something else handle the SQL plumbing.

Behavior recovery — in real use, the model will sometimes write bad SQL. The next turn has to deal with that. So instead of always training on perfect conversation histories, I generated training examples where earlier turns contained the model's actual (sometimes wrong) SQL output. The idea: teach the model to recognize when a previous query returned empty results or wrong data, and adjust. Show it the messy states it'll actually face in production.

Schema selection — this one was a deliberate cheat, designed to answer a diagnostic question. I took the correct reference SQL for each question, parsed it to extract exactly which tables, columns, and joins were needed, then fed those as hints in the prompt before asking the model to write SQL. Obviously you can't do this in production — you'd need to already know the answer to give the hint. But it tells you something critical: if the model already knew where to look in the database, how good would its SQL be?

The results told a clear story.

What the numbers say

Run Value accuracy Strict accuracy What it means
Claude Sonnet 4.6 (via OpenRouter) 0.674 0.395 The frontier. Same questions, same scorer.
Qwen 9B + schema hints from correct answer 0.651 0.465 Cheating — told which tables and columns to use. Not deployable, but revealing.
Qwen 9B + semantic training (50 steps) 0.581 0.326 Best result from a fair, deployable local model.
Qwen 9B + direct SQL training (50 steps) 0.581 0.302 Same value accuracy, slightly less precise output shape.
Qwen 9B out of the box 0.558 0.302 No training, just the base model with a SQL prompt.
Qwen 9B + recovery training 0.558 0.302 Trained on messy histories. Didn't help yet.

Before reading into these, a note on the metrics. Value accuracy asks: did the SQL return the right data? It's lenient about column aliases — if you called the column total instead of revenue but got the right numbers, that's a pass. Strict accuracy also cares about output shape. Both matter, but value accuracy is the more meaningful measure of whether the model is actually useful.

Some of these results are what you'd expect. Some are not.

The clean local lift is real but small. The best fine-tuned adapters (semantic and direct SQL) both reach 0.581 value accuracy, up from 0.558 for raw Qwen. That's a 2.3 percentage point improvement. Not nothing — it proves the model can learn from the task-specific data — but it doesn't close the gap to Sonnet's 0.674.

Behavior recovery was a disappointment. Conceptually, it's the right idea. In practice, it tied raw Qwen. The training data didn't teach the useful repair moves yet. That's a data quality problem, not a conceptual one, but it means recovery training is still a hypothesis, not a result.

The schema hints result is the real story. When you cheat and tell the model which tables and columns it needs, the local model jumps to 0.651 — within two points of Sonnet on value accuracy and actually ahead on strict accuracy (0.465 vs 0.395). You can't deploy this — the hints come from parsing the correct answer, which you obviously don't have at inference time. But it's the kind of result that changes the direction of your research.

What the schema hints result actually means

Here's what the cheat does concretely. The evaluation code takes the reference SQL (the correct answer), parses it with sqlglot, and extracts the relevant tables, the specific columns used, the join conditions, the projection shape, and grouping. It then injects those as hints into the prompt: "you need the orders table joined to customers on customer_id, selecting amount and country." The model still has to write the actual SQL, but it no longer has to figure out where to look.

Most wrong SQL is plausible SQL. The model doesn't usually produce garbage — it produces a reasonable-looking query against the wrong table, with the wrong join, using a value format that doesn't match what's in the database. "France" instead of "FR". customer_orders instead of orders. A LEFT JOIN where an INNER JOIN was needed.

When you remove that confusion — when the model knows where to look — a 9B local model gets within striking distance of the frontier. The remaining errors are about value grounding (the database stores "FR" but the user said "France"), execution edge cases, and grain issues (accidentally duplicating rows through a many-to-many join). The table-and-column selection problem is the bottleneck, not raw SQL writing ability.

This completely reframes the question. I started asking "can a small model write SQL as well as Sonnet?" and ended up learning that "can a small model pick the right tables?" is the question that actually matters.

The honest evidence boundary

I want to be precise about what I can and can't claim from this work.

I can say: The best clean local fine-tunes slightly beat raw Qwen on the same 43 generated-history rows. Fine-tuning works. The model learns something real from multi-turn SQL data.

I can say: Sonnet 4.6 remains ahead on the clean comparison. The frontier hasn't been reached.

I can say: When the model is told which tables and columns to use (by peeking at the answer), it nearly closes the gap to Sonnet. That tells us where the real problem is.

I cannot say: A local 9B model beats Sonnet. Not yet. Not on this data.

I cannot say: The schema-hints setup is deployable. You can't peek at the answer in production. It's a ceiling test that shows where to push.

I cannot say: Recovery training is proven. It tied the baseline. The idea might be right, but the current training data didn't teach it.

This is a 43-row generated-history slice from CoSQL. It's reproducible, controlled, and honest. It is not a final benchmark claim.

So what was the point?

The thought experiment started with "what if a 9B model could beat Sonnet at SQL?" and the honest answer is: not yet, not cleanly. But the experiment was worth running because of what it revealed about where the gap actually lives.

The gap isn't in SQL syntax. Raw Qwen already produces syntactically valid SQL almost every time — 1.000 syntax accuracy across the board. The gap isn't even in understanding English questions — the model generally knows what you're asking.

The gap is in navigating the database: knowing which tables to query, which columns to select, how to join them, and what the stored values actually look like. And in multi-turn conversations, that problem compounds. You're not just picking the right tables for one question. You're carrying forward context from previous turns, resolving references like "break that down by month," and adapting when a filter that worked on turn 2 needs a different column on turn 4.

That's a learnable problem. It's not a matter of needing a smarter model — it's a matter of needing better training data about which tables matter, how values are stored, and what carries forward across turns. The schema-hints experiment proved that the 9B model can do this work when it knows where to look. The next step is teaching it to figure that out on its own.

What comes next

The next experiment should teach context selection directly:

  1. Which tables matter for this question?
  2. Which columns express the requested metric or dimension?
  3. Which stored values match the user's wording?
  4. What should carry forward from the previous turn?
  5. What changed in the follow-up: filter, grouping, metric, or repair?
  6. When execution returns nothing, what's the next reasonable attempt?

That's the bridge from a small CoSQL checkpoint toward BIRD-Interact-style evaluation. First make the local protocol stable. Then scale the benchmark. Then ask whether the local model can close the hosted gap.

The thought experiment isn't over. It just got more specific.

Code

The full fine-tuning repo: github.com/xdanny/multiturn-sql-finetuning.

Sources