Fine-tuning a local 9B model for multi-turn text-to-SQL

Show revenue by country.
Only France.
Now break it down by month.
That looks empty. Try the country code instead.

This is the four-message conversation I used to test a text-to-SQL agent I was building, and it is the reason this blogpost exists. The first two messages went fine, somewhere around the third the model lost track of which table we had been working with, and it answered the last message with SQL against a table it had abandoned earlier in the conversation. None of this shows up when you test one question at a time, which is how I had been testing, and it is also how most of the published benchmarks test.

Single-turn benchmarks do not contain this failure

When a model reports something like 85% on text-to-SQL, the number usually comes from Spider or BIRD, where every question arrives alone with fresh context. Both benchmarks have driven real progress, and neither contains the failure above, because that failure needs history to exist. The benchmarks that do contain it are newer and less quoted: SParC runs sequences of related questions over unseen databases, CoSQL adds real dialogue with ambiguity and clarification requests, and BIRD-Interact goes furthest and expects the assistant to recover from execution errors and from its own earlier wrong answers. Frontier models that dominate the single-turn leaderboards drop considerably on these, and that drop is what turned my broken demo into an experiment.

What made the experiment tempting is the shape of the failures. Forgetting the active table, dropping a filter established two turns earlier, not resolving what "it" refers to in "break it down by month": these read as behavior problems rather than intelligence problems, and behavior is what fine-tuning is good at shaping. So I set out to measure how much of the multi-turn gap a small local model can close with task-specific training.

The experiment

The base model is Qwen 3.5 9B, picked because it runs on consumer hardware, Unsloth supports it well, and the instruction-tuned variant already writes reasonable SQL. Training was bf16 LoRA on a single RTX 5090, rank 32, alpha 64, on the standard attention and MLP projections, and each run took about four hours.

The evaluation deserves more words than the training, because two decisions there shape every number that follows. The data is a CoSQL proxy of 100 turns across 32 real dialogs, and the results below come from a fixed subset of 43 follow-up turns from 12 of those dialogs, so every run in the comparison saw the same rows, the same scorer, and the same protocol. And the conversation history is generated, not teacher-forced: at every turn the model sees its own previous SQL, right or wrong, instead of the reference answer it should have produced. Teacher forcing is the more flattering setup and a useful diagnostic, but it evaluates an assistant that cannot exist in production, where turn 3 has to live with whatever turn 2 did. Generated history makes every number in this post lower than its teacher-forced equivalent, and it is the setup that matches what I would deploy.

The frontier baseline is Claude Sonnet 4.6 through OpenRouter, on exactly the same rows, scorer, and protocol.

Five training strategies

DIN-SQL made the case that text-to-SQL decomposes into subtasks: choosing tables, understanding what values are stored, understanding what the user wants, generating the code, catching mistakes. If that is right, the interesting question for multi-turn work is which subtask carries the failure, so each strategy below trains a different bet.

#1 Direct SQL

The control: train on the user's question and the database schema, producing SQL directly with no intermediate representation. If this does not beat the base model, nothing else here matters.

#2 Semantic decomposition

Before any SQL, the model first writes down what the user is asking in structured terms, which entity, which metric, which filter, which grouping. The bet is that follow-ups fail at the level of meaning, since resolving "break it down by month" requires first deciding what "it" is, and that is not a SQL skill. The training data included a Cube-inspired semantic model of each database, entities, dimensions, measures, and join hints derived from the schema, so the model could learn to map user language onto governed business concepts before touching SQL.

#3 Metric DSL

An intermediate language for business metrics, where the model first emits something like MEASURE(revenue) BY customer_country and a compiler expands it into the actual SQL. The bet here is that analysts ask for metrics rather than columns, so the model should preserve that intent and let deterministic code handle the plumbing.

#4 Behavior recovery

Training conversations usually contain ideal history, every earlier turn answered perfectly, which is not the world the model will live in. For this run I generated training examples whose earlier turns contain the model's actual, sometimes wrong, SQL, so it could learn to notice an empty result or wrong data and adjust course. The bet is that repair is a trainable skill of its own.

#5 Schema selection

A deliberate cheat, built to answer a diagnostic question rather than to be deployed. I parsed the correct reference SQL of each question with sqlglot and injected the tables, columns, and joins it uses as hints into the prompt, so the model knows where to look because we peeked at the answer. Useless in production, where there is no answer to peek at, but it measures something I wanted to know: if navigation were solved, how good is the model's SQL?

The numbers

Run	Value accuracy	Strict accuracy	Notes
Claude Sonnet 4.6 (via OpenRouter)	`0.674`	`0.395`	The frontier baseline, same questions and scorer
Qwen 9B + schema hints from correct answer	`0.651`	`0.465`	The diagnostic cheat: told which tables and columns to use, not deployable
Qwen 9B + semantic training (50 steps)	`0.581`	`0.326`	Best result from a fair, deployable local model
Qwen 9B + direct SQL training (50 steps)	`0.581`	`0.302`	Same value accuracy, slightly less precise output shape
Qwen 9B out of the box	`0.558`	`0.302`	No training, just the base model with a SQL prompt
Qwen 9B + recovery training	`0.558`	`0.302`	Trained on messy histories, no improvement yet

Value accuracy asks whether the SQL returned the right data and is lenient about aliases, so total instead of revenue with the right numbers still passes. Strict accuracy also cares about output shape. Value accuracy is the closer proxy for whether the answer was useful to the analyst, so it is the one I mostly reason from.

Three things stand out to me in this table. The clean, deployable fine-tunes give a real but small lift: semantic and direct SQL both reach 0.581 from the base model's 0.558, a 2.3 percentage point improvement that shows the training data teaches something, and stays well short of Sonnet's 0.674. Recovery training gave nothing at all, it tied the base model exactly, and my reading is that the generated repair data did not yet contain moves worth learning, so I file the idea as untested rather than disproven. And the cheat run jumped to 0.651 value accuracy, within about two points of Sonnet, with strict accuracy ahead of Sonnet's, 0.465 against 0.395.

Reading the cheat

The reason the schema-hints number matters is what the errors look like without the hints. Going through the failures, the model rarely writes garbage. Most wrong SQL is plausible SQL: a reasonable-looking query against customer_orders when the table is called orders, a LEFT JOIN where the question needs an INNER JOIN, a filter on "France" when the column stores "FR". Syntax was never the problem, since raw Qwen produces syntactically valid SQL essentially every time, 1.000 across the board, and it generally understands the English question too. What it cannot do reliably is find its way around a database it has never seen, and in a conversation that navigation problem compounds, because the model is also carrying context forward, resolving references, and adjusting when the filter that worked on turn 2 needs a different column on turn 4.

So the experiment I designed to test SQL writing ended up measuring something else. Table and column selection is the bottleneck, and once the hints remove it, a 9B model on my desk sits two points from the frontier on this data. That changed what I plan to work on next.

Limits of this comparison

The comparison is controlled, same rows, same scorer, same protocol for every run including Sonnet, and it is small: 43 generated-history follow-up turns from 12 CoSQL dialogs. It supports the claims that fine-tuning lifts a local model a little, that Sonnet 4.6 stays ahead of every deployable local run, and that the gap nearly closes when navigation is handed to the model for free. It does not support claiming that a local 9B beats Sonnet, that schema hints are usable outside the lab, or that recovery training works. I am comfortable with those limits, since what I wanted from the exercise was a direction for the next round of work.

What comes next

If table and column selection is the bottleneck, the next training data should teach it directly:

Which tables matter for this question?
Which columns express the requested metric or dimension?
Which stored values match the user's wording?
What should carry forward from the previous turn?
What changed in the follow-up: filter, grouping, metric, or repair?
When execution returns nothing, what is the next reasonable attempt?

Once that protocol is stable locally, I want to scale the same evaluation toward BIRD-Interact and re-ask the question against the hosted models there.

The fine-tuning code, the training data generation, and the eval harness are in github.com/xdanny/multiturn-sql-finetuning. If your own agent handles single questions and gets lost in follow-ups, I would be curious whether table selection turns out to be the bottleneck on your schema as well.