Can a local 9B model compete with state-of-the-art systems for data analysis?

The real question is whether a Qwen 3.5 9B model running locally can handle serious data-analysis work, not just toy SQL demos. This post turns that question into a measured multi-turn SQL benchmark before making state-of-the-art claims.

Cover Image for Can a local 9B model compete with state-of-the-art systems for data analysis?

Can a local 9B model compete with state-of-the-art systems for data analysis?

This is the core question driving this project. I'm not interested in seeing if a small model can generate toy SQL queries or look impressive in a carefully curated demo. The real test is whether a model running on a single local workstation can handle the heavy lifting we usually reserve for frontier API models. Can it read a complex database schema, carry context across multiple follow-up questions, nail the right joins, normalize messy user inputs, and actually return trustworthy data?

I honestly don't know the answer yet, which is exactly why I'm building this out.

I chose SQL generation as the proving ground because it eliminates the wiggle room you get with natural language tasks. A SQL query either executes and returns the correct rows, or it fails. Multi-turn SQL pushes this further by mimicking how real analysts work: a user asks a question, gets a result, immediately asks for a breakdown or filter, and expects the system to remember everything they just talked about.

This first post isn't about claiming a victory lap for a local Qwen 3.5 9B model against GPT-4o, and I won't be dropping a final BIRD-Interact score just yet. Instead, this post is about establishing the engineering pipeline needed to make that comparison honest. We have to walk before we can run: preparing the multi-turn data, fine-tuning LoRA adapters, serving them via vLLM, executing the SQL locally against SQLite, and—crucially—isolating genuine progress from accidental data leakage.

Here is how I'm breaking down the journey:

Post The Focus
1. Benchmark Setup Setting up a local evaluation loop we can trust. (This post)
2. Local Training and Serving The nuts and bolts of the RTX 5090, Unsloth, and vLLM pipeline.
3. The Dataset Bridge Reconciling BIRD-Interact, BIRD mini-dev, CoSQL, SParC, and synthetic SQL.
4. Analyzing Fine-tuning Runs Which training tweaks actually move the needle?
5. Deconstructing Failures Why do schema linking, joins, and dialogue state break down in practice?
6. Building a Planner System Moving from an "oracle" cheat to a production-ready query planner.

Why one-shot SQL isn't enough

Single-turn text-to-SQL is already a notoriously difficult problem. The model has to ingest a schema, parse a standalone question, and write a flawless query:

Which airlines fly to Boston?

But multi-turn SQL fundamentally changes the game. Real users rarely restate their entire intent. They speak in shorthand:

Which airlines fly to Boston?
Only show the ones from France.

That second question is entirely dependent on the first. For an agent to succeed here, it has to carry forward the route target and the airline entity, append a country filter, select the correct tables, and preserve the original output shape. If it drops the ball on any of these, it might generate perfectly valid SQL that answers the completely wrong question.

When building this out, I ran into several concrete failure modes:

  • Filter amnesia: The model forgets a constraint established in a previous turn.
  • Table misattribution: Applying the correct value to the wrong table entirely.
  • Join fan-out: Selecting a join path that inadvertently multiplies the rows.
  • Projection drift: Returning a raw count when the user specifically asked for names.
  • Value grounding failures: Querying for a string that simply doesn't exist in the database (e.g. 'France' vs 'FR').
  • History misalignment: Resolving a follow-up query against the wrong turn in a long conversation.

This is why we can't treat multi-turn SQL as just "single-turn SQL plus a chat history string." The model has to actively plan before it generates code.

Isolating the datasets

I'm pulling from several SQL datasets, but tossing them all into one massive fine-tuning pile is a bad idea. Each dataset serves a specific engineering purpose:

Dataset Role in this project
BIRD-Interact Our north star. This is the interactive benchmark closest to real-world data analysis.
BIRD mini-dev Single-turn BIRD-style rows, great for testing our execution harness but not a substitute for dialogue.
CoSQL Our current local proxy. It provides multi-turn dialogues over Spider-style databases and runs easily against SQLite.
SParC Provides related, context-dependent SQL data to bolster the training signal.
Synthetic SQL Single-turn SQL paired with complex schemas to prevent the model from overfitting to simple database layouts.

For the benchmarks reported below, I consistently evaluated against the exact same proxy slice: 100 turns across 32 CoSQL dialogues.

I deliberately kept this slice small so I could iterate quickly while still exposing the system to complex follow-up failures. I also used "teacher-forced" dialogue history—meaning when the model is evaluated on turn 3, it's given the correct, ground-truth SQL from turns 1 and 2, rather than its own potentially broken history. This tells us whether the model is capable of utilizing clean context, deferring the harder problem of error recovery for later.

Testing the fine-tuning loop

The fine-tuning runs I'm highlighting here are small, exploratory LoRA experiments, not final training runs. I used balanced slices of 192 and 384 rows from our dataset mix just to verify that the training loop could actually move the needle.

Starting out, the base Qwen 3.5 9B model achieved a meager 0.370 strict execution accuracy. After a 100-step LoRA fine-tune, that jumped to 0.530.

Run Training rows Steps Strict execution accuracy
Base unsloth/Qwen3.5-9B 0 0 0.370
50-step LoRA 192 50 0.420
100-step LoRA 192 100 0.530
384-row LoRA 384 50 0.420

While a 0.530 accuracy is a real improvement, it's completely unviable for production. If a model fails on nearly half of its turns, the user's conversation is going to derail constantly. However, these numbers gave me exactly what I needed: proof that the training and evaluation loop was sensitive. The fact that the scores shifted gave me a reliable signal to start debugging.

Moving beyond strict scoring

I quickly realized that strict execution scoring was masking part of the picture. The scorer would brutally fail a query simply because the model chose a different output alias than the reference SQL, even if the returned rows and values were identical.

To fix this, I implemented a value-aware execution scorer. It still rigorously checks row values, duplicate counts, and order-sensitive outputs, but it forgives harmless cosmetic differences in column labels.

Under this fairer metric, my best non-oracle run maxed out at 0.640. This is the number I consider our true production-style proxy ceiling. It represents a run using only the inputs a real system would have access to: the question, conversation history, schema context, and standard prompt instructions.

Proxy SQL accuracy on the fixed CoSQL slice Bar chart showing base strict accuracy 0.370, 100-step LoRA strict accuracy 0.530, best non-oracle value accuracy 0.640, oracle prompt value accuracy 0.850, and oracle-trained value accuracy 0.890. The useful result is the gap, not the top number Fixed 100-turn CoSQL proxy. The last two bars use gold SQL-derived planning labels. 0.370 0.530 0.640 0.850 0.890 Base 100-step Best real Oracle Oracle train strict strict value value, gold plan value, gold plan
The first three bars are production-style proxy runs. The last two are diagnostics that use labels derived from the reference SQL.

The ceiling test: Diagnosing the bottleneck

Once I hit that 0.640 ceiling, I decided to run a diagnostic test that you would never be allowed to run in production.

I parsed the ground-truth reference SQL for every turn and extracted the exact tables, columns, joins, query skeleton, projection shape, and duplicate-row policy. I then fed these extracted labels directly to the model as planning hints.

This is what I call an oracle run—we are intentionally leaking the answer key. It's completely useless for production, but it's an incredible diagnostic tool. It isolates a very specific question: If the schema planning was already perfect, how well could the model actually write the SQL?

The results were eye-opening. A prompt-only oracle run hit 0.850 value accuracy, and a model specifically trained on that oracle-labelled format reached a massive 0.890.

This doesn't mean the project is solved, but it proves something vital: the SQL generator itself is actually quite robust when it isn't burdened with schema selection and projection logic. The core missing piece isn't just better prompting; it's a dedicated planner that can make those structural decisions without peeking at the answer key.

Formalizing the planner

Based on this insight, I restructured the pipeline to make the planner an explicit, measurable entity. Rather than asking the model to just "think step by step" in the prompt, we enforce a structured planning phase before any SQL is written.

Each evaluated turn now carries:

  • A gold_plan, extracted from the reference SQL and used purely for training/diagnostics.
  • A predicted_plan, generated from scratch without seeing the reference SQL.
  • Distinct scoring metrics for how well the planner chose tables, columns, joins, aggregations, and grouping.

To establish a baseline, I threw together a quick, naive lexical planner. It simply reads the visible schema, looks at the user's question, and attempts to guess the tables and columns using raw string matching.

As expected, it struggles:

Planner source Turns Dialogs Oracle prompt rows Macro score Table F1 Column F1 Skeleton F1
lexical_schema_baseline 100 32 0 0.571 0.599 0.117 0.648
Planner baseline metrics on the fixed CoSQL slice Bar chart showing macro planner score 0.571, table F1 0.599, column F1 0.117, skeleton F1 0.648, join F1 0.790, group-by F1 0.830, and selected-count match 0.180. The planner can now fail in specific ways Lexical non-oracle planner. 100 CoSQL turns, 32 dialogs, zero oracle prompt rows. 0.571 0.599 0.117 0.648 0.790 0.830 0.180 Macro Table Column Skeleton Join Group-by Count
The planner sometimes finds the rough table area, but column selection is still incredibly weak. This highlights exactly where our next engineering push needs to be.

This decoupling is a huge win for observability. Now, if the planner picks the wrong table, we flag it as a planning failure. If the planner is spot-on but the SQL syntax is mangled, that's a generation failure. If both are perfect but execution still fails, we know we have an issue with value normalization or duplicate rows. By splitting these concerns, we avoid collapsing every error into a useless bucket of "bad SQL."

The road ahead

With this groundwork laid, the path forward for our data-analysis claim is clear:

  1. Continue iterating on the CoSQL proxy slice for tight local feedback loops.
  2. Upgrade the naive planner to a robust, non-oracle system that can accurately predict tables, columns, joins, and value maps.
  3. Feed the predicted plans into the SQL generator and formally score planning distinct from execution.
  4. Establish baselines with heavy, hosted models on this exact same proxy slice.
  5. Finally, migrate this entire evaluation contract over to the rigorous BIRD-Interact benchmark.

At this stage, the data engineering effort is just as critical as the model fine-tuning. Up next, I'll be building out a value index, entity-resolution labels, join/fan-out fixtures, and a schema validator to catch bad column references before they trigger generic SQL failures.

While the local 9B model definitely improved through fine-tuning, the real victory of this first phase is that our evaluation loop successfully highlighted what the system cannot do yet.

To summarize our honest, current state:

  • 0.640 value accuracy is our production-ready proxy baseline.
  • 0.890 value accuracy is our oracle-assisted diagnostic ceiling.
  • The planner baseline is finally measurable, but needs significant work.
  • The definitive BIRD-Interact and hosted-model showdowns are up next.

A local 9B model might very well become a powerhouse for serious data analysis, but we have to build the evidence methodically: perfect the proxy, nail the planner, establish the baselines, and then tackle the final benchmark.

Reference Material

Recommended

Subscribe to new posts