Google’s Gemini-SQL2 Puts Text-To-SQL Accuracy Into The Enterprise Workflow Test
Google says Gemini-SQL2 reached 80.04% execution accuracy on BIRD, but the gap with human experts keeps the technology in a supervised workflow rather than a fully autonomous data-query layer.

A Database Interface Built Around Execution
Google has introduced Gemini-SQL2 as a text-to-SQL capability for turning natural-language questions into executable database queries.
The system is built on Gemini 3.1 Pro and is aimed at a familiar enterprise problem: business users can describe the answer they need, but the database still requires precise SQL that joins tables, handles dates and returns the correct result.
The important distinction is execution.
Gemini-SQL2 is presented as more than a query-writing assistant that produces plausible syntax.
On the BIRD benchmark, a generated query must run against the database and match the result of the reference SQL.
Google said Gemini-SQL2 reached 80.04% execution accuracy in BIRD's Single Trained Model category, putting it above the earlier Gemini-SQL score of 76.13% disclosed in November 2025.
That makes the announcement a data-product story, not only a model-performance claim.
If natural-language interfaces are going to sit inside analytics tools, finance systems or developer platforms, the useful measure is whether the query gives the right answer when it touches messy data.
BIRD Shows Why Enterprise SQL Is Hard
BIRD is designed to make text-to-SQL systems deal with enterprise-like complexity.
The benchmark includes 95 databases, 37 professional domains and 12,751 question-SQL pairs, with a total data scale of 33.4GB.
It also includes incomplete data and external-knowledge requirements, which are common failure points when a model tries to interpret a business request.
Those conditions matter because enterprise users rarely ask database questions in clean schema language.
A finance team could request regional monthly recurring revenue for customers who left within 90 days of an upgrade.
Turning that into SQL can require joins, window functions and date logic.
A data engineer may describe a transformation in plain language, then review generated BigQuery SQL before using it in a pipeline.
Gemini-SQL2's score suggests stronger handling of that workflow, but it does not remove verification.
BIRD's stated human expert level is 92.96%, leaving a 12.9 percentage point gap.
Accuracy around the 80% level still means enough failure risk that production analytics teams would need review, testing and permission controls around generated queries.
Specialized Training Still Matters
Google's comparison also points to an important technical pattern.
Some specialized SQL models at the 32-billion-parameter level outperformed general-purpose frontier language models on database work.
That supports a narrower lesson for enterprise AI: broad language ability is not always enough when the task is constrained by schema structure, execution rules and domain-specific data conventions.
Gemini-SQL2 is not described as a separate standalone model.
It is a capability built on Gemini 3.1 Pro, which means the product question is where Google places it.
The likely venues are existing Gemini-based SQL generation surfaces such as BigQuery Studio, AlloyDB AI and Cloud SQL Studio, although Google has not disclosed a separate Gemini-SQL2 API or model string.
The Next Test Is Product Control
The strongest near-term use case is supervised assistance.
SaaS companies with Ask Your Data features, enterprise analytics teams and data engineering groups could use the system to shorten the path from a question to a draft query.
The remaining control problem is deciding when the generated SQL can be trusted, when it requires human review and how much access the model should have to sensitive production data.
That is where the benchmark result becomes a deployment question.
Gemini-SQL2 improves the case for natural-language database interfaces, but the source-backed numbers still point to a human-in-the-loop design.
Until the accuracy gap narrows further, the practical value is faster query construction with review, not unsupervised database automation.
















