Uber QueryGPT - Multi-Agent Natural Language to SQL

The Problem

At Uber, the data platform handles approximately 1.2 million interactive queries each month, with the Operations organization alone contributing about 36% of these queries. Authoring SQL queries requires not only solid understanding of SQL syntax but deep knowledge of how Uber’s internal data models represent business concepts. Each query takes around 10 minutes to author between searching for relevant datasets in the data dictionary and writing the query. With hundreds of thousands of datasets across the organization, finding the right tables and understanding their schemas creates significant productivity drag.

The Manual Process:

Search through data dictionary to identify relevant datasets among hundreds of thousands of tables
Read table schemas, understand column meanings, and map business concepts to data model
Write SQL query with correct syntax, joins, filters, and aggregations
Debug and iterate when queries fail or return incorrect results
Repeat process for each analytical question across different business domains
Average 10 minutes per query for experienced SQL users

Key Pain Points:

Data discovery challenge - Finding the right tables among hundreds of thousands required extensive domain knowledge
Schema complexity - Some Tier 1 tables span over 200 columns, consuming 40-60K tokens in LLM context
Business terminology gap - Internal Uber lingo and business concepts must be correctly mapped to data model structures
Context requirements - Queries often require joins across multiple tables and understanding of data relationships
Token limits - Large schemas break LLM calls with early models supporting only 32K tokens

The Solution

Uber built QueryGPT, a multi-agent system that decomposes natural language to SQL generation into specialized agents (Intent, Table Selection, Column Pruning) working with curated workspaces and RAG-based similarity search. The system evolved through 20+ algorithm iterations from hackathon prototype to production, reducing query authoring time from 10 minutes to 3 minutes.

Impact: Averaging ~300 daily active users in limited release with 78% reporting the generated queries reduced the time they would’ve spent writing from scratch. Multi-agent decomposition proved critical - giving LLMs single units of specialized work (intent classification, table selection, column pruning) dramatically improved accuracy over monolithic generation approach. System handles 1.2M monthly queries across Uber’s data platform.

How It Works

Key Capabilities:

Workspaces - Curated collections of SQL samples and tables organized by business domain (Mobility, Ads, Core Services, Platform Engineering, IT, etc.) - includes 11 system workspaces + custom workspace creation
Intent Agent - LLM classifier mapping user questions to one or more business domains/workspaces to narrow search radius for RAG
Table Agent - LLM that identifies relevant tables and presents to user for ACK/edit before query generation (human-in-the-loop validation)
Column Prune Agent - LLM that removes irrelevant columns from large schemas to reduce token consumption and latency
RAG with Similarity Search - k-NN vector search on SQL samples and schemas to retrieve relevant few-shot examples
Evaluation Framework - Golden question-to-SQL mappings with metrics tracking (intent accuracy, table overlap, run success, query similarity)

Process Flow:

User Question - User enters natural language prompt (e.g., “Find the number of trips completed yesterday in Seattle”)
Intent Agent - LLM maps question to business domain/workspace (e.g., “Mobility” workspace for trips-related queries)
Workspace Selection - Narrow search to curated SQL samples and tables for that domain
RAG Similarity Search - k-NN search retrieves relevant SQL examples and table schemas based on question embeddings
Table Agent - LLM identifies top candidate tables, user validates or edits the list (human-in-the-loop)
Column Prune Agent - For large schemas (200+ columns), LLM prunes irrelevant columns to stay within token limits
Query Generation - Few-shot prompting with: pruned schemas + relevant SQL samples + user question + Uber business instructions
Output - Generated SQL query + explanation of how LLM generated the query
Evaluation - Track intent accuracy, table overlap, run success, output presence, qualitative similarity to golden SQL

Technical Architecture: Multi-agent pipeline where each agent handles specialized classification task using LLM calls. Workspaces provide domain-scoped context reducing search space. Vector database stores embeddings of SQL samples and schemas for similarity search. Human-in-the-loop validation at table selection stage catches errors before expensive query generation. Column Prune agent addresses token limit issues by reducing schema size 40-60K → manageable size.

Evolution Journey (20+ iterations):

Hackathon v1: Simple RAG with 7 Tier 1 tables, 20 SQL samples, k-NN similarity search on schemas/queries
Problem: Accuracy declined as more tables added; simple similarity search on schemas/SQL didn’t work well; large schemas broke 32K token limit
Current Design: Workspaces for domain organization + Intent Agent for classification + Table Agent for validation + Column Prune Agent for token management + improved RAG

Evaluation Methodology:

Vanilla Flow: Measures baseline - input question → infer intent/datasets → generate SQL → evaluate all
Decoupled Flow: Human-in-the-loop - input question + actual intent/datasets → generate SQL → enables component-level evaluation
Metrics: Intent accuracy, Table Overlap Score (0-1 based on correct tables selected), Successful Run, Run Has Output (>0 records), Qualitative Query Similarity (LLM-based score 0-1)
Golden Set: Manually curated question-to-SQL mappings covering variety of datasets and business domains

Key Insight

LLMs excel as specialized classifiers when given single units of work - decomposing query generation into Intent, Table Selection, and Column Pruning agents dramatically improved accuracy over monolithic approach. The Intent Agent, Table Agent, and Column Prune Agent each performed excellently because they were asked to work on a single unit of specialized work rather than a broad generalized task. This pattern of breaking down complex generation into specialized agents proved more effective than trying to solve everything in one LLM call.

Why This Matters:

Specialization over generalization - Intermediate agents that decompose user prompts into better signals for RAG improved accuracy significantly from first version
Token management is critical - Column Prune agent not only solved token limit issues but also reduced cost and latency by providing smaller, more focused context
Domain organization scales - Workspaces allowed targeting 1.2M monthly queries by narrowing search radius to relevant business domains
Human-in-the-loop catches errors early - Table Agent validation prevents expensive query generation with wrong datasets

Known Challenges:

Hallucinations persist - LLM sometimes generates queries with non-existent tables or columns; experimenting with validation agent for recursive fixing
User prompts lack context - Questions range from detailed with keywords to 5-word queries with typos; “prompt enhancer” needed to massage questions before sending to LLM
High bar for accuracy - Users expect generated queries to “just work”; important to target right personas for initial rollout
Non-deterministic evaluation - Same evaluation run can produce different outcomes; focus on error patterns over long periods rather than 5% run-to-run changes

Uber QueryGPT - Multi-Agent Natural Language to SQL

The Problem

The Solution

How It Works

Key Insight

Links