Uber QueryGPT - Multi-Agent Natural Language to SQL
Multi-agent system with Intent, Table, and Column Prune agents reducing SQL authoring from 10 minutes to 3 minutes with 78% user satisfaction
Uber QueryGPT - Multi-Agent Natural Language to SQL
The Problem
At Uber, the data platform handles approximately 1.2 million interactive queries each month, with the Operations organization alone contributing about 36% of these queries. Authoring SQL queries requires not only solid understanding of SQL syntax but deep knowledge of how Uber’s internal data models represent business concepts. Each query takes around 10 minutes to author between searching for relevant datasets in the data dictionary and writing the query. With hundreds of thousands of datasets across the organization, finding the right tables and understanding their schemas creates significant productivity drag.
The Manual Process:
- Search through data dictionary to identify relevant datasets among hundreds of thousands of tables
- Read table schemas, understand column meanings, and map business concepts to data model
- Write SQL query with correct syntax, joins, filters, and aggregations
- Debug and iterate when queries fail or return incorrect results
- Repeat process for each analytical question across different business domains
- Average 10 minutes per query for experienced SQL users
Key Pain Points:
- Data discovery challenge - Finding the right tables among hundreds of thousands required extensive domain knowledge
- Schema complexity - Some Tier 1 tables span over 200 columns, consuming 40-60K tokens in LLM context
- Business terminology gap - Internal Uber lingo and business concepts must be correctly mapped to data model structures
- Context requirements - Queries often require joins across multiple tables and understanding of data relationships
- Token limits - Large schemas break LLM calls with early models supporting only 32K tokens
The Solution
Uber built QueryGPT, a multi-agent system that decomposes natural language to SQL generation into specialized agents (Intent, Table Selection, Column Pruning) working with curated workspaces and RAG-based similarity search. The system evolved through 20+ algorithm iterations from hackathon prototype to production, reducing query authoring time from 10 minutes to 3 minutes.
Impact: Averaging ~300 daily active users in limited release with 78% reporting the generated queries reduced the time they would’ve spent writing from scratch. Multi-agent decomposition proved critical - giving LLMs single units of specialized work (intent classification, table selection, column pruning) dramatically improved accuracy over monolithic generation approach. System handles 1.2M monthly queries across Uber’s data platform.
How It Works
Key Capabilities:
- Workspaces - Curated collections of SQL samples and tables organized by business domain (Mobility, Ads, Core Services, Platform Engineering, IT, etc.) - includes 11 system workspaces + custom workspace creation
- Intent Agent - LLM classifier mapping user questions to one or more business domains/workspaces to narrow search radius for RAG
- Table Agent - LLM that identifies relevant tables and presents to user for ACK/edit before query generation (human-in-the-loop validation)
- Column Prune Agent - LLM that removes irrelevant columns from large schemas to reduce token consumption and latency
- RAG with Similarity Search - k-NN vector search on SQL samples and schemas to retrieve relevant few-shot examples
- Evaluation Framework - Golden question-to-SQL mappings with metrics tracking (intent accuracy, table overlap, run success, query similarity)
Process Flow:
- User Question - User enters natural language prompt (e.g., “Find the number of trips completed yesterday in Seattle”)
- Intent Agent - LLM maps question to business domain/workspace (e.g., “Mobility” workspace for trips-related queries)
- Workspace Selection - Narrow search to curated SQL samples and tables for that domain
- RAG Similarity Search - k-NN search retrieves relevant SQL examples and table schemas based on question embeddings
- Table Agent - LLM identifies top candidate tables, user validates or edits the list (human-in-the-loop)
- Column Prune Agent - For large schemas (200+ columns), LLM prunes irrelevant columns to stay within token limits
- Query Generation - Few-shot prompting with: pruned schemas + relevant SQL samples + user question + Uber business instructions
- Output - Generated SQL query + explanation of how LLM generated the query
- Evaluation - Track intent accuracy, table overlap, run success, output presence, qualitative similarity to golden SQL
Technical Architecture: Multi-agent pipeline where each agent handles specialized classification task using LLM calls. Workspaces provide domain-scoped context reducing search space. Vector database stores embeddings of SQL samples and schemas for similarity search. Human-in-the-loop validation at table selection stage catches errors before expensive query generation. Column Prune agent addresses token limit issues by reducing schema size 40-60K → manageable size.
Evolution Journey (20+ iterations):
- Hackathon v1: Simple RAG with 7 Tier 1 tables, 20 SQL samples, k-NN similarity search on schemas/queries
- Problem: Accuracy declined as more tables added; simple similarity search on schemas/SQL didn’t work well; large schemas broke 32K token limit
- Current Design: Workspaces for domain organization + Intent Agent for classification + Table Agent for validation + Column Prune Agent for token management + improved RAG
Evaluation Methodology:
- Vanilla Flow: Measures baseline - input question → infer intent/datasets → generate SQL → evaluate all
- Decoupled Flow: Human-in-the-loop - input question + actual intent/datasets → generate SQL → enables component-level evaluation
- Metrics: Intent accuracy, Table Overlap Score (0-1 based on correct tables selected), Successful Run, Run Has Output (>0 records), Qualitative Query Similarity (LLM-based score 0-1)
- Golden Set: Manually curated question-to-SQL mappings covering variety of datasets and business domains
Key Insight
LLMs excel as specialized classifiers when given single units of work - decomposing query generation into Intent, Table Selection, and Column Pruning agents dramatically improved accuracy over monolithic approach. The Intent Agent, Table Agent, and Column Prune Agent each performed excellently because they were asked to work on a single unit of specialized work rather than a broad generalized task. This pattern of breaking down complex generation into specialized agents proved more effective than trying to solve everything in one LLM call.
Why This Matters:
- Specialization over generalization - Intermediate agents that decompose user prompts into better signals for RAG improved accuracy significantly from first version
- Token management is critical - Column Prune agent not only solved token limit issues but also reduced cost and latency by providing smaller, more focused context
- Domain organization scales - Workspaces allowed targeting 1.2M monthly queries by narrowing search radius to relevant business domains
- Human-in-the-loop catches errors early - Table Agent validation prevents expensive query generation with wrong datasets
Known Challenges:
- Hallucinations persist - LLM sometimes generates queries with non-existent tables or columns; experimenting with validation agent for recursive fixing
- User prompts lack context - Questions range from detailed with keywords to 5-word queries with typos; “prompt enhancer” needed to massage questions before sending to LLM
- High bar for accuracy - Users expect generated queries to “just work”; important to target right personas for initial rollout
- Non-deterministic evaluation - Same evaluation run can produce different outcomes; focus on error patterns over long periods rather than 5% run-to-run changes
Links
- Original Blog Post - Comprehensive technical deep dive into QueryGPT’s architecture and evolution