Pinterest Text-to-SQL in Querybook
RAG-enhanced Text-to-SQL achieving 40%+ first-shot acceptance rate and 35% task completion speed improvement through table metadata optimization
Pinterest Text-to-SQL in Querybook
The Problem
Writing queries to solve analytical problems is the core task for Pinterest’s data users, yet finding the right data and translating analytical problems into correct and efficient SQL code proved challenging in a fast-paced environment. With hundreds of thousands of tables spread across different domains, data analysts spent significant time hunting for the correct tables before even beginning to write queries, while also struggling to encode complex business logic correctly.
The Manual Process:
- Browse through hundreds of thousands of tables in the data warehouse to find relevant datasets
- Manually read table schemas, column descriptions, and sample queries to understand table purpose
- Translate analytical questions into SQL code, encoding domain-specific logic and extraction conditions
- Debug and iterate on queries when initial attempts fail or return incorrect results
- Repeat process for each new analytical task across different business domains
Key Pain Points:
- Table discovery challenge - Finding the right tables among hundreds of thousands was overwhelming for analysts
- Domain knowledge barrier - Understanding which tables contained relevant data required deep institutional knowledge
- SQL translation difficulty - Converting natural language analytical questions into correct SQL syntax and logic was time-consuming
- Efficiency concerns - Writing performant queries on large datasets required optimization expertise
- Fast-paced environment - Business needs demanded quick answers, but data exploration took hours or days
The Solution
Pinterest built a Text-to-SQL feature integrated into Querybook (their open-source big data SQL query tool) using LLMs enhanced with RAG for intelligent table discovery. The system evolved from simple schema-based generation to a two-stage architecture that first finds relevant tables via vector similarity search, then generates optimized SQL queries.
Impact: First-shot acceptance rate increased from 20% to above 40% as the implementation improved and users became familiar with the feature. Real-world data showed 35% improvement in task completion speed for writing SQL queries using AI assistance (compared to 50%+ in controlled research experiments). Table documentation weighting in embeddings proved critical - search hit rate jumped from 40% without documentation to 90% with proper metadata weighting, demonstrating that data governance quality is the primary performance bottleneck.
How It Works
Key Capabilities:
- Metadata-Enriched Table Schemas - Includes table/column descriptions, low-cardinality column values for precise filtering, and column pruning for context window management
- Streaming Response Generation - WebSocket-based streaming with LangChain partial JSON parsing to avoid 10+ second wait times
- RAG-Based Table Discovery - Offline vector indexing of table summaries and historical queries, with similarity search and LLM re-selection
- Table/Query Summarization - LLM-generated summaries of table purpose and use cases from schemas + sample queries
- Tiered Table Indexing - Only top-tier tables indexed to promote high-quality datasets
Process Flow (V2 with RAG):
- Offline Vector Index Creation - Background job generates embeddings for: (a) LLM-generated table summaries (from schema + sample queries), (b) LLM-generated query summaries (purpose + tables used)
- User Question - Analyst asks analytical question (e.g., “How many active users are on the web platform?”)
- Table Search (if not specified) - Question converted to embeddings, similarity search against vector index returns top N candidate tables
- LLM Table Re-selection - GPT evaluates top N table summaries against question, selects most relevant K tables
- User Validation - Top K tables returned to user for confirmation or modification
- Schema Retrieval - Fetch table schemas from metadata store including: table/column names/descriptions, column types, low-cardinality column unique values
- Prompt Construction - Compile question + SQL dialect + table schemas into Text-to-SQL prompt
- SQL Generation - LLM generates query with streaming response via WebSocket (partial JSON parsing)
- Display to User - Structured response (SQL + explanation) displayed in real-time
Technical Architecture: Two-tier system with (1) offline indexing pipeline running scheduled jobs to generate and store table/query summaries in OpenSearch vector store with embedding-based similarity search, and (2) real-time generation pipeline using LangChain for streaming LLM responses with WebSocket transport and partial JSON parsing to handle structured outputs during streaming.
Metadata Optimization Techniques:
- Low-cardinality value injection: Include unique values for frequently-filtered columns (e.g.,
platformvalues: ‘WEB’, ‘MOBILE’, ‘TABLET’) to prevent incorrect WHERE clauses - Column pruning: Tag and exclude certain columns from schemas to stay within context window limits
- Reduced schema version: Fallback to minimal schema (table name, column name, type only) for extremely large tables
- Documentation weighting: Embeddings weighted heavily toward table documentation increased hit rate 40% → 90%
Key Insight
Table documentation quality trumps model sophistication - weighting embeddings toward table metadata increased search hit rate from 40% to 90%, proving data governance is the primary bottleneck for Text-to-SQL performance. While benchmark datasets like Spider suggest Text-to-SQL is a solved problem with off-the-shelf LLMs, real-world deployment at Pinterest revealed that table discovery (finding the right tables among hundreds of thousands) and metadata quality (ensuring descriptions accurately reflect table purpose and column meanings) are far more critical than prompt engineering or model selection for production systems.
Why This Matters:
- Real-world differs from benchmarks - Spider dataset uses small number of pre-specified, well-normalized tables; production systems face hundreds of thousands of denormalized tables requiring discovery as core challenge
- Data governance is the foundation - Investment in table descriptions, column documentation, and business terminology standardization directly drives Text-to-SQL performance
- Embeddings need structure - Generic embedding similarity alone yielded 40% hit rate; weighting toward curated metadata achieved 90%, showing human-created documentation outweighs pure semantic similarity
- Iterative user behavior is normal - 20-40% first-shot acceptance is reasonable; most queries require multiple iterations whether human or AI-generated
Evaluation Challenges: Team noted that existing benchmarks (Spider, etc.) treat small numbers of well-labeled tables as given, failing to capture the table discovery challenge central to real-world Text-to-SQL. They advocate for more realistic benchmarks including larger denormalized table sets and treating table search as core evaluation criteria.
Future Improvements Identified:
- Metadata enhancement: Include tiering, tags, domains in vector index for refined filtering
- Scheduled/real-time index updates: Currently manual; automation needed for new tables/queries
- Similarity scoring strategy: Current aggregation is basic; fine-tuning could improve relevance
- Query validation: No validation currently; constrained beam search could catch errors
- User feedback loops: UI for collecting feedback to improve vector index and metadata
Links
- Original Blog Post - Technical deep dive into Text-to-SQL implementation
- Querybook (Open Source) - Pinterest’s open-source big data SQL query tool where Text-to-SQL is deployed
- Table Summarization Prompt - Actual prompt used for LLM table summarization
- Query Summarization Prompt - Actual prompt used for query summarization
- Text-to-SQL Prompt - Actual prompt used for SQL generation
- Spider Benchmark Reference - Academic benchmark showing gap between research and production reality