Grab Hubble - Conversational Data Discovery with LLMs

Source: Grab Engineering Blog - Enabling conversational data discovery with LLMs at Grab Authors: Shreyas Parbat, Amanda Ng, Yucheng Zeng, Vinnson Lee, Feng Cheng, Varun Torka Date: September 26, 2024 Category: Data Analysis

Problem

At Grab, finding the right data among a massive repository was a significant challenge that hampered productivity across the organization.

The Data Discovery Challenge:

Over 200,000 tables in data lake, plus numerous Kafka streams, production databases, and ML features
Hubble (internal data discovery tool built on Datahub) primarily used as a reference tool, not for true discovery
Reliance on Elasticsearch: Performed well for keyword searches but couldn’t accept user-provided context (no semantic search capability)
Only 82% click-through rate: 18% of users abandoned searches without clicking on any dataset, indicating search results weren’t meeting needs
Low documentation coverage: Only 20% of most frequently queried tables (P80 tables) had existing documentation
Heavy reliance on tribal knowledge: Data consumers constantly asked colleagues via Slack to find datasets
Survey results: 51% of data consumers took multiple days to find the dataset they required

Solution

The Hubble team implemented a systematic three-step approach to revolutionize data discovery using AI and LLMs.

Vision and Goals

Vision: Remove humans from the data discovery loop by automating the entire process using LLM-powered products. Reduce time taken from multiple days to mere seconds.

Three Core Goals:

Build HubbleIQ: LLM-based chatbot serving as equivalent of Lead Data Analyst for data discovery, accessible via Slack
Improve documentation coverage: Achieve extensive high-quality documentation across datasets
Enhance Elasticsearch: Tune implementation to better meet Grab’s data consumer requirements

Step 1: Enhance Elasticsearch

Through clickstream analysis and user interviews, the team identified four categories of data search queries:

Search Query Categories:

Exact search (part of 75%): Query is substring of existing dataset name, with query length ≥40% of dataset’s name
Partial search (part of 75%): Levenshtein distance >80 from existing dataset name, usually spelling mistakes or shorter versions
Inexact search (25%): Colloquial keywords or phrases semantically related to table/column/documentation (e.g., “city” or “taxi type”)
Semantic search: Free text queries with abundant contextual information - users didn’t even attempt these on Hubble, sending them via Slack instead

Elasticsearch Optimizations Implemented:

Tagging and boosting P80 tables (most frequently queried)
Boosting the most relevant schemas
Hiding irrelevant datasets like PowerBI dataset tables
Deboosting deprecated tables
Improving search UI by simplifying and reducing clutter
Adding relevant tags
Boosting certified tables

Result: Click-through rate rose from 82% to 94% (12 percentage point increase)

Step 2: Build Context Store for HubbleIQ

To support LLM-based discovery, high-quality documentation was essential.

Documentation Generation Engine:

Built using GPT-4 to generate documentation based on table schemas and sample data
Refined prompt through multiple iterations of feedback from data producers
Added “generate” button on Hubble UI for easy documentation generation
Supported Grab-wide initiative to certify tables

Proactive Documentation:

Pre-populated docs for most critical tables
Notified data producers to review generated documentation
AI-generated docs visible with “AI-generated” tag as precaution
Tag removed when data producers accepted or edited documentation

Result: Documentation coverage for P80 tables increased by 70 percentage points to ~90%. User feedback showed ~95% of users found generated docs useful.

Step 3: Build and Launch HubbleIQ

With high documentation coverage in place, the team harnessed LLMs for conversational data discovery.

Technical Implementation:

Leveraged Glean: Used existing enterprise search tool at Grab to speed up go-to-market
Integration with Glean: Made all data lake tables with documentation available on Glean platform
HubbleIQ Bot Creation: Used Glean Apps to create bot (LLM with custom system prompt) that could access all Hubble datasets catalogued on Glean
Search Integration: For any search likely to be semantic, HubbleIQ results shown on top, followed by regular search results
Slack Integration: Recently integrated HubbleIQ with Slack for seamless discovery without breaking user flow
Channel Integration: Working with analytics teams to add bot to “ask” channels where data consumers ask contextual questions, acting as first line of defense

Impact

User Satisfaction

73% of respondents found it easy to discover datasets (17 percentage point increase from previous survey)
Hubble reached all-time high in monthly active users

Time Savings

Reduced data discovery from multiple days to mere seconds for semantic searches
Eliminated need for constant Slack messages to colleagues

Documentation Quality

Documentation coverage for critical tables increased from 20% to 90%
95% of users found AI-generated documentation useful

Search Effectiveness

Click-through rate improved from 82% to 94%
Successfully addressed all four search query categories

Key Insights

Four Distinct Search Categories: Understanding the different types of searches (exact, partial, inexact, semantic) was crucial for building the right solutions for each category.
Documentation is Foundation for LLM Discovery: Without high-quality documentation coverage, LLM-based data discovery cannot work effectively. The team had to solve documentation first.
Leverage Existing Tools: Using Glean (existing enterprise search tool) significantly accelerated go-to-market instead of building from scratch.
Meet Users Where They Are: Integrating HubbleIQ with Slack (where data consumers already work) increased adoption and reduced friction.
Incremental Approach Works: Systematic three-step approach (Elasticsearch → Documentation → LLM chatbot) allowed team to build on each success and validate assumptions.
AI-Assisted + Human Review: Using “AI-generated” tags and requiring data producer review balanced automation with quality control.
Click-Through Rate as Key Metric: CTR served as effective proxy for search quality, allowing team to measure and improve systematically.

Next Steps

Documentation Generation Enhancements:

Enrich generator with more context for improved accuracy
Enable auto-update of data docs from Slack threads directly from Slack
Develop evaluator model leveraging LLMs to assess quality of both human and AI-written docs
Implement Reflexion (agentic workflow) that uses doc evaluator outputs to iteratively regenerate docs until quality benchmark is met

HubbleIQ Improvements:

Add support for metric datasets and other dataset types
Enable follow-up questions to HubbleIQ directly on HubbleUI
Intelligently pull additional metadata when user mentions specific dataset

Technical Stack

Datahub: Open-source data catalog platform (foundation for Hubble)
Elasticsearch: Enhanced with custom parameters for better search
GPT-4: Documentation generation from table schemas and sample data
Glean: Enterprise search tool for LLM integration
Glean Apps: Platform for creating HubbleIQ bot with custom system prompts
Slack: Integration for seamless conversational data discovery

Search Query Classification

The team’s analysis of search patterns provided valuable framework for understanding data discovery needs:

Search Type	% of Queries	Characteristics	Solution
Exact	37.5% (est.)	Substring of dataset name, ≥40% length	Vanilla Elasticsearch
Partial	37.5% (est.)	Levenshtein distance >80, spelling variations	Enhanced Elasticsearch
Inexact	25%	Colloquial keywords, semantic relations	Enhanced Elasticsearch + Boosting
Semantic	Not attempted on UI	Free text with abundant context	HubbleIQ (LLM-powered)

Tags: Data Discovery, LLM, GPT-4, Documentation Generation, Semantic Search, Elasticsearch, Enterprise Search, Slack Integration, Datahub, Production System

Related Examples: