Meta AI for Efficient Incident Response
Two-stage incident response system achieving 42% accuracy identifying root cause in top 5 suggestions through heuristic filtering (thousands→hundreds changes) + fine-tuned Llama 2 (7B) election-based ranking (hundreds→5)
Meta AI for Efficient Incident Response
Source: Meta Engineering Blog - Leveraging AI for efficient incident response Authors: Diana Hsu, Michael Neu, Mohamed Farrag, Rahul Kindi Date: June 24, 2024 Category: Development & Engineering
Problem
Investigation is a critical part of ensuring system reliability at Meta and a prerequisite to mitigating issues quickly. However, investigating anomalies presents significant challenges:
Investigation Challenges:
- Every investigation is unique: Identifying the root cause is necessary to mitigate properly
- Monolithic repository scalability: Accumulating number of changes across many teams makes investigations complex
- Context building overhead: Responders need to build context on what is broken, which systems are involved, and who might be impacted
- Large search space: Thousands of potential code changes to review
- Time-consuming process: Complex and lengthy investigations delay mitigation
The challenge was particularly acute for systems dependent on monolithic repositories where the volume of changes and cross-team dependencies create a massive search space for root cause identification.
Solution
Meta developed an AI-assisted root cause analysis system that streamlines investigations through a combination of heuristic-based retrieval and large language model (LLM)-based ranking to speed up root cause identification.
Two-Stage Approach
Stage 1: Heuristics-Based Retriever
Reduces the search space from thousands of changes to a few hundred without significant reduction in accuracy using:
- Code and directory ownership analysis
- Runtime code graph exploration of impacted systems
- Other novel heuristics specific to Meta’s infrastructure
Purpose: Intelligent filtering to maintain high recall while dramatically reducing the search space.
Stage 2: LLM-Based Ranker
Once the search space is reduced to a few hundred changes, the system uses a fine-tuned Llama 2 (7B) model to further reduce to the top 5 most likely root causes.
Ranking Through Election:
- Structures prompts to contain maximum of 20 changes at a time
- Asks LLM to identify top 5 changes from each batch
- Aggregates outputs across LLM requests
- Repeats process until only 5 candidates remain
Key Innovation: This election-based approach accommodates context window limitations while enabling the model to reason across different changes effectively.
Training Process
The biggest lever to achieving high accuracy was fine-tuning a Llama 2 (7B) model using historical investigations with known root causes.
Continued Pre-Training (CPT)
- Used limited and approved internal wikis, Q&As, and code
- Exposed model to Meta-specific artifacts and terminology
- Built foundation for understanding Meta’s infrastructure
Supervised Fine-Tuning (SFT)
- Mixed Llama 2’s original SFT data with internal context
- Created dedicated investigation root cause analysis (RCA) SFT dataset
- Taught model to follow RCA instructions
RCA SFT Dataset:
- ~5,000 instruction-tuning examples
- Details of 2-20 changes from retriever
- Known root cause included
- Information available at investigation start (title, observed impact)
- Intentionally limited information density to match real-world scenarios
Advanced Fine-Tuning for Ranked Lists:
- Gathered model’s log probabilities (logprobs) for each possible culprit
- Ranked search space based on relevancy to given investigation
- Created fine-tuning examples where model yields ranked list of potential code changes
- Expected root cause appears first in ranked output
- Re-ran SFT with this augmented dataset to enable proper ranking behavior
Impact
Accuracy Metrics
- 42% accuracy in identifying root causes for investigations at their creation time (related to web monorepo)
- Root cause appears in top 5 suggested code changes for 42% of backtested investigations
- Based on exhaustive backtesting with historical investigations and information available at their start
Time Savings
- Significantly reduced effort and time needed to root cause an investigation
- Streamlined onboarding of engineers to investigations
- Accelerated root cause isolation process
Operational Impact
- Production deployment at Meta scale
- Integrated into internal investigation tools
- Used alongside existing tools like Hawkeye (end-to-end ML workflow debugging)
Key Insights
-
Two-Stage Retrieval Works: Combining heuristic-based filtering (thousands → hundreds) with LLM-based ranking (hundreds → top 5) provides optimal balance of recall and precision.
-
Election-Based Ranking Handles Context Limits: Processing 20 changes at a time and aggregating results enables reasoning across large candidate sets despite context window limitations.
-
Fine-Tuning is Critical: The biggest performance lever was fine-tuning on ~5,000 historical investigations with known root causes, not just using pre-trained models.
-
Limited Information Training Matches Reality: Training on intentionally sparse information (what’s known at investigation start) makes the model perform better in real-world scenarios.
-
Log Probabilities Enable Ranking: Using logprobs to rank relevancy and training the model to produce ranked lists significantly improved output quality.
-
Closed Feedback Loops for Trust: Ensuring responders can independently reproduce results validates AI suggestions and prevents misleading recommendations.
-
Precision Over Reach: Using confidence measurement methodologies to detect low confidence answers—sacrificing reach in favor of precision—maintains trust.
-
Domain-Specific Training Matters: Continued pre-training on internal wikis, Q&As, and code exposed the model to Meta-specific context, improving performance on Meta infrastructure.
Risk Mitigation
The team identified key risks and implemented mitigations:
Risk: Wrong Root Cause Suggestions
- Mitigation: Prioritize closed feedback loops and explainability
- Ensure responders can independently reproduce results
- Validate AI-generated results before acting
Risk: Low Confidence Answers
- Mitigation: Confidence measurement methodologies
- Avoid recommending low-confidence results
- Sacrifice reach in favor of precision
Risk: Over-Reliance on AI
- Mitigation: AI assists but doesn’t replace human judgment
- Responders maintain control of investigation process
- AI suggestions are starting points, not final answers
What’s Next
Autonomous Workflows
- Expanding AI capabilities to autonomously execute full workflows
- AI systems validate their own results
- Reduced need for human intervention in routine investigations
Proactive Incident Detection
- Detect potential incidents prior to code push
- Proactively mitigate risks before they arise
- Shift from reactive investigation to preventive detection
Broader Integration
- Expand beyond web monorepo to other systems
- Integrate deeper into existing investigation tooling
- Scale AI-assisted investigation across Meta’s infrastructure
Technical Stack
- LLM: Fine-tuned Llama 2 (7B) model
- Training Data: ~5,000 instruction-tuning examples from historical investigations
- Training Process: Continued pre-training (CPT) + supervised fine-tuning (SFT)
- Ranking Algorithm: Election-based ranking with max 20 changes per prompt
- Retrieval: Novel heuristics-based retrieval (code ownership, runtime code graph analysis)
- Integration: Internal investigation tooling, Hawkeye debugging system
Training Dataset Composition
RCA SFT Dataset Structure:
- Investigation title and observed impact
- 2-20 code changes from retriever
- Known root cause (for training)
- Change metadata (author, timestamp, affected files, etc.)
- Ranked list format with logprobs
Fine-Tuning Data Sources:
- Llama 2’s original SFT data (base instruction-following)
- Internal context (wikis, Q&As, code)
- Historical RCA investigations (~5,000 examples)
- Ranked list examples (augmented dataset)
Tags: AI-Assisted Investigation, Root Cause Analysis, LLM, Llama, Fine-Tuning, Incident Response, System Reliability, Heuristic Retrieval, Production System, Meta Infrastructure
Related Examples:
- Other AI-assisted development and debugging tools
- LLM-based code analysis systems
- Incident response automation platforms