Distilled insights from 52 real-world implementations
Instead of showcasing time saved or ROI metrics, these insights focus on what actually matters: problem patterns you can recognize, architecture decisions with rationale, breakthrough insights that made implementations successful, and anti-patterns to avoid.
Does your problem look like any of these? Click to explore examples.
Duolingo
Delivery Hero (Woowa Brothers), Pinterest, Swiggy +1 more
Delivery Hero Quick Commerce
eBay
Grab, Dropbox, Harvard Business School +5 more
Honeycomb
Salesforce, Grab, Ramp +12 more
Spotify
Walmart, Grab
LlamaIndex, SoftIQ
Adobe, Salesforce, Bertelsmann +1 more
Adyen
Manus, Jeppesen (Boeing), Airtable
Meta
DoorDash, Vimeo
Instacart
Whatnot
Anthropic, Exa
How teams structured their agents—and why those choices mattered.
The critical decisions that made implementations successful. Not time saved—what actually worked.
Patterns over prompts - feeding AI existing curriculum content as examples dramatically outperformed adding more constraints to prompt instructions; generate-many-filter-to-best approach (create multiple episodes, AI evaluators select highest quality) more effective than trying to generate perfect content on first try; curriculum foundation ensures level-appropriate, grammatically sound content
Enriching table metadata with business terminology, few-shot examples from domain experts, and multi-stage retrieval algorithms transforms generic GPT-4 into domain-expert Text-to-SQL capable of production use - simply using GPT-4 alone produces queries lacking company context, ignoring data policies, and suffering hallucinations, but combination of (1) augmented documentation with detailed column descriptions and business glossaries, (2) sophisticated search algorithms refining questions and selecting relevant examples, and (3) ReAct prompting reasoning step-by-step while dynamically retrieving context enables query quality that employees trust for actual work; 'garbage in, garbage out' principle applies where foundation model performance capped by input quality
Knowledge distillation dramatically reduced costs while maintaining quality - teacher model (GPT-4o) trained smaller student (GPT-4o-mini) to achieve same quality with much shorter prompts
Agent abstraction decoupling input/output from implementation allows multiple variations honoring same contract but differing in complexity/models/optimization - maintaining strong inter-agent compatibility and reusability for rapid development; if solution works in developer's local environment for single instance, platform automatically scales to eBay's industrial needs serving hundreds of millions across billions of listings; NRT distributed queue-based messaging smooths user activity peaks/valleys for consistent controllable throughput and better GPU utilization
Documentation is foundation for LLM discovery - without high-quality docs (increased from 20%→90% coverage), LLM chatbot cannot work effectively; four distinct search categories (exact, partial, inexact, semantic) require different solutions - keyword search handles 75%, LLM handles semantic; incremental approach (Elasticsearch→Docs→LLM) validated each step before next; leveraging existing Glean tool accelerated go-to-market vs building from scratch
Single-pass generation essential to avoid accuracy degradation - 90% accuracy compounds to 59% over 5 chains, so chaining multiple LLM calls fails; few-shot prompting outperformed zero-shot and chain-of-thought; accepting 'good enough' outputs (flexible interpretation like user typing 'slow') serves users better than rigid correctness; LLMs are engines for features not standalone products; schema filtering by 7-day activity handles context limits for 5,000+ field customers
Automating quality evaluation with LLMs enables continuous, rapid iteration on search improvements that would be impossible with manual review - by reducing evaluation time from days/weeks to hours, LinkedIn can experiment with search enhancements and measure impact quickly, dramatically accelerating search quality improvements; slow feedback loops prevent experimentation and innovation, but GPT-powered evaluation provides consistent, fast assessment enabling continuous improvement at platform scale serving billions
Table documentation quality trumps model sophistication - weighting embeddings toward table metadata increased search hit rate from 40% to 90%, proving data governance is the primary bottleneck for Text-to-SQL performance; real-world deployment revealed table discovery (finding right tables among hundreds of thousands) and metadata quality (accurate descriptions of table purpose and column meanings) far more critical than prompt engineering or model selection; benchmarks like Spider fail to capture this as they treat small number of pre-specified well-normalized tables as given
Showing top 8 of 52 insights · Browse all examples
Real-world constraints that shaped how teams built their agents.
What teams tried that didn't work. Learn from these failures to avoid repeating them.
Adding more English-only constraints to prompts for original content generation produced subpar results requiring extensive manual editing - feeding curriculum content as patterns worked better
— Duolingo
Trying to make AI generate perfect content immediately failed - generate-many-filter-to-best approach with comprehensive evaluators more effective
— Duolingo
Giving AI freedom to sequence exercises resulted in hit-or-miss quality - standardizing exercise order using learner session data improved reliability
— Duolingo
Automated translations frequently missed accuracy and proficiency level requirements - curriculum-driven generation better aligned with learning goals
— Duolingo
Using GPT-4 with basic prompt produces queries lacking company context, ignoring data policies, and suffering hallucinations - domain-enriched metadata and multi-stage retrieval essential for production quality
— Delivery Hero (Woowa Brothers)
Table schemas alone insufficient for business logic encoding - augmented documentation with detailed column descriptions, business terminology glossaries, and few-shot SQL examples from domain experts required
— Delivery Hero (Woowa Brothers)
Showing 6 common failures · See all examples
Most-used frameworks in these implementations. Note: Popularity ≠ right for your problem.
Compare your requirements to these proven patterns. If you see similar problems, constraints, and complexity, you've found a validated starting point.
Browse All Examples →