<lp>

Agent Implementation Insights

Distilled insights from 52 real-world implementations

Instead of showcasing time saved or ROI metrics, these insights focus on what actually matters: problem patterns you can recognize, architecture decisions with rationale, breakthrough insights that made implementations successful, and anti-patterns to avoid.

19
Problem Patterns
52
Breakthrough Insights
8
Common Constraints
52
Anti-Patterns

Problem Patterns

Does your problem look like any of these? Click to explore examples.

Architecture Patterns

How teams structured their agents—and why those choices mattered.

Pipeline

2 implementations

Multi Agent

14 implementations

Single Agent

28 implementations

Multi Stage System

1 implementation

Hybrid

5 implementations

Ensemble

1 implementation

Event Driven

1 implementation

Top Breakthrough Insights

The critical decisions that made implementations successful. Not time saved—what actually worked.

Duolingo

Patterns over prompts - feeding AI existing curriculum content as examples dramatically outperformed adding more constraints to prompt instructions; generate-many-filter-to-best approach (create multiple episodes, AI evaluators select highest quality) more effective than trying to generate perfect content on first try; curriculum foundation ensures level-appropriate, grammatically sound content

Delivery Hero (Woowa Brothers)

Enriching table metadata with business terminology, few-shot examples from domain experts, and multi-stage retrieval algorithms transforms generic GPT-4 into domain-expert Text-to-SQL capable of production use - simply using GPT-4 alone produces queries lacking company context, ignoring data policies, and suffering hallucinations, but combination of (1) augmented documentation with detailed column descriptions and business glossaries, (2) sophisticated search algorithms refining questions and selecting relevant examples, and (3) ReAct prompting reasoning step-by-step while dynamically retrieving context enables query quality that employees trust for actual work; 'garbage in, garbage out' principle applies where foundation model performance capped by input quality

Delivery Hero Quick Commerce

Knowledge distillation dramatically reduced costs while maintaining quality - teacher model (GPT-4o) trained smaller student (GPT-4o-mini) to achieve same quality with much shorter prompts

eBay

Agent abstraction decoupling input/output from implementation allows multiple variations honoring same contract but differing in complexity/models/optimization - maintaining strong inter-agent compatibility and reusability for rapid development; if solution works in developer's local environment for single instance, platform automatically scales to eBay's industrial needs serving hundreds of millions across billions of listings; NRT distributed queue-based messaging smooths user activity peaks/valleys for consistent controllable throughput and better GPU utilization

Grab

Documentation is foundation for LLM discovery - without high-quality docs (increased from 20%→90% coverage), LLM chatbot cannot work effectively; four distinct search categories (exact, partial, inexact, semantic) require different solutions - keyword search handles 75%, LLM handles semantic; incremental approach (Elasticsearch→Docs→LLM) validated each step before next; leveraging existing Glean tool accelerated go-to-market vs building from scratch

Honeycomb

Single-pass generation essential to avoid accuracy degradation - 90% accuracy compounds to 59% over 5 chains, so chaining multiple LLM calls fails; few-shot prompting outperformed zero-shot and chain-of-thought; accepting 'good enough' outputs (flexible interpretation like user typing 'slow') serves users better than rigid correctness; LLMs are engines for features not standalone products; schema filtering by 7-day activity handles context limits for 5,000+ field customers

LinkedIn

Automating quality evaluation with LLMs enables continuous, rapid iteration on search improvements that would be impossible with manual review - by reducing evaluation time from days/weeks to hours, LinkedIn can experiment with search enhancements and measure impact quickly, dramatically accelerating search quality improvements; slow feedback loops prevent experimentation and innovation, but GPT-powered evaluation provides consistent, fast assessment enabling continuous improvement at platform scale serving billions

Pinterest

Table documentation quality trumps model sophistication - weighting embeddings toward table metadata increased search hit rate from 40% to 90%, proving data governance is the primary bottleneck for Text-to-SQL performance; real-world deployment revealed table discovery (finding right tables among hundreds of thousands) and metadata quality (accurate descriptions of table purpose and column meanings) far more critical than prompt engineering or model selection; benchmarks like Spider fail to capture this as they treat small number of pre-specified well-normalized tables as given

Showing top 8 of 52 insights · Browse all examples

Most Common Constraints

Real-world constraints that shaped how teams built their agents.

Production Quality Requirements 2
Cost Efficiency Required 2
Hundreds Thousands Tables 2
Metadata Quality Dependency 2
Heterogeneous Systems 2
Production Scale 2
Manual Production Unsustainable 1
Multi Language Scale Required 1

Common Anti-Patterns

What teams tried that didn't work. Learn from these failures to avoid repeating them.

complex-prompt-instructions

Adding more English-only constraints to prompts for original content generation produced subpar results requiring extensive manual editing - feeding curriculum content as patterns worked better

— Duolingo

perfect-first-try-generation

Trying to make AI generate perfect content immediately failed - generate-many-filter-to-best approach with comprehensive evaluators more effective

— Duolingo

unstructured-exercise-placement

Giving AI freedom to sequence exercises resulted in hit-or-miss quality - standardizing exercise order using learner session data improved reliability

— Duolingo

translation-automation-only

Automated translations frequently missed accuracy and proficiency level requirements - curriculum-driven generation better aligned with learning goals

— Duolingo

simple-prompt-only

Using GPT-4 with basic prompt produces queries lacking company context, ignoring data policies, and suffering hallucinations - domain-enriched metadata and multi-stage retrieval essential for production quality

— Delivery Hero (Woowa Brothers)

generic-metadata-without-enrichment

Table schemas alone insufficient for business logic encoding - augmented documentation with detailed column descriptions, business terminology glossaries, and few-shot SQL examples from domain experts required

— Delivery Hero (Woowa Brothers)

Showing 6 common failures · See all examples

Popular Frameworks

Most-used frameworks in these implementations. Note: Popularity ≠ right for your problem.

custom
8 cases
Custom
5 cases
LangGraph
5 cases
LangChain
3 cases
LlamaIndex
3 cases

Is your problem a good fit for agents?

Compare your requirements to these proven patterns. If you see similar problems, constraints, and complexity, you've found a validated starting point.

Browse All Examples →