Skip to main content

Chapter 17: Building RAG Applications

How do I build applications that combine search with generation?


Large language models hallucinate. Ask about your company's policies, your product documentation, or yesterday's support tickets, and they'll generate confident, plausible, completely fabricated answers. You cannot prompt this away; it's fundamental to how these models work.

Retrieval-augmented generation transforms hallucination from an unsolvable problem into misquotation, which you can fix. Retrieve relevant context from your data before generating. Include that context in the prompt. The model grounds its answers in your specific information rather than its general training. When RAG fails, it fails in diagnosable, correctable ways.

RAG Failures Are Confidently Wrong

A RAG system that retrieves the wrong chunks fails confidently, with citations. Users trust cited answers more than uncited ones. Bad retrieval with good generation produces confidently wrong answers harder to catch than obvious hallucinations.

However, RAG introduces its own failure modes that require careful management. Wrong chunks produce confidently wrong, cited answers. Too much retrieval overflows context windows or buries signal in noise. Too little and the model falls back to hallucination. Building effective RAG means understanding these tradeoffs and designing systems that fail gracefully.

This chapter covers RAG on Cloudflare: the architectural decisions that determine success, when to use managed versus custom pipelines, and edge-specific patterns that distinguish Cloudflare RAG from traditional cloud implementations.

RAG economics

RAG has three cost centres: embedding generation, vector storage, and inference. The distribution surprises most teams.

Embedding generation runs once per chunk during indexing, then once per query. A million-document corpus produces roughly five million embeddings at indexing time, a one-time cost measured in single-digit dollars at Workers AI pricing. Query embeddings add ongoing cost, but embedding a short query is cheap.

Vector storage scales with corpus size and embedding dimensions. A 768-dimension embedding consumes roughly 3 KB. Five million vectors at 3 KB each is 15 GB. Vectorize pricing makes this manageable, but dimension choice compounds: 1536-dimension embeddings double storage costs for marginal quality gains.

Inference dominates ongoing costs. Every RAG query runs an LLM completion with retrieved context in the prompt. More tokens, higher costs. The instinct to retrieve more chunks "just in case" directly increases your inference bill.

Optimise Inference First

Inference accounts for over 95% of ongoing RAG costs. Aggressive chunking may reduce storage but increases inference costs if you retrieve more chunks to compensate. Conservative chunking increases storage but may reduce inference costs. Optimise where money actually goes: generation, not storage.

Consider documentation search serving 100,000 monthly queries against a 50,000-document corpus. Indexing: $6 one-time for 250,000 chunks. Vector storage: $4 monthly. Query embeddings: $2 monthly. Inference: $180 to $500 monthly depending on model and response length. Optimise inference first.

This cost structure shapes architectural decisions. Aggressive chunking requires retrieving more chunks to assemble useful context, increasing inference costs. Conservative chunking increases storage but can reduce inference costs through fewer, more complete chunks. The optimum depends on your query patterns. Measuring beats intuition.

RAG on the edge

Traditional RAG concentrates components in a single region. Your vector database, embedding service, and LLM all sit in us-east-1, and users worldwide pay latency penalties.

Cloudflare RAG distributes differently. Vectorize indexes replicate globally, with queries executing at the edge nearest your user. Embedding generation through Workers AI happens at the edge. Only final LLM inference might route to specific hardware. Traditional RAG spends more time on network than compute; Cloudflare inverts that ratio.

A query taking 800ms in a centralised architecture might take 500ms on Cloudflare: same compute, less network. Edge architecture also changes failure characteristics. Regional outages in traditional cloud RAG take down the entire system, while Cloudflare's distribution means problems in one location don't affect users elsewhere.

Edge distribution introduces constraints. Vectorize indexes have size limits: ten million vectors on paid plans. Exceeding that requires multiple indexes with routing logic. For most applications, the latency and resilience benefits outweigh this complexity. For very large corpora, evaluate carefully.

Hyperscaler comparison

AspectCloudflare RAGAWS Bedrock + OpenSearchAzure AI Search
Vector search latencySub-50ms at edge50–200ms regional50–150ms regional
Global distributionNative replicationManual multi-region setupManual replication
Managed RAG optionAI SearchKnowledge BasesAzure AI Search
Maximum vectors10M per indexEffectively unlimited1B per index
Hybrid searchCustom via D1 FTS5Built-inBuilt-in
Egress costsNoneSignificantSignificant

Cloudflare wins on latency and simplicity for globally distributed applications with moderate corpus sizes. Hyperscalers win on scale limits and built-in hybrid search. If your corpus exceeds ten million vectors and can't be partitioned logically, or you need sophisticated hybrid search without building it yourself, hyperscaler offerings may fit better despite latency penalties.

Chunking strategy

Chunking determines retrieval quality more than any other factor. The decision isn't which algorithm to implement; any competent engineer can write a text splitter. Rather, it's about which tradeoffs to accept.

Smaller chunks (200–300 tokens) improve retrieval precision. When a query matches a small chunk, you know exactly which content is relevant. But small chunks lack context. A paragraph explaining a concept may need the preceding paragraph to make sense; retrieve the second without the first and you get technically accurate but practically useless content.

Larger chunks (500–1000 tokens) preserve context. Retrieve one and you have enough surrounding material to understand it. But larger chunks dilute relevance. If your query matches one sentence in a 500-token chunk, you're injecting 499 irrelevant tokens into your context window: tokens that cost money and attention.

Overlapping chunks hedge against boundary losses. If an answer spans two chunks, non-overlapping chunking might split it perfectly, with each chunk containing half the answer and neither ranking highly enough to retrieve. Overlap of 10–15% ensures at least one chunk contains the complete thought, at the cost of proportionally more storage and embedding compute. A 400-token chunk with 10% overlap shares 40 tokens with its neighbours; at document boundaries, overlap simply stops.

Semantic chunking respects document structure. Split on paragraph boundaries, section headers, or logical divisions rather than arbitrary token counts. Authors grouped related information together for a reason. But semantic chunks vary wildly in size, complicating context window management.

Choosing a chunking strategy

Document typeRecommended approachRationale
API documentationSemantic chunking on endpoints/sectionsEach endpoint is a logical unit; splitting mid-endpoint creates useless fragments
Legal contractsSemantic chunking on clauses, 500–800 tokensClauses are self-contained; smaller chunks lose critical context
Support ticketsSmall chunks, 200–300 tokens, minimal overlapConversational content has natural boundaries; precision matters more than context
Technical guidesMedium chunks, 400–500 tokens, 15% overlapBalance between context preservation and retrieval precision
Mixed corpusDocument-type routing to appropriate strategyOne size doesn't fit all; invest in classification
Chunking Determines Retrieval Quality

Chunking affects retrieval quality more than embedding model choice, dimension count, or similarity function. Start with semantic chunking at 300–500 tokens with 10–15% overlap. Adjust based on observed failures: incomplete answers suggest chunks too small, irrelevant tangents suggest chunks too large.

Vectorize

Vectorize is Cloudflare's vector database, purpose-built for RAG with global distribution and edge query execution.

When Vectorize fits

Vectorize excels at moderate-scale RAG with global user bases. Ten million vectors handles substantial corpora: two million documents with five chunks each, or ten million individual records. Query latency runs sub-50ms at the edge, and the binding model integrates cleanly with Workers.

Vectorize struggles at extreme scale. Exceeding ten million vectors requires multiple indexes with routing logic: a Worker examining query metadata to select the appropriate index, with cross-index queries requiring fan-out and merge. If you need features Vectorize lacks (built-in hybrid search, complex filtering beyond metadata, custom similarity functions), you'll work around limitations or choose alternatives.

Dimension selection

Higher dimensions capture more semantic nuance but cost more to store and query. Default to 768 dimensions until benchmarks prove otherwise.

bge-base-en-v1.5 at 768 dimensions provides strong retrieval quality for most applications at reasonable cost. bge-small-en-v1.5 at 384 dimensions reduces storage and improves query latency at some quality cost; consider it for high-volume, latency-sensitive applications where retrieval quality can tolerate degradation. Models outputting 1536 dimensions capture maximal semantic detail but double storage costs versus 768; consider only when retrieval quality demonstrably improves for your specific queries.

Vectorize caps dimensions at 1536. Higher-dimension models require truncation or a different model choice.

Embedding model selection

ModelDimensionsBest forAvoid when
bge-base-en-v1.5768Default choice, strong English performanceMultilingual corpus
bge-small-en-v1.5384Latency-critical, cost-sensitive applicationsQuality is paramount
bge-large-en-v1.51024When base model retrieval quality disappointsCost constraints exist
Multilingual modelsVariesNon-English or mixed-language corpusEnglish-only content
External APIs (OpenAI, Cohere)Up to 1536Specific model requirements, existing investmentLatency-sensitive edge deployment
Never Mix Embedding Models

Different embedding models produce incompatible vector spaces. A 0.9 similarity score means nothing when comparing vectors from different models. Choose once and stay consistent, or reindex everything when you switch.

Index configuration

Dimensions and distance metric are immutable; changing either requires a new index and full reindex.

Cosine similarity is correct for most embedding models because it measures directional alignment independent of vector magnitude. Use Euclidean distance only if your embedding model documentation recommends it. Dot product suits models trained with that metric, which is rare.

Multi-tenancy with namespaces

Namespaces partition an index into isolated segments. Each namespace functions as an independent vector space; queries return results only from the specified namespace.

Multi-tenant vector isolation with namespaces
// Tenant data stays isolated without separate indexes
await env.VECTOR_INDEX.upsert(vectors, { namespace: `tenant-${tenantId}` });
const results = await env.VECTOR_INDEX.query(embedding, {
namespace: `tenant-${tenantId}`,
topK: 5
});

This provides data separation without maintaining separate indexes per tenant. The tradeoff: all tenants share the ten-million-vector limit. If individual tenants have large corpora, you'll need dedicated indexes per tenant anyway.

Metadata design

Metadata enables filtering and provides context for retrieved results. Design the schema carefully; you'll query against it and use it to reconstruct context.

Store document references (IDs, titles, URLs) for citation and source linking. Store categorical data (document type, product line, language) for query filtering. Store the original chunk text. Vectorize stores only vectors, not content. Without text in metadata, you can't assemble context for generation.

Metadata filtering narrows search scope before similarity ranking. Use it when criteria are known at query time: searching only product documentation, excluding old content, restricting to a specific language. Post-filtering offers more flexibility but wastes retrieval capacity on results you'll discard. Filter early when you can; filter late when you must.

Vectorize limits

LimitValueArchitectural implication
Maximum dimensions1536Constrains embedding model choice
Vectors per index10,000,000 paidRoughly 2M documents at 5 chunks each
Metadata per vector10 KBSufficient for chunk text plus metadata
Indexes per account100Enables multi-index architectures for scale

Approaching the limit requires architectural decisions. Multiple indexes with routing logic add complexity but remove the ceiling. More aggressive chunking reduces vector count but may harm retrieval quality. External vector databases trade latency for scale. Choose based on whether your scale is temporary (approaching the limit during a migration) or permanent (corpus genuinely exceeds Cloudflare's limits).

AI Search: managed RAG

AI Search handles chunking, embedding, indexing, and retrieval automatically. Point it at data sources and query directly. AI Search is the correct choice until it isn't. If standard chunking works for your data, you've saved months of engineering.

Choosing between AI Search and custom RAG

RequirementAI SearchCustom RAG
Time to prototypeHoursWeeks
Chunking controlNone (standard algorithm)Complete
Embedding model choiceFixedAny model
Hybrid searchNot availableBuild with D1 FTS5
Corpus sizeUp to 100,000 filesLimited by Vectorize (10M vectors)
Index managementAutomaticProgrammatic
Supported formatsPDF, TXT, MD, HTML, CSV, JSONAnything you can parse
Ongoing maintenanceMinimalSignificant

Choose AI Search for rapid prototyping, standard document formats, corpora under 100,000 files, and teams without RAG expertise. Choose custom RAG for specific chunking strategies, hybrid search, custom embedding models, programmatic index management, or unsupported formats.

The decision often becomes clear during prototyping. If AI Search disappoints and you can identify why (chunks splitting wrong, important keywords missing from semantic matches, filtering needs it can't express), custom RAG addresses those specific problems. If AI Search works acceptably, custom RAG's engineering effort rarely justifies marginal improvements. Measure before rebuilding.

AI Search constraints

AI Search limits shape what you can build. Ten instances per account means ten separate knowledge bases, sufficient for departmental separation but limiting for multi-tenant SaaS where each customer needs isolated search. The 100,000 files per instance ceiling suits documentation and support content but falls short for enterprise-scale repositories. The 100 MB maximum file size handles most documents but excludes large media files or data dumps.

Path filtering provides control over what gets indexed from website and R2 data sources. Include and exclude rules using glob patterns let you index documentation while skipping drafts, exclude admin pages from results, or limit indexing to specific language directories. This filtering improves result relevance by keeping irrelevant content out of the index, and enables splitting a single data source across multiple AI Search instances for specialised search experiences.

Needing more than ten instances requires routing logic across multiple AI Search instances or custom RAG. Path filtering can help here: the same R2 bucket or website can feed multiple instances, each filtering for different content subsets. Approaching 100,000 files per instance, evaluate whether content can split across instances using path filters or whether custom RAG offers more headroom.

Custom RAG architecture

When AI Search doesn't fit, you build custom pipelines. The complexity is real but manageable, and the control enables optimisations impossible with managed services.

The indexing pipeline

A custom indexing pipeline coordinates document processing, chunking, embedding, and storage. The orchestration is straightforward; the decisions hide inside the helper functions. Format-specific text extraction, chunking strategy selection based on document type, and metadata schema design determine quality. The pipeline structure itself is less important.

Chunking strategy should vary by content type. Technical documentation benefits from semantic chunking on section boundaries. Support tickets need smaller chunks with less overlap. A mixed corpus requires classification and routing. One-size-fits-all chunking produces mediocre results across everything.

The retrieval pipeline

Retrieval converts queries to vectors, searches the index, and assembles context for generation. The critical decision isn't the search itself (that's a single API call), but what to do when retrieval confidence is low.

Handling low-confidence retrieval
// Detect low-confidence retrieval before proceeding
const results = await env.VECTOR_INDEX.query(queryEmbedding, { topK: 5 });

if (!results.matches.length || results.matches[0].score < threshold) {
return "I couldn't find relevant information to answer this question.";
}

Threshold calibration requires care. Similarity scores don't have universal meaning; 0.7 might indicate strong relevance with one embedding model and weak relevance with another. Start conservative at 0.8, accepting false negatives where the system claims ignorance despite relevant content existing. Monitor user feedback to identify these cases. Lower the threshold gradually while tracking false positives; these are cases where the system retrieves and cites irrelevant content. Finding the optimal threshold requires empirical measurement, not guesswork.

Vector search finds semantically similar content; keyword search finds exact term matches. Production RAG often needs both.

Vector search excels at semantic similarity. A query about "resetting credentials" matches content about "password recovery" even without those exact words. But vector search can miss exact matches that matter. A query for "ERR_SSL_PROTOCOL_ERROR" might retrieve generic SSL troubleshooting rather than the specific documentation for that error code.

Keyword search through D1's FTS5 extension finds exact term matches, complementing vector search for precise lookups.

Hybrid search combining vector and keyword results
async function hybridSearch(query: string, env: Env) {
const [vectorResults, keywordResults] = await Promise.all([
vectorSearch(query, env),
keywordSearch(query, env) // D1 FTS5 query
]);

return mergeResults(vectorResults, keywordResults);
}

Merging logic determines hybrid search quality. Simple interleaving alternates results: first vector, first keyword, second vector, and so on. This works when both sources produce comparable quality but can elevate mediocre keyword matches above excellent vector matches.

Scored merging normalises scores from both sources and ranks by combined score. Vector similarity typically ranges 0–1. FTS5 BM25 scores have no fixed range; divide by maximum score or apply a sigmoid to compress into standard range. Weight sources based on observed performance.

When documents appear in both result sets, deduplicate by keeping the higher-scored instance. Track which source contributed each result to diagnose retrieval quality issues.

Start with simple interleaving. Add scored merging only when interleaving demonstrably fails. Complexity should follow evidence of need.

RAG failure modes

RAG systems fail in characteristic ways. Understanding these helps you design monitoring and graceful degradation.

Retrieval failures

The most common failure: relevant content exists but isn't retrieved. Symptoms include answers that miss obvious information or generic responses when specific answers exist in the corpus.

Causes vary. Poor chunking splits relevant content across chunks, diluting similarity scores. Embedding model mismatch means your model encodes semantics differently than your queries express them; a model trained on formal text may not embed casual queries effectively. Insufficient indexing means relevant content was never processed, through ingestion failures or gaps in source coverage.

Context overflow

Models have finite context windows. Retrieve too many chunks and you exceed the limit, causing truncation or errors. More subtly, too many chunks dilute signal with noise as the model attends to irrelevant content.

Symptoms include answers citing irrelevant sources, responses missing the most relevant information despite it being retrieved, or explicit truncation errors.

Solutions: retrieve fewer chunks through better precision, summarise chunks before injection to preserve signal while reducing tokens, or use models with larger context windows at higher cost. Retrieval precision improvements compound; better retrieval means fewer chunks needed, lower inference costs, and better answer quality simultaneously.

Generation failures

Even with good retrieval, generation can fail. The model might ignore context and hallucinate anyway, particularly when context contradicts training data. It might synthesise incorrectly across sources, combining facts that don't belong together. It might quote accurately but miss the actual answer, fixating on related but non-responsive content.

These failures are harder to detect automatically. Structured prompts requiring explicit source citation make hallucination visible; the model must point to where it got each claim, making unsupported claims stand out. Spot-checking answers against retrieved context catches synthesis errors. User feedback remains the most reliable signal.

Staleness

RAG answers based on indexed content. If source documents update but indexes don't, answers become stale. Users trust RAG because it cites sources; stale citations betray that trust worse than honest uncertainty.

For slowly-changing corpora, scheduled batch reindexing suffices: nightly or weekly jobs rebuilding indexes from current sources. Frequently-changing content needs incremental indexing on document update, which adds operational complexity. Event-driven reindexing triggers on document changes and requires your storage to emit change events.

Answer caching requires careful invalidation. TTL-based invalidation accepts bounded staleness; answers may be up to N minutes old. Version-keyed caching includes corpus version in the cache key, invalidating everything when the index updates. Content-hash caching includes hashes of retrieved chunks, invalidating only when specific sources change. Choose based on staleness tolerance and cache hit rate requirements.

Monitoring for failures

Log retrieval results alongside user feedback: query, retrieved chunk IDs, similarity scores, generated answer, and explicit feedback (thumbs up/down, corrections, follow-up questions indicating confusion).

Track top-k similarity score distribution over time. Gradual decline signals corpus-query drift; your indexed content is becoming less relevant to user questions. Sudden drops indicate indexing failures or embedding model issues.

Sample queries for human evaluation regularly. Automated metrics cannot fully capture retrieval quality; human judgment identifies subtle failures that metrics miss. Even 1% sampling surfaces issues before they become widespread.

Correlate retrieval confidence with user satisfaction where feedback exists. If high-confidence retrievals receive positive feedback and low-confidence retrievals receive negative feedback, your thresholds are well-calibrated. Weak correlation means thresholds need adjustment or retrieval quality needs improvement.

Optimising RAG performance

RAG latency has three components: embedding the query, searching the index, and generating the answer. On Cloudflare, the first two happen at the edge, typically under 50ms combined. Generation dominates; it often takes 300ms to several seconds depending on model and response length.

Model selection directly affects latency and cost. Llama 3.1 8B generates faster and cheaper than 70B. For many RAG applications, the smaller model suffices because retrieved context provides the specificity that larger models achieve through extensive training. The smaller model reads your documentation and answers from it; it doesn't need to know everything, just to read well. Benchmark both before assuming you need the larger model.

Streaming responses improve perceived latency dramatically. Users see content appearing immediately rather than waiting for complete generation. Total time doesn't change, but experience does; a 2-second streaming response feels faster than a 1.5-second blocking response.

Answer caching suits RAG with repeated queries. Documentation search often sees the same questions; "how do I reset my password" appears constantly. Caching eliminates generation latency for cache hits. The tradeoff is invalidation complexity. For stable corpora, aggressive caching makes sense. For rapidly-changing content, shorter TTLs or version-keyed caching prevent stale answers.

When RAG is wrong

RAG finds relevant text in unstructured content. If your answer isn't in unstructured text, RAG is the wrong tool.

Very small corpora don't need RAG. If your knowledge base fits in a model's context window, include it directly. A 50-page document fits comfortably in modern context windows. Retrieval machinery adds complexity without benefit when you can provide all context every time.

Rapidly changing data strains RAG architectures. If source documents update every minute, indexing lag means perpetually stale answers. Real-time data needs different approaches, such as direct database queries, live API calls, or tool use, rather than pre-computed vector indexes.

Queries not benefiting from semantic search waste RAG's strengths. Exact lookups for order status or account balance need database queries, not vector similarity. Computations need code execution, not text retrieval. Structured data queries need SQL, not embeddings.

High-precision requirements may exceed RAG's capabilities. Legal research, medical diagnosis, and other domains where wrong answers cause serious harm need retrieval quality RAG may not achieve. Vector similarity scores don't map to accuracy guarantees; 0.85 similarity doesn't mean 85% confidence the answer is correct. When precision matters more than coverage, evaluate carefully and consider human review for high-stakes queries.

What comes next

RAG retrieves context for generation. It's a powerful but fundamentally reactive pattern. The system waits for questions and retrieves relevant context to answer them. Chapter 18 covers AI agents: applications that go beyond question-answering to take actions, maintain conversation state, and use tools to accomplish goals.

Agents often incorporate RAG as one capability among many. An agent might retrieve documentation, call an API based on what it learned, then update a database with results. The Agents SDK provides the framework for these multi-step, stateful AI applications, with Durable Objects providing the coordination stateful agents require.