Chapter 15: The AI Stack on Cloudflare

Can we build AI-powered applications on Cloudflare, and what's the architecture?

Cloudflare's AI stack trades frontier model quality for reduced latency and operational simplicity, and if that trade-off works for your use case, what follows will save you months of infrastructure work. If it doesn't, AI Gateway lets you use frontier models whilst still getting Cloudflare's observability. Know which path you're on before you start building.

This chapter covers how Cloudflare's AI infrastructure works, where it outperforms alternatives, and where you should use something else. Chapters 16 through 18 cover implementation.

The core trade-off

Workers AI runs capable open-source models (Llama, Mistral, Qwen) on Cloudflare's GPU infrastructure that handle most AI tasks competently: classification, summarisation, embeddings, code generation, and conversational responses. They don't match GPT-4 or Claude on complex reasoning, nuanced creative writing, or sophisticated instruction-following.

That gap matters less than you'd expect, because a response that's 90% as good but arrives 200ms faster often provides better user experience, and an embedding model that's "good enough" at a tenth of the cost processes ten times more documents for the same budget. The question isn't whether Workers AI models are the best available (they aren't) but whether they're good enough for your use case.

If your users can't tell the difference between Llama 3 and GPT-4 for your specific task, you're paying a tax on sophistication you don't need. If they can tell the difference and it matters, use the better model. AI Gateway routes those requests while keeping observability on-platform.

Cloudflare's AI stack suits applications where AI is a feature, not the product. Building the next ChatGPT? Use dedicated AI infrastructure with frontier models and fine-tuning. Adding intelligence to an existing application (search that understands intent, support that suggests answers, content that categorises itself)? Cloudflare gives you AI without the AI ops.

How "edge AI" actually works

"AI at the edge" suggests models running in every Cloudflare location alongside your Workers, but that's not how it works, and the distinction matters for latency calculations.

Workers AI runs on GPU clusters distributed across Cloudflare's network, but because GPUs are expensive and power-hungry, they're not in all 300+ locations where Workers run. When your Worker calls Workers AI, the request routes to a GPU-equipped data centre within the same continent but not necessarily the same city. Cloudflare doesn't publish exact locations, so assume inference requests travel regionally, not locally.

Workers AI latency has two components: the round-trip from your Worker to a GPU cluster plus inference time. For a London user calling a Worker that invokes Workers AI, the request might flow: user → London PoP (Worker executes) → GPU cluster (possibly London, possibly Frankfurt or Amsterdam) → back to Worker → back to user, with the Worker-to-GPU leg adding 10-50ms on top of inference time.

Compare this to calling OpenAI directly, where OpenAI's inference runs primarily from US data centres. A London user's request flows: user → London PoP → OpenAI (US) → back to London → back to user, with that transatlantic round-trip adding 100-150ms before OpenAI even starts inference.

Edge AI doesn't mean models in every city but rather models on the right continent. For a European user, that might mean 30ms of network overhead instead of 130ms, and whether that 100ms matters depends on your use case; for autocomplete that must feel instant, it's transformative, whilst for batch processing, it's irrelevant.

Scaling and capacity

Workers AI doesn't have cold starts in the traditional sense because models are loaded and ready, though inference requests queue when GPU capacity is constrained, and latency increases as requests wait for available GPU cycles.

Cloudflare scales GPU capacity automatically, but GPU scaling is slower than Worker scaling, so traffic spikes might see elevated latency for seconds or minutes whilst capacity adjusts. Cloudflare doesn't publish percentile latency guarantees under load, so test your specific workload patterns before committing to latency-sensitive production use.

Median latency is predictable and documented per model; P99 latency under normal load typically runs 2-3x median, and P99 during capacity constraints can spike significantly higher. If your application has strict latency SLAs, build fallback paths such as cached responses, alternative providers through AI Gateway, or graceful degradation.

The product stack

Cloudflare's AI offerings form a coherent stack. Understanding what each solves prevents both over-engineering and under-building.

Workers AI: serverless inference

Workers AI exists because GPU infrastructure is operationally complex, with provisioning GPUs, managing CUDA drivers, handling model loading, and scaling capacity being problems you avoid by calling Workers AI. Simply configure a binding, invoke a model, and receive a response.

The trade-off is control, as you get the models Cloudflare offers, configured the way Cloudflare configures them, with no fine-tuning, no custom model hosting, and no proprietary models. If you need GPT-4 or Claude, call those providers directly (potentially through AI Gateway for unified observability).

Workers AI handles well embeddings for semantic search, classification and extraction, summarisation of moderate-length documents, conversational responses where speed matters more than sophistication, and code generation for common patterns.

External providers handle complex multi-step reasoning, mathematical problem-solving, creative writing where quality is the product, very long context synthesis, and any task where you've tested Workers AI models and found them inadequate.

Vectorize: when and why

Vectorize stores vector embeddings and performs similarity search, which you need for RAG, recommendation systems, or semantic search. Should that storage be Vectorize?

Factor	Choose Vectorize	Choose External Vector DB
Platform integration	Single bill, native bindings, no network hop	Additional vendor, latency for external calls
Scale	Handles millions of vectors adequately	Pinecone, Weaviate proven at massive scale
Query capabilities	Similarity search plus metadata filtering	Advanced filtering, hybrid search, more query options
Operational model	Managed, minimal configuration	More control, more operational burden
Team familiarity	New tooling to learn	Pinecone/Weaviate may be known quantities

Vectorize Constraints (February 2026)

Maximum 1536 dimensions per vector (sufficient for most embedding models)
Metadata filtering for scoped queries
Namespaces for multi-tenant isolation
10 million vectors per index on paid plans

These constraints rarely limit typical applications, but verify against your requirements.

Vectorize isn't the most feature-rich vector database available, but it's adequate for most RAG and search applications, integrates cleanly with Workers, and eliminates a vendor relationship. If you need advanced capabilities (hybrid keyword-and-semantic search, complex filtering, proven performance at billions of vectors), evaluate Pinecone or Weaviate.

Vectorize is a building block, not a complete RAG solution. You provide embeddings; Vectorize stores and retrieves them. Chunking strategy, embedding model selection, retrieval logic, and re-ranking remain your responsibility. Chapter 17 covers building complete RAG pipelines.

AI Gateway: optionality as architecture

AI Gateway sits between your application and AI providers (Workers AI, OpenAI, Anthropic, Azure OpenAI, or others), with every request passing through to enable logging, analytics, caching, and routing control.

The immediate value is observability; without AI Gateway, understanding AI costs requires aggregating logs from multiple providers, correlating request patterns, and building analytics pipelines, whereas AI Gateway provides per-request logging, cost tracking, latency analysis, and error rates by provider and model.

The strategic value is optionality, as AI Gateway adds about 10 milliseconds of latency but buys the ability to change AI providers in 10 minutes instead of 10 weeks. The AI provider landscape will look different in two years, with models improving, pricing changing, and providers having outages, so AI Gateway abstracts the provider, making switches a configuration change rather than a code rewrite.

The abstraction isn't perfect because providers have different APIs, token limits, response formats, and capabilities; switching from GPT-4 to Llama requires prompt adjustments and quality testing. However, AI Gateway handles the mechanics so you can focus on model-specific concerns.

AI Gateway adds clear value when you use multiple providers, want unified observability, need fallback routing (if OpenAI fails, try Workers AI), or when repetitive queries benefit from caching. The overhead isn't worth it when you have a single provider with no plans to change, every millisecond of latency matters, or when highly variable queries yield negligible cache hit rates.

For most applications building on Cloudflare, AI Gateway is worth the minimal overhead. Even if you use only Workers AI today, routing through AI Gateway preserves future flexibility at negligible cost.

AI Search: managed RAG

AI Search handles RAG end-to-end by accepting data sources (R2 buckets, web URLs), chunking documents, generating embeddings, building an index, and answering questions with retrieved context, so the entire pipeline becomes configuration.

The trade-off is control for convenience, as AI Search chooses chunking strategies, embedding models, and retrieval algorithms, which you can't customise. If the defaults work for your use case, you've saved significant implementation effort; if they don't, you need custom RAG with Vectorize.

AI Search Constraints (February 2026)

Maximum 10 AI Search instances per account
Maximum 100,000 files per instance
Limited control over chunking and retrieval parameters

These constraints suit applications where RAG is a feature rather than the core product.

The question isn't whether you can build your own RAG pipeline but whether your competitive advantage lies in RAG infrastructure or in what you do with it. If you're building a documentation Q&A feature and "good enough" retrieval suffices, AI Search eliminates weeks of work; if retrieval quality is your competitive advantage, build custom RAG and invest in tuning.

Agents SDK: stateful AI applications

The Agents SDK provides a framework for AI agents - applications where models maintain conversation state, use tools, and take actions across multiple interactions. The SDK builds on Durable Objects, inheriting their single-threaded consistency guarantees.

When do you need the Agents SDK versus Durable Objects directly? The SDK provides abstractions for common patterns: conversation history management, tool registration and execution, and structured prompting.

If your "agent" is simple (a chatbot with conversation history), you might not need the SDK, as a Durable Object storing messages and calling Workers AI handles that without additional abstraction. The SDK earns its complexity when you need tool use (the model calls functions you define), multi-step reasoning (the model chains multiple operations), or sophisticated conversation management (branching dialogues, context windowing, memory summarisation).

Chapter 18 covers the Agents SDK in detail. For strategic purposes: it's the right choice for serious agent applications, overkill for simple conversational features.

Data privacy and compliance

For enterprises evaluating Cloudflare's AI stack, data privacy is often decisive.

Inference requests are processed and discarded; Cloudflare doesn't train models on customer data, doesn't retain prompts or completions beyond request processing, and doesn't share inference data across customers.

AI Gateway logging is configurable, allowing you to enable request/response logging for debugging and analytics (storing data in your Cloudflare account subject to your configured retention) or disable logging entirely for sensitive workloads (sacrificing observability for privacy).

Compliance certifications follow Cloudflare's broader platform: SOC 2 Type II, ISO 27001, and GDPR compliance, with Workers AI inheriting these as part of the Workers platform.

AI Data Residency

Data residency considerations for Workers AI are less mature than for other Cloudflare primitives. Inference occurs on GPU clusters whose locations aren't published. For workloads with strict geographic requirements, verify residency guarantees directly with Cloudflare.

Data residency considerations are less mature than for Durable Objects. Workers AI inference occurs on GPU clusters whose locations aren't published. If you have strict data residency requirements (processing must occur within specific geographic boundaries), verify with Cloudflare whether Workers AI can meet them. For some regulated workloads, this uncertainty may be disqualifying.

For most commercial applications, Workers AI's privacy posture is adequate. For applications handling highly sensitive data (healthcare, financial, government), verify specific requirements with Cloudflare and your compliance team before committing.

When things break

AI systems fail in ways that traditional systems don't. Understanding failure modes shapes resilient architectures.

Workers AI failures

Workers AI can fail due to capacity constraints (GPU queuing), model issues, or infrastructure issues (GPU cluster unavailable), which manifest as elevated latency, error responses, or timeouts.

Detection requires monitoring beyond standard request metrics by tracking AI-specific signals: inference latency percentiles, error rates by model, and timeout rates. Alert when latency exceeds thresholds that affect user experience, not just when requests fail outright, because a 5-second response from a feature that should take 500ms is a failure even if it returns HTTP 200.

Recovery strategies depend on criticality. For features that can fail gracefully (a "related content" widget, a summarisation preview), return cached content, show a placeholder, or hide the feature temporarily. For features where AI is essential (a search interface requiring semantic understanding), implement fallbacks to alternative providers through AI Gateway with clear degradation paths when all providers fail.

Model deprecation

Cloudflare's model catalogue evolves and models deprecate with notice, so your application must handle transitions. Design for model flexibility from the start by abstracting model selection behind configuration, testing against multiple models during development, and monitoring for deprecation announcements.

Quality degradation

AI quality can degrade without obvious signals, as a model update might handle your prompts differently and produce subtly worse results, load patterns might cause inconsistent behaviour, and prompt changes that improve one use case might harm another.

Monitor quality, not just availability. For classification tasks, track accuracy against labelled samples; for generation tasks, implement human review sampling or automated quality scoring. Quality monitoring is harder than availability monitoring, but undetected quality degradation is worse than detected outages.

Cost architecture

AI costs can dominate infrastructure spending. Understanding the cost structure enables informed decisions.

Workers AI pricing

Workers AI charges per inference using "neurons", a unit that normalises across model sizes, with larger models costing more neurons per request and longer inputs and outputs costing more than shorter ones.

Approximate costs as of February 2026. Verify current pricing:

Small models (7-8B parameters): roughly $0.01-0.03 per 1,000 short completions
Medium models (13-30B parameters): roughly $0.04-0.06 per 1,000 completions
Large models (70B+ parameters): roughly $0.15-0.25 per 1,000 completions
Embedding models: roughly $0.001-0.003 per 1,000 requests

Compare to proprietary models (approximate, per 1K tokens):

GPT-4 Turbo: $0.01 input, $0.03 output
GPT-3.5 Turbo: $0.0005 input, $0.0015 output
Claude 3 Sonnet: $0.003 input, $0.015 output

A typical request with 500 input tokens and 200 output tokens costs roughly $0.011 for GPT-4 Turbo or $0.00003 for Workers AI (Llama 3 8B equivalent), making Workers AI 10-500x cheaper than proprietary models for comparable tasks. The question is whether the quality difference justifies the cost difference for your use case.

Cost breakpoints

Workers AI makes economic sense when quality is adequate and volume is high enough to matter; below a thousand daily requests, the cost difference is pocket change so choose based on quality and integration, whilst above a million daily requests, the cost difference funds an engineer so make sure Workers AI quality suffices.

Workers AI wins when request volumes exceed a few hundred daily, open-source model quality meets requirements, latency benefits provide user-facing value, and traffic variability makes serverless pricing attractive.

External providers win when you need GPT-4 or Claude quality (no Workers AI model matches them), volume is low enough that cost is irrelevant, or enterprise agreements include AI services at negotiated rates.

Self-hosted wins when sustained volume exceeds millions of daily requests with predictable traffic, you need custom or fine-tuned models, or data residency requirements preclude third-party inference.

Optimisation strategies

Right-size model selection provides the largest cost impact by using the smallest model that produces acceptable results. Classification tasks rarely need 70B parameter models; 7B models often suffice at 10x lower cost. Test and measure.

AI Gateway caching reduces costs for repetitive queries, with FAQ-style applications possibly seeing 50%+ cache hit rates and effectively halving AI costs. Enable caching and monitor hit rates, then adjust cache TTL based on how quickly your content changes.

Prompt engineering for efficiency matters at scale, since verbose prompts cost more tokens. A prompt that achieves the same result with half the tokens costs half as much, and for high-volume applications, the cumulative savings are significant.

Decision frameworks

Should you use Cloudflare's AI stack?

The decision tree is simpler than it appears:

Do you need frontier model quality (GPT-4, Claude)? Use those providers directly (potentially through AI Gateway for observability), as Workers AI cannot substitute for frontier models on quality-demanding tasks.

Is inference latency critical? If sub-200ms response time matters, Workers AI's regional inference helps; if latency is flexible, external providers work fine and you should choose based on quality, cost, and integration.

Already building on Cloudflare? Workers AI integrates naturally with zero additional vendors; if not, integration complexity is similar across providers and you should evaluate based on features and cost.

Value multi-provider flexibility? Route through AI Gateway from the start, as the optionality is worth the minimal latency overhead. If you're certain you'll never switch providers (unlikely), direct integration is marginally simpler.

The hybrid strategy

Most serious applications benefit from a hybrid approach, using Workers AI for high-volume, latency-sensitive, quality-tolerant tasks and frontier models for low-volume, quality-critical tasks.

A single application might use Workers AI embeddings for semantic search (high volume, quality adequate, latency matters), Workers AI classification for content categorisation (high volume, quality adequate), GPT-4 through AI Gateway for customer-facing complex responses (lower volume, quality critical), and Workers AI summarisation for internal tools (quality tolerance higher for internal use).

The key is identifying which tasks are quality-tolerant and which aren't; test rather than assume by running your actual prompts through both Workers AI and frontier models. If users can't tell the difference, Workers AI saves money and latency; if they can tell and it matters, use the better model.

AI Gateway makes hybrid strategies practical with unified logging across providers, consistent authentication patterns, and a single interface for cost tracking; without AI Gateway, hybrid strategies require maintaining multiple provider integrations and aggregating observability manually.

Latency budget planning

AI features compete for user attention spans, so if your total acceptable latency is 500ms and you need both retrieval (Vectorize) and generation (Workers AI), you must allocate that budget carefully.

Typical latency components are Edge Worker execution (1-5ms), Vectorize query (10-30ms), Workers AI inference for small models with short completions (50-150ms), Workers AI inference for large models with longer completions (200-500ms), and external provider inference (add 50-150ms network latency).

For a RAG application with a 500ms budget, you might allocate roughly 20ms for Vectorize retrieval, 100ms for Workers AI generation (8B model), and 10ms for Worker logic and response formatting, totalling about 130ms and well within budget.

For complex generation with the same budget, a larger model or longer output pushes inference to 300-400ms, leaving minimal headroom for retrieval and processing, so consider streaming responses to improve perceived latency.

If your latency budget is tight, design constraints from the start: use smaller models, limit output length, enable streaming, and implement aggressive caching, since latency optimisation is easier when designed in from the beginning.

Comparison to hyperscaler AI

Aspect	Cloudflare	AWS Bedrock	Azure OpenAI	Google Vertex
Inference location	Regional GPU clusters	Regional	Regional	Regional
Latency for global users	Reduced (regional routing)	Standard regional	Standard regional	Standard regional
Model variety	Open-source models	Extensive (Claude, Llama, more)	OpenAI models	Gemini + open-source
Proprietary models	No	Yes	Yes	Yes
Fine-tuning	Not supported	Supported	Supported	Supported
Managed RAG	AI Search	Knowledge Bases, Kendra	AI Search	Vertex AI Search
Multi-provider proxy	AI Gateway	Not native	Not native	Not native
Integration depth	Deep with Workers	Deep with AWS	Deep with Azure	Deep with GCP

Choose Cloudflare when latency and simplicity matter more than model selection; choose hyperscalers when you need specific models, fine-tuning, or deep platform integration; choose both through AI Gateway when you want Cloudflare's edge benefits with hyperscaler model access.

AWS Bedrock suits organisations deeply invested in AWS wanting managed access to multiple model providers including Anthropic and Meta. Azure OpenAI suits enterprises with Microsoft relationships needing GPT-4 with enterprise compliance. Google Vertex suits GCP-centric organisations wanting Gemini access and advanced ML pipeline tooling. Cloudflare suits organisations building on Workers who want AI features without AI operations, value latency reduction for global users, and want to avoid deep lock-in to any single AI provider.

Build vs buy for AI features

Before building AI features on any infrastructure, ask whether building is the right choice.

Many applications don't need custom AI infrastructure but instead need a chatbot, a search feature, or content classification, and purpose-built products often serve better than primitives. Intercom handles support chat, Algolia handles search with AI features, and various vendors handle content moderation.

Build on Cloudflare's AI stack when your AI feature is differentiated (not commodity), you need control over prompts and behaviour, you want to avoid per-seat SaaS pricing at scale, or the AI feature integrates tightly with your application logic.

Buy a purpose-built product when the feature is commodity (standard support chat, basic search), time-to-market matters more than customisation, the vendor's model quality and training data exceed what you'd achieve, and operational simplicity outweighs control.

Most AI features are less differentiated than their builders believe. If your "AI-powered search" would be adequately served by Algolia, building custom RAG is engineering vanity.

What comes next

This chapter established when AI on Cloudflare makes strategic sense. The following chapters make it concrete: Chapter 16 covers Workers AI and inference patterns, Chapter 17 addresses RAG applications and knowledge retrieval, and Chapter 18 explores AI agents and advanced orchestration. Together, they provide the technical depth to execute on the strategic choices outlined here.

The core trade-off​

How "edge AI" actually works​

Scaling and capacity​

The product stack​

Workers AI: serverless inference​

Vectorize: when and why​

AI Gateway: optionality as architecture​

AI Search: managed RAG​

Agents SDK: stateful AI applications​

Data privacy and compliance​

When things break​

Workers AI failures​

Model deprecation​

Quality degradation​

Cost architecture​

Workers AI pricing​

Cost breakpoints​

Optimisation strategies​

Decision frameworks​

Should you use Cloudflare's AI stack?​

The hybrid strategy​

Latency budget planning​

Comparison to hyperscaler AI​

Build vs buy for AI features​

What comes next​