Skip to main content

Chapter 3: Workers: The Core Compute Primitive

How do Workers actually work, and how do I use them effectively?


Chapter 1 introduced the V8 isolate model. This chapter makes it concrete: the request lifecycle, the hard constraints and their implications, and the patterns that separate effective Workers code from code that fights the platform.

Understanding Workers deeply matters because every other Cloudflare service builds on them. Durable Objects are Workers with persistent state. Workflows are Workers with durable execution. Queues deliver messages to Workers. Containers route through Workers. Master this primitive and you understand the foundation of everything else.

The execution model

Workers execute JavaScript in V8 isolates, the same sandboxing technology that isolates browser tabs from each other within the browser. This isn't a virtual machine or container; it's a lightweight sandbox within an existing runtime. Creating an isolate costs roughly a hundred times less than spawning a new process, and starts proportionally faster.

Single-threaded by design

Each request executes in a single thread without parallelism. You cannot parallelise computation within a request: no worker threads API, no shared memory between concurrent operations. For parallel processing, you make multiple subrequests executing in potentially different isolates, or use Promise.all for concurrent I/O operations.

This constraint enables sub-millisecond cold starts as a natural consequence of the design. A model requiring thread pools, shared memory synchronisation, or process-level isolation would sacrifice the startup performance that makes Workers viable for latency-sensitive applications. The single-threaded model isn't a limitation you work around; it's the architectural choice that makes everything else possible.

The design implication: compute-heavy operations benefiting from parallelism need decomposition across multiple Workers invocations or delegation to specialised services. Workers excel at orchestration and I/O coordination, not raw parallel computation.

Isolate lifetime and state

An isolate running your Worker may persist between requests as an optimisation for performance. This is how Cloudflare achieves near-instant response times for warm isolates. But you cannot rely on this persistence as a guarantee. Global variables set during one request may or may not exist during the next.

Always treat every request as if it's the first to a fresh isolate for correctness. That persistence might not exist is a fact; that it sometimes does is a performance optimisation, not a guarantee you can rely on.

This reality shapes how you use global scope strategically. Expensive initialisation producing immutable results (parsing configuration, compiling regular expressions, establishing reusable structures) belongs in global scope where it might persist between requests. State that matters (user sessions, request counts, accumulated data) belongs in Durable Objects or external storage where persistence is guaranteed.

The successful pattern emerges: use global variables for caches where loss is acceptable and reconstruction is cheap. A cache miss means slightly slower response, not incorrect behaviour. If losing data would cause errors or require complex recovery logic, it doesn't belong in isolate memory.

The request lifecycle

Workers respond to events through handler methods. The fetch handler processes HTTP requests; other handlers exist for scheduled events, queue messages, email, and additional triggers. The handler receives the incoming event, configured bindings through env, and an execution context providing waitUntil() for background work.

Execution ends when the handler returns and no waitUntil() promises remain pending. The platform doesn't keep isolates alive for orphaned promises. If you start an async operation without awaiting it or passing it to waitUntil(), completion isn't guaranteed; the isolate may shut down first.

This differs fundamentally from long-running server processes where orphaned promises eventually complete because the process continues indefinitely. In Workers, the execution boundary is explicit and definitive: return from the handler, and the clock starts ticking toward isolate shutdown.

Background work with waitUntil()

waitUntil() extends execution lifetime without blocking the response to the user. Pass it a promise, and the Worker continues running in the background until that promise settles while the response returns immediately to the client.

Use waitUntil() for operations that don't affect the response and where clients shouldn't wait: analytics tracking, logging to external systems, cache warming after serving stale content, cleanup operations. The client gets their response quickly; background work completes afterward.

Don't use waitUntil() for operations where failure must change the outcome to the user. If the background operation fails, you've already sent a success response. The client believes their action succeeded, but your system knows it didn't. If failure should change the response, await the operation directly rather than deferring it.

Important constraints: waitUntil() extends wall time but not CPU time limits. Errors in waitUntil() promises appear in logs but don't affect the already-sent response. You can call waitUntil() multiple times; all promises must settle before the isolate shuts down.

Resource constraints

Workers operate within hard constraints that shape application design. These aren't suggestions or soft limits that increase cost. They're boundaries that cause failures when exceeded.

Memory: 128 MB per isolate

Each isolate has 128 MB of memory total, encompassing the JavaScript heap, WebAssembly linear memory, and buffers. This memory is shared across concurrent requests handled by the same isolate, though in practice most isolates handle one request at a time.

Reframe the Constraint

128 MB sounds restrictive until you realise most web requests don't need to hold entire files in memory; they need to stream bytes from one place to another. The limit constrains buffering, not capability.

What fits comfortably: typical request/response handling, JSON parsing of documents up to a few megabytes, reasonable in-memory data structures, most API gateway patterns.

What doesn't fit within constraints: buffering large file uploads before processing (stream directly to R2), loading datasets for in-memory analysis (use D1 or process incrementally), high-resolution image manipulation without streaming (use R2's image transformations), large ML models (use Workers AI).

The key strategy is streaming. When handling large request or response bodies, pipe directly to destination rather than accumulating in memory. A Worker can proxy gigabytes through R2 while staying under the memory limit because it never holds more than a buffer's worth at once.

When you genuinely need more memory because the workload requires holding substantial data in memory simultaneously rather than streaming it through, Workers aren't the right tool for that workload. Containers offer up to 12 GB at the cost of longer cold starts. Recognise this constraint early rather than fighting it with overly complex streaming gymnastics.

CPU time: the billing boundary

CPU time is limited to 30 seconds on paid plans (extendable to 5 minutes). The free tier allows only 10 milliseconds, enough for simple request routing but not substantial computation.

Understanding what consumes CPU time matters profoundly for staying within limits and controlling costs. Active JavaScript execution, JSON serialisation, cryptographic operations, WebAssembly computation, and regular expression evaluation all consume CPU time. Waiting does not. A fetch() to an external API, a D1 query, a KV read: while your code awaits these operations, CPU time doesn't accumulate.

This distinction is fundamental to Workers economics and receives deeper treatment shortly. The 30-second limit applies to computation time, not elapsed time. A Worker making dozens of external calls over several seconds of wall time might consume only milliseconds of CPU time.

Subrequest limits

Workers on paid plans default to 10,000 subrequests per invocation, configurable up to 10 million through the limits.subrequests setting in your Wrangler configuration. This includes external fetches and binding operations. Calls to D1, KV, R2, and Durable Objects all count. Free plans allow 50 external subrequests and 1,000 subrequests to Cloudflare services.

wrangler.jsonc
{
"limits": {
"subrequests": 50000
}
}

The configurability is significant. Before February 2026, Workers had a fixed ceiling of 1,000 subrequests with no way to raise it. That hard wall forced architectural workarounds for legitimate use cases: long-lived WebSocket connections on Durable Objects, extended Workflows, and fan-out patterns that needed to reach more than a thousand backends. The new model treats subrequests as a tuneable safety valve. The default of 10,000 accommodates most workloads comfortably, and raising it further is a configuration change rather than an architectural redesign.

You can also set a lower limit to protect against runaway code or unexpected costs:

wrangler.jsonc
{
"limits": {
"subrequests": 10,
"cpu_ms": 1000
}
}
Still Not Infinite

Exceeding your configured limit still fails abruptly with no graceful degradation. The difference is that you now control where that ceiling sits. Set it deliberately based on your workload's expected behaviour rather than accepting the default.

When constraints don't fit

Hard Constraints

Workers have two truly hard limits: 128 MB memory (not configurable) and 5 minutes maximum CPU time. These are boundaries causing failures that cannot be raised through configuration. A third constraint, subrequests per invocation, defaults to 10,000 but can be raised to 10 million through Wrangler configuration. Recognise which constraints are architectural and which are tuneable.

If your workload routinely requires more than 128 MB per request and streaming won't help because you genuinely need data in memory simultaneously, you need Containers.

If your workload requires more than 5 minutes of computation per invocation, use Containers for long-running processes or Queues to distribute work across many short invocations.

If you are exceeding the default subrequest limit, raise it through configuration before redesigning your architecture. The limit exists to prevent amplification attacks and runaway costs, but it is now tuneable rather than fixed. Set it to match your workload's legitimate needs and treat it as a safety net, not a design constraint.

These constraints are not arbitrary. The 128 MB limit allows rapid isolate creation. The CPU time limit prevents runaway computation from affecting other tenants. The subrequest limit prevents amplification attacks. The distinction is that memory and CPU limits are fundamental to the isolate model whilst the subrequest limit is a configurable guard rail. Working within the hard constraints means working with the platform's design; configuring the soft constraints means adapting the platform to your workload.

Subrequest security behaviour

When a Worker makes a subrequest using fetch(), the security treatment depends on the destination. Same-zone subrequests bypass your zone's security features; cross-zone subrequests do not. Understanding this distinction prevents architectural surprises.

A same-zone subrequest is a fetch() call to a hostname within the same Cloudflare zone where the Worker runs. These requests route directly to your origin, bypassing the zone's security stack: WAF rules, DDoS protection, Bot Management, and other security features do not evaluate same-zone subrequests. If you have a WAF rule blocking requests to a specific path, that rule applies to external visitors but not to your own Workers calling the same path.

Cross-zone subrequests tell a different story. When your Worker fetches a URL in a different Cloudflare zone, even one within the same account, the request passes through that target zone's full security stack. WAF rules, rate limiting, and access controls all apply as they would for any external request.

Why this behaviour exists

Cloudflare treats same-zone subrequests as trusted internal traffic. Your Worker already runs within your zone; allowing it to bypass security features for requests within that same zone avoids the circular problem of your own security rules blocking your own backend operations. The request originates from code you deployed, targeting infrastructure you control, within a zone you own.

This design choice has architectural implications. If your Worker orchestrates requests to endpoints within the same zone, those requests cannot be blocked by WAF rules you've configured. You must implement any security logic directly in the Worker code, since the request never passes through the security evaluation layer.

Identifying Worker subrequests in rules

Cloudflare provides the cf.worker.upstream_zone field in rule expressions. This field contains the name of the zone whose Worker made the subrequest, or is empty for direct visitor requests. You can use this field in rate limiting rules, transform rules, and other rule types to identify and handle Worker-originated traffic differently:

(cf.worker.upstream_zone != "" and cf.worker.upstream_zone != "example.com")

This expression matches subrequests from Workers in other zones, excluding both direct visitor traffic and same-zone Worker traffic. Use it when you need to apply rules specifically to cross-zone Worker subrequests.

Architectural implications

If you need security features to apply to internal service calls, you have several options. Implement the security logic within the Worker itself, checking conditions before making the fetch. Alternatively, structure your architecture so the call crosses zone boundaries, causing it to pass through the target zone's security stack. For Worker-to-Worker communication where both Workers are in the same zone, consider whether the security concern is genuine; traffic between your own services may not need the same scrutiny as external traffic.

Comparison with hyperscaler serverless

Cloudflare's automatic same-zone bypass represents a meaningful architectural difference from hyperscaler platforms. AWS, Azure, and GCP require explicit configuration to achieve similar internal service trust; Cloudflare provides it by default.

On AWS, Lambda functions calling API Gateway endpoints pass through WAF rules regardless of whether the caller and target are in the same account. A Lambda function invoking another service through API Gateway will have its requests evaluated against any WAF rules attached to that gateway. To achieve internal trust, you must either configure explicit WAF rule exceptions identifying Lambda traffic or use IAM authentication to bypass public endpoints entirely. Neither approach is automatic; both require deliberate architectural decisions.

Azure takes a similar stance. Application Gateway WAF evaluates all incoming traffic, including requests from Azure Functions within the same virtual network. Achieving internal bypass requires configuring custom WAF rules to allow traffic from specific IP ranges or service tags, or routing internal traffic to avoid the Application Gateway entirely. The platform defaults to evaluating all traffic; internal trust is opt-in.

GCP's Cloud Armor presents an interesting variation. Cloud Functions and Cloud Run expose default URLs that bypass Cloud Armor entirely because they don't route through the load balancer where Cloud Armor policies attach. This creates a security gap rather than a feature: attackers discovering these default URLs can bypass your WAF rules. The recommended configuration is to disable default URLs and configure ingress controls so all traffic routes through the load balancer. GCP's internal bypass is accidental and undesirable, requiring explicit work to close rather than explicit work to enable.

The pattern across hyperscalers is consistent: WAF evaluation is the default for all traffic, internal trust requires explicit configuration, and the burden is on architects to carve out exceptions for service-to-service communication. Cloudflare inverts this: same-zone traffic is trusted automatically, the security stack applies only to cross-zone and external traffic, and architects must explicitly add security logic if they want it applied to internal calls.

Neither model is inherently superior. Hyperscalers' default-evaluate approach catches security issues in internal traffic that might otherwise propagate. Cloudflare's default-trust approach reduces configuration overhead and eliminates the risk of WAF rules accidentally blocking backend operations. The right choice depends on your threat model: if you distrust your own services and want defence in depth against compromised internal components, hyperscaler defaults serve better. If you trust code you deployed to call other code you deployed, Cloudflare's model removes friction.

The Key Insight

Same-zone subrequests are trusted; cross-zone subrequests are not. This behaviour is consistent and predictable once understood, but it catches architects unaware when they expect WAF rules to apply universally.

CPU time versus wall time

The separation of CPU time from wall time isn't billing detail. It's the architectural insight that determines whether Workers make economic sense for your workload.

The Economic Insight

Lambda charges you for waiting. Workers don't. For I/O-heavy workloads, this changes everything.

The economics of waiting

Consider a typical API endpoint authenticating a request, querying a database twice, calling an external service, and returning formatted results:

  • JWT validation: 2ms CPU time
  • D1 query 1: 5ms waiting, 3ms CPU for parsing
  • D1 query 2: 5ms waiting, 2ms CPU for parsing
  • External API call: 200ms waiting, 3ms CPU for processing
  • Response serialisation: 5ms CPU time
  • Total wall time: ~220ms
  • Total CPU time: ~15ms

On Lambda with 512 MB allocated, you pay for 220ms x 0.5 GB = 110 GB-milliseconds. The function sat idle for 93% of that time, but you paid for all of it.

On Workers, you pay for 15ms of CPU time. The 205ms of waiting was free.

For I/O-heavy workloads (most API endpoints, most web applications, most integration layers), this difference compounds dramatically. Workers charge for computation. Lambda charges for existence.

The inverse is also true. For compute-heavy workloads (data transformation, complex calculations, CPU-bound processing), billing models converge or even favour Lambda's larger memory allowances and longer time limits. Workers optimise for the common case of web workloads, not compute-intensive batch processing.

Designing for free I/O

The cost model rewards designs maximising I/O relative to computation.

Move computation to specialised services. Need image resizing? R2's image transformations run on Cloudflare's infrastructure, not yours. Need ML inference? Workers AI handles the heavy lifting. Don't reimplement computationally intensive operations when bindings provide them.

Prefer focused queries over local filtering. A database query returning exactly what you need consumes I/O time (free waiting) and minimal parsing (cheap computation). Fetching excessive data and filtering locally consumes the same I/O time plus significant computation for filtering. Push predicates to the data source.

Cache computed results aggressively. A KV read is I/O; the computation producing the cached value already happened. For values changing infrequently relative to read frequency, computing once and caching beats recomputing every request.

Use streaming for large payloads. Parsing a 10 MB JSON document consumes substantial CPU time. Streaming the same data through without parsing consumes almost none. When you don't need the full parsed structure (when you're proxying, storing, or passing data through) don't parse it.

Common CPU time sinks

Certain patterns consume more CPU time than developers expect.

Cost Gotchas

Repeated JSON parsing/serialisation, string concatenation in loops, and heavy validation libraries accumulate CPU time quickly. Parse JSON once at the boundary. For hot paths, profile before adding heavyweight validation frameworks.

Repeated JSON operations accumulate quickly. Parse once at the boundary; pass the object through your code. Don't serialise and deserialise repeatedly as data moves between functions.

String manipulation in loops scales poorly. Concatenation in loops, repeated substring extraction, and applying regular expressions to the same content multiple times can be problematic. Each operation is cheap, but thousands add up.

Validation libraries vary enormously in efficiency. Schema validation frameworks designed for developer ergonomics may do far more work than manual validation for simple cases. Measure whether library overhead is acceptable.

Node.js dependencies may include abstractions costing CPU time without providing value in Workers. Workers-optimised alternatives often exist for common tasks.

Profiling and measurement

The Cloudflare dashboard shows CPU time distribution across requests. For development, measure synchronous code paths with timing calls. For operations without I/O waits, wall time approximates CPU time.

If P95 CPU time significantly exceeds P50, investigate outlier requests. They often hit edge cases: regular expressions catastrophically backtracking on certain inputs, loops iterating far more than expected, or unusually large payloads requiring more parsing.

Worker placement

Workers run globally by default: your code executes in whichever Cloudflare location is closest to the user. For applications serving static content or performing computation not depending on external data, this is optimal. Users everywhere get low latency.

But many Workers make backend calls: queries to databases, API calls, access to services living in specific regions. A Worker running in Sydney making ten calls to a database in Virginia adds 200ms of round-trip latency per call. The Worker is close to the user but far from the data.

Cloudflare offers three approaches to placement: automatic Smart Placement that learns from traffic patterns, explicit placement hints that target specific infrastructure, and jurisdictional placement for compliance requirements.

Smart placement

Smart Placement analyses your Worker's traffic patterns, specifically where subrequests go. If a Worker consistently calls backends in a particular region, Smart Placement runs the Worker closer to those backends rather than closer to the user.

Enabling Smart Placement
{
"placement": {
"mode": "smart"
}
}

The trade-off is explicit: users further from the backend region experience higher latency reaching the Worker, but the Worker experiences lower latency reaching the backend. For workloads making many backend calls, total latency decreases despite the longer initial hop.

Enable Smart Placement when your Worker makes multiple calls to backends in a specific region and those calls dominate total latency. An API endpoint making five database queries to a PostgreSQL instance in eu-west-1 benefits from running in Europe rather than wherever the user happens to be.

Don't enable Smart Placement for Workers primarily serving cached content, performing computation without external calls, or calling globally distributed services without a single location.

Explicit placement hints

When you know exactly where your backend infrastructure lives, explicit placement hints provide more precise control than Smart Placement's automatic analysis. You can target specific cloud regions or let Cloudflare probe your infrastructure directly.

For backends in AWS, GCP, or Azure, specify the region directly:

Placement hint targeting AWS region
{
"placement": {
"region": "aws:us-east-1"
}
}

For infrastructure outside the major cloud providers, expose it to placement probes. Layer 4 probes check connectivity to a host and port:

Placement hint with host probe
{
"placement": {
"host": "my_database_host.com:5432"
}
}

Layer 7 probes check HTTP connectivity:

Placement hint with hostname probe
{
"placement": {
"hostname": "my_api_server.com"
}
}

Explicit hints work best for single-homed infrastructure with a known, fixed location. They are not suitable for anycast or multicast resources where the "location" varies by requester. For those scenarios, Smart Placement's automatic analysis remains the better choice.

The decision framework: if you know precisely where your backend lives and it won't move, use explicit hints. If you have multiple backends, backends that might move, or you're unsure of optimal placement, use Smart Placement and let Cloudflare figure it out.

Jurisdictional placement

Distinct from performance-oriented placement, jurisdictional restrictions constrain where Workers execute for compliance reasons. If regulations require certain data processing within specific geographic boundaries, jurisdictional settings ensure Workers handling that data run only in compliant locations.

This isn't optimisation; it's a constraint you accept when compliance requires it. Latency may increase for users outside permitted regions.

Service bindings and composition

Workers can call other Workers through service bindings, enabling modular architectures without the latency, reliability, and observability challenges of network calls between microservices.

Architectural Insight

Service bindings give you microservice modularity without network call latency. Decomposition becomes nearly free, changing how you think about service boundaries.

How service bindings work

A service binding connects one Worker to another through configuration. The calling Worker receives a binding looking like an object with methods; calls route to the target Worker. When both Workers run in the same location (typical since they're invoked in the same request path), calls complete in under a millisecond with no network traversal.

Typical latency comparison
Communication TypeLatency
Service binding (collocated)0.1-0.5ms per call
Service binding (non-collocated)1-5ms per call
External HTTP (same region)20-50ms per call
External HTTP (cross-region)100-300ms per call

This differs fundamentally from microservices communicating over HTTP. Traditional service-to-service calls involve DNS resolution, connection establishment, serialisation, network transmission, deserialisation, and the reverse path. Service bindings skip nearly all of this: an in-process function invocation when colocation permits, or minimal-overhead internal routing when it doesn't.

Reliability characteristics differ too. HTTP calls can fail through network partitions, DNS failures, connection timeouts, TLS handshake failures. Service binding calls fail only if the target Worker itself fails. The networking layer causing most distributed systems headaches doesn't exist.

Fetch forwarding versus RPC

Service bindings support two communication styles. Fetch forwarding passes an entire Request to the target Worker, which processes it and returns a Response. RPC-style bindings expose typed methods the calling Worker invokes directly.

Fetch forwarding suits scenarios where you're proxying requests or the target Worker needs full request context: headers, method, body, URL. An authentication Worker examining cookies and setting response headers benefits from receiving the complete Request.

RPC suits scenarios calling specific functionality with specific parameters. A rate-limiting Worker needing a user ID and action name doesn't need full request context, just those two values, returning a boolean. RPC provides type safety, clearer interfaces, and avoids the ceremony of constructing Request objects for simple operations.

Use fetch forwarding when the target Worker processes requests as requests. Use RPC when the target Worker provides functionality as functions.

When to split Workers

Decomposing into multiple Workers involves trade-offs that near-zero latency doesn't eliminate.

Start with a single Worker unless you have a specific reason to split. Service bindings make splitting cheap but not free. Each call adds latency even if sub-millisecond, and each additional Worker adds deployment complexity, versioning concerns, and cognitive overhead.

Split when deployment independence matters. If one part changes frequently while another is stable, deploying separately avoids unnecessary risk to the stable component.

Split when resource profiles differ significantly. A CPU-intensive Worker benefits from different optimisation strategies than an I/O-heavy Worker.

Split when team boundaries align with service boundaries. Different teams owning different functionality benefit from separate Workers with clearer ownership and reduced coordination overhead.

Split when sharing functionality across multiple entry points. A common authentication Worker called by several API Workers is cleaner than duplicating authentication logic: one place to update security policies, one place to fix vulnerabilities.

Don't split because microservices are architecturally fashionable. A design requiring twenty service binding calls per request performs worse than a monolith even with sub-millisecond latency, and complexity cost is substantial. Measure before distributing.

Error handling in edge systems

Error handling in Workers differs from traditional servers in ways affecting both implementation and architecture.

Named failure modes

Edge systems exhibit failure patterns worth naming explicitly.

Orphaned background work. A waitUntil() promise fails after the response was sent. The client believes success; your system recorded failure. Data inconsistency results, difficult to diagnose because no error surfaced to the user.

Regional backend outage. An external API or database fails in one region but remains healthy elsewhere. Aggregate error rate is 5%, but São Paulo users see 80% failures while London users see none. Aggregate metrics hide severity for affected users.

Isolate state assumption. Code assumes a global variable persists between requests. Works in testing where the same isolate handles sequential requests, then fails in production when requests hit fresh isolates.

Subrequest exhaustion. A fan-out pattern works at normal load, but unusual requests trigger more subrequests than expected, exceeding the configured limit and failing abruptly. Even with configurable limits (up to 10 million on paid plans), setting an appropriate ceiling and monitoring actual usage prevents surprises.

Timeout cascade. A slow upstream exhausts wall time budget. Other healthy upstreams in the same request path never get called because time ran out waiting for the slow one.

Cold cache stampede. A cached value expires. Many concurrent requests miss simultaneously and all attempt to regenerate the value, overwhelming the backend.

Naming these patterns gives your team vocabulary for design discussions and post-incident analysis. "We hit orphaned background work" communicates more precisely than "something went wrong with the async stuff."

The ephemeral context

Workers execute in ephemeral isolates. You cannot maintain in-memory circuit breaker state across requests since each request might execute in a different isolate with no knowledge of previous failures.

For circuit breaker behaviour, state must live outside the isolate. Durable Objects can track failure counts and provide consistent circuit breaker decisions. KV can store failure state with some eventual consistency tolerance. The pattern changes from "maintain state in memory" to "coordinate through external storage."

Global distribution implications

The same Worker code runs in over 300 locations. A regional backend outage affects requests routed to that region but not requests elsewhere. An external API having problems in Asia might cause failures for Asian users while European users experience normal operation.

Error rates can be geographically localised in ways aggregate metrics obscure. A 2% global error rate might represent 50% failures in one region and 0% everywhere else. Alerting and investigation should account for this distribution.

Retry economics

The CPU time billing model affects retry economics. A failed request waiting 2 seconds for a timeout cost almost nothing since you paid only for minimal computation around the failed call. Retrying is cheap from a billing perspective.

This doesn't mean retry everything. Only retry idempotent operations, use exponential backoff to avoid thundering herds, set maximum retry counts to prevent infinite loops. But the concern that retries double costs doesn't apply the way it would with wall-time billing.

Timeout decisions

Workers don't impose timeouts on subrequests by default. A slow upstream can consume your entire wall time budget waiting. Implement explicit timeouts using AbortController, setting limits appropriate to SLA requirements and expected upstream performance.

Five seconds is reasonable for most external API calls. Database queries through Hyperdrive might tolerate longer; calls to services with strict latency SLAs might require shorter. Choose timeouts reflecting "how long am I willing to wait before giving up" rather than "how long does this usually take."

Error response strategy

Handle errors at the handler level to return controlled responses. Unhandled exceptions produce generic 500 responses with no useful information. Wrap handler logic, catch exceptions, log with sufficient context for investigation, return responses helping clients understand what happened without leaking implementation details.

The categories that matter: client errors (4xx) for invalid requests the client can fix, server errors (5xx) for problems the client cannot fix. A missing resource is 404, not 500; a malformed request is 400, not 404. Precise status codes help clients respond appropriately and help you categorise errors in monitoring.

Language support

Workers execute JavaScript natively, with additional options for teams with different requirements or existing codebases.

JavaScript and TypeScript

JavaScript executes directly in the V8 runtime with access to modern language features and standard Web APIs. TypeScript compiles to JavaScript through Cloudflare's tooling with no additional configuration. For most new projects, TypeScript provides type safety benefits without meaningful downsides.

You have access to Web APIs you'd use in browsers: fetch, Request, Response, Headers, URL, crypto, TextEncoder, TextDecoder, streams. Code written against these APIs is portable across Workers, browsers, Deno, and other Web-API-compatible environments.

Node.js compatibility

Workers aren't Node.js, but a compatibility layer provides access to many Node.js APIs. Enable through configuration for Buffer, crypto module mapping to Web Crypto, util, events, stream, and many commonly-used APIs.

What doesn't work: filesystem access (no filesystem), native modules compiled from C++ (no binary addon support), process-level APIs like process.exit() (no process to exit). The compatibility layer enables many npm packages to work without modification, but packages with native dependencies fail.

The decision is straightforward: use Node.js compatibility when migrating existing code or using npm packages requiring Node.js APIs. For new code, prefer Web APIs directly: standard, portable, and without compatibility layer overhead.

WebAssembly

WebAssembly allows running code compiled from Rust, C, C++, Go, and other languages: computationally intensive algorithms where JavaScript performance isn't adequate, existing compiled libraries you can't or won't rewrite, and code sharing across WASM-compatible environments.

WASM executes within the same constraints as JavaScript: 128 MB memory limit and CPU time limits apply identically. WebAssembly provides access to efficient compiled code, not resource limit evasion.

Choose WebAssembly when you have existing compiled code prohibitively expensive to rewrite, or when profiling shows JavaScript performance is genuinely inadequate. Don't choose WebAssembly because it seems faster; for I/O-heavy workloads, execution speed is rarely the bottleneck.

Python Workers

Python Workers execute Python code at the edge through Pyodide, a Python runtime compiled to WebAssembly. This isn't a compatibility layer or transpilation. It's actual Python executing in the Workers runtime with access to the standard library and a growing ecosystem of packages.

How Python Workers execute

When you deploy a Python Worker, Cloudflare uploads your Python code and packages specified in pyproject.toml. The runtime creates a V8 isolate, injects Pyodide, scans your code for imports, executes them, then captures a memory snapshot of this initialised state.

This snapshot is key to Python Workers' performance. Cold starts in Python would normally require loading the runtime, importing packages, and executing top-level code on every new isolate. With memory snapshots, expensive initialisation happens once at deploy time. Subsequent cold starts load the snapshot directly, bypassing import cost.

Python Workers achieve cold start times competitive with other platforms. Cloudflare's benchmarks show cold starts averaging around one second for Workers importing httpx, FastAPI, and Pydantic, compared to 2.5 seconds for Lambda and 3 seconds for Cloud Run. For warm requests, execution is fast regardless of language.

Package management with pywrangler

Python package management uses pywrangler, a CLI tool wrapping Wrangler and handling Python-specific concerns:

pyproject.toml
[project]
name = "my-worker"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = [
"fastapi",
"httpx",
"pydantic"
]

[dependency-groups]
dev = [
"workers-py",
"workers-runtime-sdk"
]

Deploy with uv run pywrangler deploy. The tool bundles dependencies automatically, resolving versions and ensuring pure-Python packages are included.

Package support covers pure Python packages from PyPI and packages included in Pyodide. Packages with C extensions not pre-compiled for Pyodide don't work; you can't pip install arbitrary native code. The practical impact is smaller than it sounds: FastAPI, Pydantic, httpx, langchain, and many common packages work. NumPy, pandas, and other scientific computing packages with C extensions are included in Pyodide's pre-built set.

Accessing bindings from Python

Python Workers access Cloudflare services through the same binding model as JavaScript Workers, with Python-native syntax:

src/index.py
from workers import WorkerEntrypoint, Response

class Worker(WorkerEntrypoint):
async def fetch(self, request):
# Access KV
value = await self.env.MY_KV.get("key")

# Access D1
result = await self.env.DB.prepare(
"SELECT * FROM users WHERE id = ?"
).bind(user_id).first()

# Access R2
object = await self.env.BUCKET.get("file.txt")

return Response.json({"value": value, "user": result})

The env object is a JavaScript object accessed through Pyodide's foreign function interface. When you call methods on bindings, you're invoking JavaScript APIs from Python. The JsProxy mechanism handles type conversion automatically.

Cron triggers, queue consumers, and other handler types work identically. Define the appropriate method on your WorkerEntrypoint class; the runtime invokes it when the trigger fires.

When to choose Python

Python Workers make sense when your team's expertise is Python and rewriting in JavaScript would slow development significantly. They're appropriate for I/O-heavy workloads where Python's execution speed isn't the bottleneck: API orchestration, data transformation, AI inference coordination.

Language Choice Decision

Choose Python Workers when team expertise in Python outweighs JavaScript performance benefits. Cold starts average ~1 second (vs. sub-5ms for JavaScript) but remain competitive with Lambda. Best for I/O-heavy workloads where network latency dominates execution time.

Python Workers are inappropriate when raw performance matters (JavaScript and WebAssembly execute faster for CPU-intensive computation) or when you need packages with native extensions not in Pyodide's pre-built set.

The mental model is simple: Python Workers let Python developers build on Cloudflare without learning JavaScript. They're not faster or more capable than JavaScript Workers; they're an alternative for teams where Python is the better choice for human reasons.

Comparing to hyperscaler serverless

Coming from Lambda, Azure Functions, or Cloud Functions, certain differences affect how you design and operate applications.

Cold starts: a category that disappears

Lambda cold starts range from 100ms for small Node.js functions to several seconds for JVM-based functions or those in VPCs. An entire ecosystem of workarounds exists: provisioned concurrency, warming pings, architectural patterns routing initial requests differently.

Workers cold starts are under 5ms, typically under 1ms for warm isolates. The patterns designed to mitigate Lambda cold starts simply don't apply. This isn't a small improvement. It's an entire category of complexity that doesn't exist.

This affects application design. On Lambda, you might accept higher latency for first requests and design user experience around it. On Workers, there's no "first request penalty" worth designing around. Every request gets consistent latency.

Memory: a hard boundary

Lambda offers up to 10 GB of memory per function. Workers offer 128 MB. This is a hard boundary, not a tuning parameter.

The response isn't to avoid Workers but to understand which workloads fit each model. Request/response handling, API gateways, edge logic, and coordination tasks fit Workers' memory model. Data transformation, large file processing, and memory-intensive computation fit Lambda or Workers Containers.

The global model

Lambda functions deploy to regions. You choose a region, your function runs there, users far from that region experience latency. Multi-region deployment requires explicit configuration, additional infrastructure, and careful data synchronisation.

Workers deploy globally by default. You don't choose regions; your code runs everywhere. Users close to any Cloudflare location get low latency. The operational complexity of multi-region deployment doesn't exist because deployment is inherently global.

On Lambda, going global is a project. On Workers, you're global from the first deployment. Questions shift from "should we deploy to additional regions" to "are there reasons to constrain where we run."

AspectWorkersLambda
Cold startUnder 5ms100ms-3s+
Maximum memory128 MB10 GB
Maximum CPU time5 minutes15 minutes
Deployment scopeGlobal (automatic)Regional
Billing modelCPU timeGB-seconds (wall time)
Multi-regionDefaultAdditional complexity
The Key Difference

Workers optimise for the common case of web workloads (I/O-heavy, latency-sensitive). Lambda optimises for flexibility (arbitrary memory, longer execution, regional control).

What comes next

This chapter covered Workers as the foundational compute primitive: how isolates execute, what constraints apply, how the billing model rewards I/O-heavy designs, and how service bindings enable composition.

Chapter 4 addresses full-stack applications. Workers as API handlers is the simple case; Workers serving static assets, rendering pages on the server, and integrating with frontend frameworks involves additional patterns.

Chapter 5 covers local development and testing. The execution model differs enough from traditional servers that testing requires specific approaches.

Chapter 6 introduces Durable Objects, the stateful coordination primitive enabling patterns impossible with Workers alone: real-time collaboration, distributed coordination, and strongly consistent state.

The compute model you've learned here underlies everything else. Constraints shape what's possible, the billing model shapes what's economical, global distribution shapes how you think about geography. Understanding this primitive deeply is understanding the foundation of the entire platform.