Chapter 7: Workflows: Durable Execution

How do I build processes that must complete even if individual steps fail?

Most Workers follow a simple pattern: handle a request, return a response, complete in milliseconds or seconds, and retry on failure. But some processes don't fit this model at all. Order fulfilment spans payment capture, inventory reservation, shipping label generation, and email notification; document approval waits days for human review; data synchronisation processes millions of records over hours. These long-running, multi-step processes must survive infrastructure failures, code deployments, and network outages while continuing to completion even when individual steps fail.

Cloudflare built Workflows because some processes are too important to fail and too long to hold in memory; it provides durable execution without requiring you to build orchestration infrastructure yourself.

What "durable" actually means

The term "durable" gets used loosely in distributed systems, but for Workflows it means something specific: after each step completes, its result persists to storage before the next step begins. If the system fails, the workflow resumes from the last completed step rather than restarting from the beginning.

Durable vs. Retry

This differs from retry logic. A retry re-executes the entire operation. Durable execution checkpoints progress and resumes from where it left off. A workflow that completed steps 1 through 7 before failing resumes at step 8, not step 1.

Durable execution has over a decade of production history across systems like Amazon SWF, Step Functions, Uber's Cadence, Temporal, and Azure Durable Functions. Cloudflare Workflows applies these proven patterns to edge computing with characteristic simplicity: you write TypeScript, each instance is isolated, and the platform handles replay, checkpointing, and step persistence.

The infrastructure beneath

Each workflow instance runs on its own SQLite-backed Durable Object, and understanding this architecture isn't merely academic; it explains Workflows' behaviour and constraints in practical terms.

The durability guarantees from Chapter 6 apply directly. Writes replicate to multiple data centres before external effects occur. Single-threaded execution means steps within an instance never run concurrently. The 1 MiB storage limit per step result traces directly to Durable Object storage characteristics. Understanding that a workflow instance is a Durable Object with orchestration logic layered on top makes constraints predictable rather than arbitrary.

This architecture explains scaling characteristics too. Durable Objects handle millions of instances across Cloudflare's network; Workflows inherits that capacity. Each workflow instance is independent, placed near where it's invoked, and costs nothing while idle. You can run 100,000 pending approval workflows, each waiting on external events, without capacity planning. The underlying Durable Objects hibernate until needed.

You cannot access the underlying Durable Object directly, as Workflows provides an abstraction that handles replay logic and step isolation. If you need direct Durable Object access (custom storage patterns, WebSocket connections, real-time state queries), use Durable Objects rather than Workflows.

The cost of durability

Durability isn't free; every checkpoint is a write, and every step carries overhead. A workflow with 50 steps processing 10,000 orders daily writes 500,000 checkpoints, meaning you're paying for orchestration machinery on every step.

Workflows pricing involves per-step charges plus CPU time. A 50-step workflow at current rates costs roughly $0.0003 per execution. At 10,000 daily executions, that's approximately $90 per month in orchestration overhead alone, before accounting for compute within each step. Processing 10,000 independent items through a Queue costs a fraction of that, with no step checkpointing, no replay machinery, and no orchestration state.

This overhead is justified when failure is expensive. An order that captures payment but fails to reserve inventory creates a customer service nightmare. A document approval that loses its place after three of five approvers signed wastes days of human effort. Checkpoint cost is negligible compared to recovery cost.

Not every background process needs this machinery, however. Processing 10,000 uploaded images where each is independent works better with a Queue and idempotent consumers, which is simpler, cheaper, and faster since the images don't depend on each other and you'd otherwise pay for orchestration you don't use.

Use Workflows when steps depend on each other, when failure mid-process creates inconsistent state, or when you need visibility into where a process is stuck. Workflows earns its overhead in these scenarios. For independent, idempotent tasks, simpler patterns suffice.

Exactly-once step semantics

Each step executes exactly once successfully, regardless of retries or failures. If step 3 succeeds and the system fails before step 4 begins, step 3 won't re-execute when the workflow resumes because its result was persisted; Workflows simply retrieves it from storage and continues.

A step may execute multiple attempts if retries occur, but only one attempt succeeds and persists, so you can design for retry while trusting that success is singular.

Any work before the step function returns might happen multiple times if the step fails and retries, while any work after the successful return happens exactly once. The step boundary serves as your commit point.

The constraints that shape design

Before examining step types, you need to understand the constraints governing workflow architecture, as these aren't limitations to work around but boundaries that actually define good design.

The 1 mib barrier

danger

Each step can return at most 1 MiB of data. Exceed this and the step fails hard. This isn't a soft limit you can configure around. It's fundamental to how workflow state persists in the underlying Durable Object.

This constraint shapes everything about workflow design. If your natural inclination is to pass rich objects between steps, you'll hit this wall quickly, so instead store large data externally (R2, KV, or D1) and pass references. A step processing a large dataset returns { resultPath: "results/workflow-123" } rather than the dataset itself, and later steps fetch from storage using the path.

Design for this from the start. Discovering the constraint after building means restructuring your entire workflow. Small step results also improve observability: you can inspect workflow state in the dashboard without wading through megabytes of payload.

The 128 MB runtime memory limit

The 1 MiB barrier governs what you can persist between steps, while a separate constraint governs what you can hold in memory during a step: Workflows share Workers' 128 MB memory limit. This distinction trips up developers who assume larger persisted state limits mean larger runtime capacity.

Memory vs Persisted State

Persisted state limits (1 MiB per step, 1 GB total) describe what survives between steps. The 128 MB memory limit describes what you can hold in memory while a step executes. These are independent constraints serving different purposes.

A step processing a 500 MB file cannot buffer it in memory regardless of how little state it persists, because the step crashes before completing and never reaches the point where persisted state limits matter. The 1 GB total persisted state limit doesn't help here because memory exhaustion happens during execution, not during persistence.

The streaming pattern solves this. Instead of fetching a file into memory, transforming it, and writing the result, stream data directly from source to destination:

Streaming large files through R2
const processedPath = await step.do("process-large-file", async () => {
  const sourceObject = await env.SOURCE_BUCKET.get(inputPath);
  if (!sourceObject) throw new Error("Source file not found");

  // Stream through a transform without buffering the entire file
  const transformedStream = sourceObject.body
    .pipeThrough(new TransformStream({
      transform(chunk, controller) {
        // Process chunk by chunk, never holding the full file
        controller.enqueue(processChunk(chunk));
      }
    }));

  const outputPath = `processed/${instanceId}/${filename}`;
  await env.DEST_BUCKET.put(outputPath, transformedStream);
  return outputPath;  // Return reference, not data
});

The file flows from R2 through the transform and back to R2 without the Worker holding more than one chunk at a time, so a 10 GB file processes successfully because memory usage stays bounded regardless of file size.

For workloads requiring analysis of entire files (not just streaming transforms), chunk the work across multiple steps where each step processes a portion, persists intermediate results to R2, and returns a reference so the next step can pick up where the previous left off. This pattern trades step overhead for memory safety.

Determinism requirements

Replay and Determinism

Workflows may replay, re-executing your workflow code to reconstruct state after failures. Non-deterministic operations outside steps cause problems. If you generate a random number outside a step and use it to decide whether to execute a subsequent step, replay generates a different number and makes a different decision.

Wrap anything that might differ between executions in a step. Timestamps, random numbers, external reads: if the value changes, checkpoint it. The step's result persists; replays use the persisted value rather than re-computing.

Idempotency for external calls

Steps may retry, and if a step calls an external API, that API might receive the same request multiple times. Payment processors, email services, and inventory systems must handle duplicate requests gracefully.

This discipline isn't Workflows-specific; any distributed system needs it, but Workflows makes it explicit and unavoidable. Every step modifying external state needs an idempotency strategy. Most payment processors accept idempotency keys, where a retry with the same key returns the original result instead of charging again. For services without native idempotency support, implement your own by checking a database before acting, recording actions immediately, and skipping on retry if already recorded.

Step types and when to use each

Workflows provides four step types, and understanding when to use each matters far more than memorising their syntax.

step.do(): execute and persist

The fundamental step wraps code execution with durability, persisting the result before the workflow continues; if the step fails, Workflows retries according to configuration.

src/order-workflow.ts
const paymentResult = await step.do("capture-payment", async () => {
  return await paymentService.capture(order.paymentMethodId, order.total);
});

The string identifier serves three purposes: observability in dashboards, debugging in logs, and replay identification. Step names must be deterministic. Don't include timestamps or random values; replay will fail to match steps with their persisted results.

step.sleep() and step.sleepUntil(): pause without cost

These steps pause execution for a duration or until a specific time, and unlike a delay in a regular Worker (which counts against wall time and keeps resources allocated), a sleeping workflow consumes nothing. The workflow hibernates, the underlying Durable Object goes idle, and Cloudflare wakes it when the sleep completes.

This fundamentally changes what's economically viable. On AWS, a Step Functions workflow waiting 7 days either uses Express Workflows with polling (compute costs) or Standard Workflows with state transition costs, and Lambda-based polling adds even more expense. The Cloudflare model makes "wait for arbitrary duration" a first-class operation with zero marginal cost during the wait.

Patterns that are expensive elsewhere become trivial. Week-long approval flows cost nothing while waiting; reminder sequences firing emails at day 3, day 7, and day 14 hibernate between sends; subscription renewals scheduled 30 days out don't require cron jobs because the workflow resumes when the time comes. A single line, await step.sleep("reminder-delay", "7 days"), replaces infrastructure that traditionally required scheduled tasks, state tracking, and careful coordination.

step.waitForEvent(): pause for external signals

This step type enables workflows that wait for external input such as human approvals, webhook callbacks, file uploads, or third-party notifications. The workflow pauses until a matching event arrives or a timeout expires.

src/approval-workflow.ts
const approval = await step.waitForEvent("manager-approval", {
  type: "approval-decision",
  timeout: "48 hours"
});

Before waitForEvent, workflows requiring external input needed cumbersome patterns where a Worker receives a webhook, looks up the corresponding workflow instance, correlates state, and sends a message to continue execution. You maintained the state machine while the platform just ran code. With waitForEvent, the workflow is the state machine: it pauses, the event arrives via HTTP API or from another Worker, and execution resumes with the event payload.

waitForEvent Failure Modes

Understand the failure modes. If the event-sending system fails, the workflow waits until timeout. If duplicate events arrive, the first matching event resumes execution; subsequent events are ignored or error depending on workflow state. If the event arrives before the workflow reaches waitForEvent, events have a short buffer window, but designing around this requires careful coordination.

warning

Always set timeouts on waitForEvent. "Wait forever" is rarely correct; it produces workflows that silently hang when external systems fail. A 48-hour timeout with escalation logic beats an infinite wait requiring manual intervention to discover.

The step granularity decision

One fundamental design question is how much work belongs in a single step, and this decision has very real consequences for your workflow's efficiency and resilience.

Too granular creates overhead: a workflow with 100 steps for a simple operation writes 100 checkpoints, meaning you're paying for durability granularity you don't need.

Too coarse bundles operations that should fail independently: a single step making five API calls retries all five if the fifth fails, and if any call has side effects, you risk duplicates.

The principle is simple: one step per side effect. Bundle reads freely since failure just means re-reading, but isolate writes ruthlessly. Multiple queries gathering data can share a step, but the moment you write, send, or charge, you need a step boundary.

What about operations that must succeed or fail atomically? If three services must all update or none should, you face a design choice. One step means handling partial failure yourself by calling compensating actions within the step if service 3 fails after 1 and 2 succeeded. Three steps means step 2's success is permanent, and if step 3 fails, you compensate at the workflow level. Neither approach is wrong; the choice depends on whether external services support transactions (rare) or you need saga-style compensation (common).

Parallel steps with promise.all and promise.race

Steps can run concurrently using Promise.all, allowing independent operations that don't depend on each other's results to execute in parallel and reduce total workflow duration:

Parallel independent steps
const [userData, orderHistory, recommendations] = await Promise.all([
  step.do("fetch-user", () => fetchUser(userId)),
  step.do("fetch-orders", () => fetchOrders(userId)),
  step.do("fetch-recommendations", () => fetchRecommendations(userId))
]);

Promise.race and Promise.any require more care because the workflow engine may restart between steps and steps are cached by name. If you use Promise.race with steps directly, the cached result after a restart might differ from the original race winner, causing subtle bugs.

Promise.race Caching Gotcha

Wrap Promise.race or Promise.any calls in a containing step.do to ensure consistent caching across workflow restarts. Without the wrapper, the race result may change if the workflow hibernates and resumes.

Safe Promise.race pattern
// Correct: wrap the race in a step for consistent caching
const winner = await step.do("race-for-response", async () => {
  return await Promise.race([
    step.do("fast-source", () => fetchFromFastSource()),
    step.do("slow-source", () => fetchFromSlowSource())
  ]);
});

The outer step persists the race result, so on replay, the workflow retrieves the persisted winner rather than re-racing, ensuring deterministic behaviour regardless of which source happens to respond first on a given execution.

What goes wrong: non-idempotent steps

A team builds an order workflow where the email notification step calls their email service, which sends successfully. But then the step fails, perhaps because a subsequent line throws or a timeout occurs after the send but before acknowledgment. The workflow retries, the email sends again, and the customer receives two identical order confirmations.

Worse still: a payment capture step charges the card, then fails on response parsing, so retry charges again and the customer pays twice.

The root cause is treating steps as atomic when they're not. A step making an external call and then doing more work can fail after the external effect but before completion; the effect happened, and retry makes it happen again.

The fix is idempotency keys: unique identifiers passed to external services that prevent duplicate operations. Generate the key deterministically from workflow instance ID and step name, pass it with every external call, and on retry, the external service recognises the key and returns the original result.

The Idempotency Imperative

Every step that modifies external state needs an idempotency strategy. This isn't paranoia; retries are normal operation in durable execution systems. If your step can't safely run twice, it's not production-ready.

Failure handling

Steps fail, networks time out, and services return errors, so understanding how Workflows handles failure separates robust workflows from fragile ones.

Retry behaviour

When a step throws an exception or times out, Workflows catches the error, records it, and decides whether to retry using a default strategy of 5 attempts with exponential backoff: first retry after roughly 1 second, second after 2 seconds, doubling with jitter to prevent thundering herds.

Step code may run multiple times, but successful results persist exactly once, and retry state persists to storage so infrastructure failure during backoff doesn't lose track of where the workflow was.

Retry configuration adapts your workflow to the reliability characteristics of what it calls; defaults assume well-behaved services while customisation acknowledges reality.

Rate-limited APIs need longer backoff, since aggressive retries make things worse when an API returns 429 errors. Increase initial delay and let exponential backoff create breathing room.

Slow-failing services need shorter timeouts, since waiting for the default timeout wastes time if a service hangs 30 seconds before failing. Set explicit step timeouts to fail fast and retry sooner; the default step timeout of 15 minutes is appropriate for long-running operations but excessive for APIs that should respond in milliseconds.

Permanently failing operations shouldn't retry, since a payment declined for insufficient funds won't succeed on attempt six; validation failures, business rule violations, and malformed inputs are all permanent. Throw NonRetryableError to stop immediately:

if (!validation.valid) {
  throw new NonRetryableError(`Invalid order: ${validation.reason}`);
}

Flaky but critical services might need more attempts if they fail often but eventually succeed and you can't fix them, so increase retry limits. But treat this as technical debt rather than a proper solution.

When failure is acceptable

Some operations are truly optional: enrichment services adding nice-to-have data, analytics calls that don't affect business logic, or notifications retried later through other means. Wrap these in try/catch and continue:

Handling optional step failures
try {
  await step.do("optional-enrichment", () => enrichmentService.enhance(data));
} catch {
  // Enrichment unavailable; proceed without it
}

The workflow continues despite failure, so log it and perhaps alert on high failure rates, but don't let optional operations block critical paths.

Compensation and rollback

When a step fails after previous steps succeeded, you may need to undo earlier work: payment captured but shipping failed means refunding the payment; inventory reserved but payment failed means releasing the inventory.

This pattern is called the saga pattern: a sequence of operations with compensating actions for rollback where every step that changes the world needs an undo button.

Track which operations succeeded, then in your catch block execute compensating steps for each. Workflows doesn't provide sagas automatically, so you implement compensation logic in your workflow code.

Design compensation steps with their own retry logic because compensation that fails leaves you worse off than the original failure. Have a fallback for when compensation fails, usually alerting humans for manual intervention.

Workflows that succeed but are wrong

Not all failures throw errors; a payment step might return success while charging the wrong amount, or a reservation step might confirm inventory that doesn't exist, and the workflow completes with green checkmarks and incorrect business state.

These silent failures are harder to catch than exceptions because they require validation steps verifying that outcomes match expectations, reconciliation processes comparing workflow results against source-of-truth systems, and monitoring that detects anomalies in business metrics even when technical metrics look healthy.

Build verification into critical workflows by verifying that captured amounts match order totals after payment capture, confirming that reservations exist after inventory reservation. External systems often lie, usually unintentionally.

When Workflows get stuck

Workflows can sometimes get stuck, and these failure modes deserve explicit names so you can recognise them.

Poison step loop occurs when a step fails consistently but not with NonRetryableError, so the workflow retries according to configuration, exhausts attempts, and either fails or loops longer if you've configured more retries. A step calling a service that always returns 500 retries until something intervenes. The fix is recognising which failures are permanent and throwing NonRetryableError.

Event starvation happens when a waitForEvent step never receives its event because the external system fails, loses the event, or was never configured correctly, so the workflow waits until timeout or indefinitely if you didn't set one. Always set timeouts and design escalation paths for when events don't arrive.

Compensation cascade failure manifests when your rollback logic itself fails: step 3 failed, you try to compensate step 2, but compensation also fails, leaving you with partially captured payments and partially reserved inventory in a workflow stuck trying to clean up. Design compensation to be more reliable than original operations and have manual intervention as the ultimate fallback.

State size overflow hits when step results accumulate past the 1 MiB limit; a workflow passing 200 KiB between steps works fine for small jobs, but when a larger job returns 1.5 MiB from a step, the workflow fails hard. The fix is architectural: store large data externally and pass references.

Your intervention playbook breaks down into clear steps:

Diagnosis first: The dashboard shows step-level progress, retry counts, and error messages, so before terminating, understand why the workflow is stuck (external service down, bug in your code, or event source failing to send).

Terminate stuck instances when recovery isn't possible, since termination is immediate and stops the workflow permanently. Use this for workflows that can't complete and shouldn't retry.

Send missing events for workflows waiting on external signals; if an approval event was lost, resend it programmatically, and if a webhook failed, replay it so the workflow resumes from where it paused.

Fix and redeploy for bugs, but remember that running instances continue with their persisted state and don't automatically pick up new code. See versioning below for how to handle this.

Versioning and deployments

One question that catches many teams off guard is what happens when you deploy new workflow code while instances are running.

Running workflows execute current code but with persisted state, meaning the workflow code runs fresh on each step even though step results already exist. Deploying doesn't restart running instances; it just changes what code runs for steps that haven't executed yet.

This creates categories of changes:

Additive changes (new steps after existing ones) deploy safely. Running instances won't execute new steps (they've passed that point), often acceptable. New instances get the full workflow.

Logic changes within existing steps apply to future step executions in running instances, so a bug fix in step 5 takes effect when a running instance reaches step 5, but if it's already past step 5, the fix doesn't apply.

Structural changes (removed, reordered, or renamed steps) break running instances because the workflow tries to execute a step that no longer exists, or replay can't match persisted results to step definitions, so these require a new workflow type.

Breaking input/output changes also require new workflow types because if step 3 now expects different input than step 2 used to produce, running instances fail when reaching step 3.

The decision framework:

Deployment Impact on Running Workflows

Change Type	Running Instances	Approach
Bug fix in step logic	Future executions use new code	Deploy in place
New steps at end	Won't execute new steps	Deploy in place; accept limitation
New steps in middle	Break at insertion point	New workflow type
Removed steps	Break when reaching removed step	New workflow type
Renamed steps	Replay fails	New workflow type
Changed step return type	Downstream steps may fail	New workflow type

When creating a new workflow type, you need a drainage strategy for the old one: stop creating new instances of the old type, monitor existing instances until they complete or timeout, and only then decommission the old definition. This might take days or weeks for long-running workflows.

Service binding compatibility

Version pinning creates a subtle dependency challenge because running workflow instances execute with their original code, including calls to service bindings. If your workflow calls an RPC method on a bound Worker and you deploy a breaking change to that Worker's interface, running workflow instances still expect the old interface.

Service bindings called by workflows must maintain backward compatibility for as long as old workflow instances might exist; a workflow sleeping for 30 days before its final step means your service binding must support 30-day-old calling conventions, because breaking the RPC contract breaks running workflows.

The practical implication is treating service binding interfaces called by workflows as versioned APIs: add new methods rather than changing existing signatures, deprecate gracefully, and if you must make breaking changes, coordinate with workflow drainage to ensure old instances complete before removing old interface methods.

Deployment timing matters

Beyond semantic compatibility, deployment timing affects running workflows operationally because when you deploy a Worker, its associated Workflow redeploys too. Workflow instances actively executing steps at that moment may encounter transient errors, typically surfacing as "Attempt failed due to internal workflows error."

Deployment Impact

Workflows actively executing steps during deployment may fail and retry. Workflows in a waiting state (sleeping or awaiting events) are unaffected. Time deployments to minimise active step execution, or accept that some steps will retry.

The protection for waiting workflows explains why long-running processes with step.sleep() or step.waitForEvent() calls survive deployments gracefully, since a workflow sleeping for 30 days isn't executing code; it's hibernating. Deployment doesn't wake it or disrupt it; only workflows mid-step-execution face disruption.

For high-frequency workflows where instances are always executing, deployment strategies matter a great deal. Deploy during low-traffic windows when fewer instances are mid-step, use gradual rollouts if available, and accept that some percentage of active steps will retry while ensuring your steps handle retries correctly through idempotency.

Handling platform transients

Occasionally steps fail due to platform issues rather than your code, where internal errors, infrastructure hiccups, and deployment-related disruptions surface as step failures. These are typically transient and succeed on retry.

The danger arises when transient errors get misclassified; if error-handling code catches a platform error and wraps it in NonRetryableError, the workflow terminates permanently for what should have been a temporary condition. Defensive error handling distinguishes between errors you understand and errors you don't:

Defensive error handling for platform transients
try {
  await step.do("critical-operation", async () => {
    return await performOperation();
  });
} catch (error) {
  if (isKnownPermanentFailure(error)) {
    // Only throw NonRetryableError for failures you understand
    throw new NonRetryableError(`Permanent failure: ${error.message}`);
  }
  // Let unknown errors retry - they might be transient platform issues
  throw error;
}

The principle is simple: be conservative about declaring failures non-retryable, since unknown errors deserve retry attempts while only errors you've explicitly identified as permanent (validation failures, business rule violations, malformed data) should bypass retry. Platform transients resolve themselves given time.

Observability in production

The dashboard shows workflow instances, step progress, and error messages, which is useful for debugging individual failures but insufficient for production operations.

Alerting on stuck workflows requires external monitoring. Export workflow metrics to your observability platform. Alert when workflows remain "running" beyond expected duration, when step retry counts exceed thresholds, when failed workflow rates increase.

Duration analysis identifies bottlenecks by answering questions like which step takes longest and whether it's consistently slow or shows variable latency. Track step durations over time since degradation often appears gradually before causing failures.

Business metrics matter more than technical metrics: a workflow completing successfully but taking 4 hours instead of 4 minutes might not trigger technical alerts, so monitor what actually matters: time from order placement to shipping label generation, approval workflow cycle time, and data synchronisation lag.

Log context aggressively: when a step fails at 3 AM, you'll want to know what inputs it received, what external calls it made, and what state preceded the failure. Include workflow instance ID in all logs for correlation.

Logs Flush at Instance Completion

Console logs from workflow steps don't stream in real-time. They're batched and flushed when the workflow instance completes or fails. If you're tailing logs for a stuck workflow, you won't see anything until it finishes. For debugging running workflows, rely on the dashboard's step-level visibility rather than expecting live log output.

Long-running process patterns

Workflows excel at processes spanning extended time (hours, days, or weeks), and these patterns share common characteristics: steps depending on each other, external waits where polling cost would be prohibitive, and the need for visibility into where processes stand.

Sequential with external waits

Order processing exemplifies this pattern: validate, capture payment, reserve inventory, wait for warehouse confirmation, generate shipping, wait for carrier pickup, notify customer, wait for delivery, and request review. The waits are where Workflows earns its keep because each waitForEvent hibernates the workflow, consuming nothing until the external signal arrives. Without this capability, you'd poll external systems, maintain status in a database, and run scheduled jobs to check for updates.

Approval chains with escalation

Document approvals need timeouts and escalation paths like waiting 48 hours for manager approval, escalating to director with 24 hours if timeout occurs, and auto-rejecting with notification if still no response. The workflow encodes the escalation policy directly with no external state machine, no cron jobs checking approval status, and no separate escalation service. The policy is readable in code, and changes deploy like any other code change.

Chunked batch processing

Large datasets benefit from per-chunk checkpointing because a data migration processing 1 million records can't be a single step; failure at record 999,000 would restart from record 1. Chunk the data and process each chunk as a separate step so failure at chunk 47 resumes at chunk 47, not chunk 1.

Make each chunk a step, and progress persists automatically while chunk size balances checkpoint overhead (smaller chunks mean more checkpoints) against recovery cost (larger chunks mean more re-processing on failure). For most workloads, 1,000 to 10,000 items per chunk balances well.

When Workflows versus when alternatives

Workflows versus Queues

Chapter 8 covers Queues in depth, but the architectural choice between them needs to happen now.

Workflows orchestrate dependent steps where step 3 needs output from step 2, you need visibility into where a process is stuck, processes span significant time, order matters, and compensation is required for partial failure.

Queues distribute independent tasks for processing 10,000 items where each is separate, order doesn't matter, fire-and-forget is acceptable, and you want parallel processing.

A common pattern combines both, where Workflows orchestrates the overall process while Queues handles parallel fan-out within steps. An order workflow might queue 100 notification sends to process concurrently, wait for completion, and then continue.

Workflows versus Durable Objects

Workflows is built on Durable Objects, so when should you use the abstraction versus the primitive?

Use Workflows when orchestration fits: you have sequential steps with persistence between them, human wait times measured in hours or days, and standard retry and timeout needs. You want the framework to handle replay, checkpointing, and step isolation.

Use Durable Objects directly when you need patterns Workflows doesn't express: real-time state access where clients query current state rather than just final results, WebSocket connections for live updates, complex conditional logic that doesn't map to linear steps, custom storage patterns beyond step results, or receiving messages while processing.

The test is straightforward: if you're fighting Workflows' step model to express your logic, you probably want a Durable Object, but if your process naturally decomposes into "do this, then this, then wait, then this," Workflows saves you from building that machinery yourself.

A third option combines both: the AgentWorkflow class (Chapter 18) lets AI agents delegate durable execution to Workflows while maintaining real-time WebSocket connections with clients. The agent handles interactive decision-making and user communication while the Workflow handles reliable multi-step execution with checkpointing and retry. This composition is particularly effective for human-in-the-loop patterns where users interact with an agent that triggers and monitors long-running processes.

Comparing to external orchestrators

If you're evaluating Workflows against AWS Step Functions or Temporal, you'll find the differences are architectural in nature.

Aspect	Cloudflare Workflows	AWS Step Functions	Temporal
Definition	TypeScript code	JSON (ASL) or SDK	Code (multiple languages)
Execution Model	Durable Object per instance	Managed service	Self-hosted or Temporal Cloud
Max Duration	Weeks	1 year	Unlimited
State Limit	1 MiB per step	256 KB total	Configurable
Cold Start	Sub-millisecond	~50-100ms	Depends on hosting
Step Pricing	Per CPU time	Per state transition	Per action (cloud)

Step Functions uses a JSON-based definition language (ASL) that constrains expression but integrates tightly with AWS services, with native connectors to Lambda, SNS, DynamoDB, and dozens of other services reducing integration code. Pricing per state transition can be expensive for step-heavy workflows, though if you're deep in AWS and need tight integration with Lambda and DynamoDB, Step Functions' native connectors save integration work despite ASL's constraints.

Temporal offers the most powerful model with unlimited duration, sophisticated replay, child workflows, and strong consistency guarantees, but you either operate infrastructure yourself or pay for Temporal Cloud. The concepts (activities, workflows, workers, task queues) have a learning curve where teams report weeks to internalise fully. If your workflows run for months or need sub-workflow spawning, parent-child relationships, and sophisticated replay, Temporal's power justifies its complexity.

Workflows optimises for simplicity, allowing you to write TypeScript, isolate each instance, and enjoy fast cold starts. The trade-off is fewer features, no built-in saga support, simpler retry policies, and shorter maximum duration than Temporal. Step Functions optimises for AWS integration, Temporal for power users, and Workflows for teams wanting durability without needing a PhD in distributed systems.

Testing Workflows

Workflows testing was historically black-box because you could verify an instance reached completion, but intermediate steps were opaque; you couldn't know whether the payment step succeeded or the notification step received correct data without inspecting external systems. Worse, adding a Workflow forced you to disable isolated storage in vitest-pool-workers, meaning state could leak between tests.

Since late 2024, vitest-pool-workers provides introspection APIs solving both problems by exposing introspectWorkflowInstance for testing known instance IDs and introspectWorkflow for integration tests where IDs are generated dynamically through the cloudflare:test module. These run locally and offline, making tests fast and free of network dependencies.

The introspection APIs let you mock step results, inject events at precise moments, and disable sleeps for time-dependent tests, so a workflow normally sleeping seven days can have sleeps disabled entirely, and a step calling an external payment API can return a mocked response. You control exactly what each step produces and verify what subsequent steps receive.

For time-dependent behaviour, call disableSleeps() on the modifier object rather than parameterising durations; for waitForEvent testing, use mockEvent() to inject events without querying instance status manually; and for compensation logic, mock step failures to trigger rollback paths on demand.

The principle is straightforward: don't test Workflows' machinery because Cloudflare tests that steps persist and retries happen; instead, test your business logic by verifying that given these inputs, the right steps execute, and given this mocked failure, compensation happens correctly. The introspection APIs make these assertions straightforward.

What comes next

Workflows handle processes spanning time and surviving failures, but sometimes you need simpler asynchronous processing: work requiring reliable execution without the overhead of orchestration.

Chapter 8 covers Queues for decoupling producers from consumers, handling background work, and managing load through buffering. Where Workflows orchestrates, Queues distribute; the two complement each other with Workflows handling coordination and Queues handling fan-out.

Chapter 9 addresses Containers for workloads exceeding Workers' constraints entirely, since some Workflow steps need more memory or longer compute than Workers allow, and Containers fill that gap while remaining accessible from Workflows through the same binding model.

The patterns from this chapter (durable state, step isolation, compensating transactions) recur throughout stateful system design on Cloudflare, so master them here and you'll apply them to Durable Objects coordination, Queue consumers, and Container orchestration alike.

What "durable" actually means​

The infrastructure beneath​

The cost of durability​

Exactly-once step semantics​

The constraints that shape design​

The 1 mib barrier​

The 128 MB runtime memory limit​

Determinism requirements​

Idempotency for external calls​

Step types and when to use each​

step.do(): execute and persist​

step.sleep() and step.sleepUntil(): pause without cost​

step.waitForEvent(): pause for external signals​

The step granularity decision​

Parallel steps with promise.all and promise.race​

What goes wrong: non-idempotent steps​

Failure handling​

Retry behaviour​

When failure is acceptable​

Compensation and rollback​

Workflows that succeed but are wrong​

When Workflows get stuck​

Versioning and deployments​

Service binding compatibility​

Deployment timing matters​

Handling platform transients​

Observability in production​

Long-running process patterns​

Sequential with external waits​

Approval chains with escalation​

Chunked batch processing​

When Workflows versus when alternatives​

Workflows versus Queues​

Workflows versus Durable Objects​

Comparing to external orchestrators​

Testing Workflows​

What comes next​