Chapter 25: Migration Playbooks
How do I move existing workloads to Cloudflare, and when should I?
The previous chapter helped determine whether Cloudflare fits your workload. This chapter addresses the practical question: how do you get there?
Migration is not a goal; faster applications, lower costs, and simpler operations are goals. Migration is a means and an expensive one. Before planning how to migrate, establish why and whether expected benefits justify the certain costs. This chapter provides playbooks for common migration scenarios, assuming you've already decided migration is worthwhile.
The migration principles
Successful migrations share common principles regardless of source or target. These aren't best practices to consider. They're requirements that distinguish success from cautionary tales.
Coexistence before cutover
Run old and new systems in parallel, routing some traffic to the new system while the old remains operational. Validate behaviour, compare results, and build confidence. Only after the new system proves itself do you cut over completely.
New systems always misbehave in ways you didn't anticipate: edge cases in data, traffic patterns you didn't test, integrations that assumed behaviours your new system doesn't provide. Coexistence gives you time to discover these problems without user-facing outages.
The cost of running two systems is real but bounded; the cost of a failed atomic cutover is unbounded. Coexistence is insurance worth paying for.
Incremental over atomic
Migrate one service, one data store, one capability at a time; each increment is a chance to learn, adjust, and verify. Atomic migrations compound risks and create uncertainty about what failed if something goes wrong.
Incrementalism also manages organisational risk: a team migrating one service learns lessons they apply to the next, whereas a team migrating everything at once learns lessons they can only apply to the post-mortem.
The objection is usually "but the systems are interconnected, we can't migrate piece by piece," which is sometimes true but more often a failure of imagination. Most systems can be decomposed; the question is whether you've tried. Hybrid architectures are valid intermediate states, not failures to complete migration.
Reversibility as requirement
Every migration step should be reversible. Keep S3 data until R2 proves reliable if you migrate from S3 to R2; maintain the ability to route traffic back if you migrate compute from Lambda to Workers; keep old DNS records available for quick revert if you migrate DNS.
Irreversible migrations are bets; reversible migrations are experiments. You want experiments. The confidence you have before migration is always less justified than you think, and the problems you'll discover are always different from those anticipated.
The cost of reversibility is maintaining old infrastructure during migration, the same cost as coexistence and equally worth paying. Set explicit timelines ("We'll maintain Lambda functions for 30 days after traffic reaches zero") to prevent indefinite accumulation while preserving rollback capability during the risk window.
Observation before and during
Establish baseline metrics before migration (latency distributions, error rates, costs, user experience measures), then monitor the same metrics during and after. Without baselines, you can't know if migration improved anything; without continuous monitoring, you can't catch regressions until users report them.
The metrics that matter depend on why you're migrating. For latency, measure p50, p95, and p99 from representative geographic locations; for cost, track resource consumption at sufficient granularity to compare; for operational simplicity, measure time-to-deploy, incident frequency, and mean time to recovery.
"We migrated successfully" is not a useful claim without metrics. "We reduced p99 latency from 450ms to 120ms while reducing compute costs by 35%" is useful and requires observation; plan for it.
Zero-downtime migration architecture
The principles above describe what to do. This section describes how to do it without users ever noticing. Zero-downtime migration isn't a luxury reserved for organisations with dedicated platform teams; Cloudflare's own infrastructure provides the building blocks that make it the default approach rather than an aspirational goal.
The core insight is that Cloudflare sits between your users and your infrastructure by design. Once traffic flows through Cloudflare's network, you control where it goes, how much goes where, and how quickly you can change your mind. This position makes Cloudflare uniquely suited to orchestrating its own adoption: the same network that will eventually run your application can manage the transition to get there.
The strangler fig pattern
The strangler fig is a tree that grows around its host, gradually replacing it while the host continues to function. The software equivalent, coined by Martin Fowler, describes incrementally replacing a legacy system by routing requests through a new layer that delegates to either new or old implementations.
Workers are natural strangler figs. Deploy a Worker in front of your existing infrastructure, initially proxying every request unchanged to your hyperscaler backend. A simple proxy Worker adds under 5ms of total latency to the request path and gives you a control point for everything that follows.
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const url = new URL(request.url);
// Migrated endpoints use new implementation
if (isMigrated(url.pathname)) {
return handleLocally(request, env);
}
// Everything else proxies to legacy infrastructure
return fetch(env.LEGACY_ORIGIN + url.pathname + url.search, {
method: request.method,
headers: request.headers,
body: request.body,
});
}
};
From this starting point, migration becomes a series of small decisions about which endpoints to handle locally. Each decision is independent and reversible. The isMigrated() function can check a KV namespace, allowing you to toggle endpoints without redeploying. Migrate /api/users on Tuesday, observe for a week, then migrate /api/orders the following Tuesday. At no point does any user experience an outage.
The pattern works because Cloudflare's network already terminates TLS and handles DNS for your domain. Adding a Worker to the request path is a configuration change, not an infrastructure migration. Your users' experience is unchanged; only the routing logic behind Cloudflare's edge has shifted.
Deploy a Worker that proxies everything to your existing backend before migrating anything. This validates the request path, establishes baseline metrics through Cloudflare's analytics, and gives you a control point for all future migration steps. A simple proxy Worker adds under 5ms of overhead per request and provides the foundation for everything that follows.
Gradual deployments and traffic splitting
Once your Worker handles some endpoints natively, you need confidence before routing all traffic through new code paths. Cloudflare's gradual deployments solve this precisely.
Gradual deployments split traffic between Worker versions by percentage. Deploy a new version handling /api/users natively, then route 0.5% of traffic to it while 99.5% continues using the proxy-to-legacy version. Monitor error rates, latency distributions, and response correctness at each stage. If anything looks wrong, roll back instantly; "instantly" here means seconds, not the minutes or hours DNS propagation would require.
Cloudflare's own internal practice follows a staged pattern: 0.05% to 0.5% to 3% to 10% to 25% to 50% to 75% to 100%, with soak time between each stage. Each stage surfaces a different class of problem. At 0.05%, you catch crashes and obvious errors. At 3%, you catch performance regressions visible in p95 latency. At 25%, you catch capacity issues and rate-limiting interactions. At 50%, you catch edge cases in long-tail traffic patterns.
The critical advantage over DNS-based traffic splitting is speed. DNS changes propagate over minutes to hours depending on TTL settings and resolver caching behaviour. Gradual deployments take effect within seconds because the split happens at Cloudflare's edge, not in the DNS layer. Rolling back a DNS change means waiting for caches to expire; rolling back a gradual deployment means clicking a button.
Workers retain up to 100 previous versions for rollback. During migration, this means you can revert to any recent known-good state, not just the immediately preceding version. If version 47 introduced a subtle data corruption bug discovered only at version 52, you can roll back to version 46 while you investigate.
Connecting to existing infrastructure
Zero-downtime migration requires your new infrastructure to communicate with your old infrastructure throughout the transition. You cannot migrate everything simultaneously, so Workers need to reach backends still running on hyperscalers. Cloudflare provides two mechanisms for this, each suited to different scenarios.
Cloudflare Tunnel creates an outbound-only encrypted connection from your existing infrastructure to Cloudflare's edge. Install a lightweight connector (cloudflared) in your AWS VPC, Azure VNet, or GCP VPC, and it establishes a persistent tunnel without requiring any inbound firewall rules. Workers route requests through the tunnel to reach internal services, databases, or APIs that aren't publicly accessible.
The operational benefit during migration is substantial. You need not expose internal services to the public internet, reconfigure security groups, or punch holes in firewalls. The tunnel connector runs alongside your existing infrastructure and can be removed when migration completes. Multiple connectors provide high availability, and failover is automatic.
Workers VPC Services build on Tunnel to provide binding-level access to specific internal services. Rather than giving a Worker access to your entire private network (which creates the SSRF risks Chapter 1 discussed), VPC Services let you bind a Worker to a specific internal endpoint. The Worker accesses env.LEGACY_API.fetch() and the request routes through the tunnel to exactly that service, nothing else.
This distinction matters for security during migration. A Worker handling user requests shouldn't be able to reach your internal monitoring infrastructure or administrative APIs just because a tunnel exists. VPC Services enforce least-privilege access at the binding level, limiting each Worker to the specific backends it needs.
For database connectivity during migration, Hyperdrive provides the bridge. Configure Hyperdrive to connect to your existing PostgreSQL or MySQL database, whether it runs on RDS, Cloud SQL, Azure Database, or self-hosted infrastructure. Workers access the database through a Hyperdrive binding with connection pooling, prepared statement caching, and global connection reuse handled automatically. Your database stays where it is; your compute migrates around it.
Smart Placement during transition
When your Workers connect to backends still running on hyperscalers, latency depends on the distance between Worker execution and backend location. A Worker running in Sydney that queries an RDS instance in us-east-1 pays for a trans-Pacific round trip on every database call.
Smart Placement addresses this automatically. Enable it on Workers that make backend calls, and Cloudflare analyses traffic patterns to determine optimal execution location. Smart Placement runs your Worker near the location handling your heaviest backend traffic, reducing database round trips to single-digit milliseconds while accepting slightly higher latency for users far from that region.
For known, fixed backend locations, explicit placement hints provide immediate optimisation without waiting for Smart Placement to learn patterns. Specify "aws:us-east-1" in your placement configuration, and your Worker executes in the Cloudflare data centre closest to that AWS region from the first request.
As migration progresses and backends move to Cloudflare (D1, R2, Durable Objects), Smart Placement's optimal location shifts. Eventually, when all backends are on Cloudflare, you disable Smart Placement and return to the default model: execution at the edge closest to each user. The transition happens naturally as you migrate backends; no manual placement reconfiguration required.
The phased approach
Combining these capabilities produces a migration architecture where zero downtime is the natural outcome rather than an engineering achievement. The phases overlap in practice, but the logical sequence provides structure.
Phase one: establish the edge. Put Cloudflare in front of your existing infrastructure. This might mean proxying through Workers, or simply enabling Cloudflare as a DNS proxy for your domain. At this stage, nothing changes for your users. You gain Cloudflare's DDoS protection, TLS management, and analytics as immediate benefits while establishing the control point for subsequent phases. If your domain already uses Cloudflare for CDN or security, this phase is already complete.
Phase two: migrate storage incrementally. Enable Sippy on an R2 bucket pointed at your S3 source. From this moment, every object request that reaches R2 either serves from R2 (if already migrated) or transparently fetches from S3, copies to R2, and serves. No application changes required; the R2 bucket behaves identically to S3 from the client's perspective, but objects accumulate in R2 over time. For the long tail of rarely-accessed objects, run Super Slurper to complete the migration in bulk. At no point does any request fail because an object hasn't migrated yet.
Phase three: migrate compute gradually. Deploy Workers handling migrated endpoints while proxying everything else to legacy infrastructure. Use gradual deployments to shift traffic incrementally: 0.5%, 3%, 10%, 25%, 50%, 100%. Enable Smart Placement so Workers execute near your legacy databases during the transition. Each endpoint migrates independently on its own timeline.
Phase four: migrate data selectively. With compute running on Workers and connecting to your existing database via Hyperdrive, evaluate whether data migration is necessary at all. Hyperdrive with an external PostgreSQL database is a valid permanent architecture. If you choose to migrate to D1, run dual writes during the transition: write to both old and new databases, read from the old, then switch reads to the new after validation. Chapter 11 covers D1's horizontal patterns for structuring the target schema.
Phase five: decommission. Remove legacy infrastructure after a validation period with full traffic on Cloudflare. Keep rollback capability (dormant Lambda functions, S3 buckets in read-only mode) for 30 days beyond the point where you're confident. The cost of maintaining idle infrastructure for a month is trivial compared to the cost of discovering you need it and finding it's gone.
Each phase is independently valuable. An organisation that completes only phase one still benefits from improved security and observability. One that reaches phase two saves on egress costs immediately. Phase three delivers latency improvements. You need not commit to the full sequence; each phase stands alone and can be the permanent stopping point if further migration doesn't justify the investment.
Comparing migration approaches across platforms
Migration tooling reveals a platform's architectural assumptions. Cloudflare's approach differs from hyperscaler migration paths in ways that reflect the platform's edge-native design.
Hyperscaler migrations typically involve lift-and-shift tooling designed to move workloads between similar environments. AWS Migration Hub, Azure Migrate, and Google Cloud's migration tools assume you're moving VMs, containers, or databases from one data centre to another. The workload structure stays the same; only the infrastructure underneath changes. This works well for homogeneous migrations (on-premise to cloud, cloud to cloud) but provides little help when the target platform has a fundamentally different execution model.
Cloudflare provides no equivalent lift-and-shift tooling because the concept doesn't apply. You cannot lift a Lambda function and shift it to Workers without understanding how the execution model differs. Instead, Cloudflare provides infrastructure-level migration tools (Sippy, Super Slurper for storage; Hyperdrive for database connectivity; Tunnel for network bridging) that handle the infrastructure layer while you handle the architectural translation.
This difference is honest about the work involved. A Lambda-to-Workers migration is not a configuration change; it's an architectural decision with implications for memory management, execution time, global distribution, and state coordination. Tools that pretend otherwise create false confidence. Cloudflare's tooling handles what can be automated (copying objects, pooling connections, routing traffic) and leaves the architectural decisions where they belong: with the engineering team.
The tradeoff is real. Hyperscaler-to-hyperscaler migrations can be faster for workloads that translate directly. Cloudflare migrations require more thought but produce architectures that exploit the platform's strengths rather than merely reproducing what existed before.
The playbooks that follow apply this zero-downtime architecture to specific migration scenarios. Each playbook's migration process is designed to be zero-downtime by default when you follow the steps in sequence. Where a product offers a distinctive zero-downtime mechanism beyond the general patterns described above (Sippy's transparent fallback for object storage, the dual-write pattern for databases, organic cache population for Redis, and AI Gateway's fallback routing for inference), the playbook covers that technique in detail.
Playbook: S3 to R2
Object storage migration is conceptually simple (copy files from one bucket to another), but it becomes operationally complex at scale with billions of objects, petabytes of data, and applications expecting zero downtime. R2 provides two migration tools for different scenarios.
Super slurper: complete migration
Super Slurper copies all objects from a source bucket to R2 in a single migration job. Use it when you want everything migrated, can tolerate egress costs, and need migration completed within a predictable timeframe.
Configure Super Slurper through the Cloudflare dashboard: specify source credentials, source bucket, destination R2 bucket, and optional path filters. It handles parallelisation, retries, and progress reporting. Migration that would take weeks with sequential copying completes in hours or days.
Super Slurper supports S3-compatible sources beyond AWS: Google Cloud Storage, MinIO, Backblaze B2, Wasabi, DigitalOcean Spaces.
The process:
- Create destination R2 bucket
- Configure Super Slurper with source credentials and bucket details
- Optionally configure path filters for specific prefixes
- Start migration and monitor progress
- Validate migrated objects match source (spot-check counts and checksums)
- Update application configuration to point to R2
- Maintain source bucket during validation period (weeks, not days)
- Delete source bucket after confidence is established
Calculate egress costs before committing to a strategy. AWS charges $0.09/GB for S3 egress; migrating 10 TB costs $900 in egress fees alone, potentially more than several months of R2 storage. For massive buckets, Sippy's on-demand migration may prove more economical than Super Slurper's complete copy.
Sippy: incremental migration
Sippy migrates objects on demand. Configure R2 as a caching layer in front of your source bucket. When an object is requested from R2 and doesn't exist, Sippy fetches it from the source, stores it in R2, and returns it. Frequently accessed objects migrate first; rarely accessed objects migrate only when needed.
The benefit is economics. You pay egress only for objects actually requested, and frequently requested objects incur egress only once. For buckets where 10% of objects receive 90% of requests, Sippy can reduce migration egress costs by an order of magnitude.
The tradeoff is timeline. Migration completes only when all objects have been requested, which might be never. Objects never accessed remain in the source bucket indefinitely, continuing to incur storage costs.
Use Sippy when:
- Egress costs for complete migration are prohibitive
- You want to serve frequently accessed content from R2 immediately
- Complete migration isn't required
- Most objects are rarely requested
The process:
- Create destination R2 bucket
- Configure Sippy with source bucket credentials
- Update application to request from R2 (Sippy handles fallback transparently)
- Monitor migration progress as objects copy on access
- Optionally run Super Slurper with "skip existing" to complete remaining objects
Combined strategy: Enable Sippy first to immediately serve frequently accessed objects from R2. After access patterns stabilise, run Super Slurper to migrate the long tail. This minimises egress costs while ensuring complete migration.
After migration: compatibility notes
R2 is S3-compatible, but compatibility isn't identity. Test thoroughly. Common differences:
ETags may differ. R2's ETag calculation matches S3 for single-part uploads, but Sippy may migrate multipart objects with different part sizes, producing different ETags. Applications validating ETags across migration will see mismatches.
Some S3 features aren't supported. S3 Object Lock, S3 Select, Requester Pays, and certain storage classes don't exist in R2. Check the compatibility matrix; missing features require application changes.
Endpoint URLs change. S3 endpoints follow bucket-name.s3.region.amazonaws.com. R2 follows account-id.r2.cloudflarestorage.com or custom domains. Update application configuration and SDK initialisations.
IAM policies don't transfer. R2 uses API tokens with specific permissions. Recreate your access control model using R2's authentication mechanisms.
Achieving zero downtime
Zero-downtime object storage migration is possible because requests for unmigrated objects can transparently fall back to the source bucket. The caller never sees a missing object; it sees either an object already in R2 or one fetched on demand from S3. This transparent fallback requires strict sequencing: Sippy must be active on your R2 bucket before any application configuration changes, because once the application points at R2, every request goes there. With Sippy active, unmigrated objects are fetched from S3, copied to R2, and returned in a single operation.
With Sippy active, update your application's S3 endpoint to point at R2. Since R2 supports the S3 API, most applications require only an endpoint URL and credential change; the SDK calls themselves remain identical. From this moment, all reads serve from R2 (directly or via Sippy fallback), and all writes go to R2. The S3 bucket becomes read-only from your application's perspective, serving only as Sippy's fallback source for objects not yet copied.
Run Super Slurper in "skip existing" mode to migrate remaining objects in bulk. Once Super Slurper completes and you've validated object counts and checksums, disable Sippy. Your R2 bucket now contains everything and operates independently. Keep the S3 bucket in read-only mode for 30 days as insurance; the storage cost is minimal compared to the value of having a rollback option.
At no point in this sequence does a request fail because an object hasn't migrated. Sippy handles the transition transparently; Super Slurper completes it efficiently; disabling Sippy finalises it cleanly. The only user-visible change is improved latency as objects serve from Cloudflare's edge rather than a single S3 region.
Playbook: Lambda to Workers
Moving serverless compute from Lambda to Workers requires understanding model differences, not just translating syntax.
The model differences
Geographic distribution: Lambda functions are regional (deploy to us-east-1, execute in us-east-1; global distribution requires deploying to multiple regions and managing routing through Route 53, CloudFront, or API Gateway). Workers are global by default; deploy once and code executes at whichever of Cloudflare's 300+ locations is closest to each user.
Resource limits: Lambda supports up to 10 GB memory and 15-minute execution. Workers have 128 MB memory and 30 seconds of CPU time for HTTP handlers by default, configurable up to 5 minutes. Cron Triggers with hourly or longer intervals get 15 minutes of CPU time. Queue consumers get up to 15 minutes of wall time but the same 5-minute CPU time ceiling as other Workers. This is the most common migration blocker.
Cold starts: Lambda cold starts range from 100ms to several seconds depending on runtime and VPC attachment. An entire ecosystem exists to mitigate them. Workers cold starts are under 5ms, typically imperceptible. Cold start mitigation strategies don't transfer because the problem doesn't exist.
Networking: Lambda connects to VPCs natively through ENI attachment. Workers connect to private resources through Cloudflare Tunnel or VPC Services integration.
Pricing: Lambda charges for GB-seconds (memory multiplied by wall-clock time). Workers charge for CPU time, with I/O wait free. Lambda charges while your function waits for a database response; Workers don't. I/O-heavy workloads typically cost less on Workers.
Assessment questions
Before migrating any Lambda function, answer these questions:
Does it fit Workers' constraints? If the function uses more than 128 MB memory, Workers isn't the right target without architectural changes. If it runs longer than 30 seconds for HTTP requests, you need Workflows, Queues, or a different approach.
What does it access? If the function accesses VPC resources (RDS, ElastiCache, internal services), how will Workers reach them? Cloudflare Tunnel works but adds latency and complexity. If using DynamoDB, will you migrate to D1, use Hyperdrive with an external database, or accept cross-cloud latency?
How is it triggered? HTTP triggers translate directly to Workers fetch handlers. SQS triggers require migrating to Cloudflare Queues or maintaining SQS with a polling Worker. EventBridge, Step Functions, and other AWS-specific triggers require alternative architectures.
What libraries does it use? Some npm packages assume Node.js APIs Workers don't provide: filesystem access, child processes, native modules. Test dependencies in the Workers environment before committing.
What's the quantified benefit? "Lower latency" isn't a benefit; "reducing p95 latency from 180ms to 45ms for European users" is. If you can't quantify expected improvement, you can't evaluate whether migration succeeded.
"Lower latency" and "reduced costs" aren't quantified benefits. Without baseline metrics and continuous measurement, you can't know if migration succeeded. If you can't articulate expected improvement in numbers, question whether you should migrate.
Migration process
The process below follows the zero-downtime architecture described earlier in this chapter. Each step maintains a valid handler for every request; at no point does migrating a Lambda function require downtime or an atomic cutover.
For functions that pass assessment:
1. Translate the handler. Lambda handlers receive event and context objects with AWS-specific structure. Workers handlers receive standard Request objects and env for bindings. The translation is mechanical but requires attention to input parsing and output formatting.
// Lambda
export const handler = async (event, context) => {
const body = JSON.parse(event.body);
const userId = event.pathParameters.userId;
// ... business logic ...
return {
statusCode: 200,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(result)
};
};
// Workers
export default {
async fetch(request, env) {
const body = await request.json();
const url = new URL(request.url);
const userId = url.pathname.split('/')[2]; // or use a router
// ... business logic (largely unchanged) ...
return Response.json(result);
}
};
2. Replace AWS SDK calls. DynamoDB becomes D1 or Hyperdrive. S3 becomes R2 bindings. SQS becomes Queue bindings. Secrets Manager becomes Wrangler secrets. Each requires understanding Cloudflare's semantics; similar but not identical.
3. Configure bindings. Resources IAM-attached in Lambda are binding-attached in Workers. Add appropriate bindings to wrangler.toml.
4. Test locally. Use wrangler dev to verify behaviour against realistic inputs including edge cases from production logs.
5. Deploy to preview. Test with real Cloudflare infrastructure but without production traffic. Verify integrations, latency, and error handling.
6. Route incrementally. Use gradual deployments to shift traffic to the new Worker version. Start at 0.5%, monitor for a day, increase to 3%, then 10%, 25%, 50%, 100%. Each stage should soak long enough to encounter representative traffic patterns; a few hours catches obvious failures, but a full day catches time-zone-dependent patterns and scheduled jobs. Gradual deployments operate at Cloudflare's edge with instant effect, avoiding the propagation delays that DNS-based routing introduces. For Lambda functions that still call AWS services during the transition, Workers VPC Services provide secure connectivity to your VPC without exposing services publicly, and Smart Placement ensures Workers execute near the AWS region hosting your backends, minimising round-trip latency until those backends migrate to Cloudflare.
7. Monitor and compare. Compare latency distributions, error rates, and costs between versions. Geographic distribution changes latency patterns in expected ways: a Lambda function in us-east-1 serving a European user at 180ms might become a Worker serving from Frankfurt at 15ms, but the same user's requests to a backend in us-east-1 might show similar total latency until data migration completes. Understand which differences are improvements, which are expected consequences of the new architecture, and which indicate bugs.
8. Complete cutover. When metrics confirm the Worker performs as well or better across all traffic stages, route 100% to Workers. Keep Lambda functions deployed but not receiving traffic for rollback.
9. Decommission. After two to four weeks of stable operation, delete Lambda functions. The cost of maintaining dormant functions is minimal compared to having no rollback option.
What the economics look like after migration
Lambda charges for GB-seconds, which is memory allocation multiplied by wall-clock execution time. A function configured at 1 GB running for 500ms costs 0.5 GB-seconds regardless of whether the function spent 490ms waiting for a database response. Workers charge only for CPU milliseconds, the time your code actually executes on a processor rather than wall time. The same function, rewritten as a Worker, might consume 10ms of CPU time while waiting 490ms for a database response, and you pay for 10ms instead of 500ms.
For I/O-heavy workloads (API orchestration, webhook processing, database-backed APIs), this difference compounds. A Lambda function making five external API calls averaging 200ms each has a wall time of at least one second. At 1 GB memory, that's 1 GB-second per invocation. The equivalent Worker, consuming perhaps 15ms of CPU time across those calls, costs a fraction of that for the compute component alone.
Lambda's API Gateway adds $3.50 per million requests for REST APIs ($1.00 for HTTP APIs) on top of compute costs. Workers include HTTP handling in the base pricing; no separate API Gateway charge applies. For an API handling 50 million monthly requests through a REST API, this single line item accounts for $175 per month that disappears after migration.
The gap narrows for compute-intensive workloads. A function spending most of its execution time on computation rather than I/O pays for that CPU time on both platforms. Workers' per-CPU-millisecond pricing exceeds Lambda's GB-second pricing for sustained computation, particularly when Lambda functions are configured with higher memory allocations that include proportionally more CPU power. Model your specific workload rather than assuming Workers are always cheaper; the pricing advantage is architectural, not universal.
Playbook: Vercel/Netlify to Workers
Vercel and Netlify provide managed deployment for frameworks like Next.js, SvelteKit, and Astro. Migration to Cloudflare means moving from managed platform to managed infrastructure with more control, more configuration, and different trade-offs.
What changes
Deployment model: Vercel and Netlify infer configuration from your framework (push to git and deployment happens), whereas Cloudflare requires explicit wrangler.toml configuration. The magic is replaced with configuration files you control.
Environment handling: Vercel's environment variables are managed through their dashboard. Cloudflare uses Wrangler secrets for sensitive values and wrangler.toml vars for non-sensitive configuration.
Serverless functions: Vercel Functions and Netlify Functions have their own conventions: file-based routing, specific handler signatures. These become Workers with explicit routing. Translation is straightforward but requires touching every function.
Edge functions: Vercel and Netlify Edge Functions are conceptually similar to Workers but have API differences. The concepts transfer; the syntax doesn't.
Data services: Vercel KV maps to Workers KV. Vercel Postgres migrates to D1 or connects through Hyperdrive. Blobs map to R2. Plan these as part of the overall effort.
The process
1. Choose your framework's Cloudflare path:
- Next.js:
@opennext/cloudflareor@cloudflare/next-on-pages(check current documentation) - React Router v7 (Remix): Official Cloudflare Vite plugin
- SvelteKit:
@sveltejs/adapter-cloudflare - Astro:
@astrojs/cloudflare - Nuxt: Nitro
cloudflarepreset
2. Create Cloudflare resources. Before migrating code, create needed infrastructure: KV namespaces, D1 databases or Hyperdrive connections, R2 buckets, Queues.
3. Configure wrangler.toml. Define bindings, build commands, and output configuration. This replaces Vercel's implicit configuration.
name = "my-app"
compatibility_date = "2025-01-15"
compatibility_flags = ["nodejs_compat"]
[build]
command = "npm run build"
[[kv_namespaces]]
binding = "CACHE"
id = "your-kv-namespace-id"
[[d1_databases]]
binding = "DB"
database_name = "my-app-db"
database_id = "your-d1-database-id"
4. Adapt environment variables. Move secrets using wrangler secret put. Move non-secret configuration to wrangler.toml vars or .dev.vars.
5. Adapt API routes and middleware. Vercel API routes become Workers handlers. Request/response model is similar but not identical.
6. Test thoroughly. Deploy to a preview URL. Test every route, API endpoint, authentication flow, edge case. Framework adapters handle most translation, but subtle differences surface in testing.
7. Configure DNS. If your domain is on Cloudflare, configure Workers routes. Otherwise, transfer the domain or serve from a workers.dev subdomain initially.
8. Compare metrics. Run Cloudflare alongside Vercel. Compare Core Web Vitals, time to first byte, error rates. Investigate differences before completing migration.
Reasons not to migrate
Migration from Vercel or Netlify isn't always beneficial. Consider staying if:
Framework integration is tighter elsewhere. Some Vercel features (automatic ISR, preview deployments with database branching, integrated analytics) don't have direct Cloudflare equivalents. If these are central to your workflow, migration removes capabilities you depend on.
Managed simplicity has value. Vercel and Netlify excel at removing decisions. Cloudflare provides more control but requires more configuration. If your team values the managed experience and the current platform works well, migration adds operational burden without proportionate benefit.
Cost comparison favours the current platform. Don't assume Cloudflare is cheaper. Compare actual costs at your traffic levels, including engineering time for migration.
Migration effort exceeds benefit timeline. If migration takes three months and you're uncertain about the product's future beyond six months, payback may not justify investment.
Migrate when Cloudflare provides specific advantages you need: lower latency through global distribution, Durable Objects for coordination, Workers AI for inference, or demonstrated cost savings at your scale. Don't migrate because it feels like progress.
Playbook: Containers to Cloudflare
Organisations running containers on ECS, Kubernetes, or other orchestration platforms have two migration targets: Workers serve workloads that fit the isolate model, while Cloudflare Containers serve workloads that genuinely need container capabilities.
Deciding the target
Most containerised workloads can run as Workers. Containers often exist because the team knew containers, not because the workload required them.
Migrate to Workers when:
- Memory usage stays under 128 MB
- Execution completes within time limits
- Dependencies run in V8 (JavaScript, TypeScript, WebAssembly)
- No filesystem persistence required between requests
Migrate to Cloudflare Containers when:
- Memory requirements exceed 128 MB
- The workload requires filesystem access
- Dependencies include native binaries that won't compile to WebAssembly
- Runtime isn't JavaScript (Python ML models, Go services, etc.)
Workers should be the default assumption. Containers are for workloads that don't fit.
Container migration process
For workloads targeting Cloudflare Containers:
1. Optimise for cold starts. Container cold starts are 2-10 seconds; faster than traditional cloud containers but slower than Workers' sub-5ms. Minimise image size with slim base images and multi-stage builds. Defer initialisation where possible.
2. Configure the container in wrangler.toml:
[[containers]]
class_name = "MyContainer"
image = "./Dockerfile"
max_instances = 10
3. Create the routing Worker. Workers route requests to Containers, deciding which requests need container processing and which can be handled at the edge.
4. Handle the cold start experience. Unlike Workers, container cold starts are user-visible. Address through loading indicators, optimistic UI updates, or architectural changes that pre-warm containers for predictable traffic.
5. Deploy and test with realistic traffic patterns, paying attention to cold start frequency and duration.
Containers on Cloudflare serve workloads needing container capabilities deployed globally with Cloudflare's network benefits. They're not a general container platform.
Playbook: database migration
Compute migration often triggers database migration because Workers connecting to RDS in us-east-1 inherit that latency regardless of where the Worker runs. But database migration is not always the answer, and when it is, the target depends on your data model and access patterns.
The fundamental decision: migrate or connect
Before planning how to migrate data, decide whether to migrate it at all. Consider these factors:
Use Hyperdrive (keep your existing database) when:
- Your PostgreSQL or MySQL database works well and you've invested in its schema, indexes, and operational tooling
- Data volume exceeds D1's 10 GB limit per database and sharding adds unwanted complexity
- Your application relies on PostgreSQL-specific features: advanced JSON operators, full-text search with ranking, PostGIS spatial queries, stored procedures, or triggers
- Regulatory requirements mandate specific database platforms or locations
- The database serves multiple applications, not just the workload you're migrating
Hyperdrive provides connection pooling, prepared statement caching, and global connection reuse. Workers connect through Hyperdrive; the database stays where it is. This is a valid permanent architecture, not a migration stepping stone.
Migrate to D1 when:
- You want edge-native data access without cross-region latency
- Your data model fits SQLite's capabilities
- Total data per logical database stays under 10 GB
- You're building new applications or rebuilding existing ones
- Horizontal partitioning (database per tenant, per region, or per entity type) matches your domain
Migrate to KV when:
- Data is key-value shaped with simple access patterns
- Eventual consistency (up to 60 seconds) is acceptable
- You're replacing DynamoDB, Redis, or similar stores used primarily for caching or simple lookups
Schema translation: PostgreSQL and MySQL to D1
D1 runs SQLite. Most SQL translates directly, but PostgreSQL and MySQL features that don't exist in SQLite require application changes.
Data types that need translation:
| PostgreSQL/MySQL | SQLite/D1 | Notes |
|---|---|---|
| SERIAL, AUTO_INCREMENT | INTEGER PRIMARY KEY | SQLite auto-increments INTEGER PRIMARY KEY automatically |
| BOOLEAN | INTEGER | Use 0 and 1; SQLite has no native boolean |
| TIMESTAMP WITH TIME ZONE | TEXT or INTEGER | Store as ISO 8601 strings or Unix timestamps |
| JSON, JSONB | TEXT | SQLite stores JSON as text; use json_extract() for queries |
| ARRAY | TEXT or separate table | No native arrays; serialise or normalise |
| UUID | TEXT or BLOB | Store as string or 16-byte blob |
| ENUM | TEXT with CHECK constraint | No native enums |
| DECIMAL, NUMERIC | REAL or TEXT | REAL for calculations; TEXT for exact decimal preservation |
Features that don't translate:
Stored procedures, triggers, and functions don't exist in SQLite. Logic that runs inside the database must move to application code. This is often beneficial: business logic in Workers is testable, version-controlled, and doesn't hide in database definitions. But it requires rewriting, not just schema translation.
Foreign key constraints exist in SQLite but are disabled by default. D1 enables them, but behaviour differs subtly from PostgreSQL. Test constraint violations explicitly; don't assume identical behaviour.
Full-text search exists in SQLite via FTS5, but the syntax and ranking differ from PostgreSQL's tsvector/tsquery. If search quality matters, consider whether D1's FTS meets requirements or whether Vectorize with semantic search better serves the use case.
D1's horizontal model
D1 databases have a 10 GB limit, which isn't a temporary constraint but rather reflects D1's architecture as Durable Objects underneath. The platform assumes many small databases, not one large one.
When to shard from the start:
If your current database exceeds 10 GB, or will exceed it within a year, plan your D1 architecture around multiple databases. Common patterns:
-
Database per tenant for multi-tenant SaaS. Each customer's data lives in isolation. The 10 GB limit applies per tenant, not globally. Cross-tenant queries become impossible, which is often a feature for data isolation.
-
Database per entity type when different data has different access patterns. Users in one database, transactions in another, audit logs in a third. Each can scale independently.
-
Database per time period for append-heavy data. Current month in one database, archives in others. Query routing adds complexity but prevents unbounded growth.
Routing queries to the right database:
Sharded architectures need routing logic. A Worker determines which database handles each request based on tenant ID, entity type, or date range. This logic is straightforward but must be consistent; routing errors corrupt data.
function getDatabaseForTenant(env: Env, tenantId: string): D1Database {
// Bindings named DB_TENANT_1, DB_TENANT_2, etc.
const dbName = `DB_TENANT_${tenantId}`;
return env[dbName] as D1Database;
}
For hundreds of tenants, binding-per-tenant becomes unwieldy. Store database IDs in a routing table (itself a D1 database or KV) and use the D1 API to connect dynamically.
Data migration process
For databases moving to D1:
1. Export schema and data. Use pg_dump, mysqldump, or equivalent. Export schema separately from data; you'll modify the schema before importing.
2. Translate the schema. Convert data types, remove unsupported features, add SQLite-specific syntax. Test the schema against an empty D1 database before importing data.
3. Transform the data. Convert boolean values, format timestamps, serialise arrays. Write transformation scripts that produce SQLite-compatible INSERT statements or CSV files.
4. Import incrementally. D1 has request size limits. Batch imports into chunks of a few thousand rows. Use transactions for consistency within batches.
# Split large SQL files for batch import
split -l 1000 data.sql data_chunk_
for chunk in data_chunk_*; do
wrangler d1 execute my-database --file="$chunk"
done
5. Verify row counts and checksums. Compare counts per table. For critical data, compute checksums on source and destination to verify integrity.
6. Run application tests against D1. Schema compatibility doesn't guarantee application compatibility. Test every query path; SQLite's type affinity and NULL handling differ subtly from PostgreSQL.
7. Deploy with dual-write. Initially write to both old and new databases. Read from the old database. This lets you verify D1 handles production write patterns before switching reads.
8. Switch reads, then remove dual-write. Once D1 proves reliable under production load, switch reads to D1. After a validation period, remove writes to the old database.
DynamoDB to Cloudflare
DynamoDB's key-value and document model maps to either KV or D1, depending on how you use it.
DynamoDB as key-value store → KV:
If you use DynamoDB primarily for simple get/put operations with partition keys, KV is the natural target. Access patterns translate directly:
| DynamoDB | Workers KV |
|---|---|
| GetItem | kv.get(key) |
| PutItem | kv.put(key, value) |
| DeleteItem | kv.delete(key) |
KV's eventual consistency (up to 60 seconds for global propagation) differs from DynamoDB's strongly consistent reads option. If your application uses consistent reads, evaluate whether eventual consistency works or whether Durable Objects better fit the access pattern.
DynamoDB with complex queries → D1:
If you use DynamoDB's Query and Scan operations with sort keys, filters, and secondary indexes, D1's SQL model may actually simplify your code. GSIs and LSIs exist because DynamoDB's query model is limited; SQL handles the same access patterns natively.
Translate DynamoDB's single-table design back to normalised tables. The patterns that optimise DynamoDB access (composite keys, overloaded attributes, sparse indexes) don't apply to SQL databases and make schemas harder to understand.
DynamoDB Streams → Queues:
If you use DynamoDB Streams for change data capture, Workers don't have a direct equivalent for D1. Options include:
- Application-level events: publish to Queues when data changes
- Polling patterns: periodically query for changes using timestamps
- External CDC: if the source remains DynamoDB during transition, process streams in Lambda and forward to Cloudflare
When not to migrate data
Database migration is expensive. The combination of schema translation, data transformation, application changes, and validation often exceeds compute migration effort by a factor of three or more.
Don't migrate data when:
Hyperdrive solves the latency problem. If your concern is Workers connecting to a distant database, Hyperdrive's connection pooling and caching may provide sufficient improvement without migration risk.
The database serves multiple applications. Migrating data used by systems outside your control creates coordination overhead that rarely justifies the benefit.
You're uncertain about D1's fit. Prototype with Hyperdrive first. If access patterns prove D1-compatible and the migration benefit becomes clear, migrate then. Premature data migration creates rollback complexity that compute migration doesn't.
Compliance requires specific platforms. Some regulations mandate specific database technologies or certifications. Verify D1's compliance status before assuming you can migrate.
Zero-downtime database migration
Database migration is the hardest component to achieve without downtime because it involves the one thing you cannot simply proxy or route around: consistency. Two systems writing the same data must agree on what that data is, and achieving agreement during transition requires careful sequencing.
The Hyperdrive bridge pattern provides the foundation. Start by connecting Workers to your existing database through Hyperdrive. At this point, your application reads and writes to the same database it always has; only the compute layer has changed. This state can persist indefinitely. Many production architectures run permanently with Workers connecting to external PostgreSQL through Hyperdrive, and doing so is neither a compromise nor a temporary measure.
If you decide to migrate to D1, the dual-write pattern provides the zero-downtime path. Deploy a Worker version that writes to both the existing database (via Hyperdrive) and the new D1 database, but reads exclusively from the existing database. This ensures D1 accumulates current data without any risk to production reads. Monitor D1 writes for failures, schema mismatches, or constraint violations; any issues affect only the shadow database, not production.
After the dual-write period validates D1's behaviour under production write patterns, backfill historical data that predates the dual-write start. Import this data from database exports, validating row counts and checksums against the source. Once backfill completes and dual writes have run long enough to cover your confidence threshold (typically one to two weeks of representative traffic), switch reads to D1 using a gradual deployment. Start reading from D1 for 1% of traffic while the remaining 99% continues reading from the existing database. Compare responses between the two paths; differences indicate migration bugs.
Escalate the read percentage through the same staged approach used for compute migration: 1% to 5% to 25% to 50% to 100%, with soak time at each stage. Once 100% of reads come from D1 and you're satisfied with the results, remove the dual-write logic and decommission the Hyperdrive connection.
The entire sequence requires no downtime because at every stage, every read and every write has a valid destination. The complexity lies in the dual-write logic and the comparison infrastructure, not in any service interruption.
Data migration should follow compute migration, not lead it. Prove the compute layer works with Hyperdrive; then evaluate whether native D1 access justifies migration investment.
Playbook: Redis and ElastiCache to KV and Durable Objects
Redis serves as caching layer, session store, rate limiter, pub/sub broker, and general-purpose coordination tool across most hyperscaler architectures. No single Cloudflare product replaces all of these roles. The migration target depends on which role Redis plays in your system.
Choosing the target
Redis usage falls into distinct categories, each mapping to a different Cloudflare primitive. Most production Redis instances serve multiple categories simultaneously, which means migration involves decomposing Redis into separate concerns handled by separate products.
Caching (read-heavy, tolerance for staleness) maps to KV. If you use Redis to cache database query results, API responses, or computed values with TTLs, KV provides the same pattern with global distribution. KV values can be up to 25 MiB, keys up to 512 bytes, and TTLs as short as 60 seconds. The critical difference is consistency: KV is eventually consistent with propagation taking up to 60 seconds globally, whereas Redis returns the most recent write immediately. For caching, eventual consistency is typically acceptable because cache data is inherently stale.
Session storage maps to KV or Durable Objects depending on your consistency requirements. Read-heavy sessions where occasional staleness is tolerable (a user might briefly see an outdated cart count) work well in KV. Sessions requiring strong consistency (a payment flow where the session must reflect the most recent state at every step) need Durable Objects, which provide single-threaded, strongly consistent access per object.
Rate limiting and counters map to Durable Objects. Atomic increment operations, sliding window rate limiters, and any pattern requiring read-modify-write consistency need Durable Objects. KV's eventual consistency makes it unsuitable for accurate counting; two concurrent increments could both read the same value and produce the same result, losing one increment.
Pub/sub maps to Queues or Durable Objects with WebSockets. Redis pub/sub for fan-out messaging translates to Queues for asynchronous distribution or Durable Objects with WebSocket hibernation for real-time broadcast. Neither is a direct replacement; the programming model changes.
The model differences
Redis operates as a single (or clustered) in-memory store with sub-millisecond latency from co-located clients. KV operates as a globally distributed store with single-digit millisecond reads at the edge but eventual consistency and a maximum of one write per second per key. These are fundamentally different performance profiles, and treating KV as "Redis but distributed" leads to architectural mistakes.
The write rate limit deserves emphasis. Redis handles thousands of writes per second to a single key. KV limits writes to one per second per key, returning 429 errors when exceeded. Applications updating the same key frequently (hit counters, real-time dashboards, high-frequency session updates) cannot use KV without architectural changes. Either distribute writes across multiple keys and aggregate on read, or use Durable Objects for write-heavy state.
Durable Objects provide strong consistency and effectively unlimited write rates per object, but each object lives in a single location. A rate limiter implemented as a Durable Object provides perfect accuracy but requires all requests to route to that object's location. For global rate limiting, this introduces latency that Redis (co-located with the application) avoids. The tradeoff is accuracy versus latency; Durable Objects give you both only when users are near the object's location.
Migration process
1. Audit Redis usage patterns. Categorise every Redis operation in your codebase: pure caching, session reads, session writes, counters, rate limiting, pub/sub, sorted sets, Lua scripts. Each category migrates differently.
2. Migrate caching first. Caching is the lowest-risk migration because cache misses are handled by design. Deploy Workers that check KV before Redis, writing to KV on cache miss. Over time, KV absorbs the read load while Redis handles decreasing traffic. Since cache data is regenerable, there is no data migration; you simply let KV populate organically.
3. Migrate sessions with dual-write. Write new sessions to both KV (or Durable Objects) and Redis simultaneously. Read from Redis during the validation period; compare reads from both systems. Once you are satisfied that the new store handles sessions correctly, switch reads to the new system. Existing Redis sessions expire naturally through TTL; new sessions exist only in the new store.
4. Migrate coordination patterns last. Rate limiters, counters, and distributed locks require Durable Objects and represent the most significant architectural change. Implement new rate limiting in Durable Objects alongside existing Redis rate limiters. Compare decisions between both systems before trusting Durable Objects alone.
Zero-downtime caching migration
The cache migration pattern produces zero downtime by design because caches exist to improve performance, not to store authoritative data. A cache miss is a performance event, not a failure.
Deploy a Worker that reads from KV first. On cache hit, return immediately. On cache miss, read from the authoritative source (database, API, Redis if it is also caching), write the result to KV asynchronously using ctx.waitUntil(), and return. This pattern means KV starts empty and fills organically based on real traffic. Frequently accessed data migrates first; rarely accessed data migrates on demand. No bulk data copy required.
For session migration, the dual-write pattern described in the zero-downtime database migration section applies. Write to both stores, read from the old, validate, then switch reads. Active sessions continue uninterrupted because both stores contain current data. Sessions created before dual-write started expire naturally through TTL; no backfill needed unless session lifetimes are very long.
The only scenario requiring careful handling is the transition from Redis-based rate limiting to Durable Objects. During the transition, run both rate limiters and use the more restrictive result (if either says "rate limited", enforce the limit). This prevents any requests from bypassing rate limiting during migration, at the cost of slightly more aggressive limiting during the overlap period.
Playbook: SQS, SNS, and Service Bus to Queues
Message queue migration carries a unique risk: losing messages. Unlike compute or storage migration where the worst outcome is degraded performance, queue migration can lose work that was submitted but never processed. The migration approach must guarantee that every message reaches exactly one consumer regardless of which system handles it.
The model differences
SQS uses a pull model: consumers poll for messages, process them, and explicitly delete them. Visibility timeouts hide messages from other consumers during processing. Dead letter queues catch messages that fail repeatedly. FIFO queues guarantee ordering and exactly-once delivery.
Cloudflare Queues uses a push model by default: Cloudflare invokes your Worker's queue() handler with batches of messages. The Worker processes the batch and returns; successful return acknowledges all messages. Individual messages can be retried by calling message.retry(). Pull-based consumption is also available for consumers outside the Workers ecosystem.
The differences that affect migration planning are significant. Queues lacks FIFO ordering guarantees and message deduplication. If your SQS usage relies on either of these, you need application-level workarounds. Queues does support dead letter queues natively: messages that fail after a configurable number of retries (up to 100) route to a designated DLQ rather than being discarded. This aligns with the SQS pattern, though the configuration differs.
Message size also differs: SQS supports up to 1 MiB per message (or larger with S3 offloading via the Extended Client Library), while Queues supports 128 KB. Messages exceeding 128 KB need a claim-check pattern: store the payload in R2 and send the R2 key as the message body.
Handling missing features
Message ordering. Queues delivers messages approximately in order but does not guarantee strict FIFO sequencing. If your application requires strict ordering per entity (all messages for order #1234 processed in sequence), route messages through a Durable Object keyed by entity ID. The Durable Object receives messages, buffers them, and processes them sequentially. This adds latency but guarantees ordering where it matters.
Deduplication. Implement idempotency at the consumer level. Include a unique message ID in each message body, check a D1 table or KV namespace before processing, and skip duplicates. This is a best practice regardless of platform.
Fan-out (SNS equivalent). SQS paired with SNS provides topic-based fan-out: one message published to a topic reaches multiple queues. Cloudflare Queues has no native topic/subscription model. Implement fan-out in the publishing Worker: when a message needs multiple consumers, publish to multiple Queues explicitly. This is more code but provides explicit control over which consumers receive which messages.
Migration process
1. Identify message patterns. Catalogue your SQS queues: throughput (messages per second), message sizes, consumer processing times, dead letter queue volumes, FIFO requirements. Queues supports up to 5,000 messages per second per queue; if any single SQS queue exceeds this, you will need multiple Cloudflare Queues with routing logic.
2. Implement the Cloudflare consumer. Write the Worker queue() handler that processes messages in the same way your SQS consumer does. Include DLQ handling if your SQS queues use dead letter queues. Test with representative message payloads.
3. Dual-publish during transition. Modify your producers to publish to both SQS and Queues simultaneously. Only the SQS consumer processes messages during this phase; the Queues consumer logs receipts to D1 for validation. Compare message counts between the two systems over several days. Any discrepancy indicates a producer or delivery issue to investigate before proceeding. For architectures with many producers, consider deploying an intermediary Worker that receives messages via HTTP and publishes to both SQS and Queues. Producers update their endpoint to the Worker; the Worker handles dual-publishing. This localises the migration logic to a single place and simplifies the eventual removal of SQS publishing.
4. Switch consumers. Disable the SQS consumer and enable the Queues consumer as the primary processor. Keep the SQS queue active and accumulating messages as insurance. If the Queues consumer encounters problems, re-enable the SQS consumer to process the backlog.
5. Drain and decommission. After the Queues consumer proves reliable under production load for one to two weeks, stop dual-publishing. Allow the SQS queue to drain naturally (messages hit their retention period), then delete it.
The process above produces zero downtime because queues are inherently asynchronous and consumers already tolerate processing delays. The dual-publish pattern in step 3 guarantees that every message reaches at least one system. Combined with idempotent consumers, no work is lost and no work is processed twice. The overlap period adds cost (you pay for messages in both systems), but eliminates the risk of a message falling between systems during cutover.
Playbook: Step Functions and Durable Functions to Workflows
Workflow orchestration migration is distinctive because orchestrators manage long-running processes that may span hours or days. You cannot simply cut over mid-execution; running Step Function state machines must complete on the old system while new executions start on the new one.
The model differences
Step Functions uses a declarative state machine model: you define states, transitions, and error handling in JSON (Amazon States Language). Each state is a node in a graph; execution follows edges between nodes. The model is visual, which aids understanding but limits expressiveness for complex conditional logic.
Cloudflare Workflows uses an imperative code model: you write a TypeScript class with a run() method containing sequential steps. Each step.do() call persists its result; if the Workflow fails and restarts, completed steps return their persisted results without re-executing. This model handles complex branching, loops, and dynamic logic naturally because it's just code, but loses the visual clarity of state machine diagrams.
The practical differences extend beyond programming model. Step Functions charges per state transition ($0.025 per 1,000 transitions), making deeply nested workflows expensive. Workflows has no per-step charge; you pay for the Worker invocation and step storage. Step Functions' Express Workflows offer lower latency but sacrifice durability. Cloudflare Workflows provides durability at all times.
Step Functions integrates deeply with AWS services through native integrations (invoke Lambda, read from DynamoDB, send to SQS, all without custom code). Workflows integrates with Cloudflare services through bindings and with external services through fetch(). If your Step Functions workflow orchestrates exclusively AWS services, migration requires replacing native integrations with explicit API calls.
Assessment before migration
Map every state machine to a Workflow class. Each Step Functions state machine becomes a Workflows class. Simple linear sequences translate directly. Parallel execution branches require care: Step Functions' Parallel state runs branches concurrently, while Workflows executes steps sequentially by default. Use Promise.all() within a single step for concurrency, or design multiple cooperating Workflows.
Check step limits. Workflows supports up to 1,024 steps per execution (sleep steps excluded). Step Functions has no explicit step limit. If your state machines have deeply nested loops or very long sequences, verify the step count fits within 1,024.
Evaluate wait patterns. Step Functions' Wait states pause execution for a specified time. Workflows' step.sleep() and step.sleepUntil() provide equivalent capability. Step Functions' Callback pattern (wait for external token) maps to step.waitForEvent(), which pauses the Workflow until an external system sends a named event.
Identify running executions. Before migration, understand how many Step Functions executions are active and their expected completion times. These must run to completion on the old system; they cannot be transferred mid-execution to Workflows.
Migration process
1. Translate state machines to Workflow classes. Convert each state machine definition to TypeScript. Task states become step.do() calls. Wait states become step.sleep(). Choice states become if/else or switch statements. Parallel states become concurrent fetch() calls within a single step. Map states become loops.
2. Handle AWS-native integrations. Step Functions' direct integrations with DynamoDB, SQS, SNS, and other AWS services become explicit API calls in Workflows. For integrations with Cloudflare services, use bindings (D1, KV, R2, Queues). For integrations with AWS services still in use during migration, use fetch() with appropriate authentication.
3. Test with production-representative inputs. Workflow behaviour under retry conditions matters more than the happy path. Test step failures, timeout handling, and the behaviour of waitForEvent() with delayed and missing events.
4. Run shadow executions. For each new Step Functions execution, trigger a parallel Workflows execution with the same input. The Workflows execution logs its results without affecting production. Compare outcomes between the two systems over one to two weeks.
5. Switch new executions to Workflows. After shadow execution validates correctness, route new executions to Workflows. Existing Step Functions executions continue running to completion on the old system; running executions cannot be migrated mid-flight, and this is a property of durable execution systems rather than a limitation to engineer around. This coexistence period can last days or weeks depending on the duration of your longest-running workflows. If your longest Step Functions execution typically runs for 48 hours, plan for at least 48 hours of coexistence after routing 100% of new executions to Workflows. In practice, add a generous buffer: a workflow that typically runs 48 hours might occasionally run for a week due to retries or external delays.
6. Decommission Step Functions. Once all legacy executions complete (monitor through the Step Functions console or API), disable the state machines. Maintain them in a stopped state for 30 days before deletion.
Workflow migration produces zero downtime naturally because each execution is independent and no execution shares state with another. Old executions finish on Step Functions; new executions start on Workflows. At no point does any execution lack a valid orchestrator.
Playbook: AI inference services to Workers AI, Vectorize, and AI Gateway
AI workload migration differs from other migrations because the services being replaced are not functionally equivalent. Workers AI offers a curated set of open-source models; SageMaker, Bedrock, and Azure OpenAI offer hundreds of models including proprietary ones. Migration is less about replacing infrastructure and more about evaluating whether the available models meet your quality requirements.
When Workers AI fits
Workers AI excels for inference workloads where latency and simplicity matter more than model selection breadth. Running Llama 3.3 70B at the edge provides lower latency than routing to a centralised SageMaker endpoint, and the operational overhead is a single binding rather than endpoint management, model versioning, and instance scaling.
The platform provides text generation (Llama, Mistral, Qwen families), embedding generation (BGE family supporting multiple languages), image generation (Flux, Stable Diffusion), speech-to-text (Whisper), text classification, and summarisation. For applications where these models provide sufficient quality, Workers AI eliminates the infrastructure management that SageMaker and Bedrock require.
When it doesn't
If your application depends on GPT-4, Claude, Gemini, or other proprietary models, Workers AI cannot replace them because it serves open-source models only. If you have fine-tuned models trained on domain-specific data, Workers AI does not currently support custom model deployment. If your quality benchmarks require models larger than those available on Workers AI, you will need to continue using external providers.
This is where AI Gateway becomes valuable. AI Gateway provides a unified proxy that routes requests to any supported provider (Workers AI, OpenAI, Anthropic, Azure OpenAI, Bedrock, Google AI Studio, and others) through a single endpoint. Rather than replacing your inference provider, AI Gateway gives you logging, caching, rate limiting, and fallback routing across all providers. Migration to AI Gateway provides operational benefits without requiring model changes.
Vector database migration
Vectorize replaces managed vector databases (Pinecone, Weaviate hosted, OpenSearch with knn plugin, pgvector) for applications within its capability envelope. Vectorize indexes support up to 10 million vectors with up to 1,536 dimensions, which covers most embedding models in common use.
The migration path for vector data is conceptual rather than operational. Re-embed your source documents using the embedding model you will use with Vectorize (the BGE family on Workers AI, or external models through AI Gateway). Insert the new embeddings into a Vectorize index. Compare search quality between old and new systems by running identical queries against both and evaluating relevance.
Re-embedding is necessary because embeddings are model-specific; vectors from one model are meaningless in an index built for another. If you are also changing embedding models during migration (for example, moving from a proprietary embedding API to BGE), quality comparison must account for the model change as well as the infrastructure change.
Migration process
1. Deploy AI Gateway first. Route all existing inference requests through AI Gateway, pointing at your current provider. This adds logging, cost tracking, and caching without changing providers. The immediate benefit is visibility: you now know exactly how many inference requests you serve, at what latency, and at what cost.
2. Evaluate model quality. Run the same prompts through Workers AI models and your current provider. Compare output quality for your specific use case. Automated evaluation (BLEU scores, embedding similarity, classification accuracy) provides quantitative comparison; human evaluation provides qualitative assessment. If Workers AI models are insufficient, keep your current provider behind AI Gateway and skip steps 3 and 4.
3. Configure fallback routing. Set up AI Gateway to route requests to Workers AI with fallback to your current provider. If Workers AI returns an error or times out, the request automatically falls through to the fallback. This gives you the latency benefits of edge inference for successful requests while maintaining reliability through fallback.
4. Shift traffic gradually. Increase the proportion of requests routed to Workers AI as primary, monitoring quality metrics. AI Gateway's analytics show per-provider latency, error rates, and costs. If quality or latency degrades, adjust routing without code changes.
5. Migrate vector search independently. If moving to Vectorize, re-embed your corpus and build a new index. Run comparison queries against both old and new vector stores. Switch when search relevance meets your threshold.
Zero-downtime AI migration
AI inference migration has a natural zero-downtime pattern because inference requests are stateless and independent. Each request contains its full context; there is no session state or ordering dependency between requests.
AI Gateway provides the zero-downtime mechanism. Configure it as a proxy in front of your current provider, then gradually add Workers AI as a primary route with your current provider as fallback. At no point does any request fail: if Workers AI is unavailable or returns an error, AI Gateway falls back to the configured alternative. The fallback adds latency (the request must fail on Workers AI before falling through), but no request goes unanswered.
For quality-sensitive applications, shadow mode provides additional safety. Route 100% of traffic to your current provider while simultaneously sending copies to Workers AI. Compare outputs without affecting production responses. This reveals quality differences before any user sees a Workers AI response, allowing you to evaluate fit without risk.
The one scenario requiring care is embedding migration. If you change embedding models, existing vectors in your search index become incompatible with new query embeddings. Re-embed the entire corpus before switching query routing. During the re-embedding period, serve queries from the old index; switch to the new index only after re-embedding completes and quality validation passes. This is a batch operation that can run in the background without affecting production search.
Migration anti-patterns
Some migration approaches fail predictably. These anti-patterns describe what happens when migration processes are flawed:
Big bang migrations maximise risk. Migrating without baselines means you can't measure success. Atomic cutover without rollback bets everything on perfection. Migration as goal wastes engineering time without delivering value. These aren't risks to manage; they're failure modes to avoid.
Big bang migration. Moving everything simultaneously maximises risk and minimises learning. Migrate incrementally; learn as you go.
Migration without baselines. If you don't measure current performance, you can't know if migration improved anything. "It feels faster" isn't evidence.
Migration without rollback. Every step should be reversible. Deleting source data before validating migration eliminates your recovery path. Keep rollback options open until confidence is established through production operation, not hope.
Migration as goal. "We migrated to Cloudflare" is an activity, not an achievement. Achievements are measurable improvements: reduced latency, lower costs, simpler operations. If you can't articulate what migration improves in quantifiable terms, question whether you should migrate.
Complete migration when partial suffices. Hybrid architectures are valid long-term states. Migration doesn't have to be total to be valuable.
Underestimating integration complexity. Compute migration is often the easy part. Integrations (authentication services, monitoring systems, CI/CD pipelines) frequently take longer than anticipated. Inventory all integrations before estimating effort.
Getting help
Cloudflare provides migration support for enterprise customers; Solutions Architects help assess migration candidates, design sequences, troubleshoot issues, and validate outcomes.
For non-enterprise migrations, Cloudflare's documentation is comprehensive (though sometimes overwhelming), community forums vary in quality, and the Discord is active for specific questions. Edge cases require consulting specific documentation or community expertise.
Migration is investment, and the hours spent migrating don't ship features but recreate existing capability on new infrastructure. Invest wisely by quantifying expected benefits, migrating incrementally, maintaining rollback capability, and measuring outcomes. The goal isn't migration but rather the improvement migration enables.
What comes next
Chapter 26 closes this book by distilling twenty-five chapters into the principles that matter most. Whether you're starting your first Worker or leading a platform migration, these mental models provide the foundation for building well on Cloudflare.