Chapter 23: Multi-Tenant and Platform Architectures

How do I build a platform that serves multiple tenants securely and efficiently?

Multi-tenancy makes SaaS economics work but is also the architectural decision most likely to wake you at 3am. One codebase serves many customers, each isolated from others while sharing infrastructure; your month-one isolation strategy determines your year-three crises.

Get isolation right and you have a scalable business with predictable costs. Get it wrong and you face failures from noisy neighbour problems to data leaks that end your company; one missing WHERE clause means explaining to Customer A why they can see Customer B's data.

Multi-Tenant Security Reality

Row-level isolation is a discipline problem disguised as a technical solution. One query without a tenant filter, one ORM misconfiguration, one copy-pasted SQL snippet, and you have a data breach. Database-per-tenant provides architectural isolation: queries physically cannot access other tenants' data because that data exists in a different database.

Cloudflare's architecture naturally fits multi-tenant patterns. Many small databases rather than one large one, per-entity Durable Objects, Workers for Platforms: these primitives assume multi-tenancy as the default. But "naturally fits" doesn't mean "automatically correct". This chapter covers decision frameworks for isolation strategies, their operational implications, and common failure modes.

The isolation ladder

Isolation spans three dimensions: compute, data, and state. Stronger isolation always costs something in complexity, money, or operational overhead, so choose the minimum isolation that meets your actual requirements.

Think of isolation as a four-rung ladder where each step up increases strength, complexity, and cost. Most applications belong on the first or second rung.

Rung one: shared everything with logical separation. Workers execute identical code for all tenants. Authentication establishes tenant identity, which determines data access. All tenant data lives in shared D1 tables with a tenant identifier column; Durable Objects use tenant-prefixed names. Sufficient for most SaaS applications.

Rung two: physical data separation. Workers remain shared, but each tenant gets their own D1 database. State isolation uses the same tenant-prefixed Durable Object pattern. Suits regulated industries, enterprise customers demanding physical separation, or applications needing per-tenant schema customisation.

Rung three: compute separation via Workers for Platforms. Each tenant's code executes in a separate dispatch namespace with resource limits you control. Necessary only when tenants contribute code: integration platforms, webhook processors, extensible applications.

Rung four: complete physical separation. Dedicated Workers for Platforms accounts, dedicated databases, dedicated everything. Rare, expensive, and typically driven by contractual requirements. Government contracts and highly regulated financial services occasionally demand it.

The dimensions are independent. You can share compute with separated databases, or separate compute with shared databases. Most architectures sit cleanly on one rung, but understanding this independence helps when requirements don't fit neatly.

Choosing your data isolation strategy

Data isolation is where the hard decisions live. Two patterns dominate: row-level isolation in a shared database, and database-per-tenant. Each has profound implications for security, operations, and cost. Understand both thoroughly before committing.

Row-level isolation

Row-level isolation stores all tenant data in shared tables with a tenant identifier column, requiring every query to filter by tenant: add tenant_id to every table, include it in every relevant index, and enforce it in every WHERE clause.

The most compelling case for row-level isolation is cross-tenant analytics. If your product includes reports comparing tenant performance to benchmarks or aggregating usage across your customer base, row-level isolation makes these queries trivial (a single SQL statement computes averages, percentiles, or aggregations across everyone), whereas database-per-tenant requires iterating through every database and aggregating results in application code.

The cost structure also favours row-level isolation at scale. One D1 database versus ten thousand affects both your bill and operational surface area. Monitoring one database is simpler than monitoring ten thousand, even with excellent automation.

Database-per-tenant

Database-per-tenant gives each tenant their own D1 database, making queries physically unable to access other tenants' data because that data exists in a different database. The isolation is architectural rather than procedural, with bindings connecting to only one tenant's database.

This pattern suits applications where data sensitivity is high, regulatory requirements mandate physical separation, or per-tenant schema customisation is necessary. Healthcare, finance, and government contracts often require physical isolation rather than mere logical separation, allowing you to point to infrastructure and say "this database contains only this tenant's data" without relying on application logic correctness.

Schema flexibility is another driver. If tenants need custom fields, different indexing strategies, or schema variations, database-per-tenant accommodates this naturally. Each database evolves independently. Row-level isolation forces a single schema on all tenants, with customisation limited to nullable columns or JSON fields.

The operational cost is substantial. Schema migrations become distributed systems problems. Adding a column means running ten thousand ALTER TABLEs, tracking which succeeded, handling failures, and managing the period where different tenants run different schema versions. This requires tooling, monitoring, and operational discipline that row-level isolation doesn't demand.

The economics

Creating a thousand D1 databases costs less than one RDS instance; this isn't a workaround for D1's 10 GB limit but rather the intended architecture. Cloudflare assumes many databases while hyperscalers assume few; a single db.t3.medium RDS instance costs roughly $30 per month (before storage, I/O, or backup), whereas a thousand D1 databases with modest usage (10 million reads and 1 million writes monthly across all tenants) cost roughly $10 total. The economics differ by orders of magnitude, making database-per-tenant virtually free compared to infrastructure cost, so the decision turns on operational readiness rather than spend.

On AWS, database-per-tenant typically means expensive RDS instances per tenant or DynamoDB with per-table isolation, neither cheap at scale. Cloudflare's pricing model makes database-per-tenant economically viable for applications where hyperscaler costs would be prohibitive.

Making the decision

Neither pattern is universally superior. Row-level isolation fits when data sensitivity is moderate, cross-tenant operations are common, and engineering resources for migration tooling are limited. If your product requires heavy cross-tenant analytics or you have thousands of tenants with limited operational capacity, row-level isolation is probably correct.

Database-per-tenant fits when data sensitivity is high, regulatory requirements mandate physical separation, per-tenant schema customisation is necessary, or breach consequences justify the operational investment. If you're in a regulated industry or serving enterprise customers with strict security requirements, database-per-tenant is probably correct.

Tenant count matters but isn't absolute. Managing schema migrations across fifty thousand databases differs qualitatively from managing fifty. The tooling investment scales stepwise as you cross operational thresholds. At a hundred tenants, manual intervention during migrations is annoying but feasible. At ten thousand, it's impossible.

Consider breach consequences honestly. For some applications, a data leak is embarrassing and expensive but survivable. For others, such as healthcare, finance, or applications handling children's data, a breach is potentially fatal. When the downside is catastrophic, database-per-tenant is insurance worth paying.

Tenant tiering

Many platforms use different isolation levels for different tenant tiers. Free tenants share a database; enterprise tenants get dedicated databases. This resolves the row-level versus database-per-tenant tension for applications where tenant value varies dramatically.

The pattern aligns isolation cost with tenant value. Free users who might churn tomorrow don't justify database provisioning complexity. Enterprise customers (typically those paying $10,000 or more annually) justify the operational overhead and often contractually require physical separation.

Implementation requires routing logic that determines database binding based on tenant tier; shared database for standard tenants, dedicated database for enterprise tenants. This adds complexity but concentrates operational overhead on tenants generating revenue to justify it.

Make tiering explicit in your pricing and contracts. Enterprise customers paying for dedicated databases should know they're getting physical isolation. This becomes a selling point, not just an operational detail.

Migration between models

Can you change your mind? Moving from row-level to database-per-tenant requires significant effort: provisioning databases for each tenant, migrating data from shared tables to dedicated databases, updating all application code to use tenant-specific bindings, and handling the transition period where some tenants are migrated and others aren't. Expect weeks to months depending on data volume and application complexity.

Moving from database-per-tenant to row-level is theoretically possible but rarely done. The scenarios driving database-per-tenant adoption (regulatory requirements, enterprise contracts) don't typically reverse.

Start with row-level isolation unless you have specific requirements demanding physical separation. Row-level is simpler to build, cheaper to operate, and migration to database-per-tenant is feasible if requirements change. Starting with database-per-tenant and discovering you didn't need it wastes substantial engineering effort.

Tenant metadata architecture

Tenant metadata, the registry of tenants, their configuration, feature flags, and billing status, differs from tenant data in one crucial way: your application needs it before knowing which tenant database to use. The authentication flow checks credentials, identifies the tenant, retrieves configuration, then routes to their data.

A shared D1 database works well for tenant metadata: tenant registry, domain mappings, feature flags, tier information, billing status. Row-level isolation concerns don't apply because knowing that tenant X exists and is on the enterprise tier reveals little.

Cache this metadata aggressively. Tenant configuration changes rarely; looking it up from D1 on every request wastes resources and adds latency. KV provides an effective caching layer with TTLs measured in minutes. Check KV first, fall back to D1 on cache miss, and write to KV on cache population.

Consider what happens when tenant metadata is unavailable. If your metadata database is unreachable, you can't authenticate any requests. Address this through replication, caching with longer TTLs during outages, or graceful degradation that allows cached authentication to continue while metadata is unavailable.

How Cloudflare handles noisy neighbours

Workers isolates provide compute isolation by design. Each request executes in a V8 isolate with memory and CPU limits. A tenant triggering expensive computation consumes their own CPU allocation; they cannot steal CPU cycles from other tenants' requests. The isolate model that enables near-zero cold starts also provides tenant isolation for free.

D1 provides data isolation through its architecture. Each D1 database is a Durable Object running in Cloudflare's infrastructure. One tenant's expensive queries don't slow another tenant's database because they're separate objects, potentially on different machines.

Durable Objects provide similar state isolation. Each object is single-threaded and processes one request at a time, but objects are independent. Heavy load on one tenant's Durable Objects doesn't affect other tenants' objects.

The remaining noisy neighbour vectors are at the account level. Workers have per-account CPU limits. Subrequest limits are per-Worker-invocation, not per-tenant. If tenant abuse could exhaust account-level resources, you need application-level rate limiting.

Workers for Platforms addresses this for tenant-contributed code. Each dispatch namespace has configurable resource limits. Runaway code is terminated when it exceeds limits, protecting both your platform and other tenants.

State isolation

Durable Objects provide natural tenant isolation through their naming scheme. Including the tenant identifier in the object name ensures each tenant's state lives in separate objects:

Tenant-isolated Durable Object naming
const id = env.SESSION.idFromName(`${tenantId}:${sessionId}`);

This suffices for most applications. The single-threaded, globally-unique nature of Durable Objects means tenant state is already isolated; objects cannot share memory or accidentally access each other's data.

More complex isolation, such as separate Durable Object bindings per tenant, is rarely necessary and adds deployment complexity without meaningful security benefit. Reserve it for regulatory requirements mandating physical separation of compute resources, or when tenant-contributed code needs Durable Object access and you must prevent tenants from constructing object names for other tenants.

Workers for Platforms

Workers for Platforms solves a specific problem: executing tenant-provided code safely. When your platform allows customers to write JavaScript running on your infrastructure (webhooks, custom transformations, integration logic), this is how you isolate their code from yours and from each other.

When configuration suffices

Before reaching for Workers for Platforms, ask whether tenant customisation can be expressed as configuration. Code execution is a capability tax you pay forever; configuration is a one-time design investment.

Most "we need custom code" requirements, examined closely, reduce to "we need more flexible configuration." Tenants want to transform data, apply conditional logic, or customise behaviour. These needs often fit well-designed configuration systems.

Consider field mappings. A tenant wants to transform webhook payloads: "rename field A to field B, extract nested field C.D to top level." JSONPath expressions, field mapping rules, or template strings handle this without code execution.

Consider conditional logic. A tenant wants to route data based on field values: "field X equals Y" or "field X contains Y." A rule engine with predefined operators handles it. You don't need arbitrary JavaScript to evaluate "status equals approved."

Consider formatting. A tenant wants to customise notification messages. Mustache, Handlebars, or simple string interpolation let tenants customise output without arbitrary code.

Workers for Platforms becomes necessary when customisation genuinely requires logic you can't anticipate: calling external APIs based on data values, implementing domain-specific business logic, transforming data in ways requiring loops or computation.

Integration platforms, workflow automation tools, and extensible applications fall into this category. The question isn't whether you can avoid code execution but whether your platform's value proposition requires it.

Architecture and implications

The dispatch model routes requests through your Worker to tenant Workers:

src/dispatch-worker.ts
export default {
  async fetch(request: Request, env: Env) {
    const tenantId = getTenantFromRequest(request);
    const tenantWorker = env.TENANT_WORKERS.get(tenantId);
    return tenantWorker.fetch(request);
  }
};

This indirection is the security boundary. Authentication, rate limiting, and request validation happen in your dispatch Worker before tenant code executes. You control what data reaches tenant code and what tenant code can do with responses. But once you hand off, you're trusting the sandbox.

The sandbox provides strong isolation. Tenant code runs in separate V8 isolates with memory limits and CPU time constraints you configure. A malicious tenant cannot access other tenants' code, memory, or data; they cannot escape the isolate.

Trust modes

Workers for Platforms offers two isolation modes with different security implications:

Untrusted mode (the default) provides complete isolation between customer Workers, including separate cache namespaces. The caches.default API is disabled entirely, preventing tenants from polluting or reading each other's cached content. Use this when customers control deployed code (the normal case for integration platforms and extensibility scenarios).

Trusted mode allows shared cache access across the namespace. Only appropriate when you, the platform operator, control all Worker code deployed to the namespace. If you're deploying your own code variants rather than accepting customer code, trusted mode enables performance optimisations impossible with full isolation.

Outbound controls

Tenants can make outbound requests. By default, tenant code can fetch any URL. Outbound Workers intercept these requests, providing bidirectional traffic control:

Outbound Worker for egress control
export default {
  async fetch(request: Request, env: Env) {
    const url = new URL(request.url);

    // Block internal infrastructure
    if (isInternalHost(url.hostname)) {
      return new Response("Access denied", { status: 403 });
    }

    // Rate limit external calls per tenant
    const tenantId = request.headers.get("CF-Worker-Tenant-ID");
    if (!await checkOutboundRateLimit(tenantId, env)) {
      return new Response("Rate limited", { status: 429 });
    }

    // Audit logging
    await logOutboundRequest(tenantId, url, env);

    return fetch(request);
  }
};

Outbound Controls Are Not Automatic

Without outbound controls, tenant code can probe your internal network, make requests appearing to originate from your infrastructure, or exfiltrate data to arbitrary destinations. Configure outbound Workers explicitly.

The outbound Worker sees every fetch() call from tenant code. This gives you complete visibility and control over egress traffic: block access to internal services, inject authentication to backend APIs without exposing credentials to tenants, rate limit external calls, and maintain audit logs of all external requests.

The documentation obligation

When you build a platform, you inherit documentation obligations. Cloudflare documents the guarantees, limits, and failure modes of Workers. You must do the same for your platform.

Your tenants can't reason about their code's behaviour without understanding what your platform promises. Document resource limits: CPU time, memory, execution duration. Document available bindings and APIs. Document failure modes: what happens when limits are exceeded, when outbound requests fail, when your infrastructure has issues. Document what you log and monitor.

When a tenant's code fails mysteriously, they need documentation to diagnose the issue. When a tenant wants to push limits, they need to know what limits exist. This documentation isn't optional; it's part of running a platform.

Operational reality

Running tenant code creates operational challenges that shared Workers avoid.

Debugging is harder. When a tenant reports their Worker is failing, you're investigating code you didn't write. Your observability is limited to logs, error messages, and resource consumption. Building good debugging tools for tenants reduces support burden.

Version management is complex. Tenants update their code independently. Platform changes (new APIs, deprecated features, security patches) may break tenant Workers. You need a strategy for communicating changes, providing migration periods, and handling tenants who don't update.

Resource limits require tuning. Too restrictive, and legitimate tenant code fails. Too generous, and one tenant's expensive computation affects platform stability. Start conservative and increase limits based on observed legitimate usage.

Support burden increases. Tenants will ask for help with code that doesn't work. Draw the line between platform support and coding assistance explicitly. "Your code timed out" is platform support. "Why does my JavaScript throw a TypeError" is coding assistance.

Custom domains

Custom domains create operational dependencies on infrastructure you don't control. Enterprise tenants expect their branding (your API accessible at api.tenant.com rather than api.yourplatform.com/tenants/tenant), and customers pay more for white-labelling, but this couples your platform's reliability to your tenants' DNS hygiene.

Cloudflare for SaaS handles certificate issuance, renewal, and routing, while your responsibility covers the tenant experience, operational edge cases, and resilience to failures outside your control.

The routing foundation

Your Worker must identify tenants by domain:

src/domain-routing.ts
const hostname = new URL(request.url).hostname;
const tenantId = await getTenantByDomain(hostname, env);
if (!tenantId) {
  return new Response("Unknown domain", { status: 404 });
}

Cache domain lookups aggressively; mappings change rarely. KV with TTLs measured in minutes works well. Invalidate explicitly when tenants change domain configuration.

Support fallback patterns. Subdomains of your platform (tenant.yourplatform.com) should coexist with custom domains, providing a working configuration while tenants sort out custom domain setup. When custom domain lookup fails, check whether the request arrived at a platform subdomain and route accordingly.

Verification strategy

Adding a custom domain requires the tenant to prove ownership. The verification method affects tenant experience significantly.

HTTP validation is seamless when tenants have already configured DNS to point at your platform. They update DNS, Cloudflare validates automatically, certificates provision, and the domain works. This is the smoothest experience for technical tenants who understand DNS.

DNS TXT validation works before DNS points to your platform. Tenants add a verification record, you confirm it, then they update their A or CNAME records. More complex but allows verification before go-live; suits tenants who want to validate before cutting over production traffic.

For less technical customers, consider the support implications. DNS configuration confuses people who don't do it regularly. Documentation, validation status dashboards, and proactive support for stuck domains improve the experience. Monitor tenants who start custom domain setup but never complete it; they may be stuck.

Resilience to external failures

DNS propagation takes time. Tenants will configure their DNS and immediately complain that custom domains don't work. Set expectations explicitly: propagation takes minutes to hours depending on DNS provider and TTL settings. Provide tools to check configuration status.

Certificate validation can fail for reasons outside your control: DNS misconfigurations, CAA records blocking issuance, propagation delays. Monitor certificate status across tenants and alert on persistent failures. Distinguish expected delays from genuine problems.

Tenants forget to renew their domains. When a tenant's domain expires, their registrar may redirect traffic to a parking page, making your platform look broken. Monitor for unexpected DNS changes, certificate validation failures on previously working domains, or sudden traffic drops. Notify tenants proactively when domain configuration appears broken.

Build your platform to handle custom domain failures gracefully. When a custom domain stops working, the tenant's platform subdomain should still function. Don't rely solely on custom domain routing; maintain subdomain routes as fallback.

Usage tracking and billing

Usage tracking serves three purposes: billing customers accurately, enforcing quotas, and understanding platform capacity, though the precision required differs by purpose.

Accuracy requirements

How accurate does billing need to be? Monthly billing aggregates smooth out momentary inconsistencies; if a tenant makes a thousand requests across ten edge locations simultaneously, your count might be temporarily wrong by a few percent, but these errors average out by month end. For billing, approximate counts aggregated over time usually suffice.

The real risk is systematic bias. If metering consistently under-counts by 2%, you're leaving revenue on the table; if it over-counts, you'll face disputes. Random inconsistency is acceptable; directional bias is not.

For quota enforcement, the precision question differs. If a tenant is limited to ten thousand requests per day, does it matter if they occasionally get ten thousand and fifty? For most applications, soft enforcement is fine. Hard enforcement requires different architecture with higher latency cost.

The metering cost problem

Every usage increment is a storage operation. At high request volumes, metering itself becomes a significant cost and potential bottleneck; metering overhead can exceed the cost of the operations being metered.

A tenant making one million requests per day generates one million metering writes if you increment a counter per request. At D1's pricing, this costs more than the Worker invocations being metered. The metering tail wags the platform dog.

Batching reduces this dramatically. Accumulate counts in memory and flush periodically rather than writing on every request. The tradeoff: if the Worker instance terminates before flushing, those counts are lost. For most applications, losing a few percent of counts is acceptable given the cost savings.

Sampling works for high-volume metrics where precision matters less than trends. Sample one in a hundred requests and extrapolate. This reduces metering overhead by 99% while providing sufficient accuracy for capacity planning and billing at scale. Don't sample for quota enforcement or per-tenant billing at low volumes; error margin becomes significant when total counts are small.

Quota enforcement patterns

The naive approach reads current usage, compares to limit, then proceeds or rejects. It adds latency to every request and doesn't work correctly under concurrent load.

Optimistic enforcement reduces latency and handles concurrency. Process the request, then increment usage asynchronously. Check quota periodically or when approaching limits. If a tenant exceeds their quota, the overage is small; handle it through overage charges or rejection on the next request. This works when occasional overage is acceptable, which is most of the time.

For hard limits, use Durable Objects. A per-tenant rate limiting object tracks usage with strong consistency and sub-millisecond latency. Each request increments the counter atomically before proceeding; if the increment would exceed the limit, the request rejects immediately.

Reserve Durable Objects for quotas that genuinely must be hard: API rate limits where abuse is a concern, resource limits protecting your infrastructure, or contractual obligations requiring exact enforcement.

Database-per-tenant operations

If you choose database-per-tenant, several operational realities require planning before you have a thousand databases with no tools to manage them.

Migration as distributed systems

Schema migrations become distributed operations. Adding a column means executing the same change across every tenant database, tracking which succeeded, handling failures, and managing the transition period.

Design migrations to be idempotent. A migration that fails partway through and retries must not corrupt data or fail on already-migrated rows. Use IF NOT EXISTS clauses. Check whether changes already exist before applying. Ensure migrations can safely run multiple times.

Track migration status per tenant:

src/migrations/tracker.ts
await env.MIGRATIONS.put(`${tenantId}:${migrationId}`, JSON.stringify({
  status: 'completed',
  timestamp: Date.now()
}));

Build tooling that expects partial failure. When migration 4,293 fails, you need to know immediately and understand why: data constraint violation, timeout, infrastructure issue. Your tooling should support both stopping to investigate and continuing with remaining tenants.

Consider lazy migration for non-critical changes. Rather than migrating all databases immediately, check schema version on first access and migrate then. This spreads load over time and ensures you only migrate active tenants. Dormant tenants migrate if and when they return.

Cross-tenant analytics

You cannot query across D1 databases; no federation, no cross-database joins. If you need data from multiple tenants, iterate through databases and aggregate in application code.

For reporting and analytics, maintain aggregate tables separately. Process usage data from tenant databases into a shared analytics database periodically. Tenant data stays isolated, but aggregated, anonymised metrics live in a shared store optimised for cross-tenant queries.

Consider what aggregates you'll need before you need them. Adding aggregation after the fact requires backfilling from thousands of databases.

Subrequest Limits and Cross-Tenant Operations

Workers default to 10,000 subrequests per invocation on paid plans, configurable up to 10 million through Wrangler's limits.subrequests setting. For cross-tenant aggregation across hundreds or thousands of databases, this configurability removes what was previously a hard constraint. A Worker iterating through 5,000 tenant databases can now set its subrequest limit accordingly rather than requiring queue-based batching. That said, queue-based aggregation remains the better pattern for very large tenant counts where wall time, not subrequests, becomes the bottleneck. A Worker querying ten thousand databases sequentially will likely hit wall time limits before subrequest limits. Design your aggregation strategy around whichever constraint binds first.

Tenant lifecycle

Provisioning creates resources that must be tracked, and a robust flow handles failure at any step. Database creation might succeed but schema application might fail (empty database, partially onboarded tenant; detect and retry or clean up). Schema might apply but storing the tenant record might fail (functional database you can't find; include the database identifier in the database name so you can discover orphans through the API). Tenant record might store but subsequent setup might fail (registry points to valid database but default data, welcome emails, or integrations didn't complete; design onboarding as a state machine that can resume from any point).

Offboarding is equally complex, with regulations potentially requiring data export before deletion, grace periods allowing tenants to return, and cascading deletes needing to clean up all resources without leaving orphans or dangling references.

Build reconciliation processes. Periodically enumerate all resources and verify they belong to active tenants. Flag orphans for investigation. This catches provisioning failures, offboarding bugs, and edge cases you haven't imagined.

What goes wrong

Multi-tenant failures are predictable in kind if not in timing. The question isn't whether you'll face data isolation bugs, migration failures, and resource exhaustion; it's whether you've built detection and mitigation before they become incidents.

Data isolation failures

In row-level isolation, the most common failure is a query without tenant filtering: typically in new code paths, rarely-exercised error handlers, or debugging sessions where someone runs a query manually.

Prevent it through multiple layers. ORM configurations can add tenant filters automatically; configure this as default, not opt-in. Database views can include tenant filters, ensuring raw table access is never needed in application code. Code review checklists should specifically verify tenant filtering. Integration tests should attempt cross-tenant access and verify failure.

Detect it through audit logging. Log all database queries with tenant context. Anomaly detection can flag queries returning data for multiple tenants or query patterns that don't match expected access.

Migration failures

In database-per-tenant, migrations fail partially. Some tenants migrate successfully; others fail. Causes vary: data violating new constraints, queries timing out on large tables, transient infrastructure issues.

Schema drift is a subtle variant. With lazy migration, different tenants run different schema versions. Your application must handle this (either supporting multiple schema versions simultaneously or forcing migration before operations requiring newer schemas). Design this handling explicitly; discovering schema incompatibility at runtime creates debugging challenges.

Resource exhaustion

Per-tenant resource limits in Workers and D1 provide natural isolation, but account-level limits are shared. A tenant triggering expensive operations can consume quota that affects all tenants.

Monitor per-tenant resource consumption. Track CPU time, subrequests, and database operations by tenant. Rate limit tenants approaching concerning thresholds before they impact the platform.

With Workers for Platforms, configure resource limits conservatively. Tenant code running at maximum allowed time on every request consumes substantial resources. Start tight and increase based on observed legitimate usage.

Scale inflection points

At ten tenants, you can operate manually (migrate databases by hand, investigate issues individually, know each tenant's situation).

At a hundred tenants, automation becomes necessary; manual migration is tedious and error-prone, and manual investigation doesn't scale.

At a thousand tenants, dedicated tooling is required because migrations become distributed systems problems, monitoring must aggregate and surface anomalies automatically, support must be tiered, and cross-tenant analytics queries take noticeable time.

At ten thousand tenants, operational overhead dominates, with migration tooling needing to handle failures gracefully at scale, monitoring needing to be hierarchical (with tenant-level detail available on demand but not in dashboards), and tenant metadata potentially needing sharding. Plan for the scale you expect to reach, not the scale you have today, and build tooling before you need it urgently.

Reference architecture

The entry point is a dispatch Worker handling all incoming requests. It authenticates requests, identifies tenants from tokens or domains, retrieves tenant configuration from cached metadata, and routes to appropriate handlers. Tenant code never runs without authentication succeeding first.

Tenant metadata lives in a shared D1 database with aggressive KV caching: tenant registry, domain mappings, feature flags, tier information, billing status. Every request queries this data; cache accordingly.

Tenant data lives according to your isolation strategy: shared D1 with tenant filtering for row-level isolation, dynamic binding resolution for database-per-tenant, or routing logic selecting shared or dedicated databases for tiered isolation.

Tenant state uses Durable Objects with tenant-prefixed names. Sessions, real-time features, and coordination all route through objects named with tenant identifiers.

For platforms with tenant-contributed code, a dispatch namespace contains tenant Workers. The dispatch Worker routes to tenant code after authentication, with outbound Workers controlling external access.

Usage tracking aggregates in a dedicated analytics database, populated by background processing. Quota enforcement uses Durable Objects for hard limits, optimistic checking for soft limits.

Custom domains route through Cloudflare for SaaS, with domain-to-tenant mappings in the metadata database. Fallback to platform subdomains ensures availability when custom domain issues arise.

Your specific platform will differ, but the pattern of centralised authentication, cached metadata, appropriate data isolation, and explicit resource boundaries applies broadly.

What comes next

Chapter 24 provides the honest assessment every technical leader needs: when is Cloudflare not the right choice? Understanding limitations prevents costly mistakes and builds credibility when you do recommend the platform.

The isolation ladder​

Choosing your data isolation strategy​

Row-level isolation​

Database-per-tenant​

The economics​

Making the decision​

Tenant tiering​

Migration between models​

Tenant metadata architecture​

How Cloudflare handles noisy neighbours​

State isolation​

Workers for Platforms​

When configuration suffices​

Architecture and implications​

Trust modes​

Outbound controls​

The documentation obligation​

Operational reality​

Custom domains​

The routing foundation​

Verification strategy​

Resilience to external failures​

Usage tracking and billing​

Accuracy requirements​

The metering cost problem​

Quota enforcement patterns​

Database-per-tenant operations​

Migration as distributed systems​

Cross-tenant analytics​

Tenant lifecycle​

What goes wrong​

Data isolation failures​

Migration failures​

Resource exhaustion​

Scale inflection points​

Reference architecture​

What comes next​