Chapter 20: Observability and Operations
How do I know what's happening in production and respond to issues?
On traditional servers, observability means installing agents, tailing log files, and SSHing in when things break. On Cloudflare, none of this applies. No servers to access, no filesystems where logs accumulate, no processes to attach debuggers to. Your Worker runs everywhere and nowhere, hundreds of locations simultaneously, each ephemeral. Capture evidence as it happens, or lose it.
Edge observability isn't harder than server observability; it's fundamentally different. Understanding the difference determines whether you're flying blind or seeing everything.
The inversion
Traditional observability assumes persistence. Logs write to disk, metrics accumulate in memory. When something breaks, you SSH in, grep the logs, attach a debugger, investigate. The evidence waits for you.
Edge observability assumes ephemerality. V8 isolates spin up, handle requests, and vanish. No disk to write to, no process to attach to, no server to access. Logs either stream somewhere in real time or disappear forever.
On servers, observability is an afterthought. On the edge, observability is architecture: what you don't plan for doesn't exist.
This inversion has three consequences.
First, you must decide what to observe before problems occur. On a server, you can enable verbose logging after discovering an issue. On the edge, if you didn't capture it, you cannot retrieve it. Observability becomes a design decision, not an operational afterthought.
Second, aggregation becomes automatic but correlation becomes hard. Every log from every location streams to a single place without infrastructure to manage. But a single user request might touch London, then Amsterdam, then Frankfurt. Without explicit correlation identifiers, you have disconnected logs from multiple continents with no way to link them.
Third, traditional tooling doesn't work. APM agents assume persistent processes: they instrument code, accumulate metrics in memory, and periodically flush to a backend. V8 isolates lack persistent processes. Profilers assume you can attach to a running process, but you cannot. Edge tracing is manual, or it is nothing.
What you can and cannot observe
Honesty about limitations prevents wasted debugging time. Some capabilities you take for granted elsewhere don't exist on the edge.
You cannot attach a debugger to a production Worker. The isolate handling a request exists only for that request's duration, potentially milliseconds. By the time you'd attach, it's gone.
You cannot get production flame graphs or CPU profiles. Profiling requires instrumentation that accumulates data over time within a process. V8 isolates don't persist long enough, and the multi-tenant architecture prevents instrumenting the runtime itself.
You cannot inspect memory state after a request completes. No heap dump, no core dump, no memory snapshot. If you needed to see what was in memory, you needed to log it during the request.
You cannot access the underlying system. No system metrics, no network statistics, no disk I/O measurements. The abstraction that enables global deployment hides infrastructure details completely.
You can't attach a debugger, can't get a heap dump, can't SSH in. Your logs are your debugging. Make them complete.
Local reproduction becomes essential. When production exhibits behaviour you can't explain, your only path forward is reproducing the issue locally where you can attach debuggers and profilers. Capture enough context in production logs to replay the request accurately: method, path, headers, body, query parameters.
The three layers of edge observability
Edge observability has three layers, each with different purposes and cost profiles. Understanding when to use each prevents over-investment and dangerous blind spots.
Real-time streams
The wrangler tail command streams logs from your Worker as they happen: console output, exceptions, and request metadata in real-time. This is your primary debugging tool during development and your first response during incidents.
# Stream all logs from production
wrangler tail --env production
# Filter to errors only
wrangler tail --env production --status error
# Filter by specific IP (useful for debugging user-reported issues)
wrangler tail --env production --ip 203.0.113.42
# Sample 10% of requests (necessary for high-traffic Workers)
wrangler tail --env production --sampling-rate 0.1
Real-time streaming is free but ephemeral. Nothing persists unless you're watching. A 3am production incident with no one tailing logs? Those logs are gone. Keep a terminal tailing production during deploys and investigations, but don't rely on it as your only observability layer.
For high-traffic Workers, sampling prevents overwhelming your terminal. But sampling means you might miss the specific request that failed. When debugging a specific user's issue, filter by their IP or a request header rather than sampling randomly.
Persistent exports with logpush
Workers Logs handles most operational logging needs with zero configuration beyond enablement. Logpush remains valuable for retention beyond seven days, integration with existing log platforms, or log data in specific formats for compliance.
Where you send logs is a business decision, not a technical one. The same logging strategy can cost $5 or $900 monthly depending on destination.
Consider a moderately successful API handling ten million requests daily. With one kilobyte of structured logging per request, that's 300GB monthly. Stored in R2: roughly $4.50. Indexed in Datadog: around $900. Same logs, same volume, 200x cost difference.
Choosing a Log Destination
| If your situation is... | Choose... | Because... |
|---|---|---|
| Already paying for Splunk/Datadog with budget headroom | Log platform | Query capability worth the cost |
| Cost-sensitive, processing >1 TB/month | R2 + query infrastructure | Order of magnitude cheaper |
| Need real-time alerting on log content | Log platform | R2 requires building alerting yourself |
| Compliance requires long retention | R2 for archive, platform for recent window | Balance cost with query capability |
| Small scale, exploring | Start with R2 | Can always add platform later |
R2 is just storage; you need to build or buy query infrastructure. This works well if you already have a data pipeline (Spark, Athena, BigQuery) that can query files in object storage. It's a poor choice for querying logs interactively during an incident.
A middle path: Logpush to R2 for archival (cheap, complete) plus real-time streaming to a log platform for the most recent window (expensive, queryable). Most incidents involve recent events; historical analysis can tolerate slower queries.
Analytics engine: high-cardinality custom metrics
The dashboard's aggregated metrics and Workers Logs answer operational questions: error rates, latency distributions, request volumes. Analytics Engine answers business questions: revenue by customer, usage by feature, performance by geographic region. Dimensions that would explode traditional time-series databases.
Traditional metrics systems like Prometheus or InfluxDB struggle with high-cardinality labels. A metric labelled by user ID across millions of users creates millions of time series, each consuming memory and storage. The database slows, queries timeout, and you're forced to aggregate away the detail you wanted.
Analytics Engine handles exactly this pattern. Write millions of distinct data points with arbitrary cardinality, then query them with SQL. Cloudflare uses Analytics Engine internally to power the per-product metrics in the dashboard for D1, R2, and other services. The same infrastructure handles your custom metrics.
Writing data points
Configure an Analytics Engine dataset in your Wrangler configuration and write data points from your Worker:
[[analytics_engine_datasets]]
binding = "METRICS"
dataset = "application_metrics"
// Record a user action with rich dimensions
env.METRICS.writeDataPoint({
blobs: [userId, action, region, planType], // String dimensions
doubles: [latencyMs, requestSize], // Numeric values
indexes: [tenantId] // Sampling key
});
The blobs array holds string dimensions: user IDs, action names, countries, feature flags. The doubles array holds numeric values: latencies, counts, sizes, amounts. The indexes field determines sampling behaviour at extremely high volume.
Writes are non-blocking and don't require awaiting. The platform handles batching and delivery asynchronously. A failed write doesn't affect your Worker's response; the data point is simply lost, acceptable for metrics that aren't financial records.
Querying with SQL
Query your data through the SQL API, available via the dashboard or REST endpoint:
-- Revenue by plan type, last 7 days
SELECT
blob4 AS plan_type,
SUM(double1) AS total_revenue,
COUNT() AS transaction_count
FROM application_metrics
WHERE timestamp > NOW() - INTERVAL '7' DAY
AND blob2 = 'purchase'
GROUP BY plan_type
ORDER BY total_revenue DESC
The SQL dialect supports standard aggregations, filtering, grouping, and time-based operations. Recent enhancements added HAVING for filtering aggregated results, LIKE for pattern matching, and topK() for finding most frequent values.
Analytics Engine applies sampling at extreme data volumes, using the indexes field to ensure sampling is consistent within logical groups. Index by tenant ID, and all data points for a given tenant are either included or excluded together. Query results include a sampleInterval indicating the sampling ratio; multiply counts by this interval to estimate true totals.
Use cases that fit
Analytics Engine excels at specific patterns:
Usage-based billing: track API calls, compute time, or storage per customer. Write a data point for each billable event; query monthly totals per customer at billing time.
Feature analytics: which features customers use, how frequently, with what parameters. High cardinality is the point.
Performance monitoring by customer: which customers experience slow responses and why. Correlate latency with customer characteristics to identify patterns.
Business metrics dashboards: expose custom analytics to your own customers. Query Analytics Engine from a Worker, format the results, return through your API.
The pattern that doesn't fit is real-time alerting. Analytics Engine is an analytics store, not a monitoring system. For alerts on specific conditions, use Workers Logs with external alerting or implement real-time checks in Tail Workers.
Workers logs: the default starting point
The previous sections described a choice between real-time streaming (free but ephemeral) and Logpush exports (persistent but requiring external infrastructure). Workers Logs eliminates this tradeoff for most use cases.
Workers Logs automatically ingests, indexes, and stores all logs from your Workers for seven days. Enable it with a configuration change and redeploy:
[observability]
enabled = true
[observability.logs]
invocation_logs = true
head_sampling_rate = 1
No external log platform to configure, no Logpush jobs to create, no infrastructure to manage. Your console.log statements, errors, and request metadata persist for a week, queryable through the dashboard or API.
The cost model is straightforward: twenty million log events per month included with paid Workers plans; additional events cost $0.60 per million. A Worker handling ten million requests monthly with two log events per request stays within the free tier. Even high-traffic applications remain economical because you're paying for storage and indexing, not operational overhead.
Head-based sampling for cost control
High-traffic Workers can quickly exceed the free tier. Head-based sampling captures a representative percentage of requests:
[observability]
enabled = true
[observability.logs]
invocation_logs = true
head_sampling_rate = 0.1 # Log 10% of requests
At 0.1, one in ten requests generates logs. All logs within a sampled request are captured. A Worker handling a billion requests monthly at 10% sampling generates 200 million log events; at $0.60 per million after the free tier, that's roughly $108 monthly. The same Worker at 100% sampling would cost over $500.
The sampling decision happens at request arrival, before your code executes. You can't selectively sample based on response status or other outcomes. To capture all errors while sampling successes, implement that logic in your Worker code: log errors unconditionally but use conditional logging for successes.
The query builder
Raw log storage is useful, but queryable log storage is powerful. The Query Builder provides a SQL-like interface for investigating your logs without external tooling.
The Query Builder operates through the Cloudflare dashboard at Workers & Pages → Observability → Investigate. Select fields to display, filters to apply, groupings for aggregation, and time ranges to search. The interface translates your selections into queries against the Workers Observability dataset.
Typical investigations:
Highest error rate endpoints? Filter to 5xx status codes, group by request path, visualise as bar chart.
Latency distribution for a specific Worker? Select wall time, filter to Worker name, visualise as histogram.
What happened to a specific user? Filter by user ID (assuming you log it), sort by timestamp, display as event list.
The Query Builder also exposes a REST API for programmatic access: listing available keys, running queries, listing unique values for a specific key. Build custom dashboards, integrate with existing monitoring systems, or automate investigation workflows.
When Workers logs isn't enough
Workers Logs solves the common case: operational visibility with minimal configuration and reasonable cost. Some requirements exceed what it provides.
Retention beyond seven days requires exporting to another system. For 90-day compliance retention or historical analysis spanning months, Logpush to R2 remains appropriate. Workers Logs provides the operational window; R2 provides the archive.
Real-time alerting on log content requires external infrastructure. Workers Logs stores and indexes but doesn't watch and notify. For PagerDuty alerts when specific log patterns appear, you need a log platform with alerting or a Tail Worker monitoring in real-time.
Cross-system correlation with non-Cloudflare logs requires a unified platform. If your architecture includes Lambda functions, Kubernetes pods, and Workers, correlating traces across all three requires exporting to a system that ingests from all sources.
The decision framework: Workers Logs handles most needs by default. Do you have specific requirements (long retention, real-time alerting, cross-system correlation) that justify additional infrastructure?
For most applications, Workers Logs is the answer. Enable it, query it when needed, invest your engineering time elsewhere.
How much observability do you need?
Not every application needs the same observability investment. A side project and a revenue-critical API have different requirements.
Minimum Viable Observability
For low-stakes applications, internal tools, or early exploration:
wrangler tailfor active debugging- Logpush to R2 (cheap archival, query when needed)
- Built-in dashboard metrics
- No custom instrumentation
Cost: nearly nothing. Sufficient when incidents are tolerable and investigation can be slow.
Production-Grade Observability
For revenue-generating systems where incidents have real cost:
- Logpush to a log platform (or R2 with query tooling)
- Structured logging with request ID propagation
- Custom metrics via Analytics Engine for business-critical paths
- Alerting on error rates and latency by endpoint
Cost: meaningful but proportional. Appropriate when fast detection and diagnosis matter.
Enterprise Observability
For systems where Cloudflare is core infrastructure:
- All of the above
- Custom tracing across service bindings
- Regional alerting (not just global aggregates)
- Integration with existing APM and incident management
- Dedicated observability budget and ownership
Cost: significant investment. Appropriate when operational excellence is a competitive advantage.
Invest in observability proportional to the cost of incidents. A side project can tolerate hours of debugging. A payment system cannot.
Logging strategy
Every log line costs money to store and attention to analyse. At edge scale, the question isn't what to log but what's worth logging.
What earns its place
Some events always warrant logging:
Request boundaries provide the skeleton for understanding system behaviour. Log when requests arrive and complete, with enough context to correlate them.
Errors and exceptions are obviously essential. You can't debug what you don't know occurred.
External service calls are where edge applications spend most of their time and encounter most of their failures. When your Worker calls D1, R2, an external API, or a Durable Object, log the call, its duration, and its outcome.
Authentication decisions matter for security auditing. Who accessed what, and were they authorised?
Business-critical actions (purchases, account changes, data modifications) provide audit trails that outlive debugging needs.
Passwords, API tokens, session secrets, credit card numbers, government identifiers: never log these regardless of how useful they'd be. The debugging benefit never outweighs the security risk.
Structure for correlation
The request identifier is the most important field in edge logging. A single user request might generate logs from a Worker, then a Durable Object, then another Worker via service binding. Without a consistent identifier propagating through the chain, correlating these logs requires timestamp guessing and prayer.
interface RequestContext {
requestId: string; // Generate once, propagate everywhere
traceId?: string; // If integrating with external tracing
userId?: string; // After authentication
startTime: number; // For duration calculation
}
function createContext(request: Request): RequestContext {
return {
requestId: request.headers.get('cf-ray') || crypto.randomUUID(),
startTime: Date.now()
};
}
function log(ctx: RequestContext, level: string, message: string, data?: object) {
console.log(JSON.stringify({
timestamp: new Date().toISOString(),
level,
requestId: ctx.requestId,
userId: ctx.userId,
message,
...data
}));
}
When calling a Durable Object or another Worker via service binding, pass the request ID in a header. The receiving code extracts it and uses the same identifier in its logs. This simple discipline transforms disconnected log lines into coherent request traces.
The debuggable log line
Because you can't attach a debugger in production, error logs must contain everything needed to reproduce the issue locally. A typical log line captures what happened; a debuggable log line captures what happened and the complete input state.
Insufficient:
{"level": "error", "message": "Failed to process request", "error": "TypeError: Cannot read property 'id' of undefined"}
Debuggable:
{
"level": "error",
"requestId": "abc123",
"message": "Failed to process request",
"error": "TypeError: Cannot read property 'id' of undefined",
"stack": "at processUser (worker.js:42)...",
"input": {
"method": "POST",
"path": "/api/users",
"body": {"name": "Alice"},
"headers": {"content-type": "application/json"}
}
}
The second log line lets you reproduce the exact request locally. The first requires guessing. On systems where you can SSH in and inspect state, the first might suffice. On the edge, it's insufficient.
This is more logging than you'd typically do on a system where you could investigate in place, but it's the only debugging strategy that works.
Tracing across service boundaries
Traditional distributed tracing assumes services run in known locations. Edge tracing adds geographic distribution that traditional tools weren't designed for.
Your Worker runs wherever the user is closest. If it calls a Durable Object, that object lives in a specific location, potentially different from where the Worker executed. If the Durable Object calls another Worker via service binding, that Worker runs co-located with the object, not with the user. A single request might traverse multiple continents without explicit routing.
What to trace
Service binding calls, Durable Object invocations, and external API calls warrant tracing. These are the boundaries where requests cross locations or leave the Cloudflare network.
Internal computation within a single Worker usually doesn't need span-level tracing. Curious where time goes within a Worker? Profile locally. Production tracing should focus on distributed aspects that can't be reproduced locally.
Propagating context
When calling across service boundaries, propagate trace context explicitly:
// In the calling Worker
const response = await env.USER_SERVICE.fetch(
new Request(url, {
headers: {
'x-request-id': ctx.requestId,
'x-trace-id': ctx.traceId,
'x-parent-span': currentSpanId
}
})
);
// In the receiving Worker
export default {
async fetch(request: Request, env: Env) {
const ctx = {
requestId: request.headers.get('x-request-id') || crypto.randomUUID(),
traceId: request.headers.get('x-trace-id'),
parentSpan: request.headers.get('x-parent-span')
};
// Continue with same context
}
};
This manual work replaces what APM agents would handle automatically in traditional environments. The alternative, disconnected logs with no way to correlate requests across service boundaries, is far worse.
Integration with external tracing
Existing tracing platforms (Jaeger, Zipkin, Datadog APM) expect trace data in specific formats. Emit compatible data from Workers using waitUntil() to avoid blocking responses:
ctx.waitUntil(
fetch('https://your-collector.example.com/traces', {
method: 'POST',
body: JSON.stringify(spans)
})
);
Be realistic about what you'll get. You'll see edge-specific hops but not the automatic instrumentation that APM agents provide for traditional applications. Flame graphs, automatic dependency mapping, and code-level attribution require instrumentation that isn't possible in V8 isolates.
When to invest in custom tracing: Complex service binding chains where understanding request flow is difficult, or when integrating edge traces with existing distributed tracing infrastructure. Simpler architectures (a Worker calling D1 and maybe one external API) probably need only structured logging with request IDs.
Detecting named failure modes
Throughout this book, we've named specific failure patterns. Observability is how you detect them in production. Each failure mode has a characteristic signature; knowing the signature tells you what to log and alert on.
Timing assumption violations (Chapter 5) manifest as high wall-clock time with low CPU time. Your Worker waits on sequential database calls that were instant locally but add seconds in production. Detect this by logging timing breakdowns:
const dbStart = Date.now();
const result = await env.DB.prepare(query).all();
const dbDuration = Date.now() - dbStart;
log(ctx, 'info', 'database_query', {
query: query.substring(0, 100), // Truncate for safety
durationMs: dbDuration,
rowCount: result.results.length
});
Alert when database call duration consistently exceeds expectations, or when total external call time dominates request time.
Placement latency mismatch (Chapter 6) shows up as bimodal latency distributions for Durable Object access. Some requests complete in 10ms; others take 200ms. The difference is geography: users near the DO placement are fast; distant users are slow. Log both the DO location and the requesting Worker's colo:
log(ctx, 'info', 'do_invocation', {
objectId: id.toString(),
workerColo: request.cf?.colo,
durationMs: duration
});
Alert when geographic mismatch consistently correlates with latency spikes. If users in Asia always experience slow DO access, you have a placement problem.
Poison message loops (Chapter 8) appear as queues that process but never drain, the same message IDs appearing repeatedly in consumer logs. Track retry counts per message:
async queue(batch: MessageBatch<QueueMessage>, env: Env) {
for (const message of batch.messages) {
if (message.attempts > 3) {
log(ctx, 'warn', 'high_retry_message', {
messageId: message.id,
attempts: message.attempts,
body: JSON.stringify(message.body).substring(0, 200)
});
}
}
}
Alert when any message exceeds a retry threshold. The message body in the log helps identify the poison.
Stale read after write (Chapter 14) with KV is visible when users report seeing old data after changes. Correlate write timestamps with subsequent read timestamps. If reads consistently return data older than recent writes from the same session, you're hitting KV's eventual consistency. The fix is architectural (use D1 or Durable Objects for consistency-sensitive data), but detection requires logging write and read events with consistent user identifiers.
Generic "error rate high" alerts catch problems eventually; specific alerts for known failure modes catch them immediately.
Alert thresholds on the edge must account for distribution. A 2% error rate might mean a global problem or a submarine cable cut affecting one region. Configure both: "Error rate > 5% globally" catches universal failures; "Error rate > 20% in any single region" catches localised issues. Singapore at 50% errors but only 5% of traffic? Global error rate might be 2.5%, below your threshold while Singapore users suffer.
Durable object observability
Durable Objects have different observability characteristics than stateless Workers. They persist between requests, maintain state, and have lifecycles extending beyond individual invocations.
State visibility
A Durable Object's in-memory state exists only while the object is active. Once it hibernates or is evicted, the in-memory state is gone. Persistent state in SQLite storage survives, but you cannot query it externally; no dashboard shows Durable Object storage contents.
Log state transitions, not continuous state (too verbose):
async updateStatus(newStatus: string) {
const oldStatus = this.status;
this.status = newStatus;
await this.ctx.storage.put('status', newStatus);
log(this.requestCtx, 'info', 'status_transition', {
objectId: this.id,
from: oldStatus,
to: newStatus
});
}
When something goes wrong, trace the sequence of state changes that led to the problem.
Connection monitoring
Durable Objects handling WebSocket connections need connection lifecycle logging: connections opened, connections closed, reasons for closure. Without this, you can't debug connection stability issues or understand capacity utilisation.
Hibernation complicates monitoring. A hibernating Durable Object isn't running code, so it cannot actively check connections. Log connection events when they occur:
async webSocketMessage(ws: WebSocket, message: string) {
// Process message
}
async webSocketClose(ws: WebSocket, code: number, reason: string) {
log(this.requestCtx, 'info', 'websocket_closed', {
objectId: this.id,
code,
reason,
connectionDuration: Date.now() - this.connectionStartTime
});
}
Alarm observability
Durable Object alarms provide scheduled execution, but failed alarms can be difficult to debug. An alarm that throws an exception retries, but you might not notice the retries unless you're logging them.
async alarm() {
const alarmTime = await this.ctx.storage.getAlarm();
log(this.requestCtx, 'info', 'alarm_triggered', {
objectId: this.id,
scheduledTime: alarmTime,
actualTime: Date.now(),
drift: Date.now() - (alarmTime || 0)
});
try {
await this.processScheduledWork();
} catch (error) {
log(this.requestCtx, 'error', 'alarm_failed', {
objectId: this.id,
error: error.message,
willRetry: true
});
throw error; // Re-throw to trigger retry
}
}
Alerting for geographic distribution
Alert thresholds on the edge must account for distribution. A 2% error rate might mean a global problem or a submarine cable cut affecting one region. Your alerts should know the difference.
Regional vs global
The most useful distinction is between regional and global incidents. Regional: one geography affected, perhaps a network issue, a regional service degradation, or a deployment problem in specific locations. Global: everywhere simultaneously, typically a code bug, a configuration change, or a dependency failure.
Configure both: "Error rate > 5% globally" catches universal failures; "Error rate > 20% in any single region" catches localised issues that might not register in global aggregates. The second alert is crucial. Singapore at 50% errors but only 5% of traffic means global error rate might be 2.5%, below your threshold. Your Singapore users are suffering while your dashboard shows green.
Baseline calibration
Edge applications have different baseline characteristics. Cold starts are nearly non-existent, so latency distributions are tighter. But geographic distribution means some requests always have higher latency; users far from data dependencies pay a distance penalty.
Establish baselines per metric and per region before setting alert thresholds. P99 latency of 500ms might be concerning for requests served from the same continent as your database, but normal from the opposite hemisphere.
Threshold selection
Overly sensitive alerts create alert fatigue. Fire three times daily and usually false? You'll start ignoring it and miss the real incident. Overly lenient alerts miss real problems. Fire only when 50% of requests fail? You'll have angry users before you know.
The right threshold is the lowest value that doesn't produce false positives during normal operation. Find it empirically: collect baseline data for a week, calculate normal variance, set thresholds above that variance. Error rate normally between 0.1% and 0.3%? Alerting at 0.5% catches real problems without noise. Alert at 0.2%? Constant pages for normal fluctuation.
Alert routing
Alerts should reach people who can understand the problem and take action. The team that wrote the code should receive alerts about that code's behaviour.
When developers carry operational responsibility for their own systems, they build systems that are easier to operate. When operations is someone else's problem, developers optimise for shipping features, not operational clarity. Structure alerting so consequences flow to decision-makers. New feature causes latency spikes? The team that shipped it should know first.
This isn't about blame; it's about learning. The fastest path to reliable systems is feeling the results of your choices directly.
Incident response
When alerts fire, speed matters. The difference between five-minute and five-hour incidents is often whether you had a plan before the alert fired.
Rollback first
Workers support instant rollback to previous versions. If a deployment preceded the incident, rollback immediately. Investigate why the new code failed after restoring service.
# List recent deployments
wrangler deployments list
# Rollback to previous version
wrangler rollback
Rollback takes effect globally within seconds, faster than any other mitigation. Don't debug while users are suffering if you can rollback instead.
The practical requirement: maintain deployable rollback targets. If your previous version also had bugs, or database migrations make old code incompatible, rollback isn't an option. Test that rollback works before you need it.
Mitigation without deployment
Some incidents can't be solved by rollback. The problem is a dependency, data, or traffic pattern, not code. Feature flags enable mitigation without deployment.
Store feature flags in KV. When a feature causes problems, disable it by changing a KV value. The change propagates globally within a minute, faster than any deployment pipeline. This requires building the feature flag check into your code before the incident:
const newFeatureEnabled = await env.FLAGS.get('enable-new-checkout') === 'true';
if (newFeatureEnabled) {
return handleNewCheckout(request);
} else {
return handleLegacyCheckout(request);
}
Plan for this during development, not during the incident.
Post-incident learning
After incidents resolve, understand what happened. Not to assign blame, but to improve systems.
What happened? Timeline of events, from first symptom to resolution. Be specific about times and actions.
Why didn't we detect it faster? Alerts missing or misconfigured? Failure mode not matching any alert? Should we add specific detection?
Why didn't we resolve it faster? Rollback not an option? Missing feature flags? On-call engineer lacking access or context?
What systemic change prevents recurrence? Not "person X will be more careful" but concrete changes: new alerts, updated runbooks, additional tests, architectural changes.
The output is action items with owners and deadlines. Post-incident review without action items is storytelling, not improvement.
Hyperscaler comparison
Technical leaders migrating from AWS, Azure, or GCP have expectations shaped by those platforms. Cloudflare differs in important ways.
| Capability | Cloudflare | AWS Lambda | Azure Functions |
|---|---|---|---|
| Built-in logging | Console streaming, Logpush | CloudWatch Logs (automatic) | Azure Monitor (automatic) |
| Log persistence | Export required (nothing automatic) | Automatic retention | Automatic retention |
| APM/Tracing | Manual instrumentation only | X-Ray (agent-based, automatic) | Application Insights (automatic) |
| Debugger attachment | Not possible | Possible | Possible |
| Custom metrics | Analytics Engine | CloudWatch custom metrics | Azure Metrics |
| Geographic breakdown | Built into dashboard | Requires custom implementation | Requires custom implementation |
Key differences: Cloudflare doesn't persist logs automatically; you must configure Logpush or logs disappear. Cloudflare doesn't support APM agents, so instrumentation is manual. Cloudflare doesn't allow debugger attachment, making local reproduction required.
In exchange: global log aggregation without infrastructure management, real-time streaming from everywhere, Analytics Engine for high-cardinality metrics that would be expensive elsewhere, and geographic visibility by default. Understanding where requests come from is built into the dashboard, not something you build yourself.
For teams accustomed to CloudWatch or Application Insights, expect less automatic instrumentation but simpler infrastructure, more explicit logging but global visibility without additional configuration.
Maintaining observability
Observability infrastructure requires ongoing attention. Neglected, it degrades until it fails precisely when you need it most.
The pattern is predictable. A team sets up Logpush, configures alerts, builds dashboards, and moves on. Months pass. Log schemas evolve but Logpush configurations don't update. Alert thresholds that once made sense become too sensitive or too lenient as traffic patterns change. Dashboards show metrics for features that no longer exist while missing metrics for features that do.
This decay happens because observability doesn't demand attention until something breaks. Unlike production code, which fails visibly when wrong, observability infrastructure can be wrong for months without consequence. Then an incident reveals that alerts didn't fire, logs didn't capture the right context, or dashboards showed green while users experienced red.
Treat observability as a system requiring maintenance, not a project that gets completed.
Review alert effectiveness quarterly. Which alerts fired? Which were actionable? Which were noise? Alerts that never fire might be misconfigured. Alerts that fire frequently and get ignored are worse than useless; they train your team to dismiss alerts.
Audit log schemas when code changes. New features should log appropriately. Deprecated features should stop cluttering logs. Engineers shipping features should update observability as part of the work.
Test that alerts actually fire. An alert that has never triggered might be broken. Periodically verify your alerting pipeline end-to-end: generate a synthetic error, confirm the alert fires, confirm the notification reaches the right person.
Budget time for maintenance. Observability upkeep isn't glamorous and doesn't ship features, so it tends to be deferred indefinitely. An hour monthly keeps you current. A year of neglect becomes a project.
What comes next
Observability tells you what's happening. Chapter 21 covers security, compliance, and deployment: ensuring that what's happening is what should be happening, and that changes reach production safely.
Observability and secure deployment form the operational foundation. Observability without security leaves you watching breaches unfold; security without observability leaves you blind to whether controls are working.