Chapter 10: Realtime: Audio and Video at the Edge

How do I build applications with real-time audio and video?

Real-time means different things to different applications, so a collaborative document editor needs sub-second text synchronisation, a stock trading platform needs millisecond price updates, and a video call needs continuous audio and video streams with latency low enough that conversation feels natural. These diverse problems require different solutions.

Durable Objects with WebSockets handle the first two cases elegantly with single-threaded coordination, global routing, and persistent connections (as Chapter 6 covered). But audio and video streaming is fundamentally different because the protocols differ (WebRTC rather than WebSockets), the infrastructure differs (media servers rather than application servers), and the operational complexity differs (NAT traversal, codec negotiation, adaptive bitrate). Cloudflare Realtime exists because Durable Objects are not designed for media.

The Core Distinction

WebSockets guarantee delivery and ordering, which matters for application state but adds latency. WebRTC prioritises low latency, accepting that some packets will never arrive. A dropped chat message is unacceptable; a dropped video frame is invisible. If your application involves voice or video between participants, you need WebRTC infrastructure regardless of how you optimise WebSockets.

This chapter covers when you need Cloudflare Realtime versus when Durable Objects suffice, explains why anycast SFUs change the trade-offs, describes the operational realities of running real-time media in production, and shows how to evaluate whether Cloudflare fits your requirements.

Two kinds of real-time

Understanding the distinction between data real-time and media real-time shapes every architectural decision you make.

Data real-time involves text, JSON, binary messages, and application state where latency tolerance ranges from tens of milliseconds to a few seconds depending on the use case. A chat message arriving 200ms late is imperceptible, and a collaborative cursor position updating 500ms late is noticeable but acceptable. WebSockets over TCP provide reliable, ordered delivery while Durable Objects coordinate state across participants, and the infrastructure is straightforward: your application logic runs in Workers and Durable Objects with persistent connections and messages flowing naturally.

Media real-time involves continuous audio and video streams where latency tolerance is measured in tens of milliseconds; beyond 150ms, conversation becomes awkward as participants talk over each other, and beyond 300ms, it feels broken. WebRTC provides the transport with UDP for speed (accepting occasional packet loss rather than waiting for retransmission), DTLS for encryption, and adaptive codecs that respond to network conditions. The infrastructure is specialised: Selective Forwarding Units (SFUs) route media between participants, TURN servers relay traffic when direct connections fail, and the entire system must operate closer to the network layer than typical application code.

The protocol reality

WebRTC and WebSockets serve fundamentally different purposes even though both enable "real-time" communication.

WebSockets run over TCP where every packet arrives in order or the connection blocks waiting, and this guarantee is essential for application state since you cannot skip a database write or drop a user action. But for media, waiting for a retransmission adds latency that compounds because a video frame arriving 200ms late is worse than a frame that never arrives. The late frame causes visible stuttering while a dropped frame is interpolated invisibly.

WebRTC runs primarily over UDP with DTLS encryption where packets arrive when they arrive and lost packets stay lost. The codec compensates: audio codecs use forward error correction to reconstruct missing data while video codecs use techniques like temporal scalability to gracefully degrade. A 2% packet loss rate that would cripple a WebSocket application is imperceptible in a well-implemented WebRTC call.

This isn't a choice between good and better protocols. It's a choice between protocols designed for different problems. Using WebSockets for video is like using a database for caching: technically possible, architecturally wrong.

When Durable Objects suffice

Many "real-time" applications don't actually need media infrastructure, so before reaching for Cloudflare Realtime, you should confirm that Durable Objects with WebSockets can't solve your problem.

Text chat and messaging: WebSockets through Durable Objects handle this perfectly. Messages are small, latency tolerance is generous, and the coordination model (one Durable Object per conversation) maps naturally to the domain. Cloudflare Realtime adds cost and complexity with no benefit.

Collaborative editing: Whether you're synchronising document state, whiteboard drawings, or design tool operations, WebSockets suffice. The data volumes are modest, sub-second latency is acceptable, and the interesting problems (conflict resolution, operational transformation) live in your application logic, not the transport layer.

Live dashboards and data feeds: Stock tickers, sports scores, IoT sensor readings, and analytics dashboards involve one-way data flow from server to many clients. WebSockets handle fan-out efficiently. Durable Objects coordinate the source of truth.

Multiplayer game state: Turn-based games, strategy games, and even many action games synchronise state through WebSockets. The game logic determines what "real-time" means; unless you're building a first-person shooter requiring sub-50ms latency, Durable Objects likely suffice.

Presence and cursor tracking: Showing who's online and where their cursor is involves small, frequent updates. WebSockets handle this trivially. The coordination challenge (ensuring all participants see consistent presence) is exactly what Durable Objects solve.

For all these cases, Chapter 6's patterns apply directly: create a Durable Object per logical entity (conversation, document, game session), connect participants via WebSockets, and let single-threaded execution handle coordination. The infrastructure is simpler, the cost is lower, and you maintain full control of the application logic.

When you need Cloudflare Realtime

Cloudflare Realtime becomes necessary when your application involves actual audio or video between participants rather than just data synchronisation.

Video conferencing: Multiple participants sending and receiving video streams. Each participant's video must reach every other participant with minimal latency. An SFU routes these streams efficiently; without one, you'd need mesh networking where each participant connects directly to every other participant. Mesh works for three or four people but bandwidth requirements grow quadratically: five participants means each sends four streams and receives four streams. By ten participants, the approach collapses entirely.

Voice calling: Audio-only calls have lower bandwidth requirements but the same latency sensitivity because conversation requires round-trip latency under 300ms to feel natural (ideally under 150ms). At 400ms, participants constantly interrupt each other, and at 500ms, they give up and switch to asynchronous communication.

Live streaming with interaction: A presenter broadcasting to an audience, with the audience able to respond via voice or video. The broadcast might tolerate higher latency (1-2 seconds is common for live streams), but interactive segments need the same low latency as a call.

AI voice agents: Applications where users speak to an AI and receive spoken responses as the AI processes speech-to-text, generates a response, and returns text-to-speech. Low latency throughout the pipeline makes the interaction feel conversational rather than like talking to a voice mail system, with target end-to-end latency under 500ms being achievable with careful architecture.

Telehealth and remote assistance: Video calls with specific requirements around quality, reliability, and sometimes recording. Medical consultations, technical support with screen sharing, and remote expert guidance all need robust video infrastructure.

For these use cases, Durable Objects with WebSockets cannot help. You need WebRTC infrastructure: media servers, TURN relays, and client SDKs that handle codec negotiation, adaptive bitrate, and network condition changes.

The anycast SFU architecture

Traditional video infrastructure requires choosing where to place your media servers. AWS Chime operates in 14 regions. Azure Communication Services operates in 18. You provision capacity, select regions based on where you expect users, and accept that some users will connect to distant servers with higher latency.

Cloudflare Realtime works differently. The SFU runs on Cloudflare's entire network: over 300 locations worldwide. When a participant connects, BGP routing automatically directs them to the nearest location. There's no region selection, no capacity planning, no trade-off between cost and coverage.

This isn't just operational convenience. The architecture changes what's possible.

The traditional SFU problem

In a traditional SFU deployment, all participants in a call connect to the same server. If three people in London and one person in Sydney join a call, the SFU must be somewhere. Place it in London, and the Sydney participant has 300ms round-trip latency. Place it in Sydney, and the London participants suffer. Place it somewhere in between, and everyone has mediocre latency.

Some systems try to solve this by selecting SFU location dynamically based on the first participant or the geographic centre of all participants. This helps when participants cluster geographically, but creates suboptimal routing when they don't, and "geographic centre" between London and Sydney is somewhere in the Indian Ocean where no data centres exist.

Cloudflare's approach

Cloudflare's anycast architecture eliminates this trade-off. Each participant connects to their nearest Cloudflare location. The London participants connect to London; the Sydney participant connects to Sydney. Media flows between Cloudflare locations over Cloudflare's backbone network, which is optimised for exactly this traffic pattern.

The result: every participant gets optimal latency to their nearest point of presence, regardless of where other participants are. The geographic distribution of participants doesn't degrade anyone's experience.

This architecture also simplifies scaling. Traditional SFUs require load balancing, capacity monitoring, and scaling policies. Cloudflare's approach distributes load automatically across locations. A server in one location handles only the participants nearest to it; adding participants in other locations adds load to other servers, not the same server.

Cascading for large calls

For calls with many participants, Cloudflare uses cascading: media flows through a tree of servers rather than a single hub. Participants in London connect to London servers; those servers cascade to servers handling other regions. This keeps latency low while enabling calls with hundreds of participants.

The cascading topology forms automatically based on participant locations. You don't configure it; you don't even see it. From your application's perspective, participants join a call and can see and hear each other. The platform handles routing optimisation invisibly.

Latency characteristics

Measured latency in production deployments:

Scenario	Typical Latency
Same city, both participants	30-50ms round-trip
Same continent, different cities	50-100ms round-trip
Cross-continent (US to Europe)	100-150ms round-trip
Cross-continent (US to Asia)	150-200ms round-trip
Worst case (distant regions)	200-300ms round-trip

These figures represent end-to-end media latency including encoding, network transit, and decoding. They're achievable because each participant connects to nearby infrastructure; the inter-region latency affects only the backbone hop, not the participant connection.

TURN: solving NAT traversal

WebRTC wants peer-to-peer connections. In practice, NATs and firewalls often prevent direct connectivity. TURN (Traversal Using Relays around NAT) servers act as intermediaries, relaying traffic when direct connection fails.

Without TURN, roughly 8-15% of connection attempts fail depending on network conditions. Users behind corporate firewalls, symmetric NATs, or certain mobile networks simply cannot establish direct connections. For a consumer application, losing 10% of users to connectivity failures is unacceptable.

Cloudflare provides TURN as part of Realtime. The same anycast architecture applies: TURN allocations route to the nearest Cloudflare location. The service handles the complexity of maintaining relay allocations, managing credentials, and ensuring connectivity.

STUN vs TURN

STUN (Session Traversal Utilities for NAT) helps peers discover their public IP addresses and attempt direct connection. It's lightweight and free on Cloudflare (stun.cloudflare.com). Most connections succeed with STUN alone: the peers discover their public addresses, exchange candidates through signalling, and establish direct media flow.

TURN relays actual media traffic when STUN-facilitated direct connection fails. This happens when both peers are behind symmetric NATs, when firewall rules block UDP, or when network policies prevent direct peer connections. TURN consumes bandwidth because it relays all media, and it costs money ($0.05 per GB).

The fallback sequence: try direct connection first, fall back to TURN only when necessary. Your application doesn't control this; the WebRTC stack handles it automatically. But understanding the sequence helps when debugging connectivity issues and predicting costs.

Privacy Consideration

WebRTC uses DTLS encryption end-to-end. When traffic routes through TURN, Cloudflare relays encrypted packets without access to the media content. This is inherent to the protocol, not a Cloudflare-specific implementation choice.

Estimating TURN usage

TURN usage varies dramatically by user base. Consumer applications with users on home networks: 5-10% TURN usage. Enterprise applications with users behind corporate firewalls: 20-40% TURN usage. Applications used primarily on mobile networks: 10-20% TURN usage.

For cost estimation, assume 15% of your traffic will route through TURN unless you have specific knowledge of your user base. If you're building for enterprise, budget for 30%. These estimates affect both cost projections and capacity planning.

Realtimekit: the client SDK

Building WebRTC applications from scratch requires handling codec negotiation, network condition monitoring, audio level detection, camera and microphone permissions, connection state management, reconnection logic, and platform-specific quirks across browsers and mobile operating systems. RealtimeKit handles this complexity.

The SDK is available for JavaScript (web), React Native, Flutter, iOS (Swift), and Android (Kotlin). It provides two layers:

Core SDK: APIs for joining sessions, managing media tracks, handling events. You build your own UI but don't implement WebRTC negotiation, ICE candidate handling, or codec selection.

UI Kit: Pre-built components for common patterns. Video grids, participant lists, control bars. Useful for rapid prototyping or applications where custom UI isn't a differentiator.

Basic integration

src/video-call.ts
import { RealtimeKit } from '@cloudflare/realtimekit';

// Initialize with authentication token from your backend
const rtk = new RealtimeKit({
  token: authToken,
  // Optional: configure quality preferences
  videoConstraints: {
    width: { ideal: 1280 },
    height: { ideal: 720 },
    frameRate: { ideal: 30 }
  }
});

// Join a session
const session = await rtk.join({
  sessionId: 'room-123',
  displayName: 'Jamie'
});

// Access local media
await session.enableCamera();
await session.enableMicrophone();

// Handle remote participants
session.on('participantJoined', (participant) => {
  console.log(`${participant.displayName} joined`);
  // Attach video to DOM element
  participant.videoTrack?.attach(videoElement);
  participant.audioTrack?.attach(audioElement);
});

session.on('participantLeft', (participant) => {
  console.log(`${participant.displayName} left`);
});

Connection state management

Real-world networks are unreliable. Connections drop, quality degrades, and users switch between WiFi and cellular. RealtimeKit exposes connection state for your application to respond appropriately.

Handling connection state changes
session.on('connectionStateChanged', (state) => {
  switch (state) {
    case 'connecting':
      showStatus('Connecting...');
      break;
    case 'connected':
      hideStatus();
      break;
    case 'reconnecting':
      showStatus('Connection lost, reconnecting...');
      break;
    case 'disconnected':
      showError('Disconnected. Please check your network.');
      break;
    case 'failed':
      showError('Connection failed. Unable to join call.');
      break;
  }
});

The SDK handles reconnection automatically for transient network issues. Your application receives state updates to inform the user. For persistent failures, the SDK eventually transitions to failed state, and your application should offer to retry or fall back gracefully.

Adaptive quality

Network conditions change during a call. A user on WiFi walks to another room and loses signal. A mobile user enters an elevator. A home user's family member starts streaming video. RealtimeKit adapts automatically, but your application can respond to quality changes.

Responding to quality changes
session.on('qualityChanged', (participant, quality) => {
  // quality: 'excellent' | 'good' | 'fair' | 'poor'
  if (quality === 'poor' && participant.isLocal) {
    suggestDisablingVideo();
  }
});

// Manual quality control
session.setVideoQuality('low'); // Reduce outgoing video quality
session.setVideoEnabled(false); // Disable video entirely

For bandwidth-constrained scenarios, simulcast allows receivers to choose quality independently. The sender transmits multiple quality layers; each receiver requests the layer matching their available bandwidth. This happens automatically but affects bandwidth usage: simulcast increases upload bandwidth by roughly 50% but enables better quality for receivers with good connections while maintaining stability for constrained receivers.

Error handling

Media applications fail in ways that web applications don't. Camera permissions denied, microphone in use by another application, hardware failures, browser restrictions. Handle these explicitly.

Handling media permission errors
try {
  await session.enableCamera();
} catch (error) {
  if (error.name === 'NotAllowedError') {
    showMessage('Camera permission denied. Enable in browser settings.');
  } else if (error.name === 'NotFoundError') {
    showMessage('No camera found. Please connect a camera.');
  } else if (error.name === 'NotReadableError') {
    showMessage('Camera in use by another application.');
  } else {
    showMessage('Unable to access camera.');
    console.error('Camera error:', error);
  }
}

Device enumeration helps users select the right input:

Device enumeration and selection
const devices = await navigator.mediaDevices.enumerateDevices();
const cameras = devices.filter(d => d.kind === 'videoinput');
const microphones = devices.filter(d => d.kind === 'audioinput');

// Let user select, then apply
await session.setCamera(selectedCameraId);
await session.setMicrophone(selectedMicrophoneId);

Recording

Recording video calls typically requires running a headless browser or custom media pipeline to capture streams. RealtimeKit includes built-in recording: enable it for a session, and recordings appear in R2 storage.

Enabling session recording
// Enable recording when creating the session (server-side)
const session = await realtimeApi.createSession({
  recording: {
    enabled: true,
    outputPath: `recordings/${sessionId}/`,
    format: 'mp4'
  }
});

The recording pipeline runs on Cloudflare's infrastructure. Recordings include all participants' audio and video composited into a single file, or optionally separate tracks per participant for post-production flexibility.

This matters for compliance-driven applications (financial services requiring call recording), education (lecture capture), and content creation (podcast recording). What would otherwise require significant infrastructure becomes a configuration option.

Cost model and estimation

Real-time media pricing varies significantly across providers. Understanding the cost model prevents surprises.

Cloudflare pricing structure

Cloudflare Realtime charges based on bandwidth:

Component	Price	Free Tier
SFU egress	$0.05/GB	1,000 GB/month
TURN relay	$0.05/GB	Included in SFU allowance
Recording storage	R2 pricing	R2 free tier applies

The bandwidth-based model creates different economics than per-minute pricing. Audio-only calls (roughly 100 Kbps per participant) cost dramatically less than video calls (1-4 Mbps per participant depending on resolution). Applications that adapt quality based on conditions naturally optimise cost.

Estimating bandwidth

For a typical video call:

Quality	Bitrate	GB per hour (one direction)
Audio only	50-100 Kbps	0.036-0.045 GB
360p video	400-600 Kbps	0.18-0.27 GB
720p video	1.5-2.5 Mbps	0.68-1.13 GB
1080p video	3-5 Mbps	1.35-2.25 GB

For a four-person call at 720p for one hour, each participant sends one stream (to the SFU) and receives three streams (from the SFU). Total bandwidth per participant: roughly 3-4.5 GB. Total SFU egress for the call: roughly 12-18 GB, costing $0.60-$0.90.

At scale, the formula for monthly cost:

Monthly Cost = (Avg Participants × Avg Call Duration × Calls Per Month × Bitrate × 3600) / (8 × 1024³) × $0.05

Comparison with alternatives

Provider	Pricing Model	4-person, 1-hour call
Cloudflare Realtime	$0.05/GB egress	~$0.60-0.90
AWS Chime SDK	$0.0017/participant-minute	$0.41
Azure Communication Services	$0.004/participant-minute	$0.96
Twilio Video	$0.004/participant-minute (group)	$0.96

Cloudflare's cost varies with video quality. The others charge flat per-minute rates regardless of quality. For audio-only calls, Cloudflare is dramatically cheaper. For high-quality video, costs converge.

Hidden Costs

Factor in:

TURN relay: 10-30% additional bandwidth for users requiring relay
Recording: Storage (R2 pricing) plus processing bandwidth
Transcoding: If you're re-encoding recordings, compute costs apply
Development time: Newer SDKs may require more integration effort than mature alternatives

Failure modes worth naming

Real-time media systems fail in ways that differ from typical web applications. Naming these failures creates vocabulary for debugging and post-incident analysis.

Connectivity failures

ICE negotiation timeout occurs when peers cannot establish a connection within the negotiation window (typically 30 seconds). Common causes: both peers behind symmetric NATs with TURN unavailable, firewall blocking UDP entirely, or incorrect TURN credentials. Symptoms: connectionState never reaches connected. The fix depends on the cause; ensure TURN is available and correctly configured.

TURN allocation exhaustion manifests when your TURN server runs out of relay allocations. Each concurrent connection through TURN consumes an allocation. Cloudflare's managed TURN scales automatically, but if you're running your own TURN infrastructure, monitor allocation counts. Symptoms: new connections fail while existing connections continue working.

Firewall state timeout affects long-running calls. Stateful firewalls track UDP sessions and drop state after inactivity (often 30 seconds). If media pauses (everyone mutes), the firewall may drop the session, requiring ICE restart. Symptoms: call drops after period of silence. Keep-alive packets mitigate this; RealtimeKit handles it automatically.

Quality degradation

Bandwidth collapse occurs when available bandwidth drops below minimum viable quality. The codec can't reduce bitrate further without becoming unwatchable. Symptoms: extreme pixelation, then freezing, then disconnection. The only fix is more bandwidth or disabling video.

Jitter buffer overflow happens when network jitter exceeds the buffer's capacity to smooth it. Packets arrive too irregularly to maintain smooth playback. Symptoms: choppy audio, stuttering video. Increasing buffer size helps but adds latency; the trade-off is application-specific.

Asymmetric quality manifests when one participant has good upload but poor download (or vice versa). They can send clearly but receive poorly. Symptoms: "I can see you fine but you're frozen on my screen." Debugging requires checking both directions independently.

Application-level failures

Permission persistence problems occur when browsers revoke media permissions unexpectedly. Safari on iOS revokes permissions when the page backgrounds. Chrome may prompt again after browser updates. Symptoms: camera or microphone stops working mid-call. Handle the permissiondenied error gracefully and prompt users to re-grant.

Device switching failure happens when a user selects a new camera or microphone and the switch fails silently. Common with Bluetooth devices that disconnect unexpectedly. Symptoms: user thinks they switched devices but audio still comes from the old source. Always verify device changes took effect.

Session state desynchronisation occurs when your application's state diverges from the actual session state. Participants your app thinks are present have actually left; tracks your app thinks are active have ended. Symptoms: ghost participants, stale video. Reconcile state on reconnection and handle all participant/track events.

The debugging playbook

Check connectivity first. Can both peers reach the signalling server? Can they reach TURN? Are ICE candidates being exchanged? webrtc-internals in Chrome shows all of this.
Examine network stats. RealtimeKit exposes RTCPeerConnection stats. Look for packet loss, jitter, and round-trip time. High packet loss (>5%) indicates network problems. High jitter indicates unstable network. High RTT indicates geographic distance or congested paths.
Verify media tracks. Are tracks actually producing frames? A muted track still exists but produces silence. A failed track produces nothing. Check track.readyState and monitor for the ended event.
Test in isolation. Connect two browsers on the same machine. If that fails, the problem is configuration. If that works, add network complexity incrementally until you reproduce the failure.

Operational considerations

Running real-time media in production requires monitoring and practices beyond typical web application operations.

Monitoring quality

Quality problems don't always surface as errors. A call might "work" but sound terrible. Monitor:

Media quality metrics: RTCPeerConnection provides statistics on packets sent/received, packets lost, jitter, and round-trip time. Export these to your observability platform. Alert when packet loss exceeds 5% or jitter exceeds 100ms.

Call quality score: Derive a composite score from network metrics. Many applications use Mean Opinion Score (MOS) estimation based on packet loss and latency. A MOS below 3.5 indicates noticeably degraded quality.

Connection success rate: Track what percentage of join attempts result in working media. A success rate below 95% warrants investigation.

Time to first frame: How long from clicking "join" until the user sees video? Target under 3 seconds. Longer delays suggest connection issues or slow backend authentication.

Collecting quality metrics for monitoring
// Example: collecting stats for monitoring
setInterval(async () => {
  const stats = await session.getStats();

  metrics.gauge('rtc.packets_lost', stats.packetsLost);
  metrics.gauge('rtc.jitter_ms', stats.jitter * 1000);
  metrics.gauge('rtc.rtt_ms', stats.roundTripTime * 1000);

  // Estimate MOS (simplified)
  const effectiveLoss = stats.packetsLost + (stats.jitter * 100);
  const mos = Math.max(1, Math.min(5, 4.5 - (effectiveLoss * 0.1)));
  metrics.gauge('rtc.mos_estimate', mos);
}, 5000);

Capacity planning

Cloudflare handles SFU scaling automatically, but your backend systems need capacity planning:

Session management: Each active session requires state. A Durable Object per session scales naturally, but estimate your concurrent session count for database and cache sizing.

Authentication tokens: Token generation happens on every join. If you're calling external auth services, ensure they handle peak load.

Recording storage: If recording is enabled, estimate storage growth. A one-hour 720p recording is roughly 1-2 GB. A busy platform generating thousands of hours of recordings monthly needs terabytes of storage and a retention policy.

Post-processing pipelines: If you're transcribing, analysing, or transforming recordings, those workloads scale with call volume. Queue-based processing (Chapter 8) prevents overload during peak times.

Testing real-time applications

Real-time media quality depends heavily on network conditions. Testing on a developer's fast, stable connection misses problems users will encounter.

Network condition simulation: Chrome DevTools and WebRTC testing tools allow simulating packet loss, latency, and bandwidth constraints. Test at 3% packet loss and 200ms latency minimum; that's a normal mobile connection.

Geographic testing: Use actual connections from different regions. A VPN or cloud VM in distant locations reveals latency issues local testing misses.

Scale testing: Join many participants to a single call. Quality often degrades non-linearly; a call working fine with 5 participants may struggle at 15.

Device testing: Test on actual mobile devices, not just browser simulators. iOS Safari and Android Chrome have different WebRTC implementations with different quirks.

A complete example: video consultation platform

Bringing the pieces together: a telehealth platform supporting video consultations between patients and doctors, with recording for medical records and AI-assisted transcription.

Architecture overview

Session coordination

The Durable Object manages session lifecycle, participant permissions, and integration with external systems.

src/consultation-session.ts
export class ConsultationSession extends DurableObject {
  private state: ConsultationState;

  constructor(ctx: DurableObjectState, env: Env) {
    super(ctx, env);
    this.state = {
      consultationId: null,
      participants: new Map(),
      status: 'waiting',
      startedAt: null,
      recordingEnabled: false
    };
  }

  async initialise(consultationId: string, config: ConsultationConfig) {
    this.state.consultationId = consultationId;
    this.state.recordingEnabled = config.recordingEnabled;

    // Create Realtime session
    const realtimeSession = await this.env.REALTIME.createSession({
      sessionId: consultationId,
      recording: config.recordingEnabled ? {
        outputPath: `consultations/${consultationId}/`,
        format: 'mp4'
      } : undefined
    });

    await this.ctx.storage.put('state', this.state);
    return { sessionId: consultationId, realtimeToken: realtimeSession.token };
  }

  async joinAsPatient(patientId: string): Promise<JoinResult> {
    // Validate patient is authorised for this consultation
    const consultation = await this.fetchConsultationDetails();
    if (consultation.patientId !== patientId) {
      throw new Error('Unauthorised');
    }

    const token = await this.generateParticipantToken(patientId, 'patient');
    this.state.participants.set(patientId, { role: 'patient', joinedAt: Date.now() });
    await this.ctx.storage.put('state', this.state);

    return { token, role: 'patient' };
  }

  async joinAsDoctor(doctorId: string): Promise<JoinResult> {
    // Validate doctor is assigned to this consultation
    const consultation = await this.fetchConsultationDetails();
    if (consultation.doctorId !== doctorId) {
      throw new Error('Unauthorised');
    }

    // Doctor joining starts the consultation
    if (this.state.status === 'waiting') {
      this.state.status = 'active';
      this.state.startedAt = Date.now();
    }

    const token = await this.generateParticipantToken(doctorId, 'doctor');
    this.state.participants.set(doctorId, { role: 'doctor', joinedAt: Date.now() });
    await this.ctx.storage.put('state', this.state);

    return { token, role: 'doctor' };
  }

  async endConsultation(endedBy: string) {
    this.state.status = 'ended';
    const duration = Date.now() - this.state.startedAt;

    // Queue post-processing
    if (this.state.recordingEnabled) {
      await this.env.PROCESSING_QUEUE.send({
        type: 'consultation_ended',
        consultationId: this.state.consultationId,
        recordingPath: `consultations/${this.state.consultationId}/`,
        duration
      });
    }

    // Update external EHR
    await this.updateEHR({
      consultationId: this.state.consultationId,
      status: 'completed',
      duration,
      participants: Array.from(this.state.participants.entries())
    });

    await this.ctx.storage.put('state', this.state);
  }
}

Post-processing pipeline

Recordings trigger a queue consumer that handles transcription and medical record updates.

src/post-processing-consumer.ts
export default {
  async queue(batch: MessageBatch, env: Env) {
    for (const message of batch.messages) {
      const { type, consultationId, recordingPath } = message.body;

      if (type === 'consultation_ended') {
        // Fetch recording from R2
        const recording = await env.RECORDINGS.get(`${recordingPath}recording.mp4`);

        if (recording) {
          // Transcribe with Workers AI
          const transcription = await env.AI.run('@cf/openai/whisper', {
            audio: await recording.arrayBuffer()
          });

          // Store transcription
          await env.D1.prepare(`
            INSERT INTO consultation_records (consultation_id, transcription, created_at)
            VALUES (?, ?, datetime('now'))
          `).bind(consultationId, transcription.text).run();

          // Generate summary (optional)
          const summary = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
            prompt: `Summarise this medical consultation transcript for clinical notes:\n\n${transcription.text}`
          });

          await env.D1.prepare(`
            UPDATE consultation_records SET summary = ? WHERE consultation_id = ?
          `).bind(summary.response, consultationId).run();
        }

        message.ack();
      }
    }
  }
};

This architecture separates concerns cleanly: Durable Objects handle coordination and business logic, Realtime handles media transport, R2 stores recordings, Queues handle async processing, and Workers AI provides intelligence. Each primitive does what it does best.

The example above transcribes after the consultation ends. For use cases needing live captions during a meeting, RealtimeKit provides built-in real-time transcription configured declaratively through ai_config.transcription when creating a meeting. Participant audio routes through AI Gateway to Deepgram Nova-3 on Workers AI, keeping the entire pipeline on Cloudflare's network and avoiding the latency of external speech-to-text services. Ten languages are supported with regional variants, and multi enables automatic language detection. Choose post-processing transcription when you need a complete, accurate transcript for records; choose real-time transcription when participants need live captions or when downstream AI agents need to act on speech as it happens.

Comparing to hyperscaler alternatives

Understanding what you're choosing between clarifies the decision.

Aspect	Cloudflare Realtime	AWS Chime SDK	Azure Communication Services	Twilio Video
Global presence	300+ PoPs, anycast	14 regions	18 regions	7 regions
Pricing model	Per-GB egress	Per-minute	Per-minute	Per-minute
PSTN integration	No	Yes	Yes	Yes
Recording	Built-in to R2	S3 integration	Azure Blob	Composition API
Client SDKs	JS, React Native, Flutter, iOS, Android	JS, iOS, Android, React Native	JS, iOS, Android	JS, iOS, Android, React Native
Max participants	1,000 (SFU mode)	250	350	50 (group rooms)
Minimum latency	Lower (anycast)	Regional	Regional	Regional
Platform maturity	Newer (2024)	Established (2020)	Established (2020)	Established (2017)

When hyperscalers win

PSTN requirements: If users need to dial in via phone number, AWS Chime and Azure Communication Services offer integrated PSTN. Cloudflare Realtime does not; you'd need Twilio or similar for phone connectivity.

Existing AWS/Azure investment: If you're deeply integrated with AWS or Azure, the native video services integrate more smoothly with your existing authentication, storage, and monitoring.

Enterprise compliance: Large enterprise deployments often require specific certifications (FedRAMP, HIPAA BAA). Verify Cloudflare's current certification status against your requirements. The hyperscalers have longer compliance track records.

Predictable budgeting: Per-minute pricing is easier to budget. You know exactly what a 100,000-minute month costs. Bandwidth pricing varies with quality and TURN usage, making cost prediction harder.

When Cloudflare wins

Globally distributed users: If participants span continents, anycast routing provides better experience than choosing a single regional server.

Variable quality needs: If calls range from audio-only to high-definition video, bandwidth pricing rewards efficiency. Hyperscaler per-minute rates don't distinguish.

Integrated platform: If you're already using Workers, Durable Objects, R2, and Workers AI, Realtime integrates seamlessly. No cross-cloud networking, no separate billing, no additional authentication systems.

Cost at scale for audio: Audio-heavy applications (voice agents, conferencing without video) cost dramatically less on Cloudflare than per-minute alternatives.

Making the decision

The decision framework for real-time features:

Question	If Yes	If No
Does the application involve audio or video between participants?	Evaluate Realtime	Use Durable Objects + WebSockets
Do users need to dial in via phone?	Use Chime/Azure/Twilio	Cloudflare viable
Are users geographically distributed?	Cloudflare's anycast helps	Regional SFU may suffice
Is the application primarily audio?	Cloudflare likely cheaper	Cost differences smaller
Do you need specific compliance certifications?	Verify Cloudflare status	Less concern
Are you already on Cloudflare's platform?	Integration is simpler	Evaluate independently

Start with the simplest option that works. If your "real-time" feature involves text, JSON, or application state, Durable Objects with WebSockets almost certainly suffice. Don't add media infrastructure complexity for problems that don't require it.

Prototype before committing. The free tier (1,000 GB) supports meaningful prototyping. Build something, test with real users across real networks, measure actual latency and quality. Real-time media quality varies significantly with implementation details; benchmarks matter more than specifications.

Plan for operational complexity. Real-time media is harder to operate than request/response applications. Budget for monitoring, debugging tools, and user support for "the call isn't working" issues. This complexity exists regardless of provider choice.

What comes next

This chapter covered real-time audio and video infrastructure: when you need it, how Cloudflare's anycast SFU architecture provides advantages over traditional approaches, the operational realities of media applications, and how to evaluate provider choices.

The next chapter begins Part IV: Data & Storage, starting with how to choose the right storage primitive for your data. The coordination patterns from Durable Objects (Chapter 6), the compute options (Workers, Containers), and now real-time media combine with storage choices to form complete architectures. Real-time media adds another dimension but follows the same principle: match the primitive to the problem, understand its constraints, and design for production from the start.

Two kinds of real-time​

The protocol reality​

When Durable Objects suffice​

When you need Cloudflare Realtime​

The anycast SFU architecture​

The traditional SFU problem​

Cloudflare's approach​

Cascading for large calls​

Latency characteristics​

TURN: solving NAT traversal​

STUN vs TURN​

Estimating TURN usage​

Realtimekit: the client SDK​

Basic integration​

Connection state management​

Adaptive quality​

Error handling​

Recording​

Cost model and estimation​

Cloudflare pricing structure​

Estimating bandwidth​

Comparison with alternatives​

Failure modes worth naming​

Connectivity failures​

Quality degradation​

Application-level failures​

The debugging playbook​

Operational considerations​

Monitoring quality​

Capacity planning​

Testing real-time applications​

A complete example: video consultation platform​

Architecture overview​

Session coordination​

Post-processing pipeline​

Comparing to hyperscaler alternatives​

When hyperscalers win​

When Cloudflare wins​

Making the decision​

What comes next​