Chapter 10: Realtime: Audio and Video at the Edge
How do I build applications with real-time audio and video?
Real-time means different things to different applications, so a collaborative document editor needs sub-second text synchronisation, a stock trading platform needs millisecond price updates, and a video call needs continuous audio and video streams with latency low enough that conversation feels natural. These diverse problems require different solutions.
Durable Objects with WebSockets handle the first two cases elegantly with single-threaded coordination, global routing, and persistent connections (as Chapter 6 covered). But audio and video streaming is fundamentally different because the protocols differ (WebRTC rather than WebSockets), the infrastructure differs (media servers rather than application servers), and the operational complexity differs (NAT traversal, codec negotiation, adaptive bitrate). Cloudflare Realtime exists because Durable Objects are not designed for media.
WebSockets guarantee delivery and ordering, which matters for application state but adds latency. WebRTC prioritises low latency, accepting that some packets will never arrive. A dropped chat message is unacceptable; a dropped video frame is invisible. If your application involves voice or video between participants, you need WebRTC infrastructure regardless of how you optimise WebSockets.
This chapter covers when you need Cloudflare Realtime versus when Durable Objects suffice, explains why anycast SFUs change the trade-offs, describes the operational realities of running real-time media in production, and shows how to evaluate whether Cloudflare fits your requirements.
Two kinds of real-time
Understanding the distinction between data real-time and media real-time shapes every architectural decision you make.
Data real-time involves text, JSON, binary messages, and application state where latency tolerance ranges from tens of milliseconds to a few seconds depending on the use case. A chat message arriving 200ms late is imperceptible, and a collaborative cursor position updating 500ms late is noticeable but acceptable. WebSockets over TCP provide reliable, ordered delivery while Durable Objects coordinate state across participants, and the infrastructure is straightforward: your application logic runs in Workers and Durable Objects with persistent connections and messages flowing naturally.
Media real-time involves continuous audio and video streams where latency tolerance is measured in tens of milliseconds; beyond 150ms, conversation becomes awkward as participants talk over each other, and beyond 300ms, it feels broken. WebRTC provides the transport with UDP for speed (accepting occasional packet loss rather than waiting for retransmission), DTLS for encryption, and adaptive codecs that respond to network conditions. The infrastructure is specialised: Selective Forwarding Units (SFUs) route media between participants, TURN servers relay traffic when direct connections fail, and the entire system must operate closer to the network layer than typical application code.
The protocol reality
WebRTC and WebSockets serve fundamentally different purposes even though both enable "real-time" communication.
WebSockets run over TCP where every packet arrives in order or the connection blocks waiting, and this guarantee is essential for application state since you cannot skip a database write or drop a user action. But for media, waiting for a retransmission adds latency that compounds because a video frame arriving 200ms late is worse than a frame that never arrives. The late frame causes visible stuttering while a dropped frame is interpolated invisibly.
WebRTC runs primarily over UDP with DTLS encryption where packets arrive when they arrive and lost packets stay lost. The codec compensates: audio codecs use forward error correction to reconstruct missing data while video codecs use techniques like temporal scalability to gracefully degrade. A 2% packet loss rate that would cripple a WebSocket application is imperceptible in a well-implemented WebRTC call.
This isn't a choice between good and better protocols. It's a choice between protocols designed for different problems. Using WebSockets for video is like using a database for caching: technically possible, architecturally wrong.
When Durable Objects suffice
Many "real-time" applications don't actually need media infrastructure, so before reaching for Cloudflare Realtime, you should confirm that Durable Objects with WebSockets can't solve your problem.
Text chat and messaging: WebSockets through Durable Objects handle this perfectly. Messages are small, latency tolerance is generous, and the coordination model (one Durable Object per conversation) maps naturally to the domain. Cloudflare Realtime adds cost and complexity with no benefit.
Collaborative editing: Whether you're synchronising document state, whiteboard drawings, or design tool operations, WebSockets suffice. The data volumes are modest, sub-second latency is acceptable, and the interesting problems (conflict resolution, operational transformation) live in your application logic, not the transport layer.
Live dashboards and data feeds: Stock tickers, sports scores, IoT sensor readings, and analytics dashboards involve one-way data flow from server to many clients. WebSockets handle fan-out efficiently. Durable Objects coordinate the source of truth.
Multiplayer game state: Turn-based games, strategy games, and even many action games synchronise state through WebSockets. The game logic determines what "real-time" means; unless you're building a first-person shooter requiring sub-50ms latency, Durable Objects likely suffice.
Presence and cursor tracking: Showing who's online and where their cursor is involves small, frequent updates. WebSockets handle this trivially. The coordination challenge (ensuring all participants see consistent presence) is exactly what Durable Objects solve.
For all these cases, Chapter 6's patterns apply directly: create a Durable Object per logical entity (conversation, document, game session), connect participants via WebSockets, and let single-threaded execution handle coordination. The infrastructure is simpler, the cost is lower, and you maintain full control of the application logic.
When you need Cloudflare Realtime
Cloudflare Realtime becomes necessary when your application involves actual audio or video between participants rather than just data synchronisation.
Video conferencing: Multiple participants sending and receiving video streams. Each participant's video must reach every other participant with minimal latency. An SFU routes these streams efficiently; without one, you'd need mesh networking where each participant connects directly to every other participant. Mesh works for three or four people but bandwidth requirements grow quadratically: five participants means each sends four streams and receives four streams. By ten participants, the approach collapses entirely.
Voice calling: Audio-only calls have lower bandwidth requirements but the same latency sensitivity because conversation requires round-trip latency under 300ms to feel natural (ideally under 150ms). At 400ms, participants constantly interrupt each other, and at 500ms, they give up and switch to asynchronous communication.
Live streaming with interaction: A presenter broadcasting to an audience, with the audience able to respond via voice or video. The broadcast might tolerate higher latency (1-2 seconds is common for live streams), but interactive segments need the same low latency as a call.
AI voice agents: Applications where users speak to an AI and receive spoken responses as the AI processes speech-to-text, generates a response, and returns text-to-speech. Low latency throughout the pipeline makes the interaction feel conversational rather than like talking to a voice mail system, with target end-to-end latency under 500ms being achievable with careful architecture.
Telehealth and remote assistance: Video calls with specific requirements around quality, reliability, and sometimes recording. Medical consultations, technical support with screen sharing, and remote expert guidance all need robust video infrastructure.
For these use cases, Durable Objects with WebSockets cannot help. You need WebRTC infrastructure: media servers, TURN relays, and client SDKs that handle codec negotiation, adaptive bitrate, and network condition changes.
The anycast SFU architecture
Traditional video infrastructure requires choosing where to place your media servers. AWS Chime operates in 14 regions. Azure Communication Services operates in 18. You provision capacity, select regions based on where you expect users, and accept that some users will connect to distant servers with higher latency.
Cloudflare Realtime works differently. The SFU runs on Cloudflare's entire network: over 300 locations worldwide. When a participant connects, BGP routing automatically directs them to the nearest location. There's no region selection, no capacity planning, no trade-off between cost and coverage.
This isn't just operational convenience. The architecture changes what's possible.
The traditional SFU problem
In a traditional SFU deployment, all participants in a call connect to the same server. If three people in London and one person in Sydney join a call, the SFU must be somewhere. Place it in London, and the Sydney participant has 300ms round-trip latency. Place it in Sydney, and the London participants suffer. Place it somewhere in between, and everyone has mediocre latency.
Some systems try to solve this by selecting SFU location dynamically based on the first participant or the geographic centre of all participants. This helps when participants cluster geographically, but creates suboptimal routing when they don't, and "geographic centre" between London and Sydney is somewhere in the Indian Ocean where no data centres exist.
Cloudflare's approach
Cloudflare's anycast architecture eliminates this trade-off. Each participant connects to their nearest Cloudflare location. The London participants connect to London; the Sydney participant connects to Sydney. Media flows between Cloudflare locations over Cloudflare's backbone network, which is optimised for exactly this traffic pattern.
The result: every participant gets optimal latency to their nearest point of presence, regardless of where other participants are. The geographic distribution of participants doesn't degrade anyone's experience.
This architecture also simplifies scaling. Traditional SFUs require load balancing, capacity monitoring, and scaling policies. Cloudflare's approach distributes load automatically across locations. A server in one location handles only the participants nearest to it; adding participants in other locations adds load to other servers, not the same server.
Cascading for large calls
For calls with many participants, Cloudflare uses cascading: media flows through a tree of servers rather than a single hub. Participants in London connect to London servers; those servers cascade to servers handling other regions. This keeps latency low while enabling calls with hundreds of participants.
The cascading topology forms automatically based on participant locations. You don't configure it; you don't even see it. From your application's perspective, participants join a call and can see and hear each other. The platform handles routing optimisation invisibly.
Latency characteristics
Measured latency in production deployments:
| Scenario | Typical Latency |
|---|---|
| Same city, both participants | 30-50ms round-trip |
| Same continent, different cities | 50-100ms round-trip |
| Cross-continent (US to Europe) | 100-150ms round-trip |
| Cross-continent (US to Asia) | 150-200ms round-trip |
| Worst case (distant regions) | 200-300ms round-trip |
These figures represent end-to-end media latency including encoding, network transit, and decoding. They're achievable because each participant connects to nearby infrastructure; the inter-region latency affects only the backbone hop, not the participant connection.
TURN: solving NAT traversal
WebRTC wants peer-to-peer connections. In practice, NATs and firewalls often prevent direct connectivity. TURN (Traversal Using Relays around NAT) servers act as intermediaries, relaying traffic when direct connection fails.
Without TURN, roughly 8-15% of connection attempts fail depending on network conditions. Users behind corporate firewalls, symmetric NATs, or certain mobile networks simply cannot establish direct connections. For a consumer application, losing 10% of users to connectivity failures is unacceptable.
Cloudflare provides TURN as part of Realtime. The same anycast architecture applies: TURN allocations route to the nearest Cloudflare location. The service handles the complexity of maintaining relay allocations, managing credentials, and ensuring connectivity.
STUN vs TURN
STUN (Session Traversal Utilities for NAT) helps peers discover their public IP addresses and attempt direct connection. It's lightweight and free on Cloudflare (stun.cloudflare.com). Most connections succeed with STUN alone: the peers discover their public addresses, exchange candidates through signalling, and establish direct media flow.
TURN relays actual media traffic when STUN-facilitated direct connection fails. This happens when both peers are behind symmetric NATs, when firewall rules block UDP, or when network policies prevent direct peer connections. TURN consumes bandwidth because it relays all media, and it costs money ($0.05 per GB).
The fallback sequence: try direct connection first, fall back to TURN only when necessary. Your application doesn't control this; the WebRTC stack handles it automatically. But understanding the sequence helps when debugging connectivity issues and predicting costs.
WebRTC uses DTLS encryption end-to-end. When traffic routes through TURN, Cloudflare relays encrypted packets without access to the media content. This is inherent to the protocol, not a Cloudflare-specific implementation choice.
Estimating TURN usage
TURN usage varies dramatically by user base. Consumer applications with users on home networks: 5-10% TURN usage. Enterprise applications with users behind corporate firewalls: 20-40% TURN usage. Applications used primarily on mobile networks: 10-20% TURN usage.
For cost estimation, assume 15% of your traffic will route through TURN unless you have specific knowledge of your user base. If you're building for enterprise, budget for 30%. These estimates affect both cost projections and capacity planning.
Realtimekit: the client SDK
Building WebRTC applications from scratch requires handling codec negotiation, network condition monitoring, audio level detection, camera and microphone permissions, connection state management, reconnection logic, and platform-specific quirks across browsers and mobile operating systems. RealtimeKit handles this complexity.
The SDK is available for JavaScript (web), React Native, Flutter, iOS (Swift), and Android (Kotlin). It provides two layers:
Core SDK: APIs for joining sessions, managing media tracks, handling events. You build your own UI but don't implement WebRTC negotiation, ICE candidate handling, or codec selection.
UI Kit: Pre-built components for common patterns. Video grids, participant lists, control bars. Useful for rapid prototyping or applications where custom UI isn't a differentiator.
Basic integration
import { RealtimeKit } from '@cloudflare/realtimekit';
// Initialize with authentication token from your backend
const rtk = new RealtimeKit({
token: authToken,
// Optional: configure quality preferences
videoConstraints: {
width: { ideal: 1280 },
height: { ideal: 720 },
frameRate: { ideal: 30 }
}
});
// Join a session
const session = await rtk.join({
sessionId: 'room-123',
displayName: 'Jamie'
});
// Access local media
await session.enableCamera();
await session.enableMicrophone();
// Handle remote participants
session.on('participantJoined', (participant) => {
console.log(`${participant.displayName} joined`);
// Attach video to DOM element
participant.videoTrack?.attach(videoElement);
participant.audioTrack?.attach(audioElement);
});
session.on('participantLeft', (participant) => {
console.log(`${participant.displayName} left`);
});
Connection state management
Real-world networks are unreliable. Connections drop, quality degrades, and users switch between WiFi and cellular. RealtimeKit exposes connection state for your application to respond appropriately.
session.on('connectionStateChanged', (state) => {
switch (state) {
case 'connecting':
showStatus('Connecting...');
break;
case 'connected':
hideStatus();
break;
case 'reconnecting':
showStatus('Connection lost, reconnecting...');
break;
case 'disconnected':
showError('Disconnected. Please check your network.');
break;
case 'failed':
showError('Connection failed. Unable to join call.');
break;
}
});
The SDK handles reconnection automatically for transient network issues. Your application receives state updates to inform the user. For persistent failures, the SDK eventually transitions to failed state, and your application should offer to retry or fall back gracefully.
Adaptive quality
Network conditions change during a call. A user on WiFi walks to another room and loses signal. A mobile user enters an elevator. A home user's family member starts streaming video. RealtimeKit adapts automatically, but your application can respond to quality changes.
session.on('qualityChanged', (participant, quality) => {
// quality: 'excellent' | 'good' | 'fair' | 'poor'
if (quality === 'poor' && participant.isLocal) {
suggestDisablingVideo();
}
});
// Manual quality control
session.setVideoQuality('low'); // Reduce outgoing video quality
session.setVideoEnabled(false); // Disable video entirely
For bandwidth-constrained scenarios, simulcast allows receivers to choose quality independently. The sender transmits multiple quality layers; each receiver requests the layer matching their available bandwidth. This happens automatically but affects bandwidth usage: simulcast increases upload bandwidth by roughly 50% but enables better quality for receivers with good connections while maintaining stability for constrained receivers.
Error handling
Media applications fail in ways that web applications don't. Camera permissions denied, microphone in use by another application, hardware failures, browser restrictions. Handle these explicitly.
try {
await session.enableCamera();
} catch (error) {
if (error.name === 'NotAllowedError') {
showMessage('Camera permission denied. Enable in browser settings.');
} else if (error.name === 'NotFoundError') {
showMessage('No camera found. Please connect a camera.');
} else if (error.name === 'NotReadableError') {
showMessage('Camera in use by another application.');
} else {
showMessage('Unable to access camera.');
console.error('Camera error:', error);
}
}
Device enumeration helps users select the right input:
const devices = await navigator.mediaDevices.enumerateDevices();
const cameras = devices.filter(d => d.kind === 'videoinput');
const microphones = devices.filter(d => d.kind === 'audioinput');
// Let user select, then apply
await session.setCamera(selectedCameraId);
await session.setMicrophone(selectedMicrophoneId);
Recording
Recording video calls typically requires running a headless browser or custom media pipeline to capture streams. RealtimeKit includes built-in recording: enable it for a session, and recordings appear in R2 storage.
// Enable recording when creating the session (server-side)
const session = await realtimeApi.createSession({
recording: {
enabled: true,
outputPath: `recordings/${sessionId}/`,
format: 'mp4'
}
});
The recording pipeline runs on Cloudflare's infrastructure. Recordings include all participants' audio and video composited into a single file, or optionally separate tracks per participant for post-production flexibility.
This matters for compliance-driven applications (financial services requiring call recording), education (lecture capture), and content creation (podcast recording). What would otherwise require significant infrastructure becomes a configuration option.
Cost model and estimation
Real-time media pricing varies significantly across providers. Understanding the cost model prevents surprises.
Cloudflare pricing structure
Cloudflare Realtime charges based on bandwidth:
| Component | Price | Free Tier |
|---|---|---|
| SFU egress | $0.05/GB | 1,000 GB/month |
| TURN relay | $0.05/GB | Included in SFU allowance |
| Recording storage | R2 pricing | R2 free tier applies |
The bandwidth-based model creates different economics than per-minute pricing. Audio-only calls (roughly 100 Kbps per participant) cost dramatically less than video calls (1-4 Mbps per participant depending on resolution). Applications that adapt quality based on conditions naturally optimise cost.
Estimating bandwidth
For a typical video call:
| Quality | Bitrate | GB per hour (one direction) |
|---|---|---|
| Audio only | 50-100 Kbps | 0.036-0.045 GB |
| 360p video | 400-600 Kbps | 0.18-0.27 GB |
| 720p video | 1.5-2.5 Mbps | 0.68-1.13 GB |
| 1080p video | 3-5 Mbps | 1.35-2.25 GB |
For a four-person call at 720p for one hour, each participant sends one stream (to the SFU) and receives three streams (from the SFU). Total bandwidth per participant: roughly 3-4.5 GB. Total SFU egress for the call: roughly 12-18 GB, costing $0.60-$0.90.
At scale, the formula for monthly cost:
Monthly Cost = (Avg Participants × Avg Call Duration × Calls Per Month × Bitrate × 3600) / (8 × 1024³) × $0.05
Comparison with alternatives
| Provider | Pricing Model | 4-person, 1-hour call |
|---|---|---|
| Cloudflare Realtime | $0.05/GB egress | ~$0.60-0.90 |
| AWS Chime SDK | $0.0017/participant-minute | $0.41 |
| Azure Communication Services | $0.004/participant-minute | $0.96 |
| Twilio Video | $0.004/participant-minute (group) | $0.96 |
Cloudflare's cost varies with video quality. The others charge flat per-minute rates regardless of quality. For audio-only calls, Cloudflare is dramatically cheaper. For high-quality video, costs converge.
Factor in:
- TURN relay: 10-30% additional bandwidth for users requiring relay
- Recording: Storage (R2 pricing) plus processing bandwidth
- Transcoding: If you're re-encoding recordings, compute costs apply
- Development time: Newer SDKs may require more integration effort than mature alternatives
Failure modes worth naming
Real-time media systems fail in ways that differ from typical web applications. Naming these failures creates vocabulary for debugging and post-incident analysis.
Connectivity failures
ICE negotiation timeout occurs when peers cannot establish a connection within the negotiation window (typically 30 seconds). Common causes: both peers behind symmetric NATs with TURN unavailable, firewall blocking UDP entirely, or incorrect TURN credentials. Symptoms: connectionState never reaches connected. The fix depends on the cause; ensure TURN is available and correctly configured.
TURN allocation exhaustion manifests when your TURN server runs out of relay allocations. Each concurrent connection through TURN consumes an allocation. Cloudflare's managed TURN scales automatically, but if you're running your own TURN infrastructure, monitor allocation counts. Symptoms: new connections fail while existing connections continue working.
Firewall state timeout affects long-running calls. Stateful firewalls track UDP sessions and drop state after inactivity (often 30 seconds). If media pauses (everyone mutes), the firewall may drop the session, requiring ICE restart. Symptoms: call drops after period of silence. Keep-alive packets mitigate this; RealtimeKit handles it automatically.
Quality degradation
Bandwidth collapse occurs when available bandwidth drops below minimum viable quality. The codec can't reduce bitrate further without becoming unwatchable. Symptoms: extreme pixelation, then freezing, then disconnection. The only fix is more bandwidth or disabling video.
Jitter buffer overflow happens when network jitter exceeds the buffer's capacity to smooth it. Packets arrive too irregularly to maintain smooth playback. Symptoms: choppy audio, stuttering video. Increasing buffer size helps but adds latency; the trade-off is application-specific.
Asymmetric quality manifests when one participant has good upload but poor download (or vice versa). They can send clearly but receive poorly. Symptoms: "I can see you fine but you're frozen on my screen." Debugging requires checking both directions independently.
Application-level failures
Permission persistence problems occur when browsers revoke media permissions unexpectedly. Safari on iOS revokes permissions when the page backgrounds. Chrome may prompt again after browser updates. Symptoms: camera or microphone stops working mid-call. Handle the permissiondenied error gracefully and prompt users to re-grant.
Device switching failure happens when a user selects a new camera or microphone and the switch fails silently. Common with Bluetooth devices that disconnect unexpectedly. Symptoms: user thinks they switched devices but audio still comes from the old source. Always verify device changes took effect.
Session state desynchronisation occurs when your application's state diverges from the actual session state. Participants your app thinks are present have actually left; tracks your app thinks are active have ended. Symptoms: ghost participants, stale video. Reconcile state on reconnection and handle all participant/track events.
The debugging playbook
-
Check connectivity first. Can both peers reach the signalling server? Can they reach TURN? Are ICE candidates being exchanged?
webrtc-internalsin Chrome shows all of this. -
Examine network stats. RealtimeKit exposes RTCPeerConnection stats. Look for packet loss, jitter, and round-trip time. High packet loss (>5%) indicates network problems. High jitter indicates unstable network. High RTT indicates geographic distance or congested paths.
-
Verify media tracks. Are tracks actually producing frames? A muted track still exists but produces silence. A failed track produces nothing. Check
track.readyStateand monitor for theendedevent. -
Test in isolation. Connect two browsers on the same machine. If that fails, the problem is configuration. If that works, add network complexity incrementally until you reproduce the failure.
Operational considerations
Running real-time media in production requires monitoring and practices beyond typical web application operations.
Monitoring quality
Quality problems don't always surface as errors. A call might "work" but sound terrible. Monitor:
Media quality metrics: RTCPeerConnection provides statistics on packets sent/received, packets lost, jitter, and round-trip time. Export these to your observability platform. Alert when packet loss exceeds 5% or jitter exceeds 100ms.
Call quality score: Derive a composite score from network metrics. Many applications use Mean Opinion Score (MOS) estimation based on packet loss and latency. A MOS below 3.5 indicates noticeably degraded quality.
Connection success rate: Track what percentage of join attempts result in working media. A success rate below 95% warrants investigation.
Time to first frame: How long from clicking "join" until the user sees video? Target under 3 seconds. Longer delays suggest connection issues or slow backend authentication.
// Example: collecting stats for monitoring
setInterval(async () => {
const stats = await session.getStats();
metrics.gauge('rtc.packets_lost', stats.packetsLost);
metrics.gauge('rtc.jitter_ms', stats.jitter * 1000);
metrics.gauge('rtc.rtt_ms', stats.roundTripTime * 1000);
// Estimate MOS (simplified)
const effectiveLoss = stats.packetsLost + (stats.jitter * 100);
const mos = Math.max(1, Math.min(5, 4.5 - (effectiveLoss * 0.1)));
metrics.gauge('rtc.mos_estimate', mos);
}, 5000);
Capacity planning
Cloudflare handles SFU scaling automatically, but your backend systems need capacity planning:
Session management: Each active session requires state. A Durable Object per session scales naturally, but estimate your concurrent session count for database and cache sizing.
Authentication tokens: Token generation happens on every join. If you're calling external auth services, ensure they handle peak load.
Recording storage: If recording is enabled, estimate storage growth. A one-hour 720p recording is roughly 1-2 GB. A busy platform generating thousands of hours of recordings monthly needs terabytes of storage and a retention policy.
Post-processing pipelines: If you're transcribing, analysing, or transforming recordings, those workloads scale with call volume. Queue-based processing (Chapter 8) prevents overload during peak times.
Testing real-time applications
Real-time media quality depends heavily on network conditions. Testing on a developer's fast, stable connection misses problems users will encounter.
Network condition simulation: Chrome DevTools and WebRTC testing tools allow simulating packet loss, latency, and bandwidth constraints. Test at 3% packet loss and 200ms latency minimum; that's a normal mobile connection.
Geographic testing: Use actual connections from different regions. A VPN or cloud VM in distant locations reveals latency issues local testing misses.
Scale testing: Join many participants to a single call. Quality often degrades non-linearly; a call working fine with 5 participants may struggle at 15.
Device testing: Test on actual mobile devices, not just browser simulators. iOS Safari and Android Chrome have different WebRTC implementations with different quirks.
A complete example: video consultation platform
Bringing the pieces together: a telehealth platform supporting video consultations between patients and doctors, with recording for medical records and AI-assisted transcription.
Architecture overview
Session coordination
The Durable Object manages session lifecycle, participant permissions, and integration with external systems.
export class ConsultationSession extends DurableObject {
private state: ConsultationState;
constructor(ctx: DurableObjectState, env: Env) {
super(ctx, env);
this.state = {
consultationId: null,
participants: new Map(),
status: 'waiting',
startedAt: null,
recordingEnabled: false
};
}
async initialise(consultationId: string, config: ConsultationConfig) {
this.state.consultationId = consultationId;
this.state.recordingEnabled = config.recordingEnabled;
// Create Realtime session
const realtimeSession = await this.env.REALTIME.createSession({
sessionId: consultationId,
recording: config.recordingEnabled ? {
outputPath: `consultations/${consultationId}/`,
format: 'mp4'
} : undefined
});
await this.ctx.storage.put('state', this.state);
return { sessionId: consultationId, realtimeToken: realtimeSession.token };
}
async joinAsPatient(patientId: string): Promise<JoinResult> {
// Validate patient is authorised for this consultation
const consultation = await this.fetchConsultationDetails();
if (consultation.patientId !== patientId) {
throw new Error('Unauthorised');
}
const token = await this.generateParticipantToken(patientId, 'patient');
this.state.participants.set(patientId, { role: 'patient', joinedAt: Date.now() });
await this.ctx.storage.put('state', this.state);
return { token, role: 'patient' };
}
async joinAsDoctor(doctorId: string): Promise<JoinResult> {
// Validate doctor is assigned to this consultation
const consultation = await this.fetchConsultationDetails();
if (consultation.doctorId !== doctorId) {
throw new Error('Unauthorised');
}
// Doctor joining starts the consultation
if (this.state.status === 'waiting') {
this.state.status = 'active';
this.state.startedAt = Date.now();
}
const token = await this.generateParticipantToken(doctorId, 'doctor');
this.state.participants.set(doctorId, { role: 'doctor', joinedAt: Date.now() });
await this.ctx.storage.put('state', this.state);
return { token, role: 'doctor' };
}
async endConsultation(endedBy: string) {
this.state.status = 'ended';
const duration = Date.now() - this.state.startedAt;
// Queue post-processing
if (this.state.recordingEnabled) {
await this.env.PROCESSING_QUEUE.send({
type: 'consultation_ended',
consultationId: this.state.consultationId,
recordingPath: `consultations/${this.state.consultationId}/`,
duration
});
}
// Update external EHR
await this.updateEHR({
consultationId: this.state.consultationId,
status: 'completed',
duration,
participants: Array.from(this.state.participants.entries())
});
await this.ctx.storage.put('state', this.state);
}
}
Post-processing pipeline
Recordings trigger a queue consumer that handles transcription and medical record updates.
export default {
async queue(batch: MessageBatch, env: Env) {
for (const message of batch.messages) {
const { type, consultationId, recordingPath } = message.body;
if (type === 'consultation_ended') {
// Fetch recording from R2
const recording = await env.RECORDINGS.get(`${recordingPath}recording.mp4`);
if (recording) {
// Transcribe with Workers AI
const transcription = await env.AI.run('@cf/openai/whisper', {
audio: await recording.arrayBuffer()
});
// Store transcription
await env.D1.prepare(`
INSERT INTO consultation_records (consultation_id, transcription, created_at)
VALUES (?, ?, datetime('now'))
`).bind(consultationId, transcription.text).run();
// Generate summary (optional)
const summary = await env.AI.run('@cf/meta/llama-3-8b-instruct', {
prompt: `Summarise this medical consultation transcript for clinical notes:\n\n${transcription.text}`
});
await env.D1.prepare(`
UPDATE consultation_records SET summary = ? WHERE consultation_id = ?
`).bind(summary.response, consultationId).run();
}
message.ack();
}
}
}
};
This architecture separates concerns cleanly: Durable Objects handle coordination and business logic, Realtime handles media transport, R2 stores recordings, Queues handle async processing, and Workers AI provides intelligence. Each primitive does what it does best.
Comparing to hyperscaler alternatives
Understanding what you're choosing between clarifies the decision.
| Aspect | Cloudflare Realtime | AWS Chime SDK | Azure Communication Services | Twilio Video |
|---|---|---|---|---|
| Global presence | 300+ PoPs, anycast | 14 regions | 18 regions | 7 regions |
| Pricing model | Per-GB egress | Per-minute | Per-minute | Per-minute |
| PSTN integration | No | Yes | Yes | Yes |
| Recording | Built-in to R2 | S3 integration | Azure Blob | Composition API |
| Client SDKs | JS, React Native, Flutter, iOS, Android | JS, iOS, Android, React Native | JS, iOS, Android | JS, iOS, Android, React Native |
| Max participants | 1,000 (SFU mode) | 250 | 350 | 50 (group rooms) |
| Minimum latency | Lower (anycast) | Regional | Regional | Regional |
| Platform maturity | Newer (2024) | Established (2020) | Established (2020) | Established (2017) |
When hyperscalers win
PSTN requirements: If users need to dial in via phone number, AWS Chime and Azure Communication Services offer integrated PSTN. Cloudflare Realtime does not; you'd need Twilio or similar for phone connectivity.
Existing AWS/Azure investment: If you're deeply integrated with AWS or Azure, the native video services integrate more smoothly with your existing authentication, storage, and monitoring.
Enterprise compliance: Large enterprise deployments often require specific certifications (FedRAMP, HIPAA BAA). Verify Cloudflare's current certification status against your requirements. The hyperscalers have longer compliance track records.
Predictable budgeting: Per-minute pricing is easier to budget. You know exactly what a 100,000-minute month costs. Bandwidth pricing varies with quality and TURN usage, making cost prediction harder.
When Cloudflare wins
Globally distributed users: If participants span continents, anycast routing provides better experience than choosing a single regional server.
Variable quality needs: If calls range from audio-only to high-definition video, bandwidth pricing rewards efficiency. Hyperscaler per-minute rates don't distinguish.
Integrated platform: If you're already using Workers, Durable Objects, R2, and Workers AI, Realtime integrates seamlessly. No cross-cloud networking, no separate billing, no additional authentication systems.
Cost at scale for audio: Audio-heavy applications (voice agents, conferencing without video) cost dramatically less on Cloudflare than per-minute alternatives.
Making the decision
The decision framework for real-time features:
| Question | If Yes | If No |
|---|---|---|
| Does the application involve audio or video between participants? | Evaluate Realtime | Use Durable Objects + WebSockets |
| Do users need to dial in via phone? | Use Chime/Azure/Twilio | Cloudflare viable |
| Are users geographically distributed? | Cloudflare's anycast helps | Regional SFU may suffice |
| Is the application primarily audio? | Cloudflare likely cheaper | Cost differences smaller |
| Do you need specific compliance certifications? | Verify Cloudflare status | Less concern |
| Are you already on Cloudflare's platform? | Integration is simpler | Evaluate independently |
Start with the simplest option that works. If your "real-time" feature involves text, JSON, or application state, Durable Objects with WebSockets almost certainly suffice. Don't add media infrastructure complexity for problems that don't require it.
Prototype before committing. The free tier (1,000 GB) supports meaningful prototyping. Build something, test with real users across real networks, measure actual latency and quality. Real-time media quality varies significantly with implementation details; benchmarks matter more than specifications.
Plan for operational complexity. Real-time media is harder to operate than request/response applications. Budget for monitoring, debugging tools, and user support for "the call isn't working" issues. This complexity exists regardless of provider choice.
What comes next
This chapter covered real-time audio and video infrastructure: when you need it, how Cloudflare's anycast SFU architecture provides advantages over traditional approaches, the operational realities of media applications, and how to evaluate provider choices.
The next chapter begins Part IV: Data & Storage, starting with how to choose the right storage primitive for your data. The coordination patterns from Durable Objects (Chapter 6), the compute options (Workers, Containers), and now real-time media combine with storage choices to form complete architectures. Real-time media adds another dimension but follows the same principle: match the primitive to the problem, understand its constraints, and design for production from the start.