Session Replay
Session replay refers to recorded recreations of agent sessions showing UI interactions, decisions, and outcomes for analysis. These recordings capture the complete sequence of actions, state changes, and visual representations of an agent's behavior during task execution, enabling detailed post-hoc analysis, debugging, and auditing.
In the context of computer-use agents and agentic UI systems, session replay serves as a critical observability tool that bridges the gap between high-level agent decisions and low-level UI interactions, providing a complete forensic trail of agent behavior.
Why Session Replay Matters
Session replay has become essential for operating production agentic systems due to several critical capabilities:
Debugging Complex Failures
Agent failures often emerge from subtle interaction chains that are impossible to reproduce from logs alone. Session replay enables developers to observe the exact sequence of UI states, mouse movements, keyboard inputs, and visual feedback that led to a failure. For example, an agent might fail to complete a checkout flow because it clicked a button before a loading spinner disappeared—a timing issue only visible through replay.
User Behavior Analysis
When agents interact with real user interfaces, understanding their decision-making process requires visual context. Session replay reveals why an agent chose one UI element over another, how it interpreted visual cues, and whether its actions aligned with intended behavior patterns. This analysis becomes crucial when optimizing agent performance or validating safety constraints.
Training Data Generation
High-quality session replays serve as valuable training data for improving agent models. By capturing successful task completions alongside the visual and interaction context, teams can create demonstration datasets that teach agents better interaction patterns, more efficient navigation strategies, and improved error recovery behaviors.
Compliance and Auditing
For regulated industries, session replay provides auditable evidence of agent actions. Financial services, healthcare, and legal applications often require proof that an agent performed specific actions in a specific sequence—session replay creates an immutable record that satisfies regulatory requirements.
Concrete Examples
DOM Recording with Event Streams
Modern web-based session replay systems capture the Document Object Model (DOM) structure and mutation events rather than video frames. This approach records:
{
"timestamp": 1698765432000,
"type": "dom_snapshot",
"data": {
"html": "<body><div class='checkout'>...</div></body>",
"css": ["app.css", "theme.css"],
"viewport": {"width": 1920, "height": 1080}
}
}
Subsequent interactions are stored as deltas:
{
"timestamp": 1698765433250,
"type": "click",
"target": "button[data-testid='submit-payment']",
"coordinates": {"x": 450, "y": 320},
"agent_reasoning": "Located primary CTA button for payment submission"
}
This approach reduces storage requirements by 95% compared to video while maintaining perfect fidelity for web interactions.
Time-Travel Debugging
Advanced replay systems enable developers to step backward and forward through agent sessions, inspecting state at any point. For example, debugging a failed form submission might reveal:
- T+0s: Agent navigates to form page
- T+1.2s: Agent fills email field (value visible in replay)
- T+1.8s: Agent attempts to fill phone field but encounters validation error
- T+2.1s: Agent retries with different format
- T+3.5s: Form submission fails due to CSRF token expiration
By stepping through this sequence, developers identify that the agent's retry logic took too long, allowing the session token to expire—a root cause invisible in traditional logs.
Multi-Modal Capture
Computer-use agents often interact across multiple modalities. Comprehensive session replay captures:
- Visual screenshots: Full-resolution images at decision points
- Agent vision: Annotated screenshots showing what the vision model actually "saw" (bounding boxes, OCR results, detected UI elements)
- Action sequences: Structured data of clicks, typing, scrolling, navigation
- LLM reasoning traces: The agent's internal decision-making process
- Network activity: API calls, resource loads, timing information
Example replay data structure:
{
"session_id": "sess_a1b2c3d4",
"timestamp": "2024-01-15T14:23:11Z",
"screenshot": {
"url": "s3://replays/sess_a1b2c3d4/frame_042.png",
"annotations": [
{"type": "bbox", "coords": [100, 200, 50, 30], "label": "submit_button", "confidence": 0.95}
]
},
"action": {
"type": "click",
"element": "button#submit",
"reasoning": "User requested form submission; detected primary CTA button"
},
"outcome": {
"success": false,
"error": "Element not clickable: overlayed by modal",
"retry_strategy": "wait_for_modal_close"
}
}
Common Pitfalls
Excessive Storage Costs
Naive session replay implementations can generate terabytes of data quickly. A single agent session might produce:
- 1,000+ screenshots at 500KB each = 500MB
- 10,000+ DOM events at 2KB each = 20MB
- 500+ network request/response pairs at 10KB each = 5MB
At scale, storing 1,000 daily sessions becomes prohibitively expensive. Teams must implement aggressive retention policies, compression strategies, and selective capture rules.
Mitigation strategy: Implement tiered storage where recent sessions (< 7 days) retain full fidelity, medium-aged sessions (7-30 days) keep only screenshots at decision points, and old sessions (> 30 days) retain only structured metadata unless flagged for long-term retention.
PII in Replays
Session replays inherently capture everything an agent sees, including personally identifiable information (PII), credentials, financial data, and health records. Storing this data creates security and compliance risks.
Common PII exposure vectors:
- Form fields containing user names, emails, SSNs
- Screenshots showing account balances, medical records
- Network traffic capturing authentication tokens
- Chat transcripts with personal information
Mitigation strategy: Implement automatic PII redaction using pattern matching, named entity recognition, and field-level sanitization. For sensitive applications, store only anonymized replay data with the ability to map back to original sessions under strict access controls.
Playback Inconsistencies
Deterministic replay is challenging because web applications depend on:
- Dynamic external resources (CDNs, APIs)
- Timing-dependent behaviors (animations, async operations)
- Non-deterministic JavaScript execution
- Browser-specific rendering differences
An agent's original session might succeed, but the replay fails because an external API returns different data or a race condition resolves differently.
Mitigation strategy: Capture complete network responses during recording, inject them during playback to ensure consistency. Use DOM snapshots rather than re-executing JavaScript. Accept that perfect determinism is impossible for complex web applications and focus on "good enough" fidelity for debugging purposes.
Performance Impact on Agents
Heavy-weight capture logic can slow down agent execution. Synchronous screenshot capture, DOM serialization, and event logging add latency to each action, potentially affecting agent behavior and task completion times.
Mitigation strategy: Use asynchronous capture mechanisms that don't block agent actions. Implement sampling strategies for high-frequency events (e.g., capture every 10th mouse movement rather than all movements). Provide configuration options to disable replay in performance-critical scenarios.
Implementation
Capture Strategies
Effective session replay requires capturing the right data at the right granularity:
Browser-based capture (for web agents):
class SessionReplayCapture:
def __init__(self, agent_session):
self.session = agent_session
self.events = []
self.screenshot_interval = 1000 # ms
async def capture_interaction(self, action):
"""Capture a single agent interaction."""
event = {
"timestamp": time.time(),
"action_type": action.type,
"target": action.target,
"screenshot_before": await self.capture_screenshot(),
"dom_snapshot": await self.capture_dom(),
"agent_reasoning": action.reasoning
}
# Execute the action
result = await action.execute()
# Capture outcome
event["screenshot_after"] = await self.capture_screenshot()
event["result"] = result
event["duration_ms"] = result.timing
self.events.append(event)
return result
async def capture_screenshot(self):
"""Capture and compress screenshot."""
screenshot = await self.session.screenshot(full_page=False)
compressed = self.compress_image(screenshot, quality=85)
url = await self.upload_to_storage(compressed)
return url
async def capture_dom(self):
"""Capture serialized DOM state."""
dom = await self.session.evaluate("""
() => {
return {
html: document.documentElement.outerHTML,
url: window.location.href,
viewport: {
width: window.innerWidth,
height: window.innerHeight
}
};
}
""")
return dom
Desktop application capture (for computer-use agents):
class DesktopReplayCapture:
def __init__(self):
self.recorder = ScreenRecorder()
self.input_tracker = InputTracker()
async def start_capture(self):
"""Begin capturing desktop session."""
# Capture full screen at key moments
self.recorder.start(
mode="on_demand", # Only capture when agent acts
format="png",
compression="lossless"
)
# Track all input events
self.input_tracker.track_events([
"mouse.click",
"mouse.move",
"keyboard.press",
"window.focus"
])
async def capture_action(self, agent_action):
"""Capture a computer-use action."""
# Screenshot before action
screen_before = await self.recorder.capture_frame()
# Record action metadata
event = {
"timestamp": time.time(),
"action": agent_action.to_dict(),
"screen_before": screen_before,
"window_title": await self.get_active_window(),
"cursor_position": await self.input_tracker.get_cursor_pos()
}
# Execute action
result = await agent_action.execute()
# Screenshot after action (with delay for UI updates)
await asyncio.sleep(0.5)
event["screen_after"] = await self.recorder.capture_frame()
event["result"] = result
return event
Compression Techniques
Efficient storage requires multi-layered compression:
Image compression:
- Convert screenshots to WebP format (30-40% smaller than PNG)
- Use progressive JPEG for large images
- Implement perceptual hashing to detect duplicate screenshots
- Store only diffs between consecutive frames when possible
Data compression:
import zstandard as zstd
import json
class ReplayCompressor:
def __init__(self):
self.compressor = zstd.ZstdCompressor(level=9)
def compress_event(self, event):
"""Compress event data with zstandard."""
# Serialize to JSON
json_data = json.dumps(event, separators=(',', ':'))
# Compress
compressed = self.compressor.compress(json_data.encode('utf-8'))
# Typical compression ratios: 5:1 to 10:1 for JSON event data
return compressed
def deduplicate_screenshots(self, screenshots):
"""Use perceptual hashing to eliminate duplicate screenshots."""
from imagehash import phash
from PIL import Image
seen_hashes = {}
unique_screenshots = []
for screenshot in screenshots:
img = Image.open(screenshot.path)
img_hash = str(phash(img))
if img_hash not in seen_hashes:
seen_hashes[img_hash] = screenshot
unique_screenshots.append(screenshot)
else:
# Reference the duplicate
screenshot.reference = seen_hashes[img_hash].id
return unique_screenshots
Playback Systems
Replay viewers enable interactive debugging:
class ReplayPlayer:
def __init__(self, session_id):
self.session = self.load_session(session_id)
self.current_index = 0
def load_session(self, session_id):
"""Load compressed session data."""
compressed_data = storage.get(f"replays/{session_id}")
decompressed = zstd.decompress(compressed_data)
return json.loads(decompressed)
def play(self, speed=1.0):
"""Play session at specified speed."""
for event in self.session["events"]:
self.render_event(event)
delay = event.get("delay_until_next", 1000) / speed
time.sleep(delay / 1000)
def step_forward(self):
"""Advance to next event."""
if self.current_index < len(self.session["events"]) - 1:
self.current_index += 1
return self.render_event(
self.session["events"][self.current_index]
)
def step_backward(self):
"""Go back to previous event."""
if self.current_index > 0:
self.current_index -= 1
return self.render_event(
self.session["events"][self.current_index]
)
def jump_to_timestamp(self, timestamp):
"""Jump to specific timestamp."""
for i, event in enumerate(self.session["events"]):
if event["timestamp"] >= timestamp:
self.current_index = i
return self.render_event(event)
def render_event(self, event):
"""Render event in viewer UI."""
return {
"screenshot": self.load_screenshot(event["screenshot_after"]),
"action": event["action"],
"reasoning": event.get("agent_reasoning", ""),
"timestamp": event["timestamp"],
"metadata": {
"duration": event.get("duration_ms", 0),
"success": event.get("result", {}).get("success", True)
}
}
Key Metrics
Effective session replay systems track several critical metrics:
Replay Coverage
Definition: Percentage of agent sessions with complete replay data available.
Calculation: (sessions_with_complete_replay / total_sessions) × 100
Targets:
- Production systems: > 99% coverage for failed sessions
- Development environments: > 95% coverage for all sessions
- High-value transactions: 100% coverage with extended retention
Why it matters: Gaps in replay coverage mean blind spots in debugging. A session that fails without replay data becomes nearly impossible to diagnose, especially for non-deterministic failures.
Storage Efficiency
Definition: Compression ratio and storage cost per session.
Calculation: original_data_size / compressed_data_size
Targets:
- Compression ratio: > 5:1 for event data
- Screenshot compression: > 10:1 (using deduplication + format optimization)
- Cost per session: < $0.10 for 30-day retention
Why it matters: Poor storage efficiency makes session replay unsustainable at scale. A 10,000-session-per-day system with inefficient storage can cost $30,000/month in cloud storage fees.
Debugging Time Reduction
Definition: Time saved in debugging by using session replay versus traditional logs.
Measurement approach: Track time-to-resolution for incidents with replay available versus without.
Typical results:
- Simple UI interaction bugs: 60-80% time reduction (30 min → 10 min)
- Complex multi-step failures: 70-90% time reduction (4 hours → 30 min)
- Non-deterministic issues: 80-95% time reduction (impossible → 1 hour)
Why it matters: Developer time is expensive. If session replay saves 2 hours per week per engineer, it pays for itself quickly even with substantial infrastructure costs.
Playback Fidelity
Definition: How accurately the replay matches original agent behavior.
Calculation: (successful_playback_recreations / total_playback_attempts) × 100
Targets:
- DOM-based web replays: > 95% fidelity
- Video-based replays: > 99% fidelity
- Desktop application replays: > 90% fidelity
Why it matters: Low fidelity means developers can't trust the replay, reducing its value and forcing them to fall back to less effective debugging methods.
Retention Compliance
Definition: Percentage of replays properly handling PII redaction and retention policies.
Targets:
- PII redaction coverage: 100% of sensitive fields
- Retention policy compliance: 100% (auto-deletion after expiry)
- Access audit logging: 100% of replay access events logged
Why it matters: Compliance failures can result in regulatory fines, security breaches, and loss of customer trust. Session replay systems often contain the most sensitive data in your infrastructure.
Related Concepts
Session replay integrates with several complementary observability and debugging patterns:
- Proof-of-action: Cryptographic verification that specific actions occurred, often using session replay data as evidence
- Audit log: Structured records of agent actions, complemented by visual session replay
- Screenshots: Point-in-time captures that form the visual component of session replays
- Observability: Broader system monitoring strategy that includes session replay as one telemetry source
- Trace analysis: Distributed tracing through agent systems, enhanced by session replay context
- Error tracking: Bug monitoring systems that link errors to specific session replays
- Compliance monitoring: Regulatory adherence systems that use session replay for audit trails
Session replay serves as the visual, interactive layer that makes other observability data actionable, transforming abstract logs and metrics into concrete, understandable agent behavior.