Session Replay

Session replay refers to recorded recreations of agent sessions showing UI interactions, decisions, and outcomes for analysis. These recordings capture the complete sequence of actions, state changes, and visual representations of an agent's behavior during task execution, enabling detailed post-hoc analysis, debugging, and auditing.

In the context of computer-use agents and agentic UI systems, session replay serves as a critical observability tool that bridges the gap between high-level agent decisions and low-level UI interactions, providing a complete forensic trail of agent behavior.

Why Session Replay Matters

Session replay has become essential for operating production agentic systems due to several critical capabilities:

Debugging Complex Failures

Agent failures often emerge from subtle interaction chains that are impossible to reproduce from logs alone. Session replay enables developers to observe the exact sequence of UI states, mouse movements, keyboard inputs, and visual feedback that led to a failure. For example, an agent might fail to complete a checkout flow because it clicked a button before a loading spinner disappeared—a timing issue only visible through replay.

User Behavior Analysis

When agents interact with real user interfaces, understanding their decision-making process requires visual context. Session replay reveals why an agent chose one UI element over another, how it interpreted visual cues, and whether its actions aligned with intended behavior patterns. This analysis becomes crucial when optimizing agent performance or validating safety constraints.

Training Data Generation

High-quality session replays serve as valuable training data for improving agent models. By capturing successful task completions alongside the visual and interaction context, teams can create demonstration datasets that teach agents better interaction patterns, more efficient navigation strategies, and improved error recovery behaviors.

Compliance and Auditing

For regulated industries, session replay provides auditable evidence of agent actions. Financial services, healthcare, and legal applications often require proof that an agent performed specific actions in a specific sequence—session replay creates an immutable record that satisfies regulatory requirements.

Concrete Examples

DOM Recording with Event Streams

Modern web-based session replay systems capture the Document Object Model (DOM) structure and mutation events rather than video frames. This approach records:

{
  "timestamp": 1698765432000,
  "type": "dom_snapshot",
  "data": {
    "html": "<body><div class='checkout'>...</div></body>",
    "css": ["app.css", "theme.css"],
    "viewport": {"width": 1920, "height": 1080}
  }
}

Subsequent interactions are stored as deltas:

{
  "timestamp": 1698765433250,
  "type": "click",
  "target": "button[data-testid='submit-payment']",
  "coordinates": {"x": 450, "y": 320},
  "agent_reasoning": "Located primary CTA button for payment submission"
}

This approach reduces storage requirements by 95% compared to video while maintaining perfect fidelity for web interactions.

Time-Travel Debugging

Advanced replay systems enable developers to step backward and forward through agent sessions, inspecting state at any point. For example, debugging a failed form submission might reveal:

  1. T+0s: Agent navigates to form page
  2. T+1.2s: Agent fills email field (value visible in replay)
  3. T+1.8s: Agent attempts to fill phone field but encounters validation error
  4. T+2.1s: Agent retries with different format
  5. T+3.5s: Form submission fails due to CSRF token expiration

By stepping through this sequence, developers identify that the agent's retry logic took too long, allowing the session token to expire—a root cause invisible in traditional logs.

Multi-Modal Capture

Computer-use agents often interact across multiple modalities. Comprehensive session replay captures:

  • Visual screenshots: Full-resolution images at decision points
  • Agent vision: Annotated screenshots showing what the vision model actually "saw" (bounding boxes, OCR results, detected UI elements)
  • Action sequences: Structured data of clicks, typing, scrolling, navigation
  • LLM reasoning traces: The agent's internal decision-making process
  • Network activity: API calls, resource loads, timing information

Example replay data structure:

{
  "session_id": "sess_a1b2c3d4",
  "timestamp": "2024-01-15T14:23:11Z",
  "screenshot": {
    "url": "s3://replays/sess_a1b2c3d4/frame_042.png",
    "annotations": [
      {"type": "bbox", "coords": [100, 200, 50, 30], "label": "submit_button", "confidence": 0.95}
    ]
  },
  "action": {
    "type": "click",
    "element": "button#submit",
    "reasoning": "User requested form submission; detected primary CTA button"
  },
  "outcome": {
    "success": false,
    "error": "Element not clickable: overlayed by modal",
    "retry_strategy": "wait_for_modal_close"
  }
}

Common Pitfalls

Excessive Storage Costs

Naive session replay implementations can generate terabytes of data quickly. A single agent session might produce:

  • 1,000+ screenshots at 500KB each = 500MB
  • 10,000+ DOM events at 2KB each = 20MB
  • 500+ network request/response pairs at 10KB each = 5MB

At scale, storing 1,000 daily sessions becomes prohibitively expensive. Teams must implement aggressive retention policies, compression strategies, and selective capture rules.

Mitigation strategy: Implement tiered storage where recent sessions (< 7 days) retain full fidelity, medium-aged sessions (7-30 days) keep only screenshots at decision points, and old sessions (> 30 days) retain only structured metadata unless flagged for long-term retention.

PII in Replays

Session replays inherently capture everything an agent sees, including personally identifiable information (PII), credentials, financial data, and health records. Storing this data creates security and compliance risks.

Common PII exposure vectors:

  • Form fields containing user names, emails, SSNs
  • Screenshots showing account balances, medical records
  • Network traffic capturing authentication tokens
  • Chat transcripts with personal information

Mitigation strategy: Implement automatic PII redaction using pattern matching, named entity recognition, and field-level sanitization. For sensitive applications, store only anonymized replay data with the ability to map back to original sessions under strict access controls.

Playback Inconsistencies

Deterministic replay is challenging because web applications depend on:

  • Dynamic external resources (CDNs, APIs)
  • Timing-dependent behaviors (animations, async operations)
  • Non-deterministic JavaScript execution
  • Browser-specific rendering differences

An agent's original session might succeed, but the replay fails because an external API returns different data or a race condition resolves differently.

Mitigation strategy: Capture complete network responses during recording, inject them during playback to ensure consistency. Use DOM snapshots rather than re-executing JavaScript. Accept that perfect determinism is impossible for complex web applications and focus on "good enough" fidelity for debugging purposes.

Performance Impact on Agents

Heavy-weight capture logic can slow down agent execution. Synchronous screenshot capture, DOM serialization, and event logging add latency to each action, potentially affecting agent behavior and task completion times.

Mitigation strategy: Use asynchronous capture mechanisms that don't block agent actions. Implement sampling strategies for high-frequency events (e.g., capture every 10th mouse movement rather than all movements). Provide configuration options to disable replay in performance-critical scenarios.

Implementation

Capture Strategies

Effective session replay requires capturing the right data at the right granularity:

Browser-based capture (for web agents):

class SessionReplayCapture:
    def __init__(self, agent_session):
        self.session = agent_session
        self.events = []
        self.screenshot_interval = 1000  # ms

    async def capture_interaction(self, action):
        """Capture a single agent interaction."""
        event = {
            "timestamp": time.time(),
            "action_type": action.type,
            "target": action.target,
            "screenshot_before": await self.capture_screenshot(),
            "dom_snapshot": await self.capture_dom(),
            "agent_reasoning": action.reasoning
        }

        # Execute the action
        result = await action.execute()

        # Capture outcome
        event["screenshot_after"] = await self.capture_screenshot()
        event["result"] = result
        event["duration_ms"] = result.timing

        self.events.append(event)
        return result

    async def capture_screenshot(self):
        """Capture and compress screenshot."""
        screenshot = await self.session.screenshot(full_page=False)
        compressed = self.compress_image(screenshot, quality=85)
        url = await self.upload_to_storage(compressed)
        return url

    async def capture_dom(self):
        """Capture serialized DOM state."""
        dom = await self.session.evaluate("""
            () => {
                return {
                    html: document.documentElement.outerHTML,
                    url: window.location.href,
                    viewport: {
                        width: window.innerWidth,
                        height: window.innerHeight
                    }
                };
            }
        """)
        return dom

Desktop application capture (for computer-use agents):

class DesktopReplayCapture:
    def __init__(self):
        self.recorder = ScreenRecorder()
        self.input_tracker = InputTracker()

    async def start_capture(self):
        """Begin capturing desktop session."""
        # Capture full screen at key moments
        self.recorder.start(
            mode="on_demand",  # Only capture when agent acts
            format="png",
            compression="lossless"
        )

        # Track all input events
        self.input_tracker.track_events([
            "mouse.click",
            "mouse.move",
            "keyboard.press",
            "window.focus"
        ])

    async def capture_action(self, agent_action):
        """Capture a computer-use action."""
        # Screenshot before action
        screen_before = await self.recorder.capture_frame()

        # Record action metadata
        event = {
            "timestamp": time.time(),
            "action": agent_action.to_dict(),
            "screen_before": screen_before,
            "window_title": await self.get_active_window(),
            "cursor_position": await self.input_tracker.get_cursor_pos()
        }

        # Execute action
        result = await agent_action.execute()

        # Screenshot after action (with delay for UI updates)
        await asyncio.sleep(0.5)
        event["screen_after"] = await self.recorder.capture_frame()
        event["result"] = result

        return event

Compression Techniques

Efficient storage requires multi-layered compression:

Image compression:

  • Convert screenshots to WebP format (30-40% smaller than PNG)
  • Use progressive JPEG for large images
  • Implement perceptual hashing to detect duplicate screenshots
  • Store only diffs between consecutive frames when possible

Data compression:

import zstandard as zstd
import json

class ReplayCompressor:
    def __init__(self):
        self.compressor = zstd.ZstdCompressor(level=9)

    def compress_event(self, event):
        """Compress event data with zstandard."""
        # Serialize to JSON
        json_data = json.dumps(event, separators=(',', ':'))

        # Compress
        compressed = self.compressor.compress(json_data.encode('utf-8'))

        # Typical compression ratios: 5:1 to 10:1 for JSON event data
        return compressed

    def deduplicate_screenshots(self, screenshots):
        """Use perceptual hashing to eliminate duplicate screenshots."""
        from imagehash import phash
        from PIL import Image

        seen_hashes = {}
        unique_screenshots = []

        for screenshot in screenshots:
            img = Image.open(screenshot.path)
            img_hash = str(phash(img))

            if img_hash not in seen_hashes:
                seen_hashes[img_hash] = screenshot
                unique_screenshots.append(screenshot)
            else:
                # Reference the duplicate
                screenshot.reference = seen_hashes[img_hash].id

        return unique_screenshots

Playback Systems

Replay viewers enable interactive debugging:

class ReplayPlayer:
    def __init__(self, session_id):
        self.session = self.load_session(session_id)
        self.current_index = 0

    def load_session(self, session_id):
        """Load compressed session data."""
        compressed_data = storage.get(f"replays/{session_id}")
        decompressed = zstd.decompress(compressed_data)
        return json.loads(decompressed)

    def play(self, speed=1.0):
        """Play session at specified speed."""
        for event in self.session["events"]:
            self.render_event(event)
            delay = event.get("delay_until_next", 1000) / speed
            time.sleep(delay / 1000)

    def step_forward(self):
        """Advance to next event."""
        if self.current_index < len(self.session["events"]) - 1:
            self.current_index += 1
            return self.render_event(
                self.session["events"][self.current_index]
            )

    def step_backward(self):
        """Go back to previous event."""
        if self.current_index > 0:
            self.current_index -= 1
            return self.render_event(
                self.session["events"][self.current_index]
            )

    def jump_to_timestamp(self, timestamp):
        """Jump to specific timestamp."""
        for i, event in enumerate(self.session["events"]):
            if event["timestamp"] >= timestamp:
                self.current_index = i
                return self.render_event(event)

    def render_event(self, event):
        """Render event in viewer UI."""
        return {
            "screenshot": self.load_screenshot(event["screenshot_after"]),
            "action": event["action"],
            "reasoning": event.get("agent_reasoning", ""),
            "timestamp": event["timestamp"],
            "metadata": {
                "duration": event.get("duration_ms", 0),
                "success": event.get("result", {}).get("success", True)
            }
        }

Key Metrics

Effective session replay systems track several critical metrics:

Replay Coverage

Definition: Percentage of agent sessions with complete replay data available.

Calculation: (sessions_with_complete_replay / total_sessions) × 100

Targets:

  • Production systems: > 99% coverage for failed sessions
  • Development environments: > 95% coverage for all sessions
  • High-value transactions: 100% coverage with extended retention

Why it matters: Gaps in replay coverage mean blind spots in debugging. A session that fails without replay data becomes nearly impossible to diagnose, especially for non-deterministic failures.

Storage Efficiency

Definition: Compression ratio and storage cost per session.

Calculation: original_data_size / compressed_data_size

Targets:

  • Compression ratio: > 5:1 for event data
  • Screenshot compression: > 10:1 (using deduplication + format optimization)
  • Cost per session: < $0.10 for 30-day retention

Why it matters: Poor storage efficiency makes session replay unsustainable at scale. A 10,000-session-per-day system with inefficient storage can cost $30,000/month in cloud storage fees.

Debugging Time Reduction

Definition: Time saved in debugging by using session replay versus traditional logs.

Measurement approach: Track time-to-resolution for incidents with replay available versus without.

Typical results:

  • Simple UI interaction bugs: 60-80% time reduction (30 min → 10 min)
  • Complex multi-step failures: 70-90% time reduction (4 hours → 30 min)
  • Non-deterministic issues: 80-95% time reduction (impossible → 1 hour)

Why it matters: Developer time is expensive. If session replay saves 2 hours per week per engineer, it pays for itself quickly even with substantial infrastructure costs.

Playback Fidelity

Definition: How accurately the replay matches original agent behavior.

Calculation: (successful_playback_recreations / total_playback_attempts) × 100

Targets:

  • DOM-based web replays: > 95% fidelity
  • Video-based replays: > 99% fidelity
  • Desktop application replays: > 90% fidelity

Why it matters: Low fidelity means developers can't trust the replay, reducing its value and forcing them to fall back to less effective debugging methods.

Retention Compliance

Definition: Percentage of replays properly handling PII redaction and retention policies.

Targets:

  • PII redaction coverage: 100% of sensitive fields
  • Retention policy compliance: 100% (auto-deletion after expiry)
  • Access audit logging: 100% of replay access events logged

Why it matters: Compliance failures can result in regulatory fines, security breaches, and loss of customer trust. Session replay systems often contain the most sensitive data in your infrastructure.

Related Concepts

Session replay integrates with several complementary observability and debugging patterns:

  • Proof-of-action: Cryptographic verification that specific actions occurred, often using session replay data as evidence
  • Audit log: Structured records of agent actions, complemented by visual session replay
  • Screenshots: Point-in-time captures that form the visual component of session replays
  • Observability: Broader system monitoring strategy that includes session replay as one telemetry source
  • Trace analysis: Distributed tracing through agent systems, enhanced by session replay context
  • Error tracking: Bug monitoring systems that link errors to specific session replays
  • Compliance monitoring: Regulatory adherence systems that use session replay for audit trails

Session replay serves as the visual, interactive layer that makes other observability data actionable, transforming abstract logs and metrics into concrete, understandable agent behavior.