Screenshots

Visual captures of application state used by agents for context understanding and as proof-of-action artifacts.

Screenshots serve as the primary visual input mechanism for computer-use agents and provide verifiable evidence of agent actions in agentic UI systems. They enable vision-language models to understand UI state, validate action outcomes, and create auditable records of agent behavior.

Why It Matters

Screenshots are fundamental to modern agent architectures for three critical reasons:

Visual Debugging and Error Analysis When agents fail, screenshots provide immediate visual context that log files cannot capture. A screenshot showing a modal dialog blocking the intended click target immediately explains why an agent's action failed, whereas textual logs might only show "click failed at coordinates (450, 300)." Teams using screenshot-based debugging report 60-70% faster root cause identification compared to log-only approaches.

User Verification and Trust Building Users need to verify that agents are performing actions correctly before trusting them with sensitive operations. Screenshots serve as proof-of-action artifacts that allow users to audit agent behavior. In enterprise deployments, screenshot-based audit trails are often required for compliance, enabling security teams to review exactly what agents did and when.

Vision Model Input for Decision Making Vision-language models like GPT-4V and Claude require screenshots to understand UI state and make intelligent decisions. Unlike DOM-based approaches that only work for web applications, screenshots enable agents to interact with any visual interface—desktop applications, legacy systems, embedded terminals, or canvas-based web apps. The screenshot becomes the universal interface description language.

Concrete Examples

Checkpoint Screenshots for Multi-Step Workflows

# Capture screenshots at each critical step
async def transfer_funds_workflow(agent, amount, recipient):
    # Screenshot 1: Initial balance
    initial_state = await agent.capture_screenshot(
        label="balance_before_transfer",
        metadata={"step": 1, "amount": amount}
    )

    await agent.navigate_to_transfers()

    # Screenshot 2: Transfer form filled
    form_state = await agent.capture_screenshot(
        label="transfer_form_filled",
        metadata={"step": 2, "recipient": hash(recipient)}
    )

    await agent.submit_transfer()

    # Screenshot 3: Confirmation screen
    confirmation = await agent.capture_screenshot(
        label="transfer_confirmed",
        metadata={"step": 3, "confirmation_id": "pending"}
    )

    return {
        "screenshots": [initial_state, form_state, confirmation],
        "audit_trail": f"transfer_{timestamp}.zip"
    }

Error State Captures with Context

// Capture screenshots when errors occur with surrounding context
class AgentErrorHandler {
  async captureErrorContext(error: AgentError) {
    const screenshot = await this.browser.screenshot({
      fullPage: false,  // Viewport only for faster capture
      quality: 85       // Balanced quality/size
    });

    const context = {
      screenshot: screenshot,
      dom_snapshot: await this.browser.getAccessibilityTree(),
      mouse_position: await this.browser.getMouseCoordinates(),
      active_element: await this.browser.getActiveElement(),
      timestamp: Date.now(),
      error_message: error.message
    };

    // Store with correlation ID for debugging
    await this.storage.saveErrorArtifact(
      `error_${error.id}`,
      context
    );
  }
}

Before/After Comparison for Validation

# Validate UI changes by comparing screenshots
async def validate_profile_update(agent, new_name):
    # Capture before state
    before = await agent.capture_screenshot(region="profile_section")
    before_hash = hash_screenshot(before)

    # Perform update
    await agent.update_profile_name(new_name)
    await agent.wait_for_network_idle()

    # Capture after state
    after = await agent.capture_screenshot(region="profile_section")
    after_hash = hash_screenshot(after)

    # Verify change occurred
    if before_hash == after_hash:
        raise ValidationError(
            "Screenshot unchanged after update",
            artifacts={"before": before, "after": after}
        )

    # Use vision model to verify correct change
    verification = await vision_model.compare(
        before_image=before,
        after_image=after,
        expected_change=f"Name changed to '{new_name}'"
    )

    return verification.success

Common Pitfalls

Unredacted PII in Screenshots The most critical pitfall is capturing screenshots containing personally identifiable information without proper redaction. Screenshots may inadvertently capture:

  • Credit card numbers in payment forms
  • Social security numbers in profile pages
  • Email addresses and phone numbers
  • User names and profile photos
  • Medical records or financial statements

Many organizations discover PII leakage only after screenshots are stored in logs, sent to third-party vision APIs, or exposed in debugging sessions. Implement PII detection and redaction before screenshots leave the capture context.

Excessive File Sizes Degrading Performance Full-page 4K screenshots with maximum quality settings can exceed 5-10MB per image. An agent taking 100 screenshots during a session generates 500MB-1GB of data, creating:

  • Slow upload times to vision APIs (adding 2-5 seconds per screenshot)
  • High cloud storage costs ($0.023/GB/month on S3, compounding over millions of screenshots)
  • Memory pressure in browser contexts
  • Bandwidth exhaustion in network-constrained environments

Teams often start with quality: 100 and fullPage: true as defaults, discovering performance issues only after production deployment.

Missing Context Making Screenshots Unusable A screenshot without metadata is difficult to understand weeks later during debugging. Critical missing context includes:

  • What action the agent was attempting
  • What the expected outcome was
  • Where the mouse cursor was positioned
  • What accessibility tree state existed
  • What network requests were pending

A screenshot labeled screenshot_1234567890.png with no additional context becomes useless artifact noise rather than valuable debugging data.

Implementation

Capture Timing Strategies

class ScreenshotStrategy(Enum):
    BEFORE_ACTION = "before"  # Capture state before each action
    AFTER_ACTION = "after"    # Capture state after each action
    ON_CHANGE = "on_change"   # Capture when viewport changes
    ON_ERROR = "on_error"     # Capture only on failures
    PERIODIC = "periodic"     # Capture every N seconds

class SmartScreenshotCapture:
    def __init__(self, strategies: List[ScreenshotStrategy]):
        self.strategies = strategies
        self.last_screenshot_hash = None
        self.screenshot_interval = 5.0  # seconds

    async def should_capture(self, event: AgentEvent) -> bool:
        # Before/after action captures
        if event.type in ["action_start", "action_end"]:
            strategy = (ScreenshotStrategy.BEFORE_ACTION
                       if event.type == "action_start"
                       else ScreenshotStrategy.AFTER_ACTION)
            if strategy in self.strategies:
                return True

        # On-change detection (avoid duplicates)
        if ScreenshotStrategy.ON_CHANGE in self.strategies:
            current_hash = await self.get_viewport_hash()
            if current_hash != self.last_screenshot_hash:
                self.last_screenshot_hash = current_hash
                return True

        # Error-only mode
        if (ScreenshotStrategy.ON_ERROR in self.strategies
            and event.type == "error"):
            return True

        # Periodic captures
        if ScreenshotStrategy.PERIODIC in self.strategies:
            if time.time() - self.last_capture_time > self.screenshot_interval:
                return True

        return False

Compression and Optimization

interface ScreenshotOptions {
  format: 'png' | 'jpeg' | 'webp';
  quality: number;  // 1-100
  maxWidth?: number;
  maxHeight?: number;
  optimize: boolean;
}

class OptimizedScreenshotCapture {
  // Adaptive quality based on content
  async captureOptimized(options: Partial<ScreenshotOptions> = {}) {
    const defaults: ScreenshotOptions = {
      format: 'jpeg',  // JPEG for photos, PNG for UI
      quality: 85,     // Sweet spot for size/quality
      maxWidth: 1920,  // Downsample 4K to 1080p
      maxHeight: 1080,
      optimize: true
    };

    const config = { ...defaults, ...options };

    // Capture at native resolution
    const screenshot = await this.browser.screenshot({
      type: config.format,
      quality: config.quality
    });

    // Downsample if needed
    let optimized = screenshot;
    if (config.maxWidth || config.maxHeight) {
      optimized = await sharp(screenshot)
        .resize(config.maxWidth, config.maxHeight, {
          fit: 'inside',
          withoutEnlargement: true
        })
        .toBuffer();
    }

    // Further optimize
    if (config.optimize) {
      if (config.format === 'jpeg') {
        optimized = await sharp(optimized)
          .jpeg({ quality: config.quality, progressive: true })
          .toBuffer();
      } else if (config.format === 'png') {
        optimized = await sharp(optimized)
          .png({ compressionLevel: 9, adaptiveFiltering: true })
          .toBuffer();
      } else if (config.format === 'webp') {
        optimized = await sharp(optimized)
          .webp({ quality: config.quality, effort: 6 })
          .toBuffer();
      }
    }

    return optimized;
  }

  // Size comparison: typical results
  // PNG (unoptimized):    2.4 MB
  // PNG (optimized):      1.8 MB  (25% reduction)
  // JPEG (quality 85):    0.4 MB  (83% reduction)
  // WebP (quality 85):    0.3 MB  (87% reduction)
}

PII Detection and Masking

import re
from typing import List, Tuple
from PIL import Image, ImageDraw
import easyocr

class PIIRedactor:
    def __init__(self):
        self.reader = easyocr.Reader(['en'])

        # PII detection patterns
        self.patterns = {
            'ssn': re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
            'credit_card': re.compile(r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b'),
            'email': re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
            'phone': re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'),
        }

    async def detect_and_redact(
        self,
        screenshot: bytes
    ) -> Tuple[bytes, List[dict]]:
        """
        Detect PII in screenshot and return redacted version.
        Returns: (redacted_screenshot, detected_pii_metadata)
        """
        # Convert to PIL Image
        image = Image.open(io.BytesIO(screenshot))

        # Run OCR to extract text and bounding boxes
        ocr_results = self.reader.readtext(np.array(image))

        # Detect PII matches
        pii_detections = []
        redaction_boxes = []

        for (bbox, text, confidence) in ocr_results:
            if confidence < 0.5:  # Skip low-confidence OCR
                continue

            # Check each PII pattern
            for pii_type, pattern in self.patterns.items():
                if pattern.search(text):
                    pii_detections.append({
                        'type': pii_type,
                        'text': text,  # For logging only
                        'bbox': bbox,
                        'confidence': confidence
                    })
                    redaction_boxes.append(bbox)

        # Redact by drawing black rectangles over PII
        if redaction_boxes:
            draw = ImageDraw.Draw(image)
            for bbox in redaction_boxes:
                # bbox format: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
                x_coords = [point[0] for point in bbox]
                y_coords = [point[1] for point in bbox]

                # Draw filled black rectangle
                draw.rectangle(
                    [min(x_coords), min(y_coords),
                     max(x_coords), max(y_coords)],
                    fill='black'
                )

        # Convert back to bytes
        output = io.BytesIO()
        image.save(output, format='JPEG')
        redacted_screenshot = output.getvalue()

        # Return redacted image and metadata (without actual PII text)
        safe_metadata = [
            {'type': d['type'], 'bbox': d['bbox']}
            for d in pii_detections
        ]

        return redacted_screenshot, safe_metadata

# Usage in agent workflow
async def capture_with_pii_protection(agent):
    redactor = PIIRedactor()

    # Capture raw screenshot
    raw_screenshot = await agent.browser.screenshot()

    # Detect and redact PII
    safe_screenshot, pii_found = await redactor.detect_and_redact(
        raw_screenshot
    )

    # Log PII detection without exposing actual values
    if pii_found:
        agent.log_security_event(
            event="pii_detected_in_screenshot",
            pii_types=[d['type'] for d in pii_found],
            count=len(pii_found)
        )

    # Use safe screenshot for vision API
    return safe_screenshot

Key Metrics

Screenshot File Size

  • Target: < 500 KB per screenshot for optimal API upload speed
  • Measurement: Track P50, P95, and P99 file sizes across all captures
  • Optimization: Use JPEG/WebP at quality 80-85, downsample to 1920x1080
  • Alert threshold: P95 > 1 MB indicates compression settings need adjustment

PII Detection Accuracy

  • Precision: Percentage of redacted regions that actually contain PII (target > 90%)
  • Recall: Percentage of actual PII that was detected and redacted (target > 95%)
  • False positive rate: Over-redaction can make screenshots unusable (target < 5%)
  • Measurement: Manual audit of random sample (100 screenshots/week), compare with security team review

Storage Costs per Agent Session

  • Baseline: 50-200 screenshots per agent session at 400 KB average = 20-80 MB per session
  • Monthly cost: 1M sessions × 50 MB × $0.023/GB = $1,150/month on S3 Standard
  • Optimization: Move to S3 Glacier after 30 days (90% cost reduction), delete after 90 days
  • Cost per screenshot: $0.000001 storage + $0.0004 vision API processing = $0.0004 total

Screenshot Capture Latency

  • Target: < 200ms for viewport screenshots, < 1s for full-page
  • Impact: Each screenshot adds to total agent action latency
  • Measurement: Track capture time distribution, correlate with image dimensions
  • Optimization: Use viewport-only captures during actions, full-page only at checkpoints

Screenshot-to-Vision-API Pipeline Latency

  • Target: End-to-end < 2 seconds (capture + upload + API processing)
  • Bottlenecks: Network upload time (depends on file size), API queue time
  • Measurement: time_to_vision_result = capture_time + upload_time + api_processing_time
  • Cost/latency tradeoff: Smaller screenshots upload faster but may reduce vision model accuracy

Related Concepts

  • Proof of Action - Screenshots serve as primary proof-of-action artifacts
  • Observability - Screenshots are critical observability data for agent debugging
  • PII Redaction - Essential pre-processing step before storing or transmitting screenshots
  • Proof of Action Artifact - Screenshots are the most common artifact type for verification