Screenshots

Visual captures of application state used by agents for context understanding and as proof-of-action artifacts.

Screenshots serve as the primary visual input mechanism for computer-use agents and provide verifiable evidence of agent actions in agentic UI systems. They enable vision-language models to understand UI state, validate action outcomes, and create auditable records of agent behavior.

Why It Matters

Screenshots are fundamental to modern agent architectures for three critical reasons:

Visual Debugging and Error Analysis When agents fail, screenshots provide immediate visual context that log files cannot capture. A screenshot showing a modal dialog blocking the intended click target immediately explains why an agent's action failed, whereas textual logs might only show "click failed at coordinates (450, 300)." Teams using screenshot-based debugging report 60-70% faster root cause identification compared to log-only approaches.

User Verification and Trust Building Users need to verify that agents are performing actions correctly before trusting them with sensitive operations. Screenshots serve as proof-of-action artifacts that allow users to audit agent behavior. In enterprise deployments, screenshot-based audit trails are often required for compliance, enabling security teams to review exactly what agents did and when.

Vision Model Input for Decision Making Vision-language models like GPT-4V and Claude require screenshots to understand UI state and make intelligent decisions. Unlike DOM-based approaches that only work for web applications, screenshots enable agents to interact with any visual interface—desktop applications, legacy systems, embedded terminals, or canvas-based web apps. The screenshot becomes the universal interface description language.

Concrete Examples

Checkpoint Screenshots for Multi-Step Workflows

# Capture screenshots at each critical step
async def transfer_funds_workflow(agent, amount, recipient):
    # Screenshot 1: Initial balance
    initial_state = await agent.capture_screenshot(
        label="balance_before_transfer",
        metadata={"step": 1, "amount": amount}
    )

    await agent.navigate_to_transfers()

    # Screenshot 2: Transfer form filled
    form_state = await agent.capture_screenshot(
        label="transfer_form_filled",
        metadata={"step": 2, "recipient": hash(recipient)}
    )

    await agent.submit_transfer()

    # Screenshot 3: Confirmation screen
    confirmation = await agent.capture_screenshot(
        label="transfer_confirmed",
        metadata={"step": 3, "confirmation_id": "pending"}
    )

    return {
        "screenshots": [initial_state, form_state, confirmation],
        "audit_trail": f"transfer_{timestamp}.zip"
    }

Error State Captures with Context

// Capture screenshots when errors occur with surrounding context
class AgentErrorHandler {
  async captureErrorContext(error: AgentError) {
    const screenshot = await this.browser.screenshot({
      fullPage: false,  // Viewport only for faster capture
      quality: 85       // Balanced quality/size
    });

    const context = {
      screenshot: screenshot,
      dom_snapshot: await this.browser.getAccessibilityTree(),
      mouse_position: await this.browser.getMouseCoordinates(),
      active_element: await this.browser.getActiveElement(),
      timestamp: Date.now(),
      error_message: error.message
    };

    // Store with correlation ID for debugging
    await this.storage.saveErrorArtifact(
      `error_${error.id}`,
      context
    );
  }
}

Before/After Comparison for Validation

# Validate UI changes by comparing screenshots
async def validate_profile_update(agent, new_name):
    # Capture before state
    before = await agent.capture_screenshot(region="profile_section")
    before_hash = hash_screenshot(before)

    # Perform update
    await agent.update_profile_name(new_name)
    await agent.wait_for_network_idle()

    # Capture after state
    after = await agent.capture_screenshot(region="profile_section")
    after_hash = hash_screenshot(after)

    # Verify change occurred
    if before_hash == after_hash:
        raise ValidationError(
            "Screenshot unchanged after update",
            artifacts={"before": before, "after": after}
        )

    # Use vision model to verify correct change
    verification = await vision_model.compare(
        before_image=before,
        after_image=after,
        expected_change=f"Name changed to '{new_name}'"
    )

    return verification.success

Common Pitfalls

Unredacted PII in Screenshots The most critical pitfall is capturing screenshots containing personally identifiable information without proper redaction. Screenshots may inadvertently capture:

Credit card numbers in payment forms
Social security numbers in profile pages
Email addresses and phone numbers
User names and profile photos
Medical records or financial statements

Many organizations discover PII leakage only after screenshots are stored in logs, sent to third-party vision APIs, or exposed in debugging sessions. Implement PII detection and redaction before screenshots leave the capture context.

Excessive File Sizes Degrading Performance Full-page 4K screenshots with maximum quality settings can exceed 5-10MB per image. An agent taking 100 screenshots during a session generates 500MB-1GB of data, creating:

Slow upload times to vision APIs (adding 2-5 seconds per screenshot)
High cloud storage costs ($0.023/GB/month on S3, compounding over millions of screenshots)
Memory pressure in browser contexts
Bandwidth exhaustion in network-constrained environments

Teams often start with quality: 100 and fullPage: true as defaults, discovering performance issues only after production deployment.

Missing Context Making Screenshots Unusable A screenshot without metadata is difficult to understand weeks later during debugging. Critical missing context includes:

What action the agent was attempting
What the expected outcome was
Where the mouse cursor was positioned
What accessibility tree state existed
What network requests were pending

A screenshot labeled screenshot_1234567890.png with no additional context becomes useless artifact noise rather than valuable debugging data.

Implementation

Capture Timing Strategies

class ScreenshotStrategy(Enum):
    BEFORE_ACTION = "before"  # Capture state before each action
    AFTER_ACTION = "after"    # Capture state after each action
    ON_CHANGE = "on_change"   # Capture when viewport changes
    ON_ERROR = "on_error"     # Capture only on failures
    PERIODIC = "periodic"     # Capture every N seconds

class SmartScreenshotCapture:
    def __init__(self, strategies: List[ScreenshotStrategy]):
        self.strategies = strategies
        self.last_screenshot_hash = None
        self.screenshot_interval = 5.0  # seconds

    async def should_capture(self, event: AgentEvent) -> bool:
        # Before/after action captures
        if event.type in ["action_start", "action_end"]:
            strategy = (ScreenshotStrategy.BEFORE_ACTION
                       if event.type == "action_start"
                       else ScreenshotStrategy.AFTER_ACTION)
            if strategy in self.strategies:
                return True

        # On-change detection (avoid duplicates)
        if ScreenshotStrategy.ON_CHANGE in self.strategies:
            current_hash = await self.get_viewport_hash()
            if current_hash != self.last_screenshot_hash:
                self.last_screenshot_hash = current_hash
                return True

        # Error-only mode
        if (ScreenshotStrategy.ON_ERROR in self.strategies
            and event.type == "error"):
            return True

        # Periodic captures
        if ScreenshotStrategy.PERIODIC in self.strategies:
            if time.time() - self.last_capture_time > self.screenshot_interval:
                return True

        return False

Compression and Optimization

interface ScreenshotOptions {
  format: 'png' | 'jpeg' | 'webp';
  quality: number;  // 1-100
  maxWidth?: number;
  maxHeight?: number;
  optimize: boolean;
}

class OptimizedScreenshotCapture {
  // Adaptive quality based on content
  async captureOptimized(options: Partial<ScreenshotOptions> = {}) {
    const defaults: ScreenshotOptions = {
      format: 'jpeg',  // JPEG for photos, PNG for UI
      quality: 85,     // Sweet spot for size/quality
      maxWidth: 1920,  // Downsample 4K to 1080p
      maxHeight: 1080,
      optimize: true
    };

    const config = { ...defaults, ...options };

    // Capture at native resolution
    const screenshot = await this.browser.screenshot({
      type: config.format,
      quality: config.quality
    });

    // Downsample if needed
    let optimized = screenshot;
    if (config.maxWidth || config.maxHeight) {
      optimized = await sharp(screenshot)
        .resize(config.maxWidth, config.maxHeight, {
          fit: 'inside',
          withoutEnlargement: true
        })
        .toBuffer();
    }

    // Further optimize
    if (config.optimize) {
      if (config.format === 'jpeg') {
        optimized = await sharp(optimized)
          .jpeg({ quality: config.quality, progressive: true })
          .toBuffer();
      } else if (config.format === 'png') {
        optimized = await sharp(optimized)
          .png({ compressionLevel: 9, adaptiveFiltering: true })
          .toBuffer();
      } else if (config.format === 'webp') {
        optimized = await sharp(optimized)
          .webp({ quality: config.quality, effort: 6 })
          .toBuffer();
      }
    }

    return optimized;
  }

  // Size comparison: typical results
  // PNG (unoptimized):    2.4 MB
  // PNG (optimized):      1.8 MB  (25% reduction)
  // JPEG (quality 85):    0.4 MB  (83% reduction)
  // WebP (quality 85):    0.3 MB  (87% reduction)
}

PII Detection and Masking

import re
from typing import List, Tuple
from PIL import Image, ImageDraw
import easyocr

class PIIRedactor:
    def __init__(self):
        self.reader = easyocr.Reader(['en'])

        # PII detection patterns
        self.patterns = {
            'ssn': re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
            'credit_card': re.compile(r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b'),
            'email': re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
            'phone': re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'),
        }

    async def detect_and_redact(
        self,
        screenshot: bytes
    ) -> Tuple[bytes, List[dict]]:
        """
        Detect PII in screenshot and return redacted version.
        Returns: (redacted_screenshot, detected_pii_metadata)
        """
        # Convert to PIL Image
        image = Image.open(io.BytesIO(screenshot))

        # Run OCR to extract text and bounding boxes
        ocr_results = self.reader.readtext(np.array(image))

        # Detect PII matches
        pii_detections = []
        redaction_boxes = []

        for (bbox, text, confidence) in ocr_results:
            if confidence < 0.5:  # Skip low-confidence OCR
                continue

            # Check each PII pattern
            for pii_type, pattern in self.patterns.items():
                if pattern.search(text):
                    pii_detections.append({
                        'type': pii_type,
                        'text': text,  # For logging only
                        'bbox': bbox,
                        'confidence': confidence
                    })
                    redaction_boxes.append(bbox)

        # Redact by drawing black rectangles over PII
        if redaction_boxes:
            draw = ImageDraw.Draw(image)
            for bbox in redaction_boxes:
                # bbox format: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
                x_coords = [point[0] for point in bbox]
                y_coords = [point[1] for point in bbox]

                # Draw filled black rectangle
                draw.rectangle(
                    [min(x_coords), min(y_coords),
                     max(x_coords), max(y_coords)],
                    fill='black'
                )

        # Convert back to bytes
        output = io.BytesIO()
        image.save(output, format='JPEG')
        redacted_screenshot = output.getvalue()

        # Return redacted image and metadata (without actual PII text)
        safe_metadata = [
            {'type': d['type'], 'bbox': d['bbox']}
            for d in pii_detections
        ]

        return redacted_screenshot, safe_metadata

# Usage in agent workflow
async def capture_with_pii_protection(agent):
    redactor = PIIRedactor()

    # Capture raw screenshot
    raw_screenshot = await agent.browser.screenshot()

    # Detect and redact PII
    safe_screenshot, pii_found = await redactor.detect_and_redact(
        raw_screenshot
    )

    # Log PII detection without exposing actual values
    if pii_found:
        agent.log_security_event(
            event="pii_detected_in_screenshot",
            pii_types=[d['type'] for d in pii_found],
            count=len(pii_found)
        )

    # Use safe screenshot for vision API
    return safe_screenshot

Key Metrics

Screenshot File Size

Target: < 500 KB per screenshot for optimal API upload speed
Measurement: Track P50, P95, and P99 file sizes across all captures
Optimization: Use JPEG/WebP at quality 80-85, downsample to 1920x1080
Alert threshold: P95 > 1 MB indicates compression settings need adjustment

PII Detection Accuracy

Precision: Percentage of redacted regions that actually contain PII (target > 90%)
Recall: Percentage of actual PII that was detected and redacted (target > 95%)
False positive rate: Over-redaction can make screenshots unusable (target < 5%)
Measurement: Manual audit of random sample (100 screenshots/week), compare with security team review

Storage Costs per Agent Session

Baseline: 50-200 screenshots per agent session at 400 KB average = 20-80 MB per session
Monthly cost: 1M sessions × 50 MB × $0.023/GB = $1,150/month on S3 Standard
Optimization: Move to S3 Glacier after 30 days (90% cost reduction), delete after 90 days
Cost per screenshot: $0.000001 storage + $0.0004 vision API processing = $0.0004 total

Screenshot Capture Latency

Target: < 200ms for viewport screenshots, < 1s for full-page
Impact: Each screenshot adds to total agent action latency
Measurement: Track capture time distribution, correlate with image dimensions
Optimization: Use viewport-only captures during actions, full-page only at checkpoints

Screenshot-to-Vision-API Pipeline Latency

Target: End-to-end < 2 seconds (capture + upload + API processing)
Bottlenecks: Network upload time (depends on file size), API queue time
Measurement: time_to_vision_result = capture_time + upload_time + api_processing_time
Cost/latency tradeoff: Smaller screenshots upload faster but may reduce vision model accuracy

Related Concepts

Proof of Action - Screenshots serve as primary proof-of-action artifacts
Observability - Screenshots are critical observability data for agent debugging
PII Redaction - Essential pre-processing step before storing or transmitting screenshots
Proof of Action Artifact - Screenshots are the most common artifact type for verification

Screenshots

Why It Matters

Concrete Examples

Common Pitfalls

Implementation

Key Metrics

Related Concepts

Related Concepts

Proof of action

Observability (agents)

PII redaction

Proof-of-action artifact