Screenshots
Visual captures of application state used by agents for context understanding and as proof-of-action artifacts.
Screenshots serve as the primary visual input mechanism for computer-use agents and provide verifiable evidence of agent actions in agentic UI systems. They enable vision-language models to understand UI state, validate action outcomes, and create auditable records of agent behavior.
Why It Matters
Screenshots are fundamental to modern agent architectures for three critical reasons:
Visual Debugging and Error Analysis When agents fail, screenshots provide immediate visual context that log files cannot capture. A screenshot showing a modal dialog blocking the intended click target immediately explains why an agent's action failed, whereas textual logs might only show "click failed at coordinates (450, 300)." Teams using screenshot-based debugging report 60-70% faster root cause identification compared to log-only approaches.
User Verification and Trust Building Users need to verify that agents are performing actions correctly before trusting them with sensitive operations. Screenshots serve as proof-of-action artifacts that allow users to audit agent behavior. In enterprise deployments, screenshot-based audit trails are often required for compliance, enabling security teams to review exactly what agents did and when.
Vision Model Input for Decision Making Vision-language models like GPT-4V and Claude require screenshots to understand UI state and make intelligent decisions. Unlike DOM-based approaches that only work for web applications, screenshots enable agents to interact with any visual interface—desktop applications, legacy systems, embedded terminals, or canvas-based web apps. The screenshot becomes the universal interface description language.
Concrete Examples
Checkpoint Screenshots for Multi-Step Workflows
# Capture screenshots at each critical step
async def transfer_funds_workflow(agent, amount, recipient):
# Screenshot 1: Initial balance
initial_state = await agent.capture_screenshot(
label="balance_before_transfer",
metadata={"step": 1, "amount": amount}
)
await agent.navigate_to_transfers()
# Screenshot 2: Transfer form filled
form_state = await agent.capture_screenshot(
label="transfer_form_filled",
metadata={"step": 2, "recipient": hash(recipient)}
)
await agent.submit_transfer()
# Screenshot 3: Confirmation screen
confirmation = await agent.capture_screenshot(
label="transfer_confirmed",
metadata={"step": 3, "confirmation_id": "pending"}
)
return {
"screenshots": [initial_state, form_state, confirmation],
"audit_trail": f"transfer_{timestamp}.zip"
}
Error State Captures with Context
// Capture screenshots when errors occur with surrounding context
class AgentErrorHandler {
async captureErrorContext(error: AgentError) {
const screenshot = await this.browser.screenshot({
fullPage: false, // Viewport only for faster capture
quality: 85 // Balanced quality/size
});
const context = {
screenshot: screenshot,
dom_snapshot: await this.browser.getAccessibilityTree(),
mouse_position: await this.browser.getMouseCoordinates(),
active_element: await this.browser.getActiveElement(),
timestamp: Date.now(),
error_message: error.message
};
// Store with correlation ID for debugging
await this.storage.saveErrorArtifact(
`error_${error.id}`,
context
);
}
}
Before/After Comparison for Validation
# Validate UI changes by comparing screenshots
async def validate_profile_update(agent, new_name):
# Capture before state
before = await agent.capture_screenshot(region="profile_section")
before_hash = hash_screenshot(before)
# Perform update
await agent.update_profile_name(new_name)
await agent.wait_for_network_idle()
# Capture after state
after = await agent.capture_screenshot(region="profile_section")
after_hash = hash_screenshot(after)
# Verify change occurred
if before_hash == after_hash:
raise ValidationError(
"Screenshot unchanged after update",
artifacts={"before": before, "after": after}
)
# Use vision model to verify correct change
verification = await vision_model.compare(
before_image=before,
after_image=after,
expected_change=f"Name changed to '{new_name}'"
)
return verification.success
Common Pitfalls
Unredacted PII in Screenshots The most critical pitfall is capturing screenshots containing personally identifiable information without proper redaction. Screenshots may inadvertently capture:
- Credit card numbers in payment forms
- Social security numbers in profile pages
- Email addresses and phone numbers
- User names and profile photos
- Medical records or financial statements
Many organizations discover PII leakage only after screenshots are stored in logs, sent to third-party vision APIs, or exposed in debugging sessions. Implement PII detection and redaction before screenshots leave the capture context.
Excessive File Sizes Degrading Performance Full-page 4K screenshots with maximum quality settings can exceed 5-10MB per image. An agent taking 100 screenshots during a session generates 500MB-1GB of data, creating:
- Slow upload times to vision APIs (adding 2-5 seconds per screenshot)
- High cloud storage costs ($0.023/GB/month on S3, compounding over millions of screenshots)
- Memory pressure in browser contexts
- Bandwidth exhaustion in network-constrained environments
Teams often start with quality: 100 and fullPage: true as defaults, discovering performance issues only after production deployment.
Missing Context Making Screenshots Unusable A screenshot without metadata is difficult to understand weeks later during debugging. Critical missing context includes:
- What action the agent was attempting
- What the expected outcome was
- Where the mouse cursor was positioned
- What accessibility tree state existed
- What network requests were pending
A screenshot labeled screenshot_1234567890.png with no additional context becomes useless artifact noise rather than valuable debugging data.
Implementation
Capture Timing Strategies
class ScreenshotStrategy(Enum):
BEFORE_ACTION = "before" # Capture state before each action
AFTER_ACTION = "after" # Capture state after each action
ON_CHANGE = "on_change" # Capture when viewport changes
ON_ERROR = "on_error" # Capture only on failures
PERIODIC = "periodic" # Capture every N seconds
class SmartScreenshotCapture:
def __init__(self, strategies: List[ScreenshotStrategy]):
self.strategies = strategies
self.last_screenshot_hash = None
self.screenshot_interval = 5.0 # seconds
async def should_capture(self, event: AgentEvent) -> bool:
# Before/after action captures
if event.type in ["action_start", "action_end"]:
strategy = (ScreenshotStrategy.BEFORE_ACTION
if event.type == "action_start"
else ScreenshotStrategy.AFTER_ACTION)
if strategy in self.strategies:
return True
# On-change detection (avoid duplicates)
if ScreenshotStrategy.ON_CHANGE in self.strategies:
current_hash = await self.get_viewport_hash()
if current_hash != self.last_screenshot_hash:
self.last_screenshot_hash = current_hash
return True
# Error-only mode
if (ScreenshotStrategy.ON_ERROR in self.strategies
and event.type == "error"):
return True
# Periodic captures
if ScreenshotStrategy.PERIODIC in self.strategies:
if time.time() - self.last_capture_time > self.screenshot_interval:
return True
return False
Compression and Optimization
interface ScreenshotOptions {
format: 'png' | 'jpeg' | 'webp';
quality: number; // 1-100
maxWidth?: number;
maxHeight?: number;
optimize: boolean;
}
class OptimizedScreenshotCapture {
// Adaptive quality based on content
async captureOptimized(options: Partial<ScreenshotOptions> = {}) {
const defaults: ScreenshotOptions = {
format: 'jpeg', // JPEG for photos, PNG for UI
quality: 85, // Sweet spot for size/quality
maxWidth: 1920, // Downsample 4K to 1080p
maxHeight: 1080,
optimize: true
};
const config = { ...defaults, ...options };
// Capture at native resolution
const screenshot = await this.browser.screenshot({
type: config.format,
quality: config.quality
});
// Downsample if needed
let optimized = screenshot;
if (config.maxWidth || config.maxHeight) {
optimized = await sharp(screenshot)
.resize(config.maxWidth, config.maxHeight, {
fit: 'inside',
withoutEnlargement: true
})
.toBuffer();
}
// Further optimize
if (config.optimize) {
if (config.format === 'jpeg') {
optimized = await sharp(optimized)
.jpeg({ quality: config.quality, progressive: true })
.toBuffer();
} else if (config.format === 'png') {
optimized = await sharp(optimized)
.png({ compressionLevel: 9, adaptiveFiltering: true })
.toBuffer();
} else if (config.format === 'webp') {
optimized = await sharp(optimized)
.webp({ quality: config.quality, effort: 6 })
.toBuffer();
}
}
return optimized;
}
// Size comparison: typical results
// PNG (unoptimized): 2.4 MB
// PNG (optimized): 1.8 MB (25% reduction)
// JPEG (quality 85): 0.4 MB (83% reduction)
// WebP (quality 85): 0.3 MB (87% reduction)
}
PII Detection and Masking
import re
from typing import List, Tuple
from PIL import Image, ImageDraw
import easyocr
class PIIRedactor:
def __init__(self):
self.reader = easyocr.Reader(['en'])
# PII detection patterns
self.patterns = {
'ssn': re.compile(r'\b\d{3}-\d{2}-\d{4}\b'),
'credit_card': re.compile(r'\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b'),
'email': re.compile(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'),
'phone': re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'),
}
async def detect_and_redact(
self,
screenshot: bytes
) -> Tuple[bytes, List[dict]]:
"""
Detect PII in screenshot and return redacted version.
Returns: (redacted_screenshot, detected_pii_metadata)
"""
# Convert to PIL Image
image = Image.open(io.BytesIO(screenshot))
# Run OCR to extract text and bounding boxes
ocr_results = self.reader.readtext(np.array(image))
# Detect PII matches
pii_detections = []
redaction_boxes = []
for (bbox, text, confidence) in ocr_results:
if confidence < 0.5: # Skip low-confidence OCR
continue
# Check each PII pattern
for pii_type, pattern in self.patterns.items():
if pattern.search(text):
pii_detections.append({
'type': pii_type,
'text': text, # For logging only
'bbox': bbox,
'confidence': confidence
})
redaction_boxes.append(bbox)
# Redact by drawing black rectangles over PII
if redaction_boxes:
draw = ImageDraw.Draw(image)
for bbox in redaction_boxes:
# bbox format: [[x1,y1], [x2,y2], [x3,y3], [x4,y4]]
x_coords = [point[0] for point in bbox]
y_coords = [point[1] for point in bbox]
# Draw filled black rectangle
draw.rectangle(
[min(x_coords), min(y_coords),
max(x_coords), max(y_coords)],
fill='black'
)
# Convert back to bytes
output = io.BytesIO()
image.save(output, format='JPEG')
redacted_screenshot = output.getvalue()
# Return redacted image and metadata (without actual PII text)
safe_metadata = [
{'type': d['type'], 'bbox': d['bbox']}
for d in pii_detections
]
return redacted_screenshot, safe_metadata
# Usage in agent workflow
async def capture_with_pii_protection(agent):
redactor = PIIRedactor()
# Capture raw screenshot
raw_screenshot = await agent.browser.screenshot()
# Detect and redact PII
safe_screenshot, pii_found = await redactor.detect_and_redact(
raw_screenshot
)
# Log PII detection without exposing actual values
if pii_found:
agent.log_security_event(
event="pii_detected_in_screenshot",
pii_types=[d['type'] for d in pii_found],
count=len(pii_found)
)
# Use safe screenshot for vision API
return safe_screenshot
Key Metrics
Screenshot File Size
- Target: < 500 KB per screenshot for optimal API upload speed
- Measurement: Track P50, P95, and P99 file sizes across all captures
- Optimization: Use JPEG/WebP at quality 80-85, downsample to 1920x1080
- Alert threshold: P95 > 1 MB indicates compression settings need adjustment
PII Detection Accuracy
- Precision: Percentage of redacted regions that actually contain PII (target > 90%)
- Recall: Percentage of actual PII that was detected and redacted (target > 95%)
- False positive rate: Over-redaction can make screenshots unusable (target < 5%)
- Measurement: Manual audit of random sample (100 screenshots/week), compare with security team review
Storage Costs per Agent Session
- Baseline: 50-200 screenshots per agent session at 400 KB average = 20-80 MB per session
- Monthly cost: 1M sessions × 50 MB × $0.023/GB = $1,150/month on S3 Standard
- Optimization: Move to S3 Glacier after 30 days (90% cost reduction), delete after 90 days
- Cost per screenshot: $0.000001 storage + $0.0004 vision API processing = $0.0004 total
Screenshot Capture Latency
- Target: < 200ms for viewport screenshots, < 1s for full-page
- Impact: Each screenshot adds to total agent action latency
- Measurement: Track capture time distribution, correlate with image dimensions
- Optimization: Use viewport-only captures during actions, full-page only at checkpoints
Screenshot-to-Vision-API Pipeline Latency
- Target: End-to-end < 2 seconds (capture + upload + API processing)
- Bottlenecks: Network upload time (depends on file size), API queue time
- Measurement:
time_to_vision_result = capture_time + upload_time + api_processing_time - Cost/latency tradeoff: Smaller screenshots upload faster but may reduce vision model accuracy
Related Concepts
- Proof of Action - Screenshots serve as primary proof-of-action artifacts
- Observability - Screenshots are critical observability data for agent debugging
- PII Redaction - Essential pre-processing step before storing or transmitting screenshots
- Proof of Action Artifact - Screenshots are the most common artifact type for verification