Telemetry
Automated collection and transmission of agent metrics, events, and diagnostic data for analysis.
Telemetry systems continuously gather quantitative and qualitative data from running agents, providing the foundation for observability, debugging, and performance optimization. Unlike logging, which captures discrete events, telemetry encompasses structured metrics, traces, and events that flow through dedicated pipelines for real-time and historical analysis.
Why It Matters
Production Insights
Telemetry provides visibility into agent behavior in production environments where traditional debugging isn't feasible. By continuously collecting metrics on task completion rates, error frequencies, and resource utilization, teams gain empirical evidence of system health. For computer-use agents operating across distributed environments, telemetry reveals patterns like regional performance variations, browser compatibility issues, or infrastructure constraints that only manifest at scale.
Performance Optimization
Quantitative telemetry data identifies optimization opportunities invisible through casual observation. Histogram distributions of action execution times reveal outliers and bottlenecks. Percentile metrics (p50, p95, p99) expose long-tail latencies affecting user experience. By correlating performance metrics with contextual dimensions—model version, task complexity, time of day—teams pinpoint specific conditions causing degradation and validate that optimizations deliver measurable improvements.
Anomaly Detection
Telemetry enables automated detection of abnormal behavior through baseline comparison and statistical analysis. Sudden spikes in error rates, unexpected drops in throughput, or deviations in resource consumption patterns trigger alerts before users report issues. For agentic systems, telemetry can identify subtle degradations like increased retry rates, longer planning times, or reduced action success rates that signal underlying problems with model performance, API dependencies, or infrastructure.
Concrete Examples
Metrics Collection
Counters track cumulative values that only increase, like total tasks completed, API calls made, or errors encountered:
// Counter for tracking agent actions
telemetry.counter('agent.actions.executed', {
action_type: 'click',
success: true,
browser: 'chrome'
});
// Counter for error tracking
telemetry.counter('agent.errors', {
error_type: 'ElementNotFoundException',
severity: 'warning',
recovery_attempted: true
});
Histograms capture distributions of values, revealing patterns beyond simple averages:
// Histogram for action execution duration
telemetry.histogram('agent.action.duration_ms', durationMs, {
action_type: 'form_fill',
complexity: 'high'
});
// Histogram for token consumption
telemetry.histogram('agent.tokens.consumed', tokenCount, {
model: 'claude-3-opus',
task_type: 'planning'
});
Gauges represent point-in-time measurements that can increase or decrease:
// Gauge for active agent sessions
telemetry.gauge('agent.sessions.active', activeCount);
// Gauge for queue depth
telemetry.gauge('agent.task_queue.depth', queueSize, {
priority: 'high'
});
Event Tracking
Events capture discrete occurrences with rich contextual information:
// Track significant agent lifecycle events
telemetry.event('agent.task.started', {
task_id: 'task_abc123',
task_type: 'form_automation',
user_id: 'user_xyz789',
estimated_complexity: 0.7,
context_tokens: 1500
});
telemetry.event('agent.recovery.triggered', {
task_id: 'task_abc123',
failure_reason: 'timeout',
recovery_strategy: 'retry_with_simpler_approach',
attempt_number: 2
});
telemetry.event('agent.task.completed', {
task_id: 'task_abc123',
duration_ms: 3420,
actions_executed: 7,
retries: 1,
success: true,
cost_usd: 0.042
});
Distributed Tracing
Traces link related operations across services, showing request flow and timing:
// Create parent span for agent task
const taskSpan = telemetry.startSpan('agent.execute_task', {
task_id: 'task_abc123',
task_type: 'data_extraction'
});
// Child span for planning phase
const planSpan = telemetry.startSpan('agent.planning', {
parent: taskSpan,
model: 'claude-3-sonnet'
});
await generatePlan();
planSpan.end();
// Child span for each action
const actionSpan = telemetry.startSpan('agent.action.execute', {
parent: taskSpan,
action_type: 'click',
selector: '#submit-button'
});
await executeAction();
actionSpan.end();
// Complete parent span
taskSpan.end();
This creates a trace showing the full execution timeline with nested operations, enabling identification of bottlenecks in multi-step agent workflows.
Common Pitfalls
High Cardinality
Problem: Including unbounded or high-cardinality values (user IDs, timestamps, unique identifiers) as metric dimensions creates exponentially large metric combinations, overwhelming storage and query systems.
Example of high cardinality (problematic):
telemetry.counter('agent.action.executed', {
user_id: 'user_12345', // Unbounded dimension
timestamp: Date.now(), // Unique for every call
session_id: sessionId // High cardinality
});
Solution: Use fixed-cardinality dimensions (action type, status, region) in metrics, and include high-cardinality identifiers only in events or trace attributes. For analysis requiring user-level granularity, query event logs rather than dimensional metrics:
// Low cardinality dimensions for metrics
telemetry.counter('agent.action.executed', {
action_type: 'click',
status: 'success',
region: 'us-east-1'
});
// High cardinality data in separate event
telemetry.event('action.detail', {
user_id: 'user_12345',
session_id: sessionId,
action_id: actionId
});
PII Leakage
Problem: Telemetry systems may inadvertently capture personally identifiable information (PII) or sensitive data in metric labels, event fields, or trace attributes, creating compliance and privacy risks.
Vulnerable areas:
- Form input values captured in action parameters
- URLs containing email addresses or account identifiers
- Error messages exposing user data
- Screenshot metadata or page titles
Solution: Implement strict filtering and sanitization at collection time:
class TelemetryClient {
private sanitizeUrl(url: string): string {
const parsed = new URL(url);
// Remove query parameters that might contain PII
const safeParams = ['page', 'tab', 'view'];
const filtered = new URLSearchParams();
safeParams.forEach(key => {
if (parsed.searchParams.has(key)) {
filtered.set(key, parsed.searchParams.get(key)!);
}
});
return `${parsed.origin}${parsed.pathname}?${filtered.toString()}`;
}
recordAction(action: AgentAction) {
telemetry.event('agent.action', {
type: action.type,
url: this.sanitizeUrl(action.url),
// Never include: input values, credentials, tokens
success: action.success
});
}
}
Establish allowlists for fields permitted in telemetry and automatically redact or hash sensitive data.
Excessive Data Volume
Problem: Collecting telemetry for every operation generates overwhelming data volumes, increasing costs and reducing signal-to-noise ratio. High-frequency metrics and verbose events can produce terabytes of data daily in busy systems.
Impact:
- Storage costs scale linearly with volume
- Query performance degrades on large datasets
- Important signals buried in noise
- Increased latency from telemetry overhead
Solution: Implement intelligent sampling and aggregation:
class AdaptiveTelemetry {
private shouldSample(eventType: string): boolean {
// Sample successful operations at lower rate
if (eventType === 'action.success') {
return Math.random() < 0.1; // 10% sampling
}
// Always capture errors and anomalies
if (eventType === 'action.error') {
return true; // 100% sampling
}
// Adaptive sampling based on load
const currentLoad = this.getSystemLoad();
return Math.random() < (1.0 / currentLoad);
}
recordEvent(type: string, data: object) {
if (this.shouldSample(type)) {
telemetry.event(type, {
...data,
sample_rate: this.getSampleRate(type)
});
}
// Always update aggregated metrics
this.updateAggregates(type, data);
}
}
Use head-based sampling for traces (decide at start), tail-based sampling for errors (capture full traces containing failures), and pre-aggregation for high-frequency metrics.
Implementation
Collection Agents
Telemetry collection typically uses lightweight agents embedded in application code or deployed as sidecars:
// Initialize telemetry provider
const telemetry = new TelemetryProvider({
endpoint: 'https://telemetry.example.com',
service: 'agent-runtime',
environment: process.env.NODE_ENV,
version: process.env.APP_VERSION,
// Buffer and batch configuration
maxBatchSize: 100,
flushInterval: 10000, // 10 seconds
// Resource attributes
attributes: {
'service.name': 'agent-runtime',
'deployment.region': 'us-west-2',
'host.id': getHostId()
}
});
// Instrument agent operations
class InstrumentedAgent {
async executeTask(task: Task) {
return telemetry.trace('agent.task', async (span) => {
span.setAttribute('task.id', task.id);
span.setAttribute('task.type', task.type);
try {
const result = await this.agent.execute(task);
span.setAttribute('task.success', true);
telemetry.counter('agent.tasks.completed').add(1, {
task_type: task.type,
status: 'success'
});
return result;
} catch (error) {
span.recordException(error);
span.setAttribute('task.success', false);
telemetry.counter('agent.tasks.completed').add(1, {
task_type: task.type,
status: 'error'
});
throw error;
}
});
}
}
For browser-based computer-use agents, use OpenTelemetry browser SDK or vendor-specific libraries to collect telemetry from client-side operations.
Batching
Batching aggregates multiple telemetry records before transmission, reducing network overhead and backend load:
class BatchingCollector {
private batch: TelemetryRecord[] = [];
private timer: NodeJS.Timeout | null = null;
constructor(
private maxBatchSize: number = 100,
private maxWaitMs: number = 10000
) {}
collect(record: TelemetryRecord) {
this.batch.push(record);
// Flush if batch size reached
if (this.batch.length >= this.maxBatchSize) {
this.flush();
return;
}
// Schedule flush if not already scheduled
if (!this.timer) {
this.timer = setTimeout(() => this.flush(), this.maxWaitMs);
}
}
private async flush() {
if (this.batch.length === 0) return;
const toSend = this.batch.splice(0, this.batch.length);
if (this.timer) {
clearTimeout(this.timer);
this.timer = null;
}
try {
await this.transmit(toSend);
} catch (error) {
// Handle transmission failure - retry, drop, or buffer
this.handleTransmissionError(error, toSend);
}
}
private async transmit(records: TelemetryRecord[]) {
await fetch(this.endpoint, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
records,
metadata: {
batch_size: records.length,
timestamp: Date.now()
}
})
});
}
}
Batching trades latency for efficiency: data arrives in backend systems with slight delay, but throughput increases significantly. Critical alerts may require immediate transmission outside batch cycles.
Sampling Strategies
Sampling reduces telemetry volume while preserving statistical validity:
Uniform Random Sampling: Sample fixed percentage of events:
const SAMPLE_RATE = 0.01; // 1%
function shouldSample(): boolean {
return Math.random() < SAMPLE_RATE;
}
Rate Limiting: Cap events per time window:
class RateLimitedTelemetry {
private counts = new Map<string, number>();
private window = 60000; // 1 minute
record(key: string, maxPerWindow: number) {
const count = this.counts.get(key) || 0;
if (count < maxPerWindow) {
this.counts.set(key, count + 1);
return true;
}
return false;
}
}
Importance Sampling: Sample based on event significance:
function getSampleRate(event: TelemetryEvent): number {
if (event.type === 'error') return 1.0; // 100%
if (event.type === 'warning') return 0.5; // 50%
if (event.duration > 5000) return 0.8; // 80% for slow operations
return 0.1; // 10% for normal events
}
Tail-Based Sampling: Retain complete traces containing errors:
class TailSamplingProcessor {
private traceBuffers = new Map<string, Span[]>();
processSpan(span: Span) {
const traceId = span.traceId;
if (!this.traceBuffers.has(traceId)) {
this.traceBuffers.set(traceId, []);
}
this.traceBuffers.get(traceId)!.push(span);
// On trace completion, decide whether to keep
if (span.isRootSpan && span.ended) {
const trace = this.traceBuffers.get(traceId)!;
const hasError = trace.some(s => s.status === 'error');
const isSlow = trace.some(s => s.duration > 5000);
if (hasError || isSlow || Math.random() < 0.05) {
// Keep trace - send all spans
this.exportTrace(trace);
}
this.traceBuffers.delete(traceId);
}
}
}
Key Metrics
Monitor telemetry system health with these meta-metrics:
Telemetry Coverage
Definition: Percentage of agent operations producing telemetry data.
Measurement:
const coverage = (instrumentedOperations / totalOperations) * 100;
Targets:
- Critical paths: 100% coverage
- Standard operations: 95%+ coverage
- Low-priority operations: 80%+ coverage
Why it matters: Gaps in coverage create blind spots where failures and performance issues go undetected. Track coverage by operation type, service, and code path to identify instrumentation gaps.
Data Freshness
Definition: Latency between event occurrence and availability for query.
Measurement:
const freshness = queryTimestamp - event.timestamp;
Targets:
- Real-time alerts: < 30 seconds
- Operational dashboards: < 2 minutes
- Historical analysis: < 15 minutes
Why it matters: Stale telemetry delays problem detection and incident response. Monitor end-to-end pipeline latency from collection through ingestion to query availability. Spikes in freshness latency indicate collection bottlenecks, network issues, or backend congestion.
Cardinality Management
Definition: Number of unique metric time series and event attribute combinations.
Measurement:
// Monitor active time series per metric
const cardinality = uniqueTimeSeries.length;
// Alert on cardinality explosion
if (cardinality > THRESHOLD) {
alert(`High cardinality detected: ${cardinality} series`);
}
Targets:
- Per-metric cardinality: < 1000 time series
- Total system cardinality: < 100,000 active series
- Cardinality growth rate: < 5% week-over-week
Why it matters: Uncontrolled cardinality growth degrades query performance and increases costs exponentially. Set up automated cardinality tracking and alerting:
interface CardinalityReport {
metric: string;
timeSeriesCount: number;
dimensions: Record<string, number>; // Unique values per dimension
growth: {
daily: number;
weekly: number;
};
}
function analyzeCardinality(metric: string): CardinalityReport {
const series = getActiveTimeSeries(metric);
const dimensions = analyzeDimensions(series);
return {
metric,
timeSeriesCount: series.length,
dimensions,
growth: {
daily: calculateGrowthRate(series, '24h'),
weekly: calculateGrowthRate(series, '7d')
}
};
}
Regularly audit high-cardinality metrics and refactor to reduce dimensionality or move to event-based collection.
Related Concepts
- Observability - Broader system visibility practice encompassing telemetry, logs, and traces
- Instrumentation - Code-level implementation of telemetry collection points
- Audit Log - Immutable record of security-relevant events for compliance
- Latency SLO - Service level objectives measured using telemetry data