Telemetry

Automated collection and transmission of agent metrics, events, and diagnostic data for analysis.

Telemetry systems continuously gather quantitative and qualitative data from running agents, providing the foundation for observability, debugging, and performance optimization. Unlike logging, which captures discrete events, telemetry encompasses structured metrics, traces, and events that flow through dedicated pipelines for real-time and historical analysis.

Why It Matters

Production Insights

Telemetry provides visibility into agent behavior in production environments where traditional debugging isn't feasible. By continuously collecting metrics on task completion rates, error frequencies, and resource utilization, teams gain empirical evidence of system health. For computer-use agents operating across distributed environments, telemetry reveals patterns like regional performance variations, browser compatibility issues, or infrastructure constraints that only manifest at scale.

Performance Optimization

Quantitative telemetry data identifies optimization opportunities invisible through casual observation. Histogram distributions of action execution times reveal outliers and bottlenecks. Percentile metrics (p50, p95, p99) expose long-tail latencies affecting user experience. By correlating performance metrics with contextual dimensions—model version, task complexity, time of day—teams pinpoint specific conditions causing degradation and validate that optimizations deliver measurable improvements.

Anomaly Detection

Telemetry enables automated detection of abnormal behavior through baseline comparison and statistical analysis. Sudden spikes in error rates, unexpected drops in throughput, or deviations in resource consumption patterns trigger alerts before users report issues. For agentic systems, telemetry can identify subtle degradations like increased retry rates, longer planning times, or reduced action success rates that signal underlying problems with model performance, API dependencies, or infrastructure.

Concrete Examples

Metrics Collection

Counters track cumulative values that only increase, like total tasks completed, API calls made, or errors encountered:

// Counter for tracking agent actions
telemetry.counter('agent.actions.executed', {
  action_type: 'click',
  success: true,
  browser: 'chrome'
});

// Counter for error tracking
telemetry.counter('agent.errors', {
  error_type: 'ElementNotFoundException',
  severity: 'warning',
  recovery_attempted: true
});

Histograms capture distributions of values, revealing patterns beyond simple averages:

// Histogram for action execution duration
telemetry.histogram('agent.action.duration_ms', durationMs, {
  action_type: 'form_fill',
  complexity: 'high'
});

// Histogram for token consumption
telemetry.histogram('agent.tokens.consumed', tokenCount, {
  model: 'claude-3-opus',
  task_type: 'planning'
});

Gauges represent point-in-time measurements that can increase or decrease:

// Gauge for active agent sessions
telemetry.gauge('agent.sessions.active', activeCount);

// Gauge for queue depth
telemetry.gauge('agent.task_queue.depth', queueSize, {
  priority: 'high'
});

Event Tracking

Events capture discrete occurrences with rich contextual information:

// Track significant agent lifecycle events
telemetry.event('agent.task.started', {
  task_id: 'task_abc123',
  task_type: 'form_automation',
  user_id: 'user_xyz789',
  estimated_complexity: 0.7,
  context_tokens: 1500
});

telemetry.event('agent.recovery.triggered', {
  task_id: 'task_abc123',
  failure_reason: 'timeout',
  recovery_strategy: 'retry_with_simpler_approach',
  attempt_number: 2
});

telemetry.event('agent.task.completed', {
  task_id: 'task_abc123',
  duration_ms: 3420,
  actions_executed: 7,
  retries: 1,
  success: true,
  cost_usd: 0.042
});

Distributed Tracing

Traces link related operations across services, showing request flow and timing:

// Create parent span for agent task
const taskSpan = telemetry.startSpan('agent.execute_task', {
  task_id: 'task_abc123',
  task_type: 'data_extraction'
});

// Child span for planning phase
const planSpan = telemetry.startSpan('agent.planning', {
  parent: taskSpan,
  model: 'claude-3-sonnet'
});
await generatePlan();
planSpan.end();

// Child span for each action
const actionSpan = telemetry.startSpan('agent.action.execute', {
  parent: taskSpan,
  action_type: 'click',
  selector: '#submit-button'
});
await executeAction();
actionSpan.end();

// Complete parent span
taskSpan.end();

This creates a trace showing the full execution timeline with nested operations, enabling identification of bottlenecks in multi-step agent workflows.

Common Pitfalls

High Cardinality

Problem: Including unbounded or high-cardinality values (user IDs, timestamps, unique identifiers) as metric dimensions creates exponentially large metric combinations, overwhelming storage and query systems.

Example of high cardinality (problematic):

telemetry.counter('agent.action.executed', {
  user_id: 'user_12345',  // Unbounded dimension
  timestamp: Date.now(),   // Unique for every call
  session_id: sessionId    // High cardinality
});

Solution: Use fixed-cardinality dimensions (action type, status, region) in metrics, and include high-cardinality identifiers only in events or trace attributes. For analysis requiring user-level granularity, query event logs rather than dimensional metrics:

// Low cardinality dimensions for metrics
telemetry.counter('agent.action.executed', {
  action_type: 'click',
  status: 'success',
  region: 'us-east-1'
});

// High cardinality data in separate event
telemetry.event('action.detail', {
  user_id: 'user_12345',
  session_id: sessionId,
  action_id: actionId
});

PII Leakage

Problem: Telemetry systems may inadvertently capture personally identifiable information (PII) or sensitive data in metric labels, event fields, or trace attributes, creating compliance and privacy risks.

Vulnerable areas:

Form input values captured in action parameters
URLs containing email addresses or account identifiers
Error messages exposing user data
Screenshot metadata or page titles

Solution: Implement strict filtering and sanitization at collection time:

class TelemetryClient {
  private sanitizeUrl(url: string): string {
    const parsed = new URL(url);
    // Remove query parameters that might contain PII
    const safeParams = ['page', 'tab', 'view'];
    const filtered = new URLSearchParams();
    safeParams.forEach(key => {
      if (parsed.searchParams.has(key)) {
        filtered.set(key, parsed.searchParams.get(key)!);
      }
    });
    return `${parsed.origin}${parsed.pathname}?${filtered.toString()}`;
  }

  recordAction(action: AgentAction) {
    telemetry.event('agent.action', {
      type: action.type,
      url: this.sanitizeUrl(action.url),
      // Never include: input values, credentials, tokens
      success: action.success
    });
  }
}

Establish allowlists for fields permitted in telemetry and automatically redact or hash sensitive data.

Excessive Data Volume

Problem: Collecting telemetry for every operation generates overwhelming data volumes, increasing costs and reducing signal-to-noise ratio. High-frequency metrics and verbose events can produce terabytes of data daily in busy systems.

Impact:

Storage costs scale linearly with volume
Query performance degrades on large datasets
Important signals buried in noise
Increased latency from telemetry overhead

Solution: Implement intelligent sampling and aggregation:

class AdaptiveTelemetry {
  private shouldSample(eventType: string): boolean {
    // Sample successful operations at lower rate
    if (eventType === 'action.success') {
      return Math.random() < 0.1; // 10% sampling
    }
    // Always capture errors and anomalies
    if (eventType === 'action.error') {
      return true; // 100% sampling
    }
    // Adaptive sampling based on load
    const currentLoad = this.getSystemLoad();
    return Math.random() < (1.0 / currentLoad);
  }

  recordEvent(type: string, data: object) {
    if (this.shouldSample(type)) {
      telemetry.event(type, {
        ...data,
        sample_rate: this.getSampleRate(type)
      });
    }
    // Always update aggregated metrics
    this.updateAggregates(type, data);
  }
}

Use head-based sampling for traces (decide at start), tail-based sampling for errors (capture full traces containing failures), and pre-aggregation for high-frequency metrics.

Implementation

Collection Agents

Telemetry collection typically uses lightweight agents embedded in application code or deployed as sidecars:

// Initialize telemetry provider
const telemetry = new TelemetryProvider({
  endpoint: 'https://telemetry.example.com',
  service: 'agent-runtime',
  environment: process.env.NODE_ENV,
  version: process.env.APP_VERSION,
  // Buffer and batch configuration
  maxBatchSize: 100,
  flushInterval: 10000, // 10 seconds
  // Resource attributes
  attributes: {
    'service.name': 'agent-runtime',
    'deployment.region': 'us-west-2',
    'host.id': getHostId()
  }
});

// Instrument agent operations
class InstrumentedAgent {
  async executeTask(task: Task) {
    return telemetry.trace('agent.task', async (span) => {
      span.setAttribute('task.id', task.id);
      span.setAttribute('task.type', task.type);

      try {
        const result = await this.agent.execute(task);
        span.setAttribute('task.success', true);
        telemetry.counter('agent.tasks.completed').add(1, {
          task_type: task.type,
          status: 'success'
        });
        return result;
      } catch (error) {
        span.recordException(error);
        span.setAttribute('task.success', false);
        telemetry.counter('agent.tasks.completed').add(1, {
          task_type: task.type,
          status: 'error'
        });
        throw error;
      }
    });
  }
}

For browser-based computer-use agents, use OpenTelemetry browser SDK or vendor-specific libraries to collect telemetry from client-side operations.

Batching

Batching aggregates multiple telemetry records before transmission, reducing network overhead and backend load:

class BatchingCollector {
  private batch: TelemetryRecord[] = [];
  private timer: NodeJS.Timeout | null = null;

  constructor(
    private maxBatchSize: number = 100,
    private maxWaitMs: number = 10000
  ) {}

  collect(record: TelemetryRecord) {
    this.batch.push(record);

    // Flush if batch size reached
    if (this.batch.length >= this.maxBatchSize) {
      this.flush();
      return;
    }

    // Schedule flush if not already scheduled
    if (!this.timer) {
      this.timer = setTimeout(() => this.flush(), this.maxWaitMs);
    }
  }

  private async flush() {
    if (this.batch.length === 0) return;

    const toSend = this.batch.splice(0, this.batch.length);
    if (this.timer) {
      clearTimeout(this.timer);
      this.timer = null;
    }

    try {
      await this.transmit(toSend);
    } catch (error) {
      // Handle transmission failure - retry, drop, or buffer
      this.handleTransmissionError(error, toSend);
    }
  }

  private async transmit(records: TelemetryRecord[]) {
    await fetch(this.endpoint, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        records,
        metadata: {
          batch_size: records.length,
          timestamp: Date.now()
        }
      })
    });
  }
}

Batching trades latency for efficiency: data arrives in backend systems with slight delay, but throughput increases significantly. Critical alerts may require immediate transmission outside batch cycles.

Sampling Strategies

Sampling reduces telemetry volume while preserving statistical validity:

Uniform Random Sampling: Sample fixed percentage of events:

const SAMPLE_RATE = 0.01; // 1%

function shouldSample(): boolean {
  return Math.random() < SAMPLE_RATE;
}

Rate Limiting: Cap events per time window:

class RateLimitedTelemetry {
  private counts = new Map<string, number>();
  private window = 60000; // 1 minute

  record(key: string, maxPerWindow: number) {
    const count = this.counts.get(key) || 0;
    if (count < maxPerWindow) {
      this.counts.set(key, count + 1);
      return true;
    }
    return false;
  }
}

Importance Sampling: Sample based on event significance:

function getSampleRate(event: TelemetryEvent): number {
  if (event.type === 'error') return 1.0;      // 100%
  if (event.type === 'warning') return 0.5;    // 50%
  if (event.duration > 5000) return 0.8;       // 80% for slow operations
  return 0.1;                                   // 10% for normal events
}

Tail-Based Sampling: Retain complete traces containing errors:

class TailSamplingProcessor {
  private traceBuffers = new Map<string, Span[]>();

  processSpan(span: Span) {
    const traceId = span.traceId;
    if (!this.traceBuffers.has(traceId)) {
      this.traceBuffers.set(traceId, []);
    }
    this.traceBuffers.get(traceId)!.push(span);

    // On trace completion, decide whether to keep
    if (span.isRootSpan && span.ended) {
      const trace = this.traceBuffers.get(traceId)!;
      const hasError = trace.some(s => s.status === 'error');
      const isSlow = trace.some(s => s.duration > 5000);

      if (hasError || isSlow || Math.random() < 0.05) {
        // Keep trace - send all spans
        this.exportTrace(trace);
      }
      this.traceBuffers.delete(traceId);
    }
  }
}

Key Metrics

Monitor telemetry system health with these meta-metrics:

Telemetry Coverage

Definition: Percentage of agent operations producing telemetry data.

Measurement:

const coverage = (instrumentedOperations / totalOperations) * 100;

Targets:

Critical paths: 100% coverage
Standard operations: 95%+ coverage
Low-priority operations: 80%+ coverage

Why it matters: Gaps in coverage create blind spots where failures and performance issues go undetected. Track coverage by operation type, service, and code path to identify instrumentation gaps.

Data Freshness

Definition: Latency between event occurrence and availability for query.

Measurement:

const freshness = queryTimestamp - event.timestamp;

Targets:

Real-time alerts: < 30 seconds
Operational dashboards: < 2 minutes
Historical analysis: < 15 minutes

Why it matters: Stale telemetry delays problem detection and incident response. Monitor end-to-end pipeline latency from collection through ingestion to query availability. Spikes in freshness latency indicate collection bottlenecks, network issues, or backend congestion.

Cardinality Management

Definition: Number of unique metric time series and event attribute combinations.

Measurement:

// Monitor active time series per metric
const cardinality = uniqueTimeSeries.length;

// Alert on cardinality explosion
if (cardinality > THRESHOLD) {
  alert(`High cardinality detected: ${cardinality} series`);
}

Targets:

Per-metric cardinality: < 1000 time series
Total system cardinality: < 100,000 active series
Cardinality growth rate: < 5% week-over-week

Why it matters: Uncontrolled cardinality growth degrades query performance and increases costs exponentially. Set up automated cardinality tracking and alerting:

interface CardinalityReport {
  metric: string;
  timeSeriesCount: number;
  dimensions: Record<string, number>; // Unique values per dimension
  growth: {
    daily: number;
    weekly: number;
  };
}

function analyzeCardinality(metric: string): CardinalityReport {
  const series = getActiveTimeSeries(metric);
  const dimensions = analyzeDimensions(series);

  return {
    metric,
    timeSeriesCount: series.length,
    dimensions,
    growth: {
      daily: calculateGrowthRate(series, '24h'),
      weekly: calculateGrowthRate(series, '7d')
    }
  };
}

Regularly audit high-cardinality metrics and refactor to reduce dimensionality or move to event-based collection.

Related Concepts

Observability - Broader system visibility practice encompassing telemetry, logs, and traces
Instrumentation - Code-level implementation of telemetry collection points
Audit Log - Immutable record of security-relevant events for compliance
Latency SLO - Service level objectives measured using telemetry data

Telemetry

Why It Matters

Production Insights

Performance Optimization

Anomaly Detection

Concrete Examples

Metrics Collection

Event Tracking

Distributed Tracing

Common Pitfalls

High Cardinality

PII Leakage

Excessive Data Volume

Implementation

Collection Agents

Batching

Sampling Strategies

Key Metrics

Telemetry Coverage

Data Freshness

Cardinality Management

Related Concepts

Related Concepts

Observability (agents)

Instrumentation (agents)

Audit log (agents)

Latency SLO