Observability (agents)

Observability is the capability to understand agent behavior and system state through metrics, logs, traces, and monitoring. In agentic systems, observability enables operators to answer critical questions about what agents are doing, why they're making specific decisions, and how well they're performing their tasks.

Unlike traditional monitoring which focuses on predefined metrics, observability provides the ability to ask arbitrary questions about system behavior without having to anticipate those questions in advance. For autonomous agents that make decisions and take actions with minimal human oversight, comprehensive observability is not optional—it's essential for safe production deployment.

Why It Matters

Observability is fundamental to operating agent systems reliably and safely in production environments:

Debugging Production Issues: When an agent fails or produces unexpected results, observability tools provide the visibility needed to reconstruct what happened. Without detailed traces showing the agent's decision-making process, reasoning steps, and tool invocations, debugging becomes nearly impossible. In production incidents where agents interact with external systems or user data, comprehensive logs and traces are often the only way to understand root causes.

Performance Optimization: Agents often perform complex multi-step tasks that involve multiple API calls, tool invocations, and reasoning loops. Observability metrics reveal bottlenecks—whether an agent spends most time waiting on external APIs, performing expensive computations, or iterating through retry logic. Distributed tracing shows exactly where latency occurs in agent workflows, enabling targeted optimization efforts.

Understanding Failure Modes: Agents fail in ways that differ from traditional software. They might misinterpret instructions, make incorrect tool choices, produce malformed outputs, or enter infinite loops. Observability data helps identify patterns in these failures: Do certain types of prompts consistently lead to errors? Do specific tool combinations cause issues? Are there environmental conditions that trigger unexpected behavior? Understanding these patterns is crucial for improving agent reliability.

Safety and Compliance: For agents that take autonomous actions—especially those interacting with production systems, financial transactions, or user data—observability provides the audit trail necessary for compliance and security investigations. Every action an agent takes should be traceable back through its decision-making process, with timestamps, inputs, outputs, and context preserved.

Cost Management: Agent systems that use language models incur costs per token processed. Observability metrics tracking token usage, API calls, and tool invocations enable cost monitoring and optimization. Without this visibility, organizations can face unexpectedly high bills from runaway agents or inefficient task decomposition.

Concrete Examples

Distributed Tracing for Agent Tasks: A customer support agent receives a query, searches a knowledge base, calls three different APIs to gather information, reasons about the results, and generates a response. Distributed tracing captures this entire flow as a trace with multiple spans:

// Parent span for the entire agent task
const taskSpan = tracer.startSpan('agent.task.execute', {
  attributes: {
    'agent.task.id': taskId,
    'agent.task.type': 'customer_support',
    'agent.input.length': input.length,
  }
});

// Child span for knowledge base search
const searchSpan = tracer.startSpan('agent.tool.knowledge_search', {
  parent: taskSpan,
  attributes: {
    'tool.name': 'knowledge_base',
    'search.query': query,
  }
});
// ... execute search ...
searchSpan.setAttributes({
  'search.results.count': results.length,
  'search.latency.ms': latency,
});
searchSpan.end();

// Child span for LLM reasoning
const reasoningSpan = tracer.startSpan('agent.llm.reason', {
  parent: taskSpan,
  attributes: {
    'llm.provider': 'anthropic',
    'llm.model': 'claude-3-5-sonnet',
    'llm.input.tokens': inputTokens,
  }
});
// ... LLM call ...
reasoningSpan.setAttributes({
  'llm.output.tokens': outputTokens,
  'llm.cost.usd': cost,
});
reasoningSpan.end();

taskSpan.end();

This trace structure allows operators to see the complete execution path, identify slow operations, and understand how the agent decomposed the task.

Metrics Dashboards: Production agent systems require real-time metrics dashboards showing:

  • Task completion rate: Percentage of agent tasks that complete successfully versus those that fail or timeout
  • Average task duration: P50, P95, and P99 latencies for different task types
  • Token usage: Total tokens processed per hour, broken down by model and task type
  • Tool invocation patterns: Which tools agents use most frequently and their success rates
  • Error rates: Failed LLM calls, timeout errors, tool errors, validation failures
  • Cost metrics: Hourly and daily spending on LLM APIs, external tool calls, and infrastructure

A dashboard might show that 2% of agent tasks are failing due to timeout errors, with P95 latency of 45 seconds—significantly higher than the 30-second timeout threshold. Drilling into traces reveals these tasks involve repeated retries of a slow external API.

Log Aggregation: Structured logging captures agent reasoning and decision-making:

logger.info('Agent reasoning step', {
  task_id: taskId,
  step_number: 3,
  reasoning_type: 'tool_selection',
  thought_process: 'User query requires current information; knowledge base may be stale',
  selected_tool: 'web_search',
  alternative_tools: ['knowledge_base', 'database_query'],
  confidence_score: 0.87,
  context: {
    user_query_type: 'factual',
    temporal_reference: 'current',
    required_freshness: 'high',
  }
});

When aggregated, these logs enable queries like "Show me all cases where the agent selected tool X with confidence < 0.7" or "Find tasks where the agent changed tool selection mid-execution."

Real-time Monitoring and Alerting: Observability platforms trigger alerts when agent behavior deviates from normal patterns:

  • Alert when error rate exceeds 5% over a 10-minute window
  • Alert when average task duration increases by 50% compared to baseline
  • Alert when an individual agent task exceeds 100 LLM calls (potential infinite loop)
  • Alert when cost per completed task exceeds $1.00 (inefficiency indicator)
  • Alert when an agent attempts to use deprecated or disabled tools

Common Pitfalls

Insufficient Instrumentation: The most common pitfall is instrumenting only high-level success/failure metrics without capturing the internal agent reasoning process. When issues occur, teams find they cannot reconstruct what the agent was thinking or why it made specific decisions. Every significant decision point in agent execution should emit structured events.

High Cardinality Metrics: Agent systems naturally produce high-cardinality data—unique task IDs, variable user inputs, diverse tool combinations. Naively adding all context as metric labels can overwhelm metrics systems:

// PROBLEMATIC: High cardinality
metrics.counter('agent_tasks_completed', {
  user_id: userId,           // Potentially millions of unique values
  input_text: inputText,     // Essentially infinite cardinality
  task_id: taskId,          // Every task is unique
});

// BETTER: Bounded cardinality
metrics.counter('agent_tasks_completed', {
  task_type: taskType,      // Limited set of types
  user_tier: userTier,      // Small set of tiers
  success: success,         // Boolean
});
// Store high-cardinality data in traces and logs instead

High-cardinality metrics cause performance degradation, increased storage costs, and query slowdowns.

Missing Correlation IDs: Agent tasks often span multiple services, microservices, and external APIs. Without correlation IDs propagated through the entire request path, connecting related events becomes impossible. Every log line, metric, and trace span should include identifiers that allow correlation:

  • task_id: Unique identifier for the agent task
  • session_id: Identifier for multi-turn agent sessions
  • trace_id: Distributed tracing identifier
  • user_id: For user-initiated tasks

Over-reliance on Sampling: To reduce costs, some teams aggressively sample traces—capturing only 1% or 0.1% of agent executions. This works for high-throughput web services but is problematic for agents where rare failure modes matter significantly. A bug affecting 0.5% of tasks might never appear in sampled data. Consider head-based sampling that captures all errors and slow requests, plus a percentage of successful requests.

Logging Sensitive Data: Agents often process user data, API credentials, and proprietary information. Logging agent reasoning verbatim can inadvertently capture sensitive data:

// DANGEROUS: May log sensitive data
logger.info('Agent input', { input: userInput });

// SAFER: Redact or summarize
logger.info('Agent input', {
  input_length: userInput.length,
  input_type: classifyInputType(userInput),
  contains_pii: detectPII(userInput),
});

Implement automatic redaction for known sensitive patterns (API keys, SSNs, credit cards) and consider separate audit logs with enhanced access controls for full data capture.

Alert Fatigue: Teams often create too many alerts with improper thresholds, leading to constant noise that gets ignored. Start with alerts for genuine production issues: complete system failures, severe performance degradation, and safety violations. Tune thresholds based on actual baseline behavior rather than guesses.

Ignoring Tool Observability: Agents invoke external tools and APIs. If these tools lack observability, diagnosing agent issues becomes difficult. When an agent's tool call fails, you need visibility into whether the failure was due to the tool itself, network issues, authentication problems, or incorrect agent usage.

Implementation

Implementing comprehensive observability for agent systems requires coordinated instrumentation across three pillars:

Three Pillars of Observability

Metrics: Numerical measurements aggregated over time windows. Metrics answer questions like "How many agent tasks are completing?" and "What's the average token usage?" Use metrics for:

  • Counters: Total tasks, successful completions, errors, tool invocations
  • Gauges: Active agent tasks, queue depth, concurrent LLM calls
  • Histograms: Task duration distribution, token counts, cost per task

Logs: Timestamped text records of discrete events. Logs answer questions like "What was the agent thinking at step 5?" and "Why did this specific task fail?" Use structured logging with consistent schemas:

interface AgentLogEvent {
  timestamp: string;
  level: 'debug' | 'info' | 'warn' | 'error';
  task_id: string;
  trace_id: string;
  event_type: string;
  message: string;
  attributes: Record<string, any>;
}

Traces: Records of request flow through distributed systems. Traces answer questions like "Where is latency occurring in this agent workflow?" and "What sequence of operations led to this outcome?" Implement distributed tracing using OpenTelemetry or similar standards.

Monitoring Stack Setup

A production-ready agent observability stack typically includes:

Collection Layer:

  • OpenTelemetry SDKs instrumented throughout agent code
  • Log shippers (Fluent Bit, Vector) aggregating logs from all services
  • Metrics exporters (Prometheus exporters, StatsD clients)

Storage and Processing:

  • Time-series database for metrics (Prometheus, Victoria Metrics, Grafana Milo)
  • Distributed tracing backend (Jaeger, Tempo, Lightstep)
  • Log aggregation platform (Elasticsearch, Loki, Datadog)

Visualization and Analysis:

  • Dashboards (Grafana, Datadog, Honeycomb)
  • Trace analysis tools with powerful querying capabilities
  • Log search interfaces with full-text search and filtering

Example OpenTelemetry Setup:

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'https://observability-backend.example.com/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'https://observability-backend.example.com/v1/metrics',
    }),
    exportIntervalMillis: 10000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations(),
    new AgentInstrumentation(),  // Custom instrumentation
  ],
  serviceName: 'agent-system',
});

sdk.start();

Alerting Strategy

Effective alerting focuses on actionable signals:

Critical Alerts (page immediately):

  • Agent system completely down (0 successful tasks in 5 minutes)
  • Error rate > 25% sustained for 10 minutes
  • Agent attempting prohibited actions (security violations)
  • Cost rate exceeding 3x normal baseline

Warning Alerts (notify during business hours):

  • Error rate > 10% sustained for 30 minutes
  • P95 latency > 2x baseline for 20 minutes
  • Unusual tool usage patterns
  • Token usage trending 50% above forecast

Informational Alerts (aggregate into reports):

  • Individual task failures (unless clustered)
  • Temporary degradation that self-recovers
  • Optimization opportunities identified

Implement alert routing that considers severity, time of day, and on-call schedules. Use alert aggregation to prevent flooding on-call engineers with hundreds of related alerts from a single incident.

Key Metrics

Essential metrics for agent system observability:

MTTD (Mean Time to Detect): The average time between when an issue begins and when it's detected through monitoring and alerting. For agent systems, target MTTD < 5 minutes for critical failures, < 15 minutes for degraded performance. Low MTTD requires comprehensive instrumentation and well-tuned alerts that trigger quickly without false positives.

Calculate MTTD as: (Sum of detection times) / (Number of incidents)

Example: An agent system experiences 5 incidents in a month with detection times of 3, 7, 15, 4, and 11 minutes. MTTD = (3 + 7 + 15 + 4 + 11) / 5 = 8 minutes.

MTTR (Mean Time to Resolve): The average time between detecting an issue and fully resolving it. For agent systems, resolution time often depends on whether issues are self-healing (retry logic succeeds), require configuration changes, or need code fixes. Target MTTR varies by severity: critical issues < 1 hour, high priority < 4 hours, medium priority < 24 hours.

Calculate MTTR as: (Sum of resolution times) / (Number of incidents)

High MTTR often indicates insufficient observability—engineers spend time gathering information rather than fixing issues. Improving trace detail and log quality typically reduces MTTR significantly.

Observability Coverage: The percentage of agent operations and decision points that are instrumented with appropriate telemetry. Calculate coverage by identifying:

  • Total critical decision points in agent code (tool selection, reasoning steps, safety checks, error handling)
  • Decision points with instrumentation (logs, traces, or metrics emitted)
  • Coverage = (Instrumented points / Total points) × 100%

Target coverage > 80% for production systems. Coverage < 60% indicates blind spots where issues may occur without visibility. Perform periodic coverage audits by walking through agent code paths and identifying gaps.

Task Success Rate: Percentage of agent tasks that complete successfully without errors or timeouts. Calculate overall and per-task-type:

  • Success Rate = (Successful tasks / Total tasks) × 100%
  • Track trends over time to identify regressions
  • Compare across different agent versions, models, and configurations

Target success rate depends on task complexity and user tolerance, but generally aim for > 95% for production agents handling critical workflows.

Token Efficiency: Average tokens consumed per completed task. Lower is better, indicating efficient prompting and tool usage:

  • Token Efficiency = Total tokens used / Successful tasks
  • Compare across agent versions to measure optimization impact
  • Break down by model (different models have different optimal patterns)

Sudden increases in token efficiency may indicate prompt regressions, infinite loops, or inefficient tool usage patterns.

Tool Error Rate: Percentage of tool invocations that result in errors:

  • Tool Error Rate = (Failed tool calls / Total tool calls) × 100%
  • Track per tool to identify unreliable integrations
  • Distinguish between tool failures (tool problem) and agent misuse (agent problem)

High tool error rates suggest either unreliable external dependencies or agents using tools incorrectly. Use traces to determine which.

Cost per Task: Total cost (LLM API calls, tool API calls, infrastructure) divided by completed tasks:

  • Cost per Task = Total costs / Successful tasks
  • Monitor trends to catch cost regressions early
  • Compare across task types to identify expensive workflows

Cost per task increasing over time may indicate model changes, less efficient prompting, or agents taking more steps to accomplish the same goals.

Related Concepts

Understanding agent observability requires familiarity with several related concepts:

  • Proof of Action: Immutable records of agent actions that complement observability by providing verifiable evidence of what agents did
  • Instrumentation: The technical implementation of adding observability hooks throughout agent code
  • Audit Log: Specialized logging focused on compliance and security requirements for agent actions
  • Telemetry: The broader practice of collecting and transmitting observability data from agent systems

Additional context:

  • Distributed Tracing: Essential for understanding multi-step agent workflows across services
  • Structured Logging: Enables powerful querying and analysis of agent reasoning
  • Service Level Objectives (SLOs): Define target reliability metrics for agent systems based on observability data
  • Error Budgets: Quantify acceptable failure rates derived from SLOs and measured through observability
  • Anomaly Detection: Machine learning techniques applied to observability data to identify unusual agent behavior