Failure Modes

Failure modes are categorized types of agent failures with associated detection, handling, and recovery strategies. In agentic systems, failure modes represent distinct patterns of system breakdown—each with unique characteristics, root causes, and appropriate remediation approaches. Rather than treating all errors uniformly, a failure modes framework enables systematic classification and tailored responses to different failure scenarios.

Why It Matters

Proactive Failure Management

Understanding failure modes transforms error handling from reactive firefighting to proactive system design. By categorizing potential failures before they occur, teams can:

  • Design for graceful degradation: Build fallback mechanisms specific to each failure category
  • Reduce mean time to recovery (MTTR): Pre-defined recovery strategies enable faster incident resolution
  • Improve user experience: Different failure modes warrant different user communications and recovery paths

Failure Mode and Effects Analysis (FMEA)

FMEA methodology applied to agentic systems involves:

  1. Identification: Enumerate all potential failure modes across the agent lifecycle
  2. Severity assessment: Rate the impact of each failure type on system functionality
  3. Detection mechanisms: Implement monitoring to catch failures early
  4. Mitigation strategies: Design preventive and corrective measures for high-priority modes

This systematic approach prevents catastrophic failures and enables data-driven prioritization of reliability improvements.

System Reliability and Trust

Agent systems operate in unpredictable environments with multiple points of potential failure. A comprehensive failure modes framework:

  • Establishes reliability boundaries and expected behavior under stress
  • Enables SLA commitments backed by measured failure characteristics
  • Builds user trust through predictable, well-handled failure scenarios
  • Supports compliance requirements in regulated domains

Concrete Examples

Network Failures

Characteristics: Intermittent connectivity, timeouts, partial responses

// Network failure detection and classification
class NetworkFailureHandler {
  classifyNetworkError(error: Error): NetworkFailureMode {
    if (error.message.includes('ETIMEDOUT')) {
      return {
        mode: 'TIMEOUT',
        retryable: true,
        backoffStrategy: 'exponential',
        maxRetries: 3
      };
    }

    if (error.message.includes('ECONNREFUSED')) {
      return {
        mode: 'CONNECTION_REFUSED',
        retryable: false,
        fallback: 'offline_mode',
        userNotification: true
      };
    }

    if (error.message.includes('DNS_RESOLUTION_FAILED')) {
      return {
        mode: 'DNS_FAILURE',
        retryable: true,
        backoffStrategy: 'linear',
        maxRetries: 2
      };
    }

    return {
      mode: 'UNKNOWN_NETWORK_ERROR',
      retryable: false,
      escalate: true
    };
  }
}

Selector Breakage

Characteristics: UI changes invalidate locator strategies, elements not found

// Multi-layered selector strategy with failure detection
interface SelectorFailureMode {
  mode: 'PRIMARY_FAILED' | 'SECONDARY_FAILED' | 'ALL_FAILED';
  failedSelectors: string[];
  snapshot?: string;
  suggestedUpdate?: string;
}

async function robustElementLocation(
  page: Page,
  selectors: string[]
): Promise<Element | SelectorFailureMode> {
  const failures: string[] = [];

  for (const selector of selectors) {
    try {
      const element = await page.waitForSelector(selector, {
        timeout: 2000
      });

      if (element) {
        // Log successful fallback for monitoring
        if (failures.length > 0) {
          logger.warn('Primary selectors failed, used fallback', {
            failed: failures,
            successful: selector
          });
        }
        return element;
      }
    } catch (error) {
      failures.push(selector);
    }
  }

  // All selectors failed - capture diagnostics
  return {
    mode: 'ALL_FAILED',
    failedSelectors: failures,
    snapshot: await page.screenshot({ encoding: 'base64' }),
    suggestedUpdate: await generateSelectorSuggestion(page)
  };
}

Timeout Errors

Characteristics: Operations exceed expected duration, resource exhaustion

// Context-aware timeout handling
class TimeoutFailureManager {
  private readonly timeoutPolicies = {
    navigation: { limit: 30000, retryable: true, escalate: false },
    api_call: { limit: 10000, retryable: true, escalate: false },
    llm_inference: { limit: 120000, retryable: false, escalate: true },
    user_action: { limit: 5000, retryable: true, escalate: false }
  };

  async executeWithTimeout<T>(
    operation: () => Promise<T>,
    context: keyof typeof this.timeoutPolicies
  ): Promise<T> {
    const policy = this.timeoutPolicies[context];

    try {
      return await Promise.race([
        operation(),
        this.timeoutPromise(policy.limit)
      ]);
    } catch (error) {
      if (error instanceof TimeoutError) {
        return this.handleTimeout(error, context, policy);
      }
      throw error;
    }
  }

  private async handleTimeout(
    error: TimeoutError,
    context: string,
    policy: TimeoutPolicy
  ) {
    this.metrics.recordTimeout(context, error.duration);

    if (policy.escalate) {
      await this.escalateToHuman({
        reason: `${context} exceeded timeout limit`,
        duration: error.duration,
        limit: policy.limit
      });
    }

    if (policy.retryable) {
      return this.retryWithBackoff(error.operation, context);
    }

    throw new UnrecoverableTimeoutError(context, policy.limit);
  }
}

Authorization Failures

Characteristics: Permission denied, expired credentials, scope limitations

// Authorization failure classification and recovery
enum AuthFailureMode {
  EXPIRED_TOKEN = 'expired_token',
  INSUFFICIENT_PERMISSIONS = 'insufficient_permissions',
  INVALID_CREDENTIALS = 'invalid_credentials',
  RATE_LIMITED = 'rate_limited',
  ACCOUNT_LOCKED = 'account_locked'
}

class AuthFailureHandler {
  async handleAuthFailure(
    error: AuthError,
    context: OperationContext
  ): Promise<RecoveryAction> {
    const mode = this.classifyAuthFailure(error);

    switch (mode) {
      case AuthFailureMode.EXPIRED_TOKEN:
        // Automatic recovery: refresh token
        return {
          action: 'REFRESH_TOKEN',
          automated: true,
          retryOriginalOperation: true
        };

      case AuthFailureMode.INSUFFICIENT_PERMISSIONS:
        // User intervention: request elevated permissions
        return {
          action: 'REQUEST_PERMISSIONS',
          automated: false,
          userPrompt: `This operation requires additional permissions: ${context.requiredScopes}`,
          fallback: 'SKIP_OPERATION'
        };

      case AuthFailureMode.RATE_LIMITED:
        // Backoff and retry
        const retryAfter = error.headers['retry-after'] || 60;
        return {
          action: 'DELAY_RETRY',
          automated: true,
          delaySeconds: retryAfter,
          notifyUser: retryAfter > 30
        };

      case AuthFailureMode.ACCOUNT_LOCKED:
        // Unrecoverable: escalate to human
        return {
          action: 'ESCALATE',
          automated: false,
          severity: 'HIGH',
          userPrompt: 'Account access locked. Please contact support.',
          abortWorkflow: true
        };
    }
  }
}

Common Pitfalls

Treating All Failures the Same

Problem: Uniform error handling obscures important distinctions and prevents targeted recovery.

// Anti-pattern: Generic error handling
try {
  await agent.executeTask(task);
} catch (error) {
  console.error('Task failed:', error);
  return { success: false, error: error.message };
}

// Better: Mode-specific handling
try {
  await agent.executeTask(task);
} catch (error) {
  const failureMode = classifyFailure(error);

  switch (failureMode.category) {
    case 'TRANSIENT':
      return await retryWithBackoff(task, failureMode);
    case 'CONFIGURATION':
      return await escalateForConfiguration(task, failureMode);
    case 'PERMANENT':
      return await skipWithFallback(task, failureMode);
  }
}

Missing Failure Categories

Problem: Incomplete failure taxonomy leads to unhandled edge cases and surprise failures.

Prevention strategies:

  • Conduct FMEA workshops with cross-functional teams
  • Analyze production incident logs to identify undocumented failure patterns
  • Review third-party API documentation for all possible error conditions
  • Implement catch-all handlers that flag unclassified failures for investigation
// Comprehensive failure taxonomy
const failureTaxonomy = {
  INFRASTRUCTURE: [
    'NETWORK_TIMEOUT',
    'DNS_FAILURE',
    'SSL_ERROR',
    'CLOUD_SERVICE_OUTAGE'
  ],
  APPLICATION: [
    'SELECTOR_NOT_FOUND',
    'ELEMENT_NOT_INTERACTIVE',
    'UNEXPECTED_NAVIGATION',
    'STATE_DESYNC'
  ],
  AUTHENTICATION: [
    'EXPIRED_TOKEN',
    'INVALID_CREDENTIALS',
    'INSUFFICIENT_PERMISSIONS',
    'RATE_LIMITED'
  ],
  RESOURCE: [
    'MEMORY_EXHAUSTED',
    'DISK_FULL',
    'CPU_THROTTLED',
    'QUOTA_EXCEEDED'
  ],
  LOGIC: [
    'INVALID_INPUT',
    'PRECONDITION_FAILED',
    'INVARIANT_VIOLATED',
    'DEADLOCK_DETECTED'
  ]
};

// Flag unknown failures for taxonomy expansion
class UnknownFailureDetector {
  handle(error: Error) {
    const isKnown = Object.values(failureTaxonomy)
      .flat()
      .some(mode => error.message.includes(mode));

    if (!isKnown) {
      this.alertForTaxonomyReview(error);
    }
  }
}

No Recovery Strategies

Problem: Detecting failures without recovery mechanisms leaves the system in failed states.

// Anti-pattern: Detection without recovery
function detectFailure(error: Error): FailureMode {
  // ... classification logic ...
  return failureMode;
}

// Better: Recovery-first approach
interface FailureModeDefinition {
  detection: (error: Error) => boolean;
  recovery: RecoveryStrategy[];
  escalation?: EscalationPolicy;
}

const failureModes: Record<string, FailureModeDefinition> = {
  NETWORK_TIMEOUT: {
    detection: (e) => e.message.includes('ETIMEDOUT'),
    recovery: [
      { type: 'retry', maxAttempts: 3, backoff: 'exponential' },
      { type: 'fallback', action: 'use_cached_data' },
      { type: 'degrade', action: 'skip_non_critical_operations' }
    ],
    escalation: { threshold: 5, window: '5m', action: 'alert_oncall' }
  }
};

Implementation

Failure Taxonomy

Establish a hierarchical classification system that balances granularity with manageability:

// Structured failure mode taxonomy
interface FailureMode {
  id: string;
  category: FailureCategory;
  subcategory?: string;
  severity: 'LOW' | 'MEDIUM' | 'HIGH' | 'CRITICAL';
  retriability: {
    retryable: boolean;
    maxAttempts?: number;
    backoffStrategy?: 'linear' | 'exponential' | 'fixed';
  };
  detection: {
    patterns: string[];
    statusCodes?: number[];
    customDetector?: (error: Error) => boolean;
  };
  recovery: RecoveryStrategy[];
  monitoring: {
    alertThreshold?: number;
    sloImpact: boolean;
  };
}

const failureModeRegistry: FailureMode[] = [
  {
    id: 'SELECTOR_STALE',
    category: 'APPLICATION',
    subcategory: 'UI_INTERACTION',
    severity: 'MEDIUM',
    retriability: {
      retryable: true,
      maxAttempts: 3,
      backoffStrategy: 'linear'
    },
    detection: {
      patterns: ['StaleElementReference', 'Element is not attached'],
      customDetector: (e) => e.name === 'StaleElementError'
    },
    recovery: [
      { type: 'relocate_element', strategy: 'refresh_and_find' },
      { type: 'page_reload', condition: 'if_relocated_fails' },
      { type: 'escalate', condition: 'all_failed' }
    ],
    monitoring: {
      alertThreshold: 10,
      sloImpact: false
    }
  }
];

Detection Patterns

Implement multi-layered failure detection combining proactive and reactive approaches:

// Proactive failure detection
class FailureDetectionSystem {
  // Pattern-based detection
  detectByPattern(error: Error): FailureMode | null {
    for (const mode of failureModeRegistry) {
      const matched = mode.detection.patterns.some(pattern =>
        error.message.includes(pattern)
      );

      if (matched) {
        return mode;
      }

      if (mode.detection.customDetector?.(error)) {
        return mode;
      }
    }

    return null;
  }

  // Anomaly-based detection
  detectByAnomaly(metrics: OperationMetrics): FailureMode | null {
    // Detect abnormal patterns even without explicit errors
    if (metrics.duration > metrics.p99Baseline * 3) {
      return failureModeRegistry.find(m => m.id === 'PERFORMANCE_DEGRADATION');
    }

    if (metrics.memoryUsage > 0.9) {
      return failureModeRegistry.find(m => m.id === 'MEMORY_PRESSURE');
    }

    return null;
  }

  // State-based detection
  detectByState(systemState: SystemState): FailureMode | null {
    // Check for invalid state combinations
    if (systemState.userLoggedIn && !systemState.sessionToken) {
      return failureModeRegistry.find(m => m.id === 'STATE_INCONSISTENCY');
    }

    return null;
  }
}

Recovery Workflows

Design recoverable failure handling with progressive fallback strategies:

// Comprehensive recovery workflow engine
class RecoveryWorkflowEngine {
  async executeRecovery(
    failureMode: FailureMode,
    context: OperationContext
  ): Promise<RecoveryResult> {
    const workflow = this.buildRecoveryWorkflow(failureMode);

    for (const step of workflow.steps) {
      try {
        const result = await this.executeRecoveryStep(step, context);

        if (result.success) {
          this.recordRecovery(failureMode, step, result);
          return { recovered: true, method: step.type };
        }
      } catch (stepError) {
        this.recordRecoveryFailure(failureMode, step, stepError);
        // Continue to next recovery step
      }
    }

    // All recovery attempts exhausted
    return this.handleUnrecoverable(failureMode, context);
  }

  private buildRecoveryWorkflow(mode: FailureMode): RecoveryWorkflow {
    return {
      steps: [
        // Level 1: Automatic retry
        ...this.generateRetrySteps(mode.retriability),

        // Level 2: Fallback mechanisms
        ...mode.recovery.filter(r => r.type === 'fallback'),

        // Level 3: Graceful degradation
        ...mode.recovery.filter(r => r.type === 'degrade'),

        // Level 4: Human escalation
        ...this.generateEscalationSteps(mode.escalation)
      ],
      timeout: this.calculateWorkflowTimeout(mode),
      abortConditions: this.defineAbortConditions(mode)
    };
  }

  private async executeRecoveryStep(
    step: RecoveryStep,
    context: OperationContext
  ): Promise<StepResult> {
    switch (step.type) {
      case 'retry':
        await this.delay(step.backoffMs);
        return await this.retryOriginalOperation(context);

      case 'fallback':
        return await this.executeFallback(step.action, context);

      case 'degrade':
        return await this.degradeService(step.degradationLevel, context);

      case 'escalate':
        return await this.escalateToHuman(step.escalationType, context);
    }
  }

  private async handleUnrecoverable(
    mode: FailureMode,
    context: OperationContext
  ): Promise<RecoveryResult> {
    // Log for post-mortem analysis
    await this.logUnrecoverableFailure(mode, context);

    // Trigger alerts based on severity
    if (mode.severity === 'CRITICAL') {
      await this.triggerCriticalAlert(mode, context);
    }

    // Apply circuit breaker if needed
    if (this.shouldApplyCircuitBreaker(mode)) {
      await this.openCircuitBreaker(context.operation);
    }

    return {
      recovered: false,
      action: 'ABORT',
      userMessage: this.generateUserMessage(mode),
      postMortemId: await this.createIncident(mode, context)
    };
  }
}

Key Metrics

Failure Rate by Category

Track the frequency of each failure mode to identify systemic issues:

// Failure rate tracking and analysis
interface FailureMetrics {
  failureMode: string;
  count: number;
  rate: number; // failures per operation
  trend: 'increasing' | 'stable' | 'decreasing';
}

class FailureRateMonitor {
  calculateFailureRates(timeWindow: string): FailureMetrics[] {
    const operations = this.getOperations(timeWindow);
    const failures = this.getFailures(timeWindow);

    return Object.keys(failureModeRegistry).map(modeId => {
      const modeFailures = failures.filter(f => f.mode === modeId);
      const rate = modeFailures.length / operations.length;

      return {
        failureMode: modeId,
        count: modeFailures.length,
        rate: rate,
        trend: this.calculateTrend(modeId, timeWindow)
      };
    });
  }

  // Alert when failure rate exceeds threshold
  monitorFailureRateThresholds() {
    const rates = this.calculateFailureRates('1h');

    for (const metric of rates) {
      const mode = failureModeRegistry.find(m => m.id === metric.failureMode);

      if (mode?.monitoring.alertThreshold &&
          metric.count > mode.monitoring.alertThreshold) {
        this.alertHighFailureRate(mode, metric);
      }

      // Track for SLO compliance
      if (mode?.monitoring.sloImpact) {
        this.recordSloImpact(mode, metric);
      }
    }
  }
}

// Example dashboard query
const failureRateQuery = `
  SELECT
    failure_mode,
    COUNT(*) as failure_count,
    COUNT(*) / (SELECT COUNT(*) FROM operations WHERE timestamp > NOW() - INTERVAL '1 hour') as failure_rate,
    AVG(recovery_time_ms) as avg_recovery_time
  FROM failures
  WHERE timestamp > NOW() - INTERVAL '1 hour'
  GROUP BY failure_mode
  HAVING failure_rate > 0.01  -- Alert if >1% failure rate
  ORDER BY failure_rate DESC;
`;

Mean Time to Recovery (MTTR) by Mode

Measure recovery effectiveness for each failure type:

// MTTR tracking and optimization
interface MTTRMetrics {
  failureMode: string;
  mttr: number; // milliseconds
  p50: number;
  p95: number;
  p99: number;
  recoveryMethod: Record<string, number>; // which recovery worked
}

class MTTRAnalyzer {
  calculateMTTR(failureMode: string, timeWindow: string): MTTRMetrics {
    const incidents = this.getRecoveredIncidents(failureMode, timeWindow);

    const recoveryTimes = incidents.map(i =>
      i.recoveryTimestamp - i.failureTimestamp
    );

    const methodCounts = incidents.reduce((acc, i) => {
      acc[i.recoveryMethod] = (acc[i.recoveryMethod] || 0) + 1;
      return acc;
    }, {} as Record<string, number>);

    return {
      failureMode,
      mttr: this.mean(recoveryTimes),
      p50: this.percentile(recoveryTimes, 0.5),
      p95: this.percentile(recoveryTimes, 0.95),
      p99: this.percentile(recoveryTimes, 0.99),
      recoveryMethod: methodCounts
    };
  }

  // Identify recovery optimization opportunities
  optimizeRecoveryStrategies() {
    const allModes = failureModeRegistry.map(m => m.id);

    for (const mode of allModes) {
      const metrics = this.calculateMTTR(mode, '7d');

      // If MTTR is high, investigate recovery workflow
      if (metrics.mttr > 5000) { // > 5 seconds
        this.flagForOptimization(mode, metrics);
      }

      // Optimize recovery step ordering based on success rates
      const bestRecoveryMethod = Object.entries(metrics.recoveryMethod)
        .sort(([,a], [,b]) => b - a)[0][0];

      this.suggestRecoveryReordering(mode, bestRecoveryMethod);
    }
  }
}

Recovery Success Rate

Track the percentage of failures successfully recovered vs. requiring escalation:

// Recovery effectiveness monitoring
interface RecoverySuccessMetrics {
  failureMode: string;
  totalFailures: number;
  autoRecovered: number;
  humanEscalated: number;
  unrecovered: number;
  successRate: number; // 0-1, where 1 = 100%
}

class RecoverySuccessMonitor {
  calculateRecoverySuccess(
    failureMode: string,
    timeWindow: string
  ): RecoverySuccessMetrics {
    const allFailures = this.getFailures(failureMode, timeWindow);

    const autoRecovered = allFailures.filter(
      f => f.outcome === 'AUTO_RECOVERED'
    ).length;

    const humanEscalated = allFailures.filter(
      f => f.outcome === 'HUMAN_ESCALATED'
    ).length;

    const unrecovered = allFailures.filter(
      f => f.outcome === 'UNRECOVERED'
    ).length;

    return {
      failureMode,
      totalFailures: allFailures.length,
      autoRecovered,
      humanEscalated,
      unrecovered,
      successRate: autoRecovered / allFailures.length
    };
  }

  // Set recovery success targets
  enforceRecoveryTargets() {
    const targets = {
      TRANSIENT_FAILURES: 0.95,    // 95% should auto-recover
      CONFIGURATION_ERRORS: 0.50,  // 50% auto-recovery acceptable
      PERMANENT_FAILURES: 0.10     // 10% auto-recovery expected
    };

    for (const [category, target] of Object.entries(targets)) {
      const modes = failureModeRegistry.filter(m => m.category === category);

      for (const mode of modes) {
        const metrics = this.calculateRecoverySuccess(mode.id, '24h');

        if (metrics.successRate < target) {
          this.alertBelowTarget(mode, metrics, target);
        }
      }
    }
  }

  // Generate recovery improvement recommendations
  generateRecoveryReport(): RecoveryReport {
    const allMetrics = failureModeRegistry.map(m =>
      this.calculateRecoverySuccess(m.id, '7d')
    );

    return {
      summary: {
        overallSuccessRate: this.calculateOverallRate(allMetrics),
        totalFailures: allMetrics.reduce((sum, m) => sum + m.totalFailures, 0),
        autoRecoveryRate: this.calculateAutoRecoveryRate(allMetrics)
      },
      recommendations: this.generateRecommendations(allMetrics),
      modesNeedingImprovement: allMetrics
        .filter(m => m.successRate < 0.7)
        .sort((a, b) => a.successRate - b.successRate)
    };
  }
}

Related Concepts

  • Error recovery: Techniques and patterns for recovering from failures automatically or with minimal human intervention
  • Observability: System visibility enabling failure detection, diagnosis, and impact assessment
  • Fail-safes: Protective mechanisms that prevent failures from cascading or causing data corruption
  • Limitations and fallbacks: Recognition of system boundaries and graceful degradation strategies when limits are reached