Error Recovery
Error recovery refers to the strategies and mechanisms for detecting failures and restoring agents to a functional state. In agentic systems, robust error recovery ensures that temporary failures, unexpected conditions, or partial execution problems don't cascade into complete system breakdowns. Effective recovery mechanisms distinguish production-ready agents from brittle prototypes.
Why Error Recovery Matters
Error recovery is fundamental to building reliable agentic systems for several critical reasons:
Task Completion Rates: Agents operating in real-world environments encounter failures constantly—network timeouts, rate limits, element not found errors, permission issues. Without recovery mechanisms, a single transient error terminates the entire task. Effective recovery can improve completion rates from 60-70% to 95%+ by handling expected failure modes gracefully.
User Experience: Users expect systems to handle problems intelligently. When an agent encounters an error, transparent recovery with clear status updates maintains trust. Silent failures or cryptic error messages erode confidence. Recovery mechanisms that explain what went wrong and what action was taken transform frustrating failures into reassuring demonstrations of robustness.
System Resilience: Agentic systems often orchestrate complex multi-step workflows involving external services, UI interactions, and state management. Each integration point represents a potential failure mode. Recovery strategies prevent localized failures from propagating through the system, maintaining overall availability even when individual components experience issues.
Cost Efficiency: Failed agent executions waste computational resources, API credits, and human time. Recovery mechanisms that checkpoint progress and resume from the last known good state avoid repeating expensive operations. For agents using premium LLM APIs, effective recovery can reduce costs by 30-50% by eliminating redundant retries from the beginning.
Concrete Examples
Automatic Retries with Exponential Backoff
When an agent encounters a transient error like a network timeout or rate limit, immediate retry often fails again. Exponential backoff introduces increasing delays between attempts:
async function executeWithRetry(action: () => Promise<void>, maxAttempts = 3) {
for (let attempt = 1; attempt <= maxAttempts; attempt++) {
try {
await action();
return;
} catch (error) {
if (attempt === maxAttempts) throw error;
const delay = Math.pow(2, attempt) * 1000; // 2s, 4s, 8s
console.log(`Attempt ${attempt} failed, retrying in ${delay}ms`);
await new Promise(resolve => setTimeout(resolve, delay));
}
}
}
This pattern handles temporary API unavailability, network glitches, and rate limiting without manual intervention.
Checkpoint Restoration
For long-running agent workflows, checkpoints enable recovery without restarting from the beginning:
interface WorkflowCheckpoint {
step: number;
completedActions: string[];
currentState: Record<string, unknown>;
timestamp: number;
}
async function executeWorkflow(steps: Step[], checkpoint?: WorkflowCheckpoint) {
const startIndex = checkpoint?.step ?? 0;
let state = checkpoint?.currentState ?? {};
for (let i = startIndex; i < steps.length; i++) {
try {
state = await steps[i].execute(state);
// Save checkpoint after each step
await saveCheckpoint({
step: i + 1,
completedActions: steps.slice(0, i + 1).map(s => s.name),
currentState: state,
timestamp: Date.now()
});
} catch (error) {
console.error(`Step ${i} failed:`, error);
throw new RecoverableError(`Can resume from step ${i}`, checkpoint);
}
}
}
If the workflow fails at step 7 of 10, recovery resumes from step 7 with the preserved state, avoiding redundant work.
Alternative Path Exploration
When primary approaches fail, intelligent agents explore alternative strategies:
async function clickElement(selector: string) {
const strategies = [
() => page.click(selector),
() => page.evaluate(sel => document.querySelector(sel)?.click(), selector),
() => page.$eval(selector, el => el.dispatchEvent(new Event('click', { bubbles: true }))),
() => findSimilarElementAndClick(selector)
];
for (const [index, strategy] of strategies.entries()) {
try {
await strategy();
if (index > 0) {
console.log(`Primary method failed, succeeded with fallback ${index}`);
}
return;
} catch (error) {
if (index === strategies.length - 1) throw error;
continue;
}
}
}
This graceful degradation tries multiple approaches before declaring failure, significantly improving success rates in dynamic UIs.
Common Pitfalls
Infinite Recovery Loops
Without proper termination conditions, recovery mechanisms can create infinite loops:
// PROBLEMATIC: No maximum attempts
while (true) {
try {
await performAction();
break;
} catch (error) {
console.log("Retrying...");
// Infinite loop if action always fails
}
}
// BETTER: Bounded retries with escalation
let attempts = 0;
const MAX_ATTEMPTS = 5;
while (attempts < MAX_ATTEMPTS) {
try {
await performAction();
break;
} catch (error) {
attempts++;
if (attempts >= MAX_ATTEMPTS) {
await escalateToHuman(error);
throw new UnrecoverableError("Max recovery attempts exceeded");
}
await backoff(attempts);
}
}
Always implement maximum retry limits and escalation paths for genuinely unrecoverable errors.
State Corruption During Recovery
Recovery attempts that don't properly restore or validate state can leave systems in inconsistent conditions:
// PROBLEMATIC: Partial state updates
try {
await updateDatabase(newData);
await updateCache(newData);
await notifySubscribers(newData);
} catch (error) {
// Database updated but cache and subscribers not updated
// State is now inconsistent
}
// BETTER: Transactional recovery with rollback
const transaction = await beginTransaction();
try {
await transaction.updateDatabase(newData);
await transaction.updateCache(newData);
await transaction.notifySubscribers(newData);
await transaction.commit();
} catch (error) {
await transaction.rollback();
console.log("All changes rolled back, state remains consistent");
throw error;
}
Use transactional patterns and state validation to ensure recovery doesn't introduce new problems.
Cascading Failures
Recovery mechanisms that trigger additional operations can amplify rather than resolve problems:
// PROBLEMATIC: Recovery triggers more expensive operations
async function fetchData() {
try {
return await quickAPI.getData();
} catch (error) {
// Fallback to expensive operation during high load
return await expensiveAPI.getDataWithFullScan();
}
}
// BETTER: Circuit breaker prevents cascading load
class CircuitBreaker {
private failures = 0;
private isOpen = false;
async execute(operation: () => Promise<any>) {
if (this.isOpen) {
throw new Error("Circuit breaker open, operation blocked");
}
try {
const result = await operation();
this.failures = 0; // Reset on success
return result;
} catch (error) {
this.failures++;
if (this.failures >= 5) {
this.isOpen = true;
setTimeout(() => this.isOpen = false, 60000); // Reopen after 1 min
}
throw error;
}
}
}
Circuit breakers prevent recovery attempts from overwhelming already-struggling systems.
Implementation Strategies
Hierarchical Recovery Strategies
Implement recovery at multiple levels with increasing intervention:
enum RecoveryLevel {
AUTOMATIC_RETRY = 1, // Silent retry with backoff
ALTERNATIVE_METHOD = 2, // Try different approach
CHECKPOINT_RESTORE = 3, // Restore from saved state
GRACEFUL_DEGRADATION = 4, // Partial functionality
HUMAN_ESCALATION = 5 // Require human intervention
}
class RecoveryOrchestrator {
async executeWithRecovery(task: Task) {
let currentLevel = RecoveryLevel.AUTOMATIC_RETRY;
while (currentLevel <= RecoveryLevel.HUMAN_ESCALATION) {
try {
switch (currentLevel) {
case RecoveryLevel.AUTOMATIC_RETRY:
return await this.retryWithBackoff(task);
case RecoveryLevel.ALTERNATIVE_METHOD:
return await this.tryAlternativeApproach(task);
case RecoveryLevel.CHECKPOINT_RESTORE:
return await this.restoreFromCheckpoint(task);
case RecoveryLevel.GRACEFUL_DEGRADATION:
return await this.partialExecution(task);
case RecoveryLevel.HUMAN_ESCALATION:
return await this.escalateToHuman(task);
}
} catch (error) {
console.log(`Recovery level ${currentLevel} failed, escalating`);
currentLevel++;
}
}
throw new UnrecoverableError("All recovery strategies exhausted");
}
}
This layered approach maximizes automatic recovery while ensuring genuine problems reach human attention.
State Management for Recovery
Maintain sufficient state information to enable intelligent recovery decisions:
interface ExecutionContext {
taskId: string;
startTime: number;
currentStep: string;
completedSteps: string[];
failureHistory: FailureRecord[];
environmentSnapshot: EnvironmentState;
recoveryAttempts: number;
}
interface FailureRecord {
step: string;
errorType: string;
errorMessage: string;
timestamp: number;
recoveryStrategy: string;
successful: boolean;
}
class StatefulRecovery {
private context: ExecutionContext;
async recordFailure(error: Error, strategy: string, successful: boolean) {
this.context.failureHistory.push({
step: this.context.currentStep,
errorType: error.constructor.name,
errorMessage: error.message,
timestamp: Date.now(),
recoveryStrategy: strategy,
successful
});
// Detect repeated failures
const recentFailures = this.context.failureHistory.filter(
f => f.timestamp > Date.now() - 60000 && !f.successful
);
if (recentFailures.length >= 5) {
throw new UnrecoverableError("Too many failures in short period");
}
}
getOptimalRecoveryStrategy(error: Error): RecoveryStrategy {
// Learn from history
const similarPastFailures = this.context.failureHistory.filter(
f => f.errorType === error.constructor.name
);
const successfulStrategies = similarPastFailures.filter(f => f.successful);
if (successfulStrategies.length > 0) {
// Use previously successful strategy
return successfulStrategies[0].recoveryStrategy;
}
return this.getDefaultStrategy(error);
}
}
Rich context enables adaptive recovery that learns from past failures.
Circuit Breaker Pattern
Prevent repeated attempts to operations likely to fail:
interface CircuitBreakerConfig {
failureThreshold: number; // Open after N failures
resetTimeout: number; // Try again after N ms
monitoringWindow: number; // Consider failures in last N ms
}
class CircuitBreaker {
private state: 'CLOSED' | 'OPEN' | 'HALF_OPEN' = 'CLOSED';
private failures: number[] = [];
private nextAttemptTime: number = 0;
constructor(private config: CircuitBreakerConfig) {}
async execute<T>(operation: () => Promise<T>): Promise<T> {
if (this.state === 'OPEN') {
if (Date.now() < this.nextAttemptTime) {
throw new Error('Circuit breaker open, operation blocked');
}
this.state = 'HALF_OPEN';
}
try {
const result = await operation();
if (this.state === 'HALF_OPEN') {
this.state = 'CLOSED';
this.failures = [];
}
return result;
} catch (error) {
this.recordFailure();
const recentFailures = this.failures.filter(
t => t > Date.now() - this.config.monitoringWindow
);
if (recentFailures.length >= this.config.failureThreshold) {
this.state = 'OPEN';
this.nextAttemptTime = Date.now() + this.config.resetTimeout;
}
throw error;
}
}
private recordFailure() {
this.failures.push(Date.now());
}
}
// Usage
const apiCircuitBreaker = new CircuitBreaker({
failureThreshold: 5,
resetTimeout: 60000, // 1 minute
monitoringWindow: 120000 // 2 minutes
});
await apiCircuitBreaker.execute(() => externalAPI.call());
Circuit breakers protect both your system and downstream services from cascading failures.
Key Metrics
Measuring recovery effectiveness requires tracking several critical metrics:
Recovery Success Rate: The percentage of errors successfully recovered without human intervention. Calculate as:
Recovery Success Rate = (Successful Recoveries / Total Errors) × 100
Target: > 85% for production systems. Low rates indicate either unpredictable environments or insufficient recovery strategies.
Mean Time to Recovery (MTTR): Average time from error detection to successful recovery. Measure separately for automatic and manual recovery:
MTTR = Σ(Recovery Time) / Number of Recovery Events
Target: < 5 seconds for automatic recovery, < 30 minutes for human-escalated issues. High MTTR suggests slow detection or inefficient recovery procedures.
Error Escalation Rate: Percentage of errors requiring human intervention:
Error Escalation Rate = (Human-Escalated Errors / Total Errors) × 100
Target: < 15% for mature systems. High escalation rates indicate gaps in automatic recovery coverage.
Recovery Attempt Distribution: Histogram showing how many recovery attempts each error required:
1 attempt: 70% of errors
2 attempts: 20% of errors
3 attempts: 7% of errors
4+ attempts: 3% of errors
Most errors should resolve on first or second attempt. High multi-attempt rates suggest suboptimal retry strategies.
State Consistency Score: Percentage of recovery events that maintain valid system state:
State Consistency Score = (Valid States After Recovery / Total Recoveries) × 100
Target: 100%. Any value < 100% indicates state corruption issues requiring immediate attention.
Recovery Cost Ratio: Computational cost of recovery relative to normal execution:
Recovery Cost Ratio = Total Recovery Time / Total Successful Execution Time
Target: < 0.3. High ratios suggest recovery mechanisms are too expensive or errors too frequent.
Related Concepts
Error recovery operates within a broader ecosystem of reliability patterns:
- Retries and Backoff: Specific timing strategies for recovery attempts, controlling when and how often to retry failed operations
- Failure Modes: Understanding the types of failures that require different recovery approaches
- Fail-safes: Preventive mechanisms that reduce the need for recovery by avoiding errors proactively
- Rollback/Undo: Reversing partial changes to restore consistent state after failures
Effective error recovery combines all these concepts into cohesive reliability strategies that keep agentic systems running despite inevitable failures.