Retries and backoff

Retries and backoff are strategies for handling transient failures by reattempting operations with increasing delays between attempts. In computer-use agents and agentic UI systems, these mechanisms are critical for managing temporary issues like network instability, slow-loading UI elements, and rate-limited APIs without overwhelming the system or degrading user experience.

Why it matters

Agent systems operate in inherently unreliable environments where failures are not exceptional—they're expected. Retries and backoff strategies transform brittle agents into resilient ones by distinguishing between permanent failures that require human intervention and transient failures that will resolve on their own.

Network flakiness

Network requests fail constantly due to packet loss, DNS resolution delays, or temporary connectivity issues. A computer-use agent attempting to fetch data from an API might encounter a timeout that resolves itself within milliseconds. Without retries, the agent would report failure and abort the task. With intelligent retry logic, the agent silently handles the hiccup and continues execution.

Modern cloud environments exhibit tail latencies where 99% of requests complete quickly but 1% take significantly longer. Retries help agents achieve high success rates despite these unpredictable delays.

UI loading delays

Web applications and desktop interfaces load asynchronously, with elements appearing at unpredictable times. An agent attempting to click a button immediately after navigation might fail because the DOM element hasn't rendered yet. Strategic retries with appropriate backoff periods give the interface time to stabilize before the agent declares failure.

Single-page applications (SPAs) present particular challenges, as content loads progressively and JavaScript frameworks manipulate the DOM dynamically. Agents need retry logic that accounts for these loading patterns without waiting unnecessarily when elements are already available.

API rate limits

Many services impose rate limits to protect infrastructure and ensure fair usage. When an agent exceeds these limits, the API returns HTTP 429 (Too Many Requests) responses. Without backoff strategies, the agent would continue hammering the endpoint, wasting resources and potentially triggering IP bans. Exponential backoff respects rate limits by progressively increasing wait times, allowing the agent to resume operations once the rate limit window resets.

Rate limit headers often indicate when capacity will be restored, enabling smart backoff implementations to wait exactly the required duration rather than guessing.

Concrete examples

Exponential backoff for API calls

import time
import random

def api_call_with_backoff(endpoint, max_retries=5):
    base_delay = 1  # Start with 1 second

    for attempt in range(max_retries):
        try:
            response = make_api_request(endpoint)
            if response.status_code == 200:
                return response.json()
            elif response.status_code == 429:
                # Rate limited - use Retry-After header if available
                retry_after = int(response.headers.get('Retry-After', 0))
                delay = retry_after if retry_after > 0 else base_delay * (2 ** attempt)
                jitter = random.uniform(0, 0.1 * delay)
                time.sleep(delay + jitter)
            elif response.status_code >= 500:
                # Server error - exponential backoff
                delay = base_delay * (2 ** attempt)
                jitter = random.uniform(0, 0.1 * delay)
                time.sleep(delay + jitter)
            else:
                # Client error - don't retry
                raise Exception(f"API error: {response.status_code}")
        except ConnectionError:
            # Network issue - retry with backoff
            delay = base_delay * (2 ** attempt)
            jitter = random.uniform(0, 0.1 * delay)
            time.sleep(delay + jitter)

    raise Exception(f"Failed after {max_retries} attempts")

This implementation demonstrates exponential backoff where delays double with each retry (1s, 2s, 4s, 8s, 16s). The addition of jitter prevents retry storms when multiple agents fail simultaneously.

DOM element waiting

async function waitForElement(
  selector: string,
  maxAttempts: number = 10,
  initialDelay: number = 100
): Promise<Element> {
  for (let attempt = 0; attempt < maxAttempts; attempt++) {
    const element = document.querySelector(selector);

    if (element) {
      return element;
    }

    // Linear backoff for UI elements (100ms, 200ms, 300ms...)
    // Exponential would wait too long for fast-loading UIs
    const delay = initialDelay * (attempt + 1);
    await new Promise(resolve => setTimeout(resolve, delay));
  }

  throw new Error(`Element ${selector} not found after ${maxAttempts} attempts`);
}

For UI elements, linear backoff often works better than exponential because interface loading times are bounded. Waiting 32 seconds for a button that will never appear wastes agent execution time.

File upload retries

def upload_with_retry(file_path, upload_url, chunk_size=1024*1024):
    max_retries = 3
    file_size = os.path.getsize(file_path)
    uploaded_bytes = 0

    with open(file_path, 'rb') as f:
        while uploaded_bytes < file_size:
            chunk = f.read(chunk_size)

            for attempt in range(max_retries):
                try:
                    response = upload_chunk(
                        upload_url,
                        chunk,
                        offset=uploaded_bytes,
                        total_size=file_size
                    )
                    uploaded_bytes += len(chunk)
                    break  # Success - move to next chunk
                except UploadError as e:
                    if attempt == max_retries - 1:
                        raise  # Final attempt failed

                    # Exponential backoff between chunk retries
                    delay = 2 ** attempt
                    time.sleep(delay)

                    # Reset file pointer for retry
                    f.seek(uploaded_bytes)

File uploads benefit from retry logic at the chunk level rather than restarting the entire transfer. This approach combines retries with resumable uploads to handle intermittent network failures efficiently.

Common pitfalls

Infinite retry loops

The most dangerous pitfall is retrying permanently failed operations indefinitely. An agent attempting to click a button that doesn't exist will retry forever unless bounded by a maximum attempt count or timeout. This wastes compute resources and prevents the agent from reporting actionable errors.

Solution: Always enforce maximum retry limits and distinguish between retryable errors (HTTP 503, network timeouts) and permanent failures (HTTP 404, authentication errors).

Insufficient backoff

Retrying too quickly creates several problems. First, it wastes CPU and network bandwidth attempting operations that haven't had time to succeed. Second, rapid retries can trigger rate limiting or DDoS protection mechanisms, converting a transient failure into a permanent one.

Agents that retry API calls every 100ms after a rate limit will simply receive more rate limit responses, burning through their retry budget without progress.

Solution: Use exponential or linear backoff with minimum delays appropriate for the operation type. Network requests need longer backoffs (1-2 seconds) than UI element checks (100-500ms).

Retry storms

When multiple agents fail simultaneously—such as when a shared API endpoint experiences downtime—synchronized retries can create a thundering herd that overwhelms the recovering service. If 1,000 agents all wait exactly 30 seconds before retrying, they'll generate 1,000 simultaneous requests when the service comes back online.

Solution: Add random jitter to backoff delays. Instead of waiting exactly 2 seconds, wait 2 seconds plus a random value between 0 and 200ms. This spreads retries over time and prevents coordinated load spikes.

Cascading failures

Agents that retry aggressively can trigger cascading failures in downstream services. When Service A is slow, agents retry their requests, multiplying the load on Service A and causing it to slow down further. This positive feedback loop can crash entire systems.

Solution: Implement circuit breakers that stop retries when failure rates exceed thresholds, giving overwhelmed services time to recover.

Implementation

Backoff algorithms

Exponential backoff doubles the delay between retries, typically with a formula like delay = base_delay * (2 ^ attempt). This approach quickly spaces out retries, making it ideal for API calls and network operations where failures indicate capacity issues.

delays = [1 * (2 ** n) for n in range(5)]
# Results in: [1, 2, 4, 8, 16] seconds

Exponential backoff with jitter adds randomness to prevent synchronization:

import random

def exponential_backoff_with_jitter(attempt, base_delay=1, max_delay=60):
    delay = min(base_delay * (2 ** attempt), max_delay)
    jitter = random.uniform(0, delay * 0.1)  # 10% jitter
    return delay + jitter

Linear backoff increases delays by a constant amount, useful for UI operations:

delays = [0.1 * (n + 1) for n in range(5)]
# Results in: [0.1, 0.2, 0.3, 0.4, 0.5] seconds

Decorrelated jitter (AWS recommendation) provides better distribution:

def decorrelated_jitter(attempt, base_delay=1, max_delay=60, previous_delay=0):
    return min(max_delay, random.uniform(base_delay, previous_delay * 3))

Circuit breakers

Circuit breakers prevent retry storms by monitoring failure rates and "opening" the circuit when failures exceed a threshold. Once open, requests fail fast without attempting retries, giving the downstream system time to recover.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.opened_at = None
        self.state = "CLOSED"  # CLOSED, OPEN, HALF_OPEN

    def call(self, func, *args, **kwargs):
        if self.state == "OPEN":
            # Check if timeout has elapsed
            if time.time() - self.opened_at > self.timeout:
                self.state = "HALF_OPEN"
            else:
                raise Exception("Circuit breaker is OPEN")

        try:
            result = func(*args, **kwargs)
            # Success - reset failure count
            if self.state == "HALF_OPEN":
                self.state = "CLOSED"
            self.failure_count = 0
            return result
        except Exception as e:
            self.failure_count += 1

            if self.failure_count >= self.failure_threshold:
                self.state = "OPEN"
                self.opened_at = time.time()

            raise e

Retry budgets

Retry budgets limit the total number of retries across all operations in a time window, preventing agents from spending excessive time on retries when success rates are low.

class RetryBudget:
    def __init__(self, budget_per_minute=30):
        self.budget = budget_per_minute
        self.used = 0
        self.window_start = time.time()

    def can_retry(self):
        # Reset budget every minute
        if time.time() - self.window_start > 60:
            self.used = 0
            self.window_start = time.time()

        return self.used < self.budget

    def use_retry(self):
        if not self.can_retry():
            raise Exception("Retry budget exhausted")
        self.used += 1

Key metrics to track

Monitoring retry behavior is essential for tuning agent reliability and identifying systemic issues.

Retry success rate

The percentage of operations that succeed after retrying, calculated as:

retry_success_rate = (successful_retries / total_retries) * 100

A high retry success rate (> 80%) indicates that retries effectively handle transient failures. Low rates (< 50%) suggest either insufficient backoff delays or permanent failures being retried unnecessarily.

Mean retries per task

The average number of retry attempts per agent task:

mean_retries = total_retry_attempts / total_tasks_completed

This metric should remain low (< 2) for healthy systems. Values above 3-4 indicate systemic reliability issues requiring investigation. Track this metric per operation type (API calls, UI interactions, file operations) to identify specific problem areas.

P95 and P99 retry latency

The 95th and 99th percentile time spent retrying operations:

p95_retry_latency = percentile(retry_durations, 95)

High P99 latency (> 30 seconds) suggests agents are waiting too long for failing operations. This often indicates either overly generous maximum retry counts or operations that should fail fast being retried aggressively.

Circuit breaker trips

The number of times circuit breakers open, indicating repeated failures:

circuit_breaker_trip_rate = trips_per_hour

Frequent trips (> 10/hour) signal downstream service instability or capacity issues. This metric serves as an early warning system for infrastructure problems before they affect agent success rates significantly.

Retry budget exhaustion rate

The percentage of time windows where agents exhaust their retry budget:

budget_exhaustion_rate = (windows_with_exhaustion / total_windows) * 100

Values above 5% suggest either insufficient retry budgets or excessive failure rates requiring architectural fixes rather than more retries.

Related concepts

Understanding retries and backoff requires familiarity with complementary reliability patterns:

  • Selector stability: Stable DOM selectors reduce the need for retries by minimizing false negatives when locating UI elements
  • Latency SLO: Service level objectives inform appropriate retry timeouts and maximum attempt counts
  • Limitations and fallbacks: Fallback strategies handle cases where retries fail after exhausting all attempts
  • Error recovery: Broader error handling patterns that complement retry mechanisms for comprehensive agent resilience