Retries and backoff
Retries and backoff are strategies for handling transient failures by reattempting operations with increasing delays between attempts. In computer-use agents and agentic UI systems, these mechanisms are critical for managing temporary issues like network instability, slow-loading UI elements, and rate-limited APIs without overwhelming the system or degrading user experience.
Why it matters
Agent systems operate in inherently unreliable environments where failures are not exceptional—they're expected. Retries and backoff strategies transform brittle agents into resilient ones by distinguishing between permanent failures that require human intervention and transient failures that will resolve on their own.
Network flakiness
Network requests fail constantly due to packet loss, DNS resolution delays, or temporary connectivity issues. A computer-use agent attempting to fetch data from an API might encounter a timeout that resolves itself within milliseconds. Without retries, the agent would report failure and abort the task. With intelligent retry logic, the agent silently handles the hiccup and continues execution.
Modern cloud environments exhibit tail latencies where 99% of requests complete quickly but 1% take significantly longer. Retries help agents achieve high success rates despite these unpredictable delays.
UI loading delays
Web applications and desktop interfaces load asynchronously, with elements appearing at unpredictable times. An agent attempting to click a button immediately after navigation might fail because the DOM element hasn't rendered yet. Strategic retries with appropriate backoff periods give the interface time to stabilize before the agent declares failure.
Single-page applications (SPAs) present particular challenges, as content loads progressively and JavaScript frameworks manipulate the DOM dynamically. Agents need retry logic that accounts for these loading patterns without waiting unnecessarily when elements are already available.
API rate limits
Many services impose rate limits to protect infrastructure and ensure fair usage. When an agent exceeds these limits, the API returns HTTP 429 (Too Many Requests) responses. Without backoff strategies, the agent would continue hammering the endpoint, wasting resources and potentially triggering IP bans. Exponential backoff respects rate limits by progressively increasing wait times, allowing the agent to resume operations once the rate limit window resets.
Rate limit headers often indicate when capacity will be restored, enabling smart backoff implementations to wait exactly the required duration rather than guessing.
Concrete examples
Exponential backoff for API calls
import time
import random
def api_call_with_backoff(endpoint, max_retries=5):
base_delay = 1 # Start with 1 second
for attempt in range(max_retries):
try:
response = make_api_request(endpoint)
if response.status_code == 200:
return response.json()
elif response.status_code == 429:
# Rate limited - use Retry-After header if available
retry_after = int(response.headers.get('Retry-After', 0))
delay = retry_after if retry_after > 0 else base_delay * (2 ** attempt)
jitter = random.uniform(0, 0.1 * delay)
time.sleep(delay + jitter)
elif response.status_code >= 500:
# Server error - exponential backoff
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, 0.1 * delay)
time.sleep(delay + jitter)
else:
# Client error - don't retry
raise Exception(f"API error: {response.status_code}")
except ConnectionError:
# Network issue - retry with backoff
delay = base_delay * (2 ** attempt)
jitter = random.uniform(0, 0.1 * delay)
time.sleep(delay + jitter)
raise Exception(f"Failed after {max_retries} attempts")
This implementation demonstrates exponential backoff where delays double with each retry (1s, 2s, 4s, 8s, 16s). The addition of jitter prevents retry storms when multiple agents fail simultaneously.
DOM element waiting
async function waitForElement(
selector: string,
maxAttempts: number = 10,
initialDelay: number = 100
): Promise<Element> {
for (let attempt = 0; attempt < maxAttempts; attempt++) {
const element = document.querySelector(selector);
if (element) {
return element;
}
// Linear backoff for UI elements (100ms, 200ms, 300ms...)
// Exponential would wait too long for fast-loading UIs
const delay = initialDelay * (attempt + 1);
await new Promise(resolve => setTimeout(resolve, delay));
}
throw new Error(`Element ${selector} not found after ${maxAttempts} attempts`);
}
For UI elements, linear backoff often works better than exponential because interface loading times are bounded. Waiting 32 seconds for a button that will never appear wastes agent execution time.
File upload retries
def upload_with_retry(file_path, upload_url, chunk_size=1024*1024):
max_retries = 3
file_size = os.path.getsize(file_path)
uploaded_bytes = 0
with open(file_path, 'rb') as f:
while uploaded_bytes < file_size:
chunk = f.read(chunk_size)
for attempt in range(max_retries):
try:
response = upload_chunk(
upload_url,
chunk,
offset=uploaded_bytes,
total_size=file_size
)
uploaded_bytes += len(chunk)
break # Success - move to next chunk
except UploadError as e:
if attempt == max_retries - 1:
raise # Final attempt failed
# Exponential backoff between chunk retries
delay = 2 ** attempt
time.sleep(delay)
# Reset file pointer for retry
f.seek(uploaded_bytes)
File uploads benefit from retry logic at the chunk level rather than restarting the entire transfer. This approach combines retries with resumable uploads to handle intermittent network failures efficiently.
Common pitfalls
Infinite retry loops
The most dangerous pitfall is retrying permanently failed operations indefinitely. An agent attempting to click a button that doesn't exist will retry forever unless bounded by a maximum attempt count or timeout. This wastes compute resources and prevents the agent from reporting actionable errors.
Solution: Always enforce maximum retry limits and distinguish between retryable errors (HTTP 503, network timeouts) and permanent failures (HTTP 404, authentication errors).
Insufficient backoff
Retrying too quickly creates several problems. First, it wastes CPU and network bandwidth attempting operations that haven't had time to succeed. Second, rapid retries can trigger rate limiting or DDoS protection mechanisms, converting a transient failure into a permanent one.
Agents that retry API calls every 100ms after a rate limit will simply receive more rate limit responses, burning through their retry budget without progress.
Solution: Use exponential or linear backoff with minimum delays appropriate for the operation type. Network requests need longer backoffs (1-2 seconds) than UI element checks (100-500ms).
Retry storms
When multiple agents fail simultaneously—such as when a shared API endpoint experiences downtime—synchronized retries can create a thundering herd that overwhelms the recovering service. If 1,000 agents all wait exactly 30 seconds before retrying, they'll generate 1,000 simultaneous requests when the service comes back online.
Solution: Add random jitter to backoff delays. Instead of waiting exactly 2 seconds, wait 2 seconds plus a random value between 0 and 200ms. This spreads retries over time and prevents coordinated load spikes.
Cascading failures
Agents that retry aggressively can trigger cascading failures in downstream services. When Service A is slow, agents retry their requests, multiplying the load on Service A and causing it to slow down further. This positive feedback loop can crash entire systems.
Solution: Implement circuit breakers that stop retries when failure rates exceed thresholds, giving overwhelmed services time to recover.
Implementation
Backoff algorithms
Exponential backoff doubles the delay between retries, typically with a formula like delay = base_delay * (2 ^ attempt). This approach quickly spaces out retries, making it ideal for API calls and network operations where failures indicate capacity issues.
delays = [1 * (2 ** n) for n in range(5)]
# Results in: [1, 2, 4, 8, 16] seconds
Exponential backoff with jitter adds randomness to prevent synchronization:
import random
def exponential_backoff_with_jitter(attempt, base_delay=1, max_delay=60):
delay = min(base_delay * (2 ** attempt), max_delay)
jitter = random.uniform(0, delay * 0.1) # 10% jitter
return delay + jitter
Linear backoff increases delays by a constant amount, useful for UI operations:
delays = [0.1 * (n + 1) for n in range(5)]
# Results in: [0.1, 0.2, 0.3, 0.4, 0.5] seconds
Decorrelated jitter (AWS recommendation) provides better distribution:
def decorrelated_jitter(attempt, base_delay=1, max_delay=60, previous_delay=0):
return min(max_delay, random.uniform(base_delay, previous_delay * 3))
Circuit breakers
Circuit breakers prevent retry storms by monitoring failure rates and "opening" the circuit when failures exceed a threshold. Once open, requests fail fast without attempting retries, giving the downstream system time to recover.
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_count = 0
self.failure_threshold = failure_threshold
self.timeout = timeout
self.opened_at = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
# Check if timeout has elapsed
if time.time() - self.opened_at > self.timeout:
self.state = "HALF_OPEN"
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
# Success - reset failure count
if self.state == "HALF_OPEN":
self.state = "CLOSED"
self.failure_count = 0
return result
except Exception as e:
self.failure_count += 1
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
self.opened_at = time.time()
raise e
Retry budgets
Retry budgets limit the total number of retries across all operations in a time window, preventing agents from spending excessive time on retries when success rates are low.
class RetryBudget:
def __init__(self, budget_per_minute=30):
self.budget = budget_per_minute
self.used = 0
self.window_start = time.time()
def can_retry(self):
# Reset budget every minute
if time.time() - self.window_start > 60:
self.used = 0
self.window_start = time.time()
return self.used < self.budget
def use_retry(self):
if not self.can_retry():
raise Exception("Retry budget exhausted")
self.used += 1
Key metrics to track
Monitoring retry behavior is essential for tuning agent reliability and identifying systemic issues.
Retry success rate
The percentage of operations that succeed after retrying, calculated as:
retry_success_rate = (successful_retries / total_retries) * 100
A high retry success rate (> 80%) indicates that retries effectively handle transient failures. Low rates (< 50%) suggest either insufficient backoff delays or permanent failures being retried unnecessarily.
Mean retries per task
The average number of retry attempts per agent task:
mean_retries = total_retry_attempts / total_tasks_completed
This metric should remain low (< 2) for healthy systems. Values above 3-4 indicate systemic reliability issues requiring investigation. Track this metric per operation type (API calls, UI interactions, file operations) to identify specific problem areas.
P95 and P99 retry latency
The 95th and 99th percentile time spent retrying operations:
p95_retry_latency = percentile(retry_durations, 95)
High P99 latency (> 30 seconds) suggests agents are waiting too long for failing operations. This often indicates either overly generous maximum retry counts or operations that should fail fast being retried aggressively.
Circuit breaker trips
The number of times circuit breakers open, indicating repeated failures:
circuit_breaker_trip_rate = trips_per_hour
Frequent trips (> 10/hour) signal downstream service instability or capacity issues. This metric serves as an early warning system for infrastructure problems before they affect agent success rates significantly.
Retry budget exhaustion rate
The percentage of time windows where agents exhaust their retry budget:
budget_exhaustion_rate = (windows_with_exhaustion / total_windows) * 100
Values above 5% suggest either insufficient retry budgets or excessive failure rates requiring architectural fixes rather than more retries.
Related concepts
Understanding retries and backoff requires familiarity with complementary reliability patterns:
- Selector stability: Stable DOM selectors reduce the need for retries by minimizing false negatives when locating UI elements
- Latency SLO: Service level objectives inform appropriate retry timeouts and maximum attempt counts
- Limitations and fallbacks: Fallback strategies handle cases where retries fail after exhausting all attempts
- Error recovery: Broader error handling patterns that complement retry mechanisms for comprehensive agent resilience