Latency SLO

Latency SLO (Service Level Objective) defines acceptable response time thresholds for agent operations to maintain user experience. These objectives establish quantifiable targets for how quickly agentic systems must respond to user requests, execute tasks, and provide feedback across different operation types and contexts.

Why It Matters

Latency SLOs are critical for maintaining trust and usability in agentic systems:

User Expectations: Different task types create different latency expectations. Users tolerate 100-200ms for simple acknowledgments but expect sub-5-second responses for interactive queries and may accept 30-60 seconds for complex multi-step operations. Without explicit SLOs, systems risk delivering inconsistent experiences that erode user confidence.

Task Prioritization: SLOs enable intelligent request routing and resource allocation. High-priority interactive tasks can be fast-tracked while batch operations consume remaining capacity. When multiple agents compete for compute resources, SLOs provide objective criteria for scheduling decisions and preemption policies.

Capacity Planning: Historical SLO compliance data reveals when systems approach saturation. If P95 latencies consistently trend toward SLO boundaries, teams can provision additional infrastructure before violations occur. SLOs also quantify the performance impact of new features, helping teams make informed trade-offs between functionality and speed.

Concrete Examples

P50/P95/P99 Targets for Different Task Types

// Latency SLO definitions for different operation types
const LATENCY_SLOS = {
  acknowledgment: {
    p50: 100,  // 100ms
    p95: 200,  // 200ms
    p99: 500   // 500ms
  },
  simpleQuery: {
    p50: 1000,   // 1s
    p95: 3000,   // 3s
    p99: 5000    // 5s
  },
  complexTask: {
    p50: 10000,  // 10s
    p95: 30000,  // 30s
    p99: 60000   // 60s
  },
  batchOperation: {
    p50: 120000,  // 2min
    p95: 300000,  // 5min
    p99: 600000   // 10min
  }
};

SLO Violation Handling

interface LatencyMonitor {
  taskType: string;
  startTime: number;
  sloTarget: number;

  checkSLO(): SLOStatus {
    const elapsed = Date.now() - this.startTime;
    const timeRemaining = this.sloTarget - elapsed;

    if (timeRemaining < 0) {
      return {
        status: 'violated',
        elapsed,
        overage: Math.abs(timeRemaining)
      };
    } else if (timeRemaining < this.sloTarget * 0.2) {
      return {
        status: 'at-risk',
        elapsed,
        remainingBudget: timeRemaining
      };
    }

    return { status: 'healthy', elapsed };
  }
}

// Graceful degradation when approaching SLO limits
async function executeWithSLO(task: Task, slo: LatencySLO) {
  const monitor = new LatencyMonitor(task.type, Date.now(), slo.p95);

  try {
    const result = await task.execute();

    const status = monitor.checkSLO();
    if (status.status === 'violated') {
      // Log SLO violation for monitoring
      logSLOViolation(task, status.elapsed, slo.p95);
      // Consume error budget
      consumeErrorBudget(task.type, status.overage);
    }

    return result;
  } catch (error) {
    // Failures also count against SLO
    logSLOViolation(task, monitor.checkSLO().elapsed, slo.p95);
    throw error;
  }
}

Latency Budgets

// Error budget system for SLO management
class ErrorBudget {
  private readonly targetCompliance = 0.99;  // 99% of requests must meet SLO
  private totalRequests = 0;
  private sloViolations = 0;

  get complianceRate(): number {
    if (this.totalRequests === 0) return 1.0;
    return 1 - (this.sloViolations / this.totalRequests);
  }

  get budgetRemaining(): number {
    const allowedViolations = this.totalRequests * (1 - this.targetCompliance);
    return Math.max(0, allowedViolations - this.sloViolations);
  }

  get budgetHealthy(): boolean {
    return this.complianceRate >= this.targetCompliance;
  }

  recordRequest(metSLO: boolean): void {
    this.totalRequests++;
    if (!metSLO) {
      this.sloViolations++;

      if (!this.budgetHealthy) {
        // Trigger incident response
        this.alertOnBudgetExhaustion();
      }
    }
  }

  private alertOnBudgetExhaustion(): void {
    console.error(`SLO error budget exhausted: ${this.complianceRate * 100}% compliance`);
    // Pause non-critical deployments, scale resources, etc.
  }
}

Common Pitfalls

Unrealistic Targets: Setting overly aggressive SLOs (e.g., P99 < 1s for LLM-based tasks) creates unachievable goals that demoralize teams. SLOs must account for inherent platform limitations like model inference time (typically 2-10s for complex reasoning), network latency (50-200ms round-trip), and external API dependencies. Base targets on measured baseline performance, not aspirational ideals.

Missing Percentile Tracking: Tracking only average latency masks critical user experience problems. A system with 500ms average but 30s P99 means 1% of users experience unacceptable delays. Always monitor P50, P95, and P99 separately, as they reveal different system characteristics: P50 shows typical performance, P95 catches common degradation, and P99 exposes tail latency from retries, garbage collection, or resource contention.

Ignoring Outliers: Dismissing P99 violations as "acceptable outliers" ignores that even 1% of users experiencing 30-second delays can generate significant support burden and churn. Tail latencies often indicate systemic issues like inefficient retry logic, missing timeouts, or resource starvation. Treat persistent P99 violations as architectural problems requiring investigation, not statistical noise to ignore.

Single Global SLO: Applying one latency target across all operation types creates misaligned incentives. Simple acknowledgments can easily achieve sub-100ms response while complex multi-step reasoning may legitimately require 20-30 seconds. Define operation-specific SLOs that match user expectations for each task category.

No SLO Review Process: User expectations evolve as competitors set new standards and your system capabilities improve. Review SLOs quarterly to ensure they remain relevant. Tighten targets when infrastructure improvements create headroom; relax targets when new features add complexity that users value more than speed.

Implementation

Latency Measurement Points

// Comprehensive latency instrumentation
class LatencyTracker {
  trackAgentLatency(requestId: string) {
    const measurements = {
      // Time to first acknowledgment
      acknowledgment: performance.now(),

      // Time to start processing
      processingStart: null as number | null,

      // Time to first meaningful response
      firstToken: null as number | null,

      // Time to complete task
      completion: null as number | null,

      // Breakdown of time spent in components
      components: {
        llmInference: 0,
        toolExecution: 0,
        dataRetrieval: 0,
        rendering: 0
      }
    };

    return {
      markProcessingStart() {
        measurements.processingStart = performance.now();
      },

      markFirstToken() {
        measurements.firstToken = performance.now();
      },

      markCompletion() {
        measurements.completion = performance.now();
        this.recordMetrics(requestId, measurements);
      },

      trackComponent(component: string, duration: number) {
        measurements.components[component] += duration;
      },

      recordMetrics(id: string, data: typeof measurements) {
        // Send to observability platform
        metrics.emit({
          requestId: id,
          ackLatency: data.acknowledgment,
          processingLatency: data.processingStart ?
            data.processingStart - data.acknowledgment : null,
          firstTokenLatency: data.firstToken ?
            data.firstToken - data.acknowledgment : null,
          totalLatency: data.completion ?
            data.completion - data.acknowledgment : null,
          componentBreakdown: data.components
        });
      }
    };
  }
}

SLO Tracking Dashboards

// Dashboard query patterns for SLO monitoring
const SLO_QUERIES = {
  // Calculate percentile latencies
  percentiles: `
    SELECT
      operation_type,
      PERCENTILE_CONT(0.50) WITHIN GROUP (ORDER BY latency_ms) as p50,
      PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY latency_ms) as p95,
      PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY latency_ms) as p99
    FROM agent_requests
    WHERE timestamp > NOW() - INTERVAL '1 hour'
    GROUP BY operation_type
  `,

  // Calculate SLO compliance
  compliance: `
    SELECT
      operation_type,
      COUNT(*) as total_requests,
      SUM(CASE WHEN latency_ms <= slo_target_ms THEN 1 ELSE 0 END) as met_slo,
      CAST(SUM(CASE WHEN latency_ms <= slo_target_ms THEN 1 ELSE 0 END) AS FLOAT) /
        COUNT(*) as compliance_rate
    FROM agent_requests
    JOIN slo_targets USING (operation_type)
    WHERE timestamp > NOW() - INTERVAL '24 hours'
    GROUP BY operation_type
  `,

  // Identify worst offenders
  violators: `
    SELECT
      request_id,
      operation_type,
      latency_ms,
      slo_target_ms,
      latency_ms - slo_target_ms as overage_ms,
      error_message
    FROM agent_requests
    JOIN slo_targets USING (operation_type)
    WHERE latency_ms > slo_target_ms
      AND timestamp > NOW() - INTERVAL '1 hour'
    ORDER BY overage_ms DESC
    LIMIT 100
  `
};

// Real-time SLO dashboard component
interface SLODashboard {
  renderSLOStatus(operationType: string): {
    currentP50: number;
    currentP95: number;
    currentP99: number;
    targetP95: number;
    complianceRate: number;
    errorBudgetRemaining: number;
    trend: 'improving' | 'stable' | 'degrading';
  };
}

Alerting Thresholds

// Multi-level alerting for SLO violations
interface SLOAlert {
  severity: 'warning' | 'critical' | 'emergency';
  condition: string;
  action: string;
}

const SLO_ALERTS: SLOAlert[] = [
  {
    severity: 'warning',
    condition: 'P95 latency exceeds SLO for 5 consecutive minutes',
    action: 'Notify on-call engineer; investigate causes'
  },
  {
    severity: 'critical',
    condition: 'P95 latency exceeds SLO by 50% for 10 minutes',
    action: 'Page on-call; begin incident response; consider scaling'
  },
  {
    severity: 'emergency',
    condition: 'Error budget exhausted (compliance < 99%)',
    action: 'Halt deployments; emergency scaling; executive notification'
  }
];

// Alert configuration
class SLOAlerting {
  private readonly alertRules = [
    {
      name: 'P95 SLO Violation',
      query: 'p95_latency_ms > slo_target_p95',
      duration: '5m',
      severity: 'warning'
    },
    {
      name: 'Sustained SLO Violation',
      query: 'p95_latency_ms > slo_target_p95 * 1.5',
      duration: '10m',
      severity: 'critical'
    },
    {
      name: 'Error Budget Exhausted',
      query: 'slo_compliance_rate < 0.99',
      duration: '1m',
      severity: 'emergency'
    },
    {
      name: 'P99 Degradation',
      query: 'p99_latency_ms > slo_target_p99 * 2',
      duration: '15m',
      severity: 'warning'
    }
  ];

  evaluateAlerts(metrics: LatencyMetrics): Alert[] {
    return this.alertRules
      .filter(rule => this.evaluateCondition(rule, metrics))
      .map(rule => this.createAlert(rule, metrics));
  }
}

Key Metrics

Track these metrics to monitor latency SLO compliance:

P50 Latency: Median response time representing typical user experience. For interactive queries, target P50 < 2s; for complex tasks, P50 < 15s. This metric shows baseline performance under normal conditions.

P95 Latency: 95th percentile latency capturing most user experiences while excluding extreme outliers. This is typically the primary SLO enforcement boundary. For interactive operations, target P95 < 5s; for complex workflows, P95 < 30s.

P99 Latency: 99th percentile revealing tail latency issues from retries, timeouts, and resource contention. While SLOs may allow higher P99 values, sustained violations indicate architectural problems. Target P99 < 10s for interactive tasks, P99 < 60s for complex operations.

SLO Compliance %: Percentage of requests meeting SLO targets, typically measured over rolling 24-hour or 7-day windows. Maintain 99%+ compliance for customer-facing operations, 95%+ for internal tools. Track per operation type and user tier.

Error Budget Consumption: Rate at which SLO violations consume allocated error budget. Calculate as (1 - compliance_rate) / (1 - target_compliance). Budget consumption > 1.0 indicates SLO targets are at risk; consumption > 2.0 requires immediate intervention.

Time to First Token: Latency from request submission to first response chunk, critical for perceived responsiveness. Target TTFT < 500ms for streaming responses to provide immediate feedback while full response generates.

Component Latency Breakdown: Time spent in LLM inference, tool execution, data retrieval, and rendering. Identifies optimization opportunities by revealing which components contribute most to total latency.

Related Concepts

  • UX Latency - User-perceived latency and response time expectations
  • Retries and Backoff - Retry strategies that impact latency and SLO compliance
  • Observability - Monitoring and instrumentation for tracking SLO metrics
  • Telemetry - Data collection systems for measuring latency across agent operations