AI Gateways: Production LLM Routing & Cost Control
Introduction
Production AI systems fail when LLM costs spiral unpredictably and latency spikes crater user experience. The root cause is rarely the models themselves—it's the absence of intelligent traffic management between them. An AI gateway is the critical infrastructure layer that routes requests across multiple LLMs, enforces token budgets, shapes traffic patterns, and maintains sub-second p95 latency while capping spend.
This article delivers field-tested patterns for building and operating AI gateways at scale: from request classification algorithms to prompt firewalls, from budget enforcement mechanisms to failover orchestration. You'll leave with concrete implementation code, diagnostic runbooks, and a decision framework for gateway selection. For teams building comprehensive AI observability alongside their gateway, native LLM tracing with OpenTelemetry provides the telemetry foundation needed for effective routing decisions.
Failure scenario: A fintech startup running customer-facing chatbot inference saw monthly OpenAI costs jump 340% in two weeks. Investigation revealed no traffic increase—instead, a marketing campaign triggered a surge in long-form legal queries hitting GPT-4 instead of cached GPT-3.5 responses. Without request classification or spend caps, the gateway-less architecture hemorrhaged budget until manual intervention. Recovery took 72 hours; customer trust took longer.
Executive Summary
TL;DR: An AI gateway routes LLM requests through policy-driven classification, budget-aware model selection, and adaptive failover—typically reducing inference costs 40–70% while maintaining p95 latency under 500ms.
Key Takeaways
- Request classification is the control plane: Semantic analysis of prompts (complexity, sensitivity, context window needs) determines routing decisions before token generation begins.
- Token budget enforcement prevents spend explosions: Hard caps per user/session with graceful degradation to cheaper models or cached responses.
- Latency-aware routing requires predictive modeling: Historical p99 latency by model, region, and time-of-day drives real-time selection, not just static rules.
- Prompt firewalls block cost attacks: Input validation, PII detection, and prompt injection filtering occur at the edge, before expensive model calls.
- Circuit breakers isolate provider failures: Sub-5-second detection of degraded endpoints with automatic traffic shift and queue draining.
- Observability is non-negotiable: Per-request cost attribution, token-level tracing, and budget burn-down dashboards enable proactive intervention.
Direct Answers (Q→A)
- How does an AI gateway route requests across multiple LLMs to control cost and latency? — It classifies prompt complexity, checks token budgets, selects the cheapest capable model using predictive latency models, and fails over automatically when providers degrade.
- What prevents runaway LLM spending in production? — Per-entity token budgets with hard enforcement, spend-rate alerting, and automatic model downgrading when thresholds approach.
- When should I use a dedicated gateway versus direct provider SDKs? — At >100K monthly requests, multi-provider requirements, or when cost predictability is a business requirement.
How AI Gateways for LLM Routing and Cost Control Works Under the Hood
Architecture Overview
An AI gateway operates as a reverse proxy with deep semantic awareness. Unlike traditional API gateways that route by URL patterns, an LLM routing layer analyzes request payload characteristics to make model selection decisions. The core components form a pipeline:
1. Request Ingress & Normalization
Incoming requests are validated against OpenAPI schemas, normalized to a canonical internal format (typically OpenAI-compatible), and assigned a correlation ID for distributed tracing. This layer handles authentication, rate limiting, and initial PII detection.
2. Prompt Classification Engine
The critical differentiation point. A lightweight classifier (often a small fine-tuned model or heuristic ensemble) scores prompts on:
- Complexity score (0–1): Reasoning depth required, extracted via prompt structure analysis and few-shot classification
- Context sensitivity: Hallucination risk for factual queries, determined by domain keyword matching
- Token demand estimate: Projected input+output tokens based on prompt length, conversation history, and output format specifications
- Latency tolerance: Explicit or inferred from request headers (e.g., streaming vs. batch indicators)
3. Policy Engine & Model Selection
Classification scores feed into a rules engine evaluating:
- Per-user/tenant token budget remaining
- Model capability matrix (which models can satisfy complexity requirements)
- Real-time cost and latency telemetry
- Business priority tiers (free vs. paid user differentiation)
The selection algorithm minimizes cost subject to latency and quality constraints—a constrained optimization solved via weighted scoring or integer programming for complex multi-objective scenarios.
4. Budget Enforcement & Quota Management
Token budget enforcement operates at multiple time horizons:
- Request-level: Pre-flight check against remaining budget; reject or downgrade if insufficient
- Streaming-level: Accumulating token counter with mid-stream cutoff when limits exceeded
- Session-level: Conversation state tracking with budget carry-forward
- Periodic reset: Daily/weekly/monthly quota replenishment with notification hooks
Budget state is typically stored in Redis or similar with sub-millisecond access, using atomic decrement operations to prevent race conditions in high-concurrency scenarios.
5. Execution & Adaptive Routing
Selected model receives the request via provider-specific adapter. The gateway maintains connection pools, handles retries with exponential backoff, and implements LLM traffic shaping through:
- Request batching for compatible queries
- Priority queuing with weighted fair queuing (WFQ) algorithms
- Backpressure propagation when downstream providers throttle
6. Response Processing & Feedback
Output passes through content filtering, format normalization, and cost attribution. Token counts and latency measurements feed back into predictive models for subsequent routing decisions.
Prompt Firewall: Security at the Edge
The prompt firewall layer executes before any model invocation, preventing both security exploits and cost attacks:
| Filter Category | Detection Method | Response Action |
|---|---|---|
| Prompt Injection | Delimiter analysis, instruction conflict detection, embedding similarity to known attack patterns | Block + alert; sanitized rewrite for low-confidence |
| PII Exposure | Regex + NER models for SSN, credit cards, health identifiers | Redaction or request rejection based on policy |
| Cost Attack (Token Bomb) | Unusual token density patterns, repetitive structure designed to maximize output length | Hard cap enforcement + user flagging |
| Jailbreak Attempts | Role-play framing, encoding obfuscation, refusal suppression patterns | Block + security audit log |
Implementation: Production Patterns
Pattern 1: Basic Gateway with Static Routing
Starting point for teams with single-provider usage seeking cost visibility:
import asyncio
from dataclasses import dataclass
from typing import Optional
import time
@dataclass
class RoutingDecision:
provider: str
model: str
estimated_cost_usd: float
latency_budget_ms: int
class StaticAIGateway:
"""Simple gateway with model selection based on prompt complexity heuristics."""
MODEL_MATRIX = {
'simple': {'provider': 'openai', 'model': 'gpt-3.5-turbo', 'cost_per_1k': 0.002},
'complex': {'provider': 'openai', 'model': 'gpt-4', 'cost_per_1k': 0.06},
'code': {'provider': 'anthropic', 'model': 'claude-3-sonnet', 'cost_per_1k': 0.015}
}
def __init__(self, budget_manager):
self.budget_manager = budget_manager
self.request_log = []
async def classify_prompt(self, prompt: str, context: Optional[list] = None) -> str:
"""Heuristic classification: token count + keyword matching."""
token_estimate = len(prompt) // 4 # Rough approximation
# Complexity indicators
code_keywords = ['def ', 'class ', 'import ', 'function', 'algorithm']
reasoning_keywords = ['explain', 'analyze', 'compare', 'evaluate', 'synthesize']
if any(kw in prompt.lower() for kw in code_keywords):
return 'code'
elif token_estimate > 500 or any(kw in prompt.lower() for kw in reasoning_keywords):
return 'complex'
return 'simple'
async def route(self, user_id: str, prompt: str, context: Optional[list] = None):
# Budget check
remaining = await self.budget_manager.get_remaining(user_id)
if remaining <= 0:
raise BudgetExceededError(f"User {user_id} budget depleted")
# Classification and routing
complexity = await self.classify_prompt(prompt, context)
config = self.MODEL_MATRIX[complexity]
# Cost estimation and enforcement
estimated_tokens = len(prompt) // 4 + 500 # Input + assumed output
estimated_cost = (estimated_tokens / 1000) * config['cost_per_1k']
if estimated_cost > remaining * 0.5: # Conservative: don't spend >50% remaining
# Downgrade to cheaper model if possible
if complexity != 'simple':
config = self.MODEL_MATRIX['simple']
estimated_cost = (estimated_tokens / 1000) * config['cost_per_1k']
return RoutingDecision(
provider=config['provider'],
model=config['model'],
estimated_cost_usd=estimated_cost,
latency_budget_ms=2000 if complexity == 'simple' else 8000
)
Pattern 2: Dynamic Latency-Aware Routing
Production-grade implementation with predictive latency modeling and circuit breakers:
import statistics
from collections import deque
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject fast
HALF_OPEN = "half_open" # Testing recovery
class LatencyPredictor:
"""Exponentially weighted moving average for per-model latency."""
def __init__(self, alpha=0.3, window_size=100):
self.alpha = alpha # Decay factor
self.windows = {} # model -> deque of latencies
self.ewma = {} # model -> current EWMA
self.window_size = window_size
def record(self, model: str, latency_ms: float, success: bool):
if model not in self.windows:
self.windows[model] = deque(maxlen=self.window_size)
self.ewma[model] = latency_ms
self.windows[model].append((latency_ms, success))
# Update EWMA
if success:
self.ewma[model] = (self.alpha * latency_ms +
(1 - self.alpha) * self.ewma[model])
def predict_p99(self, model: str) -> float:
"""Conservative p99 estimate using recent window."""
if model not in self.windows or len(self.windows[model]) < 10:
return float('inf') # Unknown = risky
latencies = [l for l, s in self.windows[model] if s]
if len(latencies) < 5:
return float('inf')
# Conservative: EWMA + 3 std dev approximation
recent = list(self.windows[model])[-20:]
vals = [l for l, s in recent]
mean = statistics.mean(vals)
try:
std = statistics.stdev(vals)
except statistics.StatisticsError:
std = mean * 0.3 # Fallback
return mean + 3 * std
class CircuitBreaker:
"""Per-provider circuit breaker with adaptive thresholds."""
def __init__(self, failure_threshold=5, recovery_timeout=30):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.states = {} # provider -> CircuitState
self.failure_counts = {} # provider -> consecutive failures
self.last_failure_time = {}
def call(self, provider: str) -> bool:
"""Returns True if call should proceed."""
state = self.states.get(provider, CircuitState.CLOSED)
if state == CircuitState.OPEN:
# Check if recovery timeout elapsed
last_fail = self.last_failure_time.get(provider, 0)
if time.time() - last_fail > self.recovery_timeout:
self.states[provider] = CircuitState.HALF_OPEN
return True
return False
return True
def record_result(self, provider: str, success: bool):
if success:
if self.states.get(provider) == CircuitState.HALF_OPEN:
self.states[provider] = CircuitState.CLOSED
self.failure_counts[provider] = 0
else:
self.failure_counts[provider] = 0
else:
self.failure_counts[provider] = self.failure_counts.get(provider, 0) + 1
self.last_failure_time[provider] = time.time()
if self.failure_counts[provider] >= self.failure_threshold:
self.states[provider] = CircuitState.OPEN
# Alert: provider circuit opened
class DynamicAIGateway:
"""Production gateway with predictive routing and resilience patterns."""
def __init__(self):
self.latency_predictor = LatencyPredictor()
self.circuit_breaker = CircuitBreaker()
self.budget_manager = TokenBudgetManager() # Redis-backed
self.model_profiles = self._load_model_profiles()
def _load_model_profiles(self):
"""Capability and cost matrix for available models."""
return {
'gpt-3.5-turbo': {
'provider': 'openai',
'max_tokens': 4096,
'cost_input_1k': 0.0015,
'cost_output_1k': 0.002,
'strengths': ['simple_qa', 'summarization', 'formatting'],
'context_window': 16385
},
'gpt-4': {
'provider': 'openai',
'max_tokens': 8192,
'cost_input_1k': 0.03,
'cost_output_1k': 0.06,
'strengths': ['reasoning', 'analysis', 'complex_instruction'],
'context_window': 8192
},
'claude-3-sonnet': {
'provider': 'anthropic',
'max_tokens': 4096,
'cost_input_1k': 0.003,
'cost_output_1k': 0.015,
'strengths': ['reasoning', 'analysis', 'long_context'],
'context_window': 200000
},
'claude-3-haiku': {
'provider': 'anthropic',
'max_tokens': 4096,
'cost_input_1k': 0.00025,
'cost_output_1k': 0.00125,
'strengths': ['speed', 'simple_qa', 'classification'],
'context_window': 200000
}
}
async def route(self, request: dict) -> dict:
"""Multi-objective optimization: minimize cost, meet latency, satisfy quality."""
# Extract requirements
required_strengths = request.get('required_capabilities', ['simple_qa'])
max_latency_ms = request.get('max_latency_ms', 3000)
user_id = request['user_id']
prompt_tokens = estimate_tokens(request['prompt'])
# Budget check
remaining_budget = await self.budget_manager.check_and_reserve(
user_id, estimated_max_cost=0.50 # Reserve $0.50
)
# Filter: capability match + circuit closed
candidates = []
for model_id, profile in self.model_profiles.items():
provider = profile['provider']
# Circuit breaker check
if not self.circuit_breaker.call(provider):
continue
# Capability match
if not all(s in profile['strengths'] for s in required_strengths):
continue
# Context window check
if prompt_tokens > profile['context_window'] * 0.8:
continue
# Latency prediction
predicted_latency = self.latency_predictor.predict_p99(model_id)
if predicted_latency > max_latency_ms * 1.2: # 20% headroom
continue
# Cost calculation
estimated_output = min(request.get('max_tokens', 500), profile['max_tokens'])
cost = (prompt_tokens / 1000 * profile['cost_input_1k'] +
estimated_output / 1000 * profile['cost_output_1k'])
if cost > remaining_budget:
continue
candidates.append({
'model_id': model_id,
'provider': provider,
'predicted_latency_ms': predicted_latency,
'estimated_cost_usd': cost,
'score': self._score(cost, predicted_latency, profile)
})
if not candidates:
raise NoAvailableModelError("No models meet constraints")
# Select optimal: minimize cost with latency constraint
best = min(candidates, key=lambda x: (x['estimated_cost_usd'], x['predicted_latency_ms']))
return {
'model_id': best['model_id'],
'provider': best['provider'],
'estimated_cost_usd': best['estimated_cost_usd'],
'predicted_latency_ms': best['predicted_latency_ms'],
'fallback_chain': self._build_fallbacks(best, candidates)
}
def _score(self, cost, latency, profile) -> float:
"""Composite score: lower is better. Weights configurable by use case."""
cost_weight = 0.6
latency_weight = 0.4
# Normalize: assume max cost $0.10, max latency 10000ms
return (cost / 0.10) * cost_weight + (latency / 10000) * latency_weight
def _build_fallbacks(self, primary, all_candidates):
"""Ordered fallback chain for resilience."""
fallbacks = [c for c in all_candidates
if c['model_id'] != primary['model_id']]
# Sort by capability similarity (simplified)
return sorted(fallbacks, key=lambda x: x['estimated_cost_usd'])[:2]
Pattern 3: Enterprise Gateway with Full Policy Engine
Advanced implementation with semantic classification, content filtering, and audit logging:
class EnterpriseAIGateway:
"""
Production gateway with:
- Fine-tuned prompt classifier
- Content safety filtering
- Comprehensive audit logging
- Multi-tenant budget isolation
"""
def __init__(self, config):
self.classifier = load_fine_tuned_classifier(config.classifier_path)
self.safety_filter = ContentSafetyFilter(config.safety_rules)
self.budget_manager = MultiTenantBudgetManager(
redis_pool=config.redis,
default_daily_budget=config.default_budget_usd
)
self.audit_logger = StructuredAuditLogger(config.log_endpoint)
self.routing_engine = DynamicAIGateway() # From Pattern 2
async def process_request(self, request: dict, tenant_id: str) -> dict:
request_id = generate_uuid()
start_time = time.monotonic()
try:
# Phase 1: Validation & Normalization
normalized = self._normalize_request(request)
# Phase 2: Prompt Firewall (security & safety)
safety_result = await self.safety_filter.analyze(normalized['prompt'])
if safety_result.action == 'block':
await self.audit_logger.log({
'request_id': request_id,
'tenant_id': tenant_id,
'action': 'blocked',
'reason': safety_result.block_reason,
'latency_ms': elapsed_ms(start_time)
})
raise SafetyViolationError(safety_result.block_reason)
# Phase 3: Semantic Classification
classification = await self.classifier.classify(
normalized['prompt'],
conversation_history=normalized.get('messages', [])
)
# Returns: complexity_score, domain, sensitivity_level, estimated_tokens
# Phase 4: Budget Enforcement with Tenant Isolation
budget_status = await self.budget_manager.check(
tenant_id=tenant_id,
user_id=normalized.get('user_id'),
estimated_cost=classification.estimated_cost,
sensitivity=classification.sensitivity_level
)
if budget_status.action == 'reject':
raise BudgetExceededError(budget_status.message)
elif budget_status.action == 'downgrade':
classification.target_tier = 'economy'
# Phase 5: Route Selection
routing_request = {
'required_capabilities': classification.required_capabilities,
'max_latency_ms': normalized.get('max_latency_ms', 3000),
'target_tier': classification.target_tier,
'prompt': normalized['prompt'],
'estimated_tokens': classification.estimated_tokens,
'user_id': normalized.get('user_id')
}
route = await self.routing_engine.route(routing_request)
# Phase 6: Execute with Circuit Breaker & Retry
result = await self._execute_with_resilience(route, normalized)
# Phase 7: Post-processing & Cost Attribution
actual_cost = calculate_actual_cost(
route['model_id'],
result['usage']['prompt_tokens'],
result['usage']['completion_tokens']
)
await self.budget_manager.commit_spend(tenant_id, actual_cost)
# Phase 8: Audit & Telemetry
await self.audit_logger.log({
'request_id': request_id,
'tenant_id': tenant_id,
'model_id': route['model_id'],
'classification': classification.to_dict(),
'routing_decision': route,
'actual_cost_usd': actual_cost,
'latency_ms': elapsed_ms(start_time),
'tokens': result['usage'],
'safety_flags': safety_result.flags
})
return {
'response': result['content'],
'model_used': route['model_id'],
'cost_usd': actual_cost,
'latency_ms': elapsed_ms(start_time)
}
except Exception as e:
await self.audit_logger.log({
'request_id': request_id,
'tenant_id': tenant_id,
'action': 'error',
'error_type': type(e).__name__,
'latency_ms': elapsed_ms(start_time)
})
raise
Comparisons & Decision Framework
Gateway Implementation Options
| Approach | Best For | Time to Production | Control Level | Ongoing Cost |
|---|---|---|---|---|
| Open-source (LiteLLM, Langfuse Gateway) | Teams with existing K8s, need quick multi-provider | 1–2 weeks | Medium (configurable rules) | Infrastructure only |
| Commercial (Kong AI Gateway, Cloudflare AI Gateway) | Enterprise needing SLA, compliance, minimal ops | Days | Low-Medium (vendor-defined) | Per-request + platform fees |
| Custom Build (Patterns above) | Unique classification needs, deep cost optimization | 6–10 weeks | High (full customization) | Team + infrastructure |
| Provider-Native (Azure AI Gateway) | Single-provider deployments, existing Azure estate | 1–2 weeks | Low (provider-locked) | Azure consumption |
Decision Checklist
Use this framework when evaluating gateway approaches:
- Scale threshold: >100K monthly requests justifies dedicated gateway; <10K use provider SDKs with client-side caching
- Multi-provider requirement: If yes, eliminate provider-native options
- Classification complexity: Need custom semantic routing? → Custom build or extensible open-source
- Compliance requirements: SOC 2, GDPR, HIPAA? → Commercial with audit guarantees or custom with legal review
- Team bandwidth: <2 FTEs available? → Commercial or managed open-source
- Latency sensitivity: p99 <200ms requirement? → Custom with edge deployment, avoid multi-hop commercial
- Budget volatility tolerance: >50% monthly variance unacceptable? → Mandatory custom budget enforcement
Those operating agentic systems at scale should reference field-tested production observability frameworks for agentic AI to ensure gateway telemetry integrates with broader system health monitoring.
Failure Modes & Edge Cases
Critical Failure Scenarios
1. Classification Cascade Failure
Symptom: Simple queries routed to expensive models, costs spike 3–5x without traffic increase.
Root cause: Prompt classifier drift or poisoning; training data no longer represents production distribution.
Diagnostics: Monitor classification confidence scores; alert on entropy spike. Track routing distribution by complexity bucket.
Mitigation: Fallback to heuristic classifier when ML confidence <0.7. A/B test classifier updates with 5% traffic shadowing.
2. Budget Race Condition
Symptom: Users exceed stated limits; spend continues post-budget-exhaustion.
Root cause: Non-atomic budget checks; concurrent requests pass check before any commits spend.
Diagnostics: Redis MONITOR for competing decrements; compare sum(spend) vs. budget limit.
Mitigation: Lua script atomic check-and-decrement; or pessimistic reservation (Pattern 3 above).
3. Circuit Breaker Oscillation
Symptom: Rapid provider switching; latency worse than single-provider baseline.
Root cause: Overly sensitive failure detection (threshold too low) or aggressive recovery timeout.
Diagnostics: Circuit state transition frequency; correlate with error rate (not just latency).
Mitigation: Adaptive thresholds based on historical error rate; minimum 60-second open duration.
4. Token Count Estimation Drift
Symptom: Budget exhaustion before predicted; actual costs 20–40% higher than estimates.
Root cause: Character-based estimation fails for code, non-English text, or special tokens.
Diagnostics: Track estimation error distribution by content type; flag systematic bias.
Mitigation: Use provider-specific tokenizers (tiktoken, Anthropic tokenizer) for pre-flight estimation; 15% safety margin on budgets.
5. Streaming Budget Overrun
Symptom: Streaming responses exceed token limits mid-generation; difficult to truncate gracefully.
Root cause: Budget check at request start doesn't account for unconstrained output generation.
Diagnostics: Streaming token rate vs. budget burn-down; mid-stream termination frequency.
Mitigation: Accumulating token counter with hard stop; pre-negotiate max_tokens conservatively; graceful truncation message.
Performance & Scaling
Latency Budgets by Component
Production AI gateway latency decomposition (p95 targets):
- Request normalization: 5–10ms
- Prompt classification: 20–50ms (ML) or 5–10ms (heuristic)
- Budget check (Redis): 2–5ms
- Routing decision: 5–15ms
- Provider connection establishment: 50–200ms (mitigated via connection pooling)
- Model time-to-first-token (TTFT): 100–800ms (provider-dependent, dominant factor)
Total gateway overhead target: <100ms p95, excluding model TTFT.
Scaling Patterns
Horizontal Scaling: Gateway instances are stateless; scale by request rate. Budget state externalized to Redis Cluster with hash tagging by tenant for locality.
Regional Deployment: Deploy gateway edge nodes in 3+ regions with provider endpoint affinity. Latency predictor models trained per region; don't assume US-East latency applies to APAC.
Connection Pool Tuning: Maintain 50–200 persistent connections per provider endpoint per gateway instance. HTTP/2 multiplexing essential for high throughput.
Monitoring KPIs
| Metric | Target | Alert Threshold |
|---|---|---|
| Gateway p99 latency (excl. model) | <150ms | >300ms |
| End-to-end p99 latency | <2000ms | >5000ms |
| Cost per 1K requests | Baseline -5% | >Baseline +20% |
| Budget exhaustion rate | <1% of users/month | >5% |
| Circuit breaker open frequency | <0.1% of requests | >1% |
| Classification accuracy (vs. human) | >90% | <85% |
Organizations seeking to optimize broader infrastructure costs alongside AI spend should examine multi-cloud Kubernetes cost optimization strategies, as gateway infrastructure typically deploys on container platforms with similar efficiency opportunities.
Production Best Practices
Security
- Prompt injection defense: Layered filtering—structural analysis first (fast), embedding similarity second (thorough). Never rely solely on provider-side filtering.
- Credential isolation: Provider API keys in dedicated secret store (HashiCorp Vault, AWS Secrets Manager); gateway uses short-lived tokens with automatic rotation.
- Audit retention: 90-day minimum for routing decisions, classification inputs, and spend attribution; encrypted at rest with key rotation.
Testing
- Shadow traffic: Route 1% production traffic through new classifier versions; compare routing decisions without user impact.
- Chaos engineering: Regularly inject provider latency spikes and failures; verify circuit breaker and fallback behavior.
- Cost regression tests: Fixed prompt corpus with known optimal routing; alert when gateway selects suboptimal model.
Rollout
- Canary by tenant tier: Free users first (accept higher error rate), paid users after 48-hour stability validation.
- Budget buffer: Maintain 20% emergency reserve for unexpected traffic patterns; auto-scale reserve with monthly growth.
Runbooks
Cost Spike Response (P1):
- Identify tenant/user causing spike via cost attribution dashboard
- Apply emergency rate limit (10 RPM) to identified entity
- Switch entity to "economy mode" (GPT-3.5 only, no streaming)
- Notify account team; preserve audit trail
- Post-incident: classify root cause (legitimate growth vs. misuse vs. classifier failure)
Provider Outage Response (P0):
- Confirm circuit breaker opened automatically
- Verify traffic shifted to fallback providers
- If all providers degraded: queue non-critical requests, degrade to cached responses for FAQs
- Communicate status page update
- Post-recovery: analyze latency impact, tune circuit thresholds if needed
Further Reading & References
- OpenAI API Documentation: "Managing Rate Limits and Token Usage" — https://platform.openai.com/docs/guides/rate-limits
- Anthropic Documentation: "Token Counting and Cost Optimization" — https://docs.anthropic.com/en/docs/build-with-claude/token-counting
- LiteLLM Gateway: Open-source unified LLM API with routing — https://docs.litellm.ai/docs/proxy/quick_start
- Cloudflare AI Gateway: Edge-deployed LLM routing and caching — https://developers.cloudflare.com/ai-gateway/
- Kong AI Gateway: Enterprise AI traffic management — https://docs.konghq.com/gateway/latest/ai-gateway/
- "Prompt Injection: Threats and Mitigations" — Greshake et al., 2023. arXiv:2302.12173