AI Gateways: Production LLM Routing & Cost Control

Introduction

Diagram showing AI gateway routing requests to multiple LLMs with cost controls and monitoring.

Production AI systems fail when LLM costs spiral unpredictably and latency spikes crater user experience. The root cause is rarely the models themselves—it's the absence of intelligent traffic management between them. An AI gateway is the critical infrastructure layer that routes requests across multiple LLMs, enforces token budgets, shapes traffic patterns, and maintains sub-second p95 latency while capping spend.

This article delivers field-tested patterns for building and operating AI gateways at scale: from request classification algorithms to prompt firewalls, from budget enforcement mechanisms to failover orchestration. You'll leave with concrete implementation code, diagnostic runbooks, and a decision framework for gateway selection. For teams building comprehensive AI observability alongside their gateway, native LLM tracing with OpenTelemetry provides the telemetry foundation needed for effective routing decisions.

Failure scenario: A fintech startup running customer-facing chatbot inference saw monthly OpenAI costs jump 340% in two weeks. Investigation revealed no traffic increase—instead, a marketing campaign triggered a surge in long-form legal queries hitting GPT-4 instead of cached GPT-3.5 responses. Without request classification or spend caps, the gateway-less architecture hemorrhaged budget until manual intervention. Recovery took 72 hours; customer trust took longer.

Executive Summary

TL;DR: An AI gateway routes LLM requests through policy-driven classification, budget-aware model selection, and adaptive failover—typically reducing inference costs 40–70% while maintaining p95 latency under 500ms.

Key Takeaways

  • Request classification is the control plane: Semantic analysis of prompts (complexity, sensitivity, context window needs) determines routing decisions before token generation begins.
  • Token budget enforcement prevents spend explosions: Hard caps per user/session with graceful degradation to cheaper models or cached responses.
  • Latency-aware routing requires predictive modeling: Historical p99 latency by model, region, and time-of-day drives real-time selection, not just static rules.
  • Prompt firewalls block cost attacks: Input validation, PII detection, and prompt injection filtering occur at the edge, before expensive model calls.
  • Circuit breakers isolate provider failures: Sub-5-second detection of degraded endpoints with automatic traffic shift and queue draining.
  • Observability is non-negotiable: Per-request cost attribution, token-level tracing, and budget burn-down dashboards enable proactive intervention.

Direct Answers (Q→A)

  • How does an AI gateway route requests across multiple LLMs to control cost and latency? — It classifies prompt complexity, checks token budgets, selects the cheapest capable model using predictive latency models, and fails over automatically when providers degrade.
  • What prevents runaway LLM spending in production? — Per-entity token budgets with hard enforcement, spend-rate alerting, and automatic model downgrading when thresholds approach.
  • When should I use a dedicated gateway versus direct provider SDKs? — At >100K monthly requests, multi-provider requirements, or when cost predictability is a business requirement.

How AI Gateways for LLM Routing and Cost Control Works Under the Hood

Architecture Overview

An AI gateway operates as a reverse proxy with deep semantic awareness. Unlike traditional API gateways that route by URL patterns, an LLM routing layer analyzes request payload characteristics to make model selection decisions. The core components form a pipeline:

1. Request Ingress & Normalization
Incoming requests are validated against OpenAPI schemas, normalized to a canonical internal format (typically OpenAI-compatible), and assigned a correlation ID for distributed tracing. This layer handles authentication, rate limiting, and initial PII detection.

2. Prompt Classification Engine
The critical differentiation point. A lightweight classifier (often a small fine-tuned model or heuristic ensemble) scores prompts on:

  • Complexity score (0–1): Reasoning depth required, extracted via prompt structure analysis and few-shot classification
  • Context sensitivity: Hallucination risk for factual queries, determined by domain keyword matching
  • Token demand estimate: Projected input+output tokens based on prompt length, conversation history, and output format specifications
  • Latency tolerance: Explicit or inferred from request headers (e.g., streaming vs. batch indicators)

3. Policy Engine & Model Selection
Classification scores feed into a rules engine evaluating:

  • Per-user/tenant token budget remaining
  • Model capability matrix (which models can satisfy complexity requirements)
  • Real-time cost and latency telemetry
  • Business priority tiers (free vs. paid user differentiation)

The selection algorithm minimizes cost subject to latency and quality constraints—a constrained optimization solved via weighted scoring or integer programming for complex multi-objective scenarios.

4. Budget Enforcement & Quota Management

Token budget enforcement operates at multiple time horizons:

  • Request-level: Pre-flight check against remaining budget; reject or downgrade if insufficient
  • Streaming-level: Accumulating token counter with mid-stream cutoff when limits exceeded
  • Session-level: Conversation state tracking with budget carry-forward
  • Periodic reset: Daily/weekly/monthly quota replenishment with notification hooks

Budget state is typically stored in Redis or similar with sub-millisecond access, using atomic decrement operations to prevent race conditions in high-concurrency scenarios.

5. Execution & Adaptive Routing
Selected model receives the request via provider-specific adapter. The gateway maintains connection pools, handles retries with exponential backoff, and implements LLM traffic shaping through:

  • Request batching for compatible queries
  • Priority queuing with weighted fair queuing (WFQ) algorithms
  • Backpressure propagation when downstream providers throttle

6. Response Processing & Feedback
Output passes through content filtering, format normalization, and cost attribution. Token counts and latency measurements feed back into predictive models for subsequent routing decisions.

Prompt Firewall: Security at the Edge

The prompt firewall layer executes before any model invocation, preventing both security exploits and cost attacks:

Filter Category Detection Method Response Action
Prompt Injection Delimiter analysis, instruction conflict detection, embedding similarity to known attack patterns Block + alert; sanitized rewrite for low-confidence
PII Exposure Regex + NER models for SSN, credit cards, health identifiers Redaction or request rejection based on policy
Cost Attack (Token Bomb) Unusual token density patterns, repetitive structure designed to maximize output length Hard cap enforcement + user flagging
Jailbreak Attempts Role-play framing, encoding obfuscation, refusal suppression patterns Block + security audit log

Implementation: Production Patterns

Pattern 1: Basic Gateway with Static Routing

Starting point for teams with single-provider usage seeking cost visibility:

import asyncio
from dataclasses import dataclass
from typing import Optional
import time

@dataclass
class RoutingDecision:
    provider: str
    model: str
    estimated_cost_usd: float
    latency_budget_ms: int

class StaticAIGateway:
    """Simple gateway with model selection based on prompt complexity heuristics."""
    
    MODEL_MATRIX = {
        'simple': {'provider': 'openai', 'model': 'gpt-3.5-turbo', 'cost_per_1k': 0.002},
        'complex': {'provider': 'openai', 'model': 'gpt-4', 'cost_per_1k': 0.06},
        'code': {'provider': 'anthropic', 'model': 'claude-3-sonnet', 'cost_per_1k': 0.015}
    }
    
    def __init__(self, budget_manager):
        self.budget_manager = budget_manager
        self.request_log = []
    
    async def classify_prompt(self, prompt: str, context: Optional[list] = None) -> str:
        """Heuristic classification: token count + keyword matching."""
        token_estimate = len(prompt) // 4  # Rough approximation
        
        # Complexity indicators
        code_keywords = ['def ', 'class ', 'import ', 'function', 'algorithm']
        reasoning_keywords = ['explain', 'analyze', 'compare', 'evaluate', 'synthesize']
        
        if any(kw in prompt.lower() for kw in code_keywords):
            return 'code'
        elif token_estimate > 500 or any(kw in prompt.lower() for kw in reasoning_keywords):
            return 'complex'
        return 'simple'
    
    async def route(self, user_id: str, prompt: str, context: Optional[list] = None):
        # Budget check
        remaining = await self.budget_manager.get_remaining(user_id)
        if remaining <= 0:
            raise BudgetExceededError(f"User {user_id} budget depleted")
        
        # Classification and routing
        complexity = await self.classify_prompt(prompt, context)
        config = self.MODEL_MATRIX[complexity]
        
        # Cost estimation and enforcement
        estimated_tokens = len(prompt) // 4 + 500  # Input + assumed output
        estimated_cost = (estimated_tokens / 1000) * config['cost_per_1k']
        
        if estimated_cost > remaining * 0.5:  # Conservative: don't spend >50% remaining
            # Downgrade to cheaper model if possible
            if complexity != 'simple':
                config = self.MODEL_MATRIX['simple']
                estimated_cost = (estimated_tokens / 1000) * config['cost_per_1k']
        
        return RoutingDecision(
            provider=config['provider'],
            model=config['model'],
            estimated_cost_usd=estimated_cost,
            latency_budget_ms=2000 if complexity == 'simple' else 8000
        )

Pattern 2: Dynamic Latency-Aware Routing

Production-grade implementation with predictive latency modeling and circuit breakers:

import statistics
from collections import deque
from enum import Enum

class CircuitState(Enum):
    CLOSED = "closed"      # Normal operation
    OPEN = "open"          # Failing, reject fast
    HALF_OPEN = "half_open"  # Testing recovery

class LatencyPredictor:
    """Exponentially weighted moving average for per-model latency."""
    
    def __init__(self, alpha=0.3, window_size=100):
        self.alpha = alpha  # Decay factor
        self.windows = {}   # model -> deque of latencies
        self.ewma = {}      # model -> current EWMA
        self.window_size = window_size
    
    def record(self, model: str, latency_ms: float, success: bool):
        if model not in self.windows:
            self.windows[model] = deque(maxlen=self.window_size)
            self.ewma[model] = latency_ms
        
        self.windows[model].append((latency_ms, success))
        
        # Update EWMA
        if success:
            self.ewma[model] = (self.alpha * latency_ms + 
                               (1 - self.alpha) * self.ewma[model])
    
    def predict_p99(self, model: str) -> float:
        """Conservative p99 estimate using recent window."""
        if model not in self.windows or len(self.windows[model]) < 10:
            return float('inf')  # Unknown = risky
        
        latencies = [l for l, s in self.windows[model] if s]
        if len(latencies) < 5:
            return float('inf')
        
        # Conservative: EWMA + 3 std dev approximation
        recent = list(self.windows[model])[-20:]
        vals = [l for l, s in recent]
        mean = statistics.mean(vals)
        try:
            std = statistics.stdev(vals)
        except statistics.StatisticsError:
            std = mean * 0.3  # Fallback
        
        return mean + 3 * std

class CircuitBreaker:
    """Per-provider circuit breaker with adaptive thresholds."""
    
    def __init__(self, failure_threshold=5, recovery_timeout=30):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.states = {}      # provider -> CircuitState
        self.failure_counts = {}  # provider -> consecutive failures
        self.last_failure_time = {}
    
    def call(self, provider: str) -> bool:
        """Returns True if call should proceed."""
        state = self.states.get(provider, CircuitState.CLOSED)
        
        if state == CircuitState.OPEN:
            # Check if recovery timeout elapsed
            last_fail = self.last_failure_time.get(provider, 0)
            if time.time() - last_fail > self.recovery_timeout:
                self.states[provider] = CircuitState.HALF_OPEN
                return True
            return False
        
        return True
    
    def record_result(self, provider: str, success: bool):
        if success:
            if self.states.get(provider) == CircuitState.HALF_OPEN:
                self.states[provider] = CircuitState.CLOSED
                self.failure_counts[provider] = 0
            else:
                self.failure_counts[provider] = 0
        else:
            self.failure_counts[provider] = self.failure_counts.get(provider, 0) + 1
            self.last_failure_time[provider] = time.time()
            
            if self.failure_counts[provider] >= self.failure_threshold:
                self.states[provider] = CircuitState.OPEN
                # Alert: provider circuit opened

class DynamicAIGateway:
    """Production gateway with predictive routing and resilience patterns."""
    
    def __init__(self):
        self.latency_predictor = LatencyPredictor()
        self.circuit_breaker = CircuitBreaker()
        self.budget_manager = TokenBudgetManager()  # Redis-backed
        self.model_profiles = self._load_model_profiles()
    
    def _load_model_profiles(self):
        """Capability and cost matrix for available models."""
        return {
            'gpt-3.5-turbo': {
                'provider': 'openai',
                'max_tokens': 4096,
                'cost_input_1k': 0.0015,
                'cost_output_1k': 0.002,
                'strengths': ['simple_qa', 'summarization', 'formatting'],
                'context_window': 16385
            },
            'gpt-4': {
                'provider': 'openai',
                'max_tokens': 8192,
                'cost_input_1k': 0.03,
                'cost_output_1k': 0.06,
                'strengths': ['reasoning', 'analysis', 'complex_instruction'],
                'context_window': 8192
            },
            'claude-3-sonnet': {
                'provider': 'anthropic',
                'max_tokens': 4096,
                'cost_input_1k': 0.003,
                'cost_output_1k': 0.015,
                'strengths': ['reasoning', 'analysis', 'long_context'],
                'context_window': 200000
            },
            'claude-3-haiku': {
                'provider': 'anthropic',
                'max_tokens': 4096,
                'cost_input_1k': 0.00025,
                'cost_output_1k': 0.00125,
                'strengths': ['speed', 'simple_qa', 'classification'],
                'context_window': 200000
            }
        }
    
    async def route(self, request: dict) -> dict:
        """Multi-objective optimization: minimize cost, meet latency, satisfy quality."""
        
        # Extract requirements
        required_strengths = request.get('required_capabilities', ['simple_qa'])
        max_latency_ms = request.get('max_latency_ms', 3000)
        user_id = request['user_id']
        prompt_tokens = estimate_tokens(request['prompt'])
        
        # Budget check
        remaining_budget = await self.budget_manager.check_and_reserve(
            user_id, estimated_max_cost=0.50  # Reserve $0.50
        )
        
        # Filter: capability match + circuit closed
        candidates = []
        for model_id, profile in self.model_profiles.items():
            provider = profile['provider']
            
            # Circuit breaker check
            if not self.circuit_breaker.call(provider):
                continue
            
            # Capability match
            if not all(s in profile['strengths'] for s in required_strengths):
                continue
            
            # Context window check
            if prompt_tokens > profile['context_window'] * 0.8:
                continue
            
            # Latency prediction
            predicted_latency = self.latency_predictor.predict_p99(model_id)
            if predicted_latency > max_latency_ms * 1.2:  # 20% headroom
                continue
            
            # Cost calculation
            estimated_output = min(request.get('max_tokens', 500), profile['max_tokens'])
            cost = (prompt_tokens / 1000 * profile['cost_input_1k'] +
                   estimated_output / 1000 * profile['cost_output_1k'])
            
            if cost > remaining_budget:
                continue
            
            candidates.append({
                'model_id': model_id,
                'provider': provider,
                'predicted_latency_ms': predicted_latency,
                'estimated_cost_usd': cost,
                'score': self._score(cost, predicted_latency, profile)
            })
        
        if not candidates:
            raise NoAvailableModelError("No models meet constraints")
        
        # Select optimal: minimize cost with latency constraint
        best = min(candidates, key=lambda x: (x['estimated_cost_usd'], x['predicted_latency_ms']))
        
        return {
            'model_id': best['model_id'],
            'provider': best['provider'],
            'estimated_cost_usd': best['estimated_cost_usd'],
            'predicted_latency_ms': best['predicted_latency_ms'],
            'fallback_chain': self._build_fallbacks(best, candidates)
        }
    
    def _score(self, cost, latency, profile) -> float:
        """Composite score: lower is better. Weights configurable by use case."""
        cost_weight = 0.6
        latency_weight = 0.4
        # Normalize: assume max cost $0.10, max latency 10000ms
        return (cost / 0.10) * cost_weight + (latency / 10000) * latency_weight
    
    def _build_fallbacks(self, primary, all_candidates):
        """Ordered fallback chain for resilience."""
        fallbacks = [c for c in all_candidates 
                    if c['model_id'] != primary['model_id']]
        # Sort by capability similarity (simplified)
        return sorted(fallbacks, key=lambda x: x['estimated_cost_usd'])[:2]

Pattern 3: Enterprise Gateway with Full Policy Engine

Advanced implementation with semantic classification, content filtering, and audit logging:

class EnterpriseAIGateway:
    """
    Production gateway with:
    - Fine-tuned prompt classifier
    - Content safety filtering
    - Comprehensive audit logging
    - Multi-tenant budget isolation
    """
    
    def __init__(self, config):
        self.classifier = load_fine_tuned_classifier(config.classifier_path)
        self.safety_filter = ContentSafetyFilter(config.safety_rules)
        self.budget_manager = MultiTenantBudgetManager(
            redis_pool=config.redis,
            default_daily_budget=config.default_budget_usd
        )
        self.audit_logger = StructuredAuditLogger(config.log_endpoint)
        self.routing_engine = DynamicAIGateway()  # From Pattern 2
    
    async def process_request(self, request: dict, tenant_id: str) -> dict:
        request_id = generate_uuid()
        start_time = time.monotonic()
        
        try:
            # Phase 1: Validation & Normalization
            normalized = self._normalize_request(request)
            
            # Phase 2: Prompt Firewall (security & safety)
            safety_result = await self.safety_filter.analyze(normalized['prompt'])
            if safety_result.action == 'block':
                await self.audit_logger.log({
                    'request_id': request_id,
                    'tenant_id': tenant_id,
                    'action': 'blocked',
                    'reason': safety_result.block_reason,
                    'latency_ms': elapsed_ms(start_time)
                })
                raise SafetyViolationError(safety_result.block_reason)
            
            # Phase 3: Semantic Classification
            classification = await self.classifier.classify(
                normalized['prompt'],
                conversation_history=normalized.get('messages', [])
            )
            # Returns: complexity_score, domain, sensitivity_level, estimated_tokens
            
            # Phase 4: Budget Enforcement with Tenant Isolation
            budget_status = await self.budget_manager.check(
                tenant_id=tenant_id,
                user_id=normalized.get('user_id'),
                estimated_cost=classification.estimated_cost,
                sensitivity=classification.sensitivity_level
            )
            
            if budget_status.action == 'reject':
                raise BudgetExceededError(budget_status.message)
            elif budget_status.action == 'downgrade':
                classification.target_tier = 'economy'
            
            # Phase 5: Route Selection
            routing_request = {
                'required_capabilities': classification.required_capabilities,
                'max_latency_ms': normalized.get('max_latency_ms', 3000),
                'target_tier': classification.target_tier,
                'prompt': normalized['prompt'],
                'estimated_tokens': classification.estimated_tokens,
                'user_id': normalized.get('user_id')
            }
            
            route = await self.routing_engine.route(routing_request)
            
            # Phase 6: Execute with Circuit Breaker & Retry
            result = await self._execute_with_resilience(route, normalized)
            
            # Phase 7: Post-processing & Cost Attribution
            actual_cost = calculate_actual_cost(
                route['model_id'], 
                result['usage']['prompt_tokens'],
                result['usage']['completion_tokens']
            )
            await self.budget_manager.commit_spend(tenant_id, actual_cost)
            
            # Phase 8: Audit & Telemetry
            await self.audit_logger.log({
                'request_id': request_id,
                'tenant_id': tenant_id,
                'model_id': route['model_id'],
                'classification': classification.to_dict(),
                'routing_decision': route,
                'actual_cost_usd': actual_cost,
                'latency_ms': elapsed_ms(start_time),
                'tokens': result['usage'],
                'safety_flags': safety_result.flags
            })
            
            return {
                'response': result['content'],
                'model_used': route['model_id'],
                'cost_usd': actual_cost,
                'latency_ms': elapsed_ms(start_time)
            }
            
        except Exception as e:
            await self.audit_logger.log({
                'request_id': request_id,
                'tenant_id': tenant_id,
                'action': 'error',
                'error_type': type(e).__name__,
                'latency_ms': elapsed_ms(start_time)
            })
            raise

Comparisons & Decision Framework

Gateway Implementation Options

Approach Best For Time to Production Control Level Ongoing Cost
Open-source (LiteLLM, Langfuse Gateway) Teams with existing K8s, need quick multi-provider 1–2 weeks Medium (configurable rules) Infrastructure only
Commercial (Kong AI Gateway, Cloudflare AI Gateway) Enterprise needing SLA, compliance, minimal ops Days Low-Medium (vendor-defined) Per-request + platform fees
Custom Build (Patterns above) Unique classification needs, deep cost optimization 6–10 weeks High (full customization) Team + infrastructure
Provider-Native (Azure AI Gateway) Single-provider deployments, existing Azure estate 1–2 weeks Low (provider-locked) Azure consumption

Decision Checklist

Use this framework when evaluating gateway approaches:

  1. Scale threshold: >100K monthly requests justifies dedicated gateway; <10K use provider SDKs with client-side caching
  2. Multi-provider requirement: If yes, eliminate provider-native options
  3. Classification complexity: Need custom semantic routing? → Custom build or extensible open-source
  4. Compliance requirements: SOC 2, GDPR, HIPAA? → Commercial with audit guarantees or custom with legal review
  5. Team bandwidth: <2 FTEs available? → Commercial or managed open-source
  6. Latency sensitivity: p99 <200ms requirement? → Custom with edge deployment, avoid multi-hop commercial
  7. Budget volatility tolerance: >50% monthly variance unacceptable? → Mandatory custom budget enforcement

Those operating agentic systems at scale should reference field-tested production observability frameworks for agentic AI to ensure gateway telemetry integrates with broader system health monitoring.

Failure Modes & Edge Cases

Critical Failure Scenarios

1. Classification Cascade Failure
Symptom: Simple queries routed to expensive models, costs spike 3–5x without traffic increase.
Root cause: Prompt classifier drift or poisoning; training data no longer represents production distribution.
Diagnostics: Monitor classification confidence scores; alert on entropy spike. Track routing distribution by complexity bucket.
Mitigation: Fallback to heuristic classifier when ML confidence <0.7. A/B test classifier updates with 5% traffic shadowing.

2. Budget Race Condition
Symptom: Users exceed stated limits; spend continues post-budget-exhaustion.
Root cause: Non-atomic budget checks; concurrent requests pass check before any commits spend.
Diagnostics: Redis MONITOR for competing decrements; compare sum(spend) vs. budget limit.
Mitigation: Lua script atomic check-and-decrement; or pessimistic reservation (Pattern 3 above).

3. Circuit Breaker Oscillation
Symptom: Rapid provider switching; latency worse than single-provider baseline.
Root cause: Overly sensitive failure detection (threshold too low) or aggressive recovery timeout.
Diagnostics: Circuit state transition frequency; correlate with error rate (not just latency).
Mitigation: Adaptive thresholds based on historical error rate; minimum 60-second open duration.

4. Token Count Estimation Drift
Symptom: Budget exhaustion before predicted; actual costs 20–40% higher than estimates.
Root cause: Character-based estimation fails for code, non-English text, or special tokens.
Diagnostics: Track estimation error distribution by content type; flag systematic bias.
Mitigation: Use provider-specific tokenizers (tiktoken, Anthropic tokenizer) for pre-flight estimation; 15% safety margin on budgets.

5. Streaming Budget Overrun
Symptom: Streaming responses exceed token limits mid-generation; difficult to truncate gracefully.
Root cause: Budget check at request start doesn't account for unconstrained output generation.
Diagnostics: Streaming token rate vs. budget burn-down; mid-stream termination frequency.
Mitigation: Accumulating token counter with hard stop; pre-negotiate max_tokens conservatively; graceful truncation message.

Performance & Scaling

Latency Budgets by Component

Production AI gateway latency decomposition (p95 targets):

  • Request normalization: 5–10ms
  • Prompt classification: 20–50ms (ML) or 5–10ms (heuristic)
  • Budget check (Redis): 2–5ms
  • Routing decision: 5–15ms
  • Provider connection establishment: 50–200ms (mitigated via connection pooling)
  • Model time-to-first-token (TTFT): 100–800ms (provider-dependent, dominant factor)

Total gateway overhead target: <100ms p95, excluding model TTFT.

Scaling Patterns

Horizontal Scaling: Gateway instances are stateless; scale by request rate. Budget state externalized to Redis Cluster with hash tagging by tenant for locality.

Regional Deployment: Deploy gateway edge nodes in 3+ regions with provider endpoint affinity. Latency predictor models trained per region; don't assume US-East latency applies to APAC.

Connection Pool Tuning: Maintain 50–200 persistent connections per provider endpoint per gateway instance. HTTP/2 multiplexing essential for high throughput.

Monitoring KPIs

Metric Target Alert Threshold
Gateway p99 latency (excl. model) <150ms >300ms
End-to-end p99 latency <2000ms >5000ms
Cost per 1K requests Baseline -5% >Baseline +20%
Budget exhaustion rate <1% of users/month >5%
Circuit breaker open frequency <0.1% of requests >1%
Classification accuracy (vs. human) >90% <85%

Organizations seeking to optimize broader infrastructure costs alongside AI spend should examine multi-cloud Kubernetes cost optimization strategies, as gateway infrastructure typically deploys on container platforms with similar efficiency opportunities.

Production Best Practices

Security

  • Prompt injection defense: Layered filtering—structural analysis first (fast), embedding similarity second (thorough). Never rely solely on provider-side filtering.
  • Credential isolation: Provider API keys in dedicated secret store (HashiCorp Vault, AWS Secrets Manager); gateway uses short-lived tokens with automatic rotation.
  • Audit retention: 90-day minimum for routing decisions, classification inputs, and spend attribution; encrypted at rest with key rotation.

Testing

  • Shadow traffic: Route 1% production traffic through new classifier versions; compare routing decisions without user impact.
  • Chaos engineering: Regularly inject provider latency spikes and failures; verify circuit breaker and fallback behavior.
  • Cost regression tests: Fixed prompt corpus with known optimal routing; alert when gateway selects suboptimal model.

Rollout

  • Canary by tenant tier: Free users first (accept higher error rate), paid users after 48-hour stability validation.
  • Budget buffer: Maintain 20% emergency reserve for unexpected traffic patterns; auto-scale reserve with monthly growth.

Runbooks

Cost Spike Response (P1):

  1. Identify tenant/user causing spike via cost attribution dashboard
  2. Apply emergency rate limit (10 RPM) to identified entity
  3. Switch entity to "economy mode" (GPT-3.5 only, no streaming)
  4. Notify account team; preserve audit trail
  5. Post-incident: classify root cause (legitimate growth vs. misuse vs. classifier failure)

Provider Outage Response (P0):

  1. Confirm circuit breaker opened automatically
  2. Verify traffic shifted to fallback providers
  3. If all providers degraded: queue non-critical requests, degrade to cached responses for FAQs
  4. Communicate status page update
  5. Post-recovery: analyze latency impact, tune circuit thresholds if needed

Further Reading & References

  1. OpenAI API Documentation: "Managing Rate Limits and Token Usage" — https://platform.openai.com/docs/guides/rate-limits
  2. Anthropic Documentation: "Token Counting and Cost Optimization" — https://docs.anthropic.com/en/docs/build-with-claude/token-counting
  3. LiteLLM Gateway: Open-source unified LLM API with routing — https://docs.litellm.ai/docs/proxy/quick_start
  4. Cloudflare AI Gateway: Edge-deployed LLM routing and caching — https://developers.cloudflare.com/ai-gateway/
  5. Kong AI Gateway: Enterprise AI traffic management — https://docs.konghq.com/gateway/latest/ai-gateway/
  6. "Prompt Injection: Threats and Mitigations" — Greshake et al., 2023. arXiv:2302.12173
Next Post Previous Post
No Comment
Add Comment
comment url