LLM Eval CI: Versioned Test Suites & Golden Datasets
Introduction
Production LLM systems fail silently. A prompt change that improved coherence on Tuesday degrades factual accuracy by Thursday. A model version bump introduces regressions in structured output parsing. Without systematic evaluation, these failures reach users before engineers detect them—if they detect them at all.
This article delivers a complete engineering framework for evaluation-as-code: versioned test suites, deterministic replay pipelines, golden datasets, and CI gatekeeping that blocks regressions at merge time. You will build systems that catch LLM quality degradation in minutes, not days, with audit trails that satisfy compliance and debugging requirements.
Failure scenario: A fintech team upgraded from GPT-4 to GPT-4o-mini for cost reduction. Unit tests passed. Integration tests passed. Three weeks later, customer support noted a 340% increase in hallucinated fee explanations. Root cause: the cheaper model systematically omitted disclaimers on regulatory-sensitive outputs. No golden dataset existed for that output class. No CI gate had evaluated semantic correctness. The rollback required 47 hours of incident response and a regulatory filing.
Executive Summary
TL;DR: Treat LLM evaluation as infrastructure code—versioned, reproducible, and gatekept in CI—to prevent silent regressions in production systems where stochastic outputs and model drift make traditional testing insufficient.
- Key takeaway 1: Golden datasets must be immutable, semantically labeled, and cover failure modes, not just happy paths—regressions hide at distribution tails.
- Key takeaway 2: Deterministic replay requires freezing model version, temperature=0, seed, and prompt template hash; anything less produces non-reproducible evaluation.
- Key takeaway 3: CI gatekeeping needs both pass/fail thresholds and statistical process control—single-point failures are noise, trend violations are signal.
- Key takeaway 4: Evaluation latency dominates pipeline cost; design for p95 < 5 min via parallelization, caching, and tiered evaluation depth.
- Key takeaway 5: LLM-as-judge evaluators require their own calibration and golden datasets; uncalibrated judges amplify bias.
- Key takeaway 6: Version test suites with the same rigor as application code—git history must reconstruct any past evaluation state exactly.
Direct answers to likely queries:
- Q: What makes LLM evaluation different from traditional software testing? A: Non-deterministic outputs, model version drift, and absence of ground truth for generative tasks require statistical evaluation over fixed test cases rather than deterministic pass/fail assertions.
- Q: How large should a golden dataset be for production CI? A: 500–5,000 examples covering all output classes and known failure modes, with stratified sampling ensuring p99 coverage; bootstrap confidence intervals validate statistical power.
- Q: What threshold should block a CI merge for LLM evaluation? A: Block on any metric falling below its rolling baseline minus 2σ, or on catastrophic failure in safety-critical classes; never block on single-example variance alone.
How Evaluation-as-Code for LLMs Works Under the Hood
The Core Abstraction: Evaluation as a Deterministic Function
Traditional software testing assumes deterministic execution: same input → same output → same assertion result. LLMs violate this assumption at multiple layers. Evaluation-as-code restores determinism by controlling every variable that affects output distribution.
The evaluation function E can be modeled as:
E(input, model, prompt_template, parameters, context) → (output, metadata)
For deterministic replay, we freeze:
- Model version: Exact API version or model checkpoint hash, not "GPT-4" or "Claude 3"
- Prompt template: Content and whitespace, versioned by SHA-256
- Parameters: temperature=0 (or fixed seed with temperature>0), max_tokens, top_p, presence/frequency penalties
- Context window: System prompt, retrieved documents, conversation history—all hashed
- Infrastructure: API endpoint region, fallback behavior, timeout configuration
Any unfrozen variable introduces entropy that invalidates comparison across evaluations. Most production regressions trace to "minor" prompt edits or undocumented API behavior changes, not model updates.
Architecture: The Evaluation Pipeline
A production evaluation pipeline has four stages with distinct reliability requirements:
Stage 1: Golden Dataset Management
Golden datasets are the ground truth repository. Each record contains:
{
"id": "uuid-v5-of-input-hash",
"version": "dataset-v3.2.1",
"input": { /* complete prompt context */ },
"expected_output": { /* structured or reference text */ },
"evaluation_criteria": ["factual_accuracy", "tone_compliance", "structured_output_valid"],
"failure_mode_tags": ["edge_case_regulatory", "ambiguous_entity_reference"],
"source": "incident-2024-0312-fee-disclaimer-missing",
"annotator_confidence": 0.95,
"last_validated": "2024-06-15T09:23:00Z"
}
Immutability is enforced at the storage layer. Updates create new dataset versions; old versions remain addressable for historical comparison. This audit property is essential for regulated deployments and root-cause analysis.
Stage 2: Execution Engine
The execution engine runs inference over golden datasets with frozen parameters. Critical design decisions:
- Parallelization strategy: Shard by example count, not by batch size, to control per-evaluation latency. Target p99 < 5 min for CI blocking.
- Retry policy: Exponential backoff with jitter for API failures; distinguish infrastructure errors (retry) from content policy blocks (record and continue).
- Caching layer: Hash all inputs including model version; cache responses for 24 hours minimum to reduce cost and variance in repeated evaluations.
Stage 3: Scoring & Aggregation
Scores derive from multiple evaluator types:
- Programmatic: JSON schema validation, regex extraction, exact match for deterministic outputs
- Embedding similarity: Cosine distance to expected output for semantic drift detection
- LLM-as-judge: Calibrated rubric-based evaluation for qualitative dimensions
- Human-in-the-loop: Spot-check validation for judge calibration drift
Aggregation uses stratified statistics: overall mean, per-class mean, and tail metrics (p5/p95 of per-example scores). The tail reveals concentrated failures that averages hide.
Stage 4: CI Gatekeeping
The gate compares current evaluation against rolling baseline with statistical process control:
IF any_class_mean < baseline_mean - 2 * baseline_std:
BLOCK_MERGE
ELIF overall_p5_score < historical_p5_min:
BLOCK_MERGE
ELIF catastrophic_failure_count > 0:
BLOCK_MERGE
ELSE:
PASS_WITH_WARNING_IF_variance_elevated
Deterministic Replay: The Critical Detail
Deterministic replay requires more than temperature=0. OpenAI's API, for example, guarantees deterministic outputs only with:
- temperature=0
- seed parameter set (introduced 2023-11)
- Identical prompt content including system fingerprint responses
- Same model version ("gpt-4-0125-preview" not "gpt-4-turbo")
Even with these controls, API-side optimizations (batching, speculative decoding) can introduce variance. For highest-fidelity replay, capture and record the API response's system_fingerprint field; divergence in this field between evaluations indicates infrastructure-level variance.
For local or self-hosted models, deterministic replay requires:
- Fixed random seed across all layers (PyTorch, CUDA, numpy)
- Deterministic CUDA operations (sacrificing some performance)
- Frozen weights checkpoint, not just model architecture
- Identical batch size and sequence length (affects attention optimizations)
Implementation: Production Patterns
Pattern 1: Minimal Viable Evaluation Pipeline
Start with golden dataset definition and basic CI integration. This Python example uses a simple structure compatible with most CI systems:
import hashlib
import json
from dataclasses import dataclass
from typing import List, Dict, Any
import openai
@dataclass(frozen=True)
class EvaluationConfig:
model: str # "gpt-4-0125-preview"
temperature: float = 0.0
seed: int = 42
max_tokens: int = 4096
def fingerprint(self) -> str:
return hashlib.sha256(
json.dumps(self.__dict__, sort_keys=True).encode()
).hexdigest()[:16]
@dataclass(frozen=True)
class GoldenExample:
id: str
input_messages: List[Dict[str, str]]
expected_schema: Dict[str, Any]
evaluation_rubric: List[str]
def input_hash(self) -> str:
return hashlib.sha256(
json.dumps(self.input_messages, sort_keys=True).encode()
).hexdigest()[:16]
class DeterministicEvaluator:
def __init__(self, config: EvaluationConfig, cache: Dict = None):
self.config = config
self.cache = cache or {}
self.client = openai.OpenAI()
def evaluate_single(self, example: GoldenExample) -> Dict:
cache_key = f"{example.input_hash()}:{self.config.fingerprint()}"
if cache_key in self.cache:
return self.cache[cache_key]
response = self.client.chat.completions.create(
model=self.config.model,
messages=example.input_messages,
temperature=self.config.temperature,
seed=self.config.seed,
max_tokens=self.config.max_tokens,
response_format={"type": "json_object"}
)
result = {
"example_id": example.id,
"cache_key": cache_key,
"system_fingerprint": response.system_fingerprint,
"output": response.choices[0].message.content,
"schema_valid": self._validate_schema(
response.choices[0].message.content,
example.expected_schema
),
"latency_ms": response.usage.response_ms if hasattr(response, 'usage') else None
}
self.cache[cache_key] = result
return result
def _validate_schema(self, output: str, expected_schema: Dict) -> bool:
try:
parsed = json.loads(output)
# Implement JSON Schema validation
return all(k in parsed for k in expected_schema.get("required", []))
except json.JSONDecodeError:
return False
Pattern 2: LLM-as-Judge with Calibration
Uncalibrated LLM judges produce inconsistent scores. Implement calibration through few-shot exemplars and inter-judge agreement metrics:
from statistics import mean, stdev
class CalibratedJudge:
def __init__(self, judge_model: str, calibration_examples: List[Dict]):
self.judge_model = judge_model
self.calibration_examples = calibration_examples
self._validate_calibration()
def _validate_calibration(self):
"""Ensure calibration examples have human-verified scores."""
assert all(
"human_score" in ex and "rubric" in ex
for ex in self.calibration_examples
)
def score(self, candidate_output: str, rubric: str) -> Dict:
messages = [
{"role": "system", "content": "You are an expert evaluator. Score outputs 1-5."},
*self._format_calibration_shots(),
{"role": "user", "content": f"Output to evaluate: {candidate_output}\nRubric: {rubric}"}
]
response = self.client.chat.completions.create(
model=self.judge_model,
messages=messages,
temperature=0,
seed=42
)
score = self._extract_score(response.choices[0].message.content)
calibrated_score = self._apply_calibration_transform(score)
return {
"raw_score": score,
"calibrated_score": calibrated_score,
"confidence": self._estimate_confidence(score)
}
def _format_calibration_shots(self) -> List[Dict]:
shots = []
for ex in self.calibration_examples[:5]: # 5-shot typical
shots.extend([
{"role": "user", "content": f"Output: {ex['output']}\nRubric: {ex['rubric']}"},
{"role": "assistant", "content": f"Score: {ex['human_score']}. {ex['reasoning']}"}
])
return shots
def benchmark_judge_reliability(self, test_set: List[Dict]) -> Dict:
"""Measure inter-judge agreement and calibration drift."""
human_scores = [ex["human_score"] for ex in test_set]
judge_scores = [self.score(ex["output"], ex["rubric"])["calibrated_score"]
for ex in test_set]
correlation = self._pearson_r(human_scores, judge_scores)
return {
"human_judge_correlation": correlation,
"acceptable": correlation > 0.85,
"drift_detected": correlation < 0.75 # Trigger recalibration
}
Pattern 3: CI Integration with Statistical Gatekeeping
GitHub Actions example with proper statistical thresholds:
name: LLM Evaluation Gate
on:
pull_request:
paths:
- 'prompts/**'
- 'src/llm/**'
- 'eval/golden/**'
jobs:
evaluate:
runs-on: ubuntu-latest
timeout-minutes: 15 # Fail fast on infrastructure issues
steps:
- uses: actions/checkout@v4
- name: Load Baseline Metrics
id: baseline
run: |
python -c "
import json
with open('eval/baselines/main.json') as f:
baseline = json.load(f)
print(f'::set-output name=overall_mean::{baseline[\"overall_mean\"]}')
print(f'::set-output name=overall_std::{baseline[\"overall_std\"]}')
print(f'::set-output name=class_mins::{json.dumps(baseline[\"class_mins\"])}')
"
- name: Run Evaluation Suite
id: eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python -m eval.runner \
--dataset eval/golden/v3.2.1 \
--config eval/configs/production.yaml \
--output eval/results/pr-${{ github.event.number }}.json \
--parallel 32
- name: Statistical Gate Check
run: |
python -c "
import json, sys
with open('eval/results/pr-${{ github.event.number }}.json') as f:
results = json.load(f)
with open('eval/baselines/main.json') as f:
baseline = json.load(f)
# Check overall mean (2-sigma threshold)
if results['overall_mean'] < baseline['overall_mean'] - 2 * baseline['overall_std']:
print(f'FAIL: overall_mean {results[\"overall_mean\"]:.3f} < {baseline[\"overall_mean\"] - 2*baseline[\"overall_std\"]:.3f}')
sys.exit(1)
# Check per-class minimums
for cls, min_score in baseline['class_mins'].items():
if results['class_means'].get(cls, 0) < min_score:
print(f'FAIL: class {cls} mean {results[\"class_means\"][cls]:.3f} < {min_score}')
sys.exit(1)
# Check catastrophic failures
if results.get('catastrophic_failures', 0) > 0:
print(f'FAIL: {results[\"catastrophic_failures\"]} catastrophic failures detected')
sys.exit(1)
print('PASS: All statistical gates cleared')
"
- name: Upload Results Artifact
uses: actions/upload-artifact@v4
with:
name: eval-results-pr-${{ github.event.number }}
path: eval/results/pr-${{ github.event.number }}.json
Pattern 4: Advanced — Tiered Evaluation Depth
Not all changes require full evaluation. Implement tiered evaluation to manage cost and latency:
class TieredEvaluator:
TIERS = {
"smoke": {"examples": 50, "evaluators": ["schema", "exact"]},
"standard": {"examples": 500, "evaluators": ["schema", "embedding", "judge-lite"]},
"full": {"examples": 5000, "evaluators": ["schema", "embedding", "judge-full", "human-spot"]},
"release": {"examples": 5000, "evaluators": "all", "replicates": 3}
}
def select_tier(self, change_impact: str, file_paths: List[str]) -> str:
"""Determine evaluation depth from change scope."""
if any(p.startswith("prompts/critical/") for p in file_paths):
return "release" # Regulatory-sensitive prompts
elif change_impact == "major":
return "full"
elif change_impact == "minor":
return "standard"
else:
return "smoke"
def evaluate(self, tier: str, dataset: GoldenDataset) -> Dict:
config = self.TIERS[tier]
subset = dataset.stratified_sample(config["examples"])
# Replicate for variance estimation on release tier
replicates = config.get("replicates", 1)
if replicates > 1:
results = [self._run_eval(subset, config["evaluators"])
for _ in range(replicates)]
return self._aggregate_with_variance(results)
return self._run_eval(subset, config["evaluators"])
This tiering reduces p95 CI time from 45 minutes (full) to 3 minutes (smoke) for low-risk changes, enabling rapid iteration without sacrificing safety on critical paths.
Comparisons & Decision Framework
Evaluator Type Selection
| Evaluator | Cost | Latency | Subjectivity | Best For | Caveat |
|---|---|---|---|---|---|
| Programmatic (schema/regex) | $0 | <1ms | None | Structured outputs, API contracts | Cannot assess quality, only validity |
| Embedding similarity | Low | ~100ms | Low | Semantic drift detection, paraphrase tolerance | Insensitive to factual errors with similar wording |
| LLM-as-judge (calibrated) | Medium | 1-5s | Medium | Tone, coherence, reasoning quality | Requires ongoing calibration; judge model drift |
| LLM-as-judge (uncalibrated) | Medium | 1-5s | High | None—avoid | Unreliable; produces false confidence |
| Human evaluation | High | Hours-days | Reference | Ground truth creation, judge calibration | Not scalable for CI; use for dataset building |
Decision Checklist: When to Block Merge
- Schema/contract violations: ALWAYS block. These are unambiguous regressions.
- Overall mean score decrease > 2σ from baseline: Block. Statistically significant degradation.
- Any class mean below minimum threshold: Block. Concentrated failure in specific use case.
- p5 score below historical minimum: Block. Worst-case outputs are worse than ever before.
- Catastrophic failures (safety, compliance, security): ALWAYS block, regardless of other metrics.
- Variance increase without mean shift: Warn, don't block. Investigate for emerging instability.
- Single-example outliers: Don't block. Record for dataset enrichment.
Golden Dataset Sizing Guidance
| Use Case | Minimum Examples | Stratification Requirement | Validation |
|---|---|---|---|
| Prototype/POC | 50-100 | Cover 3+ output classes | Visual inspection |
| Production MVP | 500 | Per-class proportional to production traffic | Bootstrap 95% CI on mean scores |
| Regulated industry (finance, healthcare) | 2,000-5,000 | Mandatory coverage of all failure modes from incident history | Human re-annotation quarterly; inter-annotator agreement > 0.9 |
| Multi-model evaluation (selection) | 1,000+ | Balanced across easy/medium/hard by human difficulty rating | Statistical power analysis: 80% power to detect 5% mean difference |
Failure Modes & Edge Cases
Failure Mode 1: Phantom Regressions from API Non-Determinism
Symptom: Scores fluctuate ±5% between identical evaluations. Baseline comparison triggers false blocks.
Diagnosis: Check system_fingerprint variance across runs. Compare cache hit rates. Verify temperature=0 and seed are actually passed (common SDK bug).
Mitigation: Implement triple-evaluation with majority voting for borderline cases. Cache aggressively. If API non-determinism persists, switch to local evaluation for gating, API for final validation.
Failure Mode 2: Judge Calibration Drift
Symptom: Scores trend upward over weeks without corresponding quality improvement. Human spot-checks disagree with judge.
Diagnosis: Run benchmark_judge_reliability() weekly. Track human-judge correlation metric.
Mitigation: Recalibrate when correlation drops below 0.75. Rotate judge models quarterly to avoid model-specific bias accumulation. For critical evaluations, use ensemble of 2+ judge models with disagreement flagging.
Failure Mode 3: Golden Dataset Rot
Symptom: Evaluation passes but production failures increase. Dataset no longer represents current traffic.
Diagnosis: Monitor production-to-dataset KL divergence on input embeddings. Track incident-to-dataset coverage: what % of production failures have analogous examples in golden set?
Mitigation: Quarterly dataset audit with automated suggestions from production failures. Maintain "recent incidents" fast-lane: add production failures to dataset within 48 hours, evaluate in next CI run.
Failure Mode 4: CI Cost Explosion
Symptom: Evaluation spend exceeds inference spend. Engineers skip evaluation to save money.
Diagnosis: Track evaluation cost per PR. Break down by tier usage, judge model selection, cache efficiency.
Mitigation: Implement tiered evaluation (Pattern 4). Use cheaper judge models for initial screening, expensive models only on disagreement. Cache with 7-day TTL. Consider local evaluation for schema/embedding checks.
Failure Mode 5: Prompt Injection in Evaluation Inputs
Symptom: Evaluation produces anomalous outputs. Judge scores are manipulated. Potential security issue.
Diagnosis: Review golden dataset examples for suspicious patterns: "ignore previous instructions", delimiter flooding, role confusion.
Mitigation: Sanitize all dataset inputs with production-grade input filters. Evaluate with same security controls as production. For security-critical applications, see NIST IR 8596 controls for LLM security evaluation to align evaluation security posture with organizational risk framework.
Performance & Scaling
Latency Targets & Measurement
| Pipeline Stage | p50 Target | p95 Target | p99 Target | Measurement Method |
|---|---|---|---|---|
| Single inference (API) | 500ms | 2s | 5s | API response time, not wall clock |
| Golden dataset evaluation (500 ex, parallel) | 30s | 2min | 5min | CI job duration |
| Full evaluation (5,000 ex) | 5min | 15min | 30min | CI job with caching |
| Judge evaluation (per example) | 1s | 3s | 5s | Judge API response time |
Cost Optimization
Evaluation cost scales with dataset size × model cost × evaluation depth. For GPT-4-class models at $30/1M tokens:
- Smoke tier (50 examples, 2K tokens each): ~$3 per evaluation
- Standard tier (500 examples): ~$30 per evaluation
- Full tier (5,000 examples, with judge): ~$500-800 per evaluation
Optimization strategies:
- Cache layer: Redis or S3 with content-addressable keys. Typical 60-80% hit rate for unchanged prompts.
- Model tiering: Use GPT-3.5 for schema/embedding checks, GPT-4 only for judge disagreement resolution.
- Parallelization: Shard across 32-64 workers. API rate limits are the bottleneck; implement token bucket with exponential backoff.
- Incremental evaluation: Only evaluate examples affected by prompt change, determined by input hash prefix matching.
For organizations managing evaluation spend across multiple teams, FinOps practices for LLM token economics provide chargeback frameworks and unit cost visibility that prevent evaluation from becoming an uncontrolled cost center.
Monitoring & Alerting
Instrument the evaluation pipeline itself:
evaluation_pipeline_metrics = {
"cache_hit_rate": Gauge, # Alert if < 0.5
"inference_p99_latency": Histogram, # Alert if > 10s
"judge_human_correlation": Gauge, # Alert if < 0.8
"cost_per_evaluation": Counter, # Alert if weekly increase > 20%
"gate_block_rate": Gauge, # Alert if > 0.3 (indicates instability)
"dataset_coverage_score": Gauge # Alert if quarterly drop
}
Production Best Practices
Security & Compliance
- Dataset access control: Golden datasets may contain PII from production incidents. Apply same classification as source data.
- Evaluation isolation: Run evaluations in separate API keys/accounts from production to prevent quota contention and blast radius containment.
- Audit logging: Log every evaluation run with config fingerprint, dataset version, and result hash. Immutable storage for compliance.
- Model version pinning: Never use "latest" or date-range aliases in evaluation configs. Exact version only.
In security-critical domains like threat intelligence, evaluation practices must integrate with broader AI security controls. The domain-specific evaluation patterns for cyber threat intelligence demonstrate how to adapt this framework for specialized outputs where hallucination carries operational risk, combining vector database retrieval verification with calibrated LLM-as-judge assessment.
Runbook: Evaluation Pipeline Outage
- Detection: CI job timeout or cache layer unavailability.
- Immediate: Switch to degraded mode—run smoke tier only, with 10x sample from most critical class.
- Communication: Notify #ml-ops with estimated restoration time; do not bypass gate without explicit sign-off.
- Restoration: Restore cache layer, verify with manual full evaluation run on recent baseline.
- Post-incident: Add evaluation pipeline health to on-call dashboard; implement circuit breaker for API dependency.
Versioning Strategy
Version all artifacts together or not at all:
eval-suite/
v3.2.1/
dataset/ # Immutable golden examples
config.yaml # Model, parameters, thresholds
rubrics/ # Judge scoring criteria
baselines.json # Statistical reference points
code/ # Evaluation runner at this version
Git tag: eval-v3.2.1 references exact state. Reproduce any historical evaluation with: git checkout eval-v3.2.1 && python -m eval.runner --dataset eval-suite/v3.2.1/dataset.
Integration with Threat Intelligence Workflows
For teams building automated threat intelligence pipelines, evaluation-as-code ensures that GenAI-generated assessments maintain accuracy as threat landscapes evolve. The workflow automation patterns for threat intelligence with GenAI show how to embed these evaluation gates within entity resolution and report generation pipelines, where structured output validity and source attribution accuracy are non-negotiable requirements.
Further Reading & References
- OpenAI API Reference — Deterministic Outputs:
https://platform.openai.com/docs/guides/reproducible-outputs— Official guidance on seed parameter and system_fingerprint for deterministic inference. - "Evaluating Large Language Model Evaluations" (Chang et al., 2024): Comprehensive analysis of LLM-as-judge reliability, calibration requirements, and failure modes. Critical for designing judge components.
- ISO/IEC 25010:2023 Systems and Software Quality Models: Standard framework for operationalizing quality characteristics (functional suitability, reliability, security) that evaluation rubrics should reference.
- MLflow Documentation — Model Evaluation:
https://mlflow.org/docs/latest/models.html#model-evaluation— Production patterns for experiment tracking and evaluation artifact management. - "Best Practices for ML Engineering" (Google, 2017): Foundational principles for test infrastructure, though predating LLM-specific concerns; still relevant for CI integration patterns.
- Weights & Biases — LLM Evaluation Tools:
https://docs.wandb.ai/guides/prompts— Practical implementation reference for visualization and traceability of evaluation runs.
The framework presented here is not theoretical. It is the minimum viable rigor for production LLM systems where silent regressions carry business, regulatory, or safety consequences. Start with deterministic replay and schema validation. Add calibrated judges. Build statistical gates. Version everything. The cost of this infrastructure is measured in hours; the cost of undetected regressions is measured in incidents, customer trust, and compliance findings.