OpenTelemetry AI: Native LLM Tracing & Observability
Introduction
Problem statement (production-framed): Observability for LLM-based systems is noisy, high-cardinality, and protocol-diverse; teams need a reproducible, vendor-agnostic way to collect, enrich, and forward LLM traces and metrics into existing monitoring platforms without breaking performance SLAs. For a broader view of how specialized platforms handle these challenges, see our comparison of AI observability platforms including Braintrust, Arize Phoenix, and Langfuse.
What this article delivers: a pragmatic, production-tested pattern for OpenTelemetry-Native AI observability that covers semantic conventions for LLMs, LLM tracing integration patterns, exporter and collector configurations (Datadog and New Relic examples), failure modes and diagnostics, and scaling guidance.
Failure scenario: imagine an inference service in production that serves agent orchestration and LLM completions. Latency spikes to p95=3s and p99=7s after a model rollout. Observability is split: traces are sent to vendor A, metrics to vendor B, and prompt-level context (prompts, few-shot examples, prompt tokens) are logged in ad-hoc application logs. The incident response team lacks prompt-level trace spans, token counts, and request/response hashes needed to triage a regression. This article describes how to instrument, enrich, route, and troubleshoot LLM traces using OpenTelemetry so that the next regression is diagnosable within your existing observability ecosystem.
Executive Summary
TL;DR: Use OpenTelemetry AI semantic conventions to create structured LLM spans, pipe them through an OpenTelemetry Collector (for enrichment, sampling, and routing), and export to Datadog or New Relic via OTLP or vendor exporters for correlation with metrics and logs.
- Key takeaway 1: Model-level spans (prompt → model.call → completion) and semantic attributes (model.name, prompt.token_count, completion.token_count, provider) are essential for root-cause analysis.
- Key takeaway 2: Run an OpenTelemetry Collector as a dedicated sidecar/central pipeline to perform adaptive sampling, PII-safe enrichment, and vendor routing.
- Key takeaway 3: Use deterministic hashing for prompts to enable de-duplication and privacy-preserving correlation across traces, logs, and metrics.
- Key takeaway 4: Export to Datadog or New Relic by sending OTLP to a local agent or the collector with proper headers (API keys); prefer batching and asynchronous export to avoid tail latency costs.
- Key takeaway 5: Monitor p95 and p99 trace export latency and collector CPU; enforce backpressure with bounded queues and circuit-breaking exporters.
3 Likely direct Q→A pairs
- Q: How do I send LLM traces from OpenTelemetry to Datadog or New Relic? A: Export spans via OTLP to an OpenTelemetry Collector that uses either the Datadog exporter/agent or the New Relic OTLP endpoint with required API headers; see the Deployment examples below.
- Q: What semantic attributes should I add for LLM spans? A: Follow OpenTelemetry semantic conventions for LLMs: model identifiers, provider, prompt/response token counts, sampling probability, latency_ms, and deterministic prompt hashes.
- Q: How to avoid PII leakage in prompt tracing? A: Hash prompts deterministically before sending, redacting full prompt text in production traces, and store full text only in a secure, audited store if absolutely necessary.
How OpenTelemetry-Native AI Observability Integrations Works Under the Hood
Architecture summary: instrument application code to create LLM-related spans and attributes → use SDKs to batch and export spans to an OpenTelemetry Collector → Collector performs enrichment (sampling, attribute transformation, PII redaction), then routes OTLP to one or more backends (Datadog, New Relic, or others). The collector is the canonical control plane for vendor-neutral telemetry processing and is central to most production deployments.
Key components and data flows (textual diagram):
- Application (Python/Go/Node): creates spans for the request lifecycle: request.received → prompt.parse → llm.call → llm.postprocess → response.sent
- Local SDK: attaches semantic attributes (llm.model.name, llm.provider, llm.prompt.tokens, trace.hash, user.id_safe) and performs local batching and non-blocking export to Collector.
- OpenTelemetry Collector: receives OTLP, performs processor steps (attributes, sampling, batching, resource detection), and routes to exporters (Datadog Agent via dogstatsd/trace agent, Datadog OTLP endpoint, or New Relic ingest endpoints).
- Vendor Backend: correlates traces with metrics and logs, applies vendor-specific enrichment (APM UI, traces maps), and surfaces latency, error, and cost insights.
Protocols and exporters: OTLP/gRPC and OTLP/HTTP are the modern, vendor-neutral transports. Datadog supports receiving OTLP through the Datadog agent (and also offers a Datadog exporter). New Relic accepts OTLP with the license key header (they also publish OpenTelemetry documentation). When vendor exporters exist, prefer them for less mapping friction, but OTLP keeps you vendor-portable.
Semantic conventions for LLMs: OpenTelemetry community introduced conventions for AI/LLM spans. Use these attribute keys (examples, choose names matching your OpenTelemetry version):
- llm.model.name / llm.model.version
- llm.provider (e.g., openai, anthropic, local)
- llm.request.prompt_token_count, llm.response.completion_token_count
- llm.call.latency_ms and llm.call.success (bool)
- llm.request.prompt_hash (deterministic SHA256 or HMAC to avoid PII leakage)
These attributes enable slice-and-dice queries: by model, by provider, by prompt pattern (via hash), and by token cost.
Implementation: Production Patterns
This section gives step-by-step patterns: basic instrumentation → collector pipeline → vendor exports → advanced sampling and redaction.
Basic instrumentation (Python example)
from opentelemetry import trace
from opentelemetry.trace import TracerProvider
from opentelemetry.sdk.trace import SpanProcessor, BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import ConsoleSpanExporter
import hashlib
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# local batch exporter to Collector
otlp_exporter = OTLPSpanExporter(endpoint="localhost:4317", insecure=True)
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(otlp_exporter))
def prompt_hash(prompt_text, salt="your-hmac-salt"):
# deterministic, one-way; do not send raw prompt_text in prod
h = hashlib.sha256()
h.update(salt.encode())
h.update(prompt_text.encode())
return h.hexdigest()
def call_llm(prompt, model_name, provider):
with tracer.start_as_current_span("llm.call") as span:
span.set_attribute("llm.model.name", model_name)
span.set_attribute("llm.provider", provider)
span.set_attribute("llm.request.prompt_hash", prompt_hash(prompt))
span.set_attribute("llm.request.prompt_token_count", estimate_tokens(prompt))
# call the model (pseudo)
response, latency_ms, completion_tokens = invoke_model(prompt, model_name)
span.set_attribute("llm.call.latency_ms", latency_ms)
span.set_attribute("llm.response.completion_token_count", completion_tokens)
span.set_attribute("llm.call.success", True)
return response
Notes: estimate_tokens() and invoke_model() are placeholders; prefer library tokenizers (e.g., tiktoken) for accurate token counts.
Collector pipeline (minimum production-ready)
Run the Collector as either a sidecar for each pod or a central deployment (DaemonSet for Kubernetes). Sidecar reduces network hops and provides per-pod control; central reduces operational overhead and is easier to scale horizontally for high-volume telemetry. Typical production pattern: sidecar for high-cardinality services + central collector for cross-service routing and long-term enrichment.
# opentelemetry-collector.yaml (extract)
receivers:
otlp:
protocols:
grpc: {}
http: {}
processors:
batch:
attributes/redact_prompts:
actions:
- key: llm.request.prompt_text
action: delete
- key: llm.request.prompt
action: delete
exporters:
otlp/datadog:
endpoint: "http://datadog-agent:4317"
otlp/newrelic:
endpoint: "https://otlp.nr-data.net:4317"
headers:
api-key: "${NEW_RELIC_API_KEY}"
service:
pipelines:
traces:
receivers: [otlp]
processors: [attributes/redact_prompts, batch]
exporters: [otlp/datadog, otlp/newrelic]
Important: do not send raw user prompts in production. Use attributes processor to redact or replace prompt text with a hash. If you must store raw prompts for debugging, write them to a secure, audited store and reference them in trace attributes with a reference ID.
Sending LLM traces to Datadog
Options:
- Send OTLP to Datadog agent (recommended for on-host setups) — Datadog agent accepts OTLP and upstreams to Datadog APM.
- Send OTLP directly to Datadog ingest endpoint with API key header configured in the Collector's exporter.
- Use Datadog's native tracing SDK if you need deep automatic instrumentation for certain frameworks, then correlate with OTLP traces via trace IDs.
Collector exporter snippet for Datadog (example):
exporters:
datadog:
api:
site: "datadoghq.com"
# When using OTLP to Agent, point to local agent address instead
endpoint: "http://datadog-agent:4317"
Sending LLM traces to New Relic
New Relic supports OTLP; set the license key (API key) header in the Collector exporter. When sending to New Relic, ensure you map attributes to the New Relic expected attributes if you want to use built-in dashboards.
exporters:
otlp/newrelic:
endpoint: "https://otlp.nr-data.net:4317"
headers:
api-key: "YOUR_NEW_RELIC_LICENSE_KEY"
Advanced: adaptive sampling and cost controls
- Use a probabilistic sampler for high-volume endpoints; use a tail-based sampler for latency/error-focused retention (capturing all slow or error traces).
- Profile token-count cardinality: keep one span with token metrics per request and avoid recording every token or embedding vector in spans; those live elsewhere.
- Apply rate-limiting processors in the Collector and circuit breakers on exporters to prevent runaway costs or vendor saturations.
Comparisons & Decision Framework
Choice surface: Datadog vs New Relic vs vendor-agnostic observability. Decision factors: existing tooling, required feature set (APM UI, trace flamegraphs, cost analytics), retention and privacy, and ingestion cost.
Checklist for selecting a routing strategy
- Existing investments: If you already use Datadog or New Relic for APM and logs, route LLM traces to the same vendor to leverage correlation UIs.
- Privacy needs: If strict PII controls exist, ensure the collector can redact and you can restrict where raw prompts are stored.
- Cost transparency: Choose exporters and retention rules that expose token counts per trace; prefer sampling that keeps token-heavy traces.
- Portability: If vendor lock-in is a concern, prefer OTLP to a central collector with multi-exporter routing.
- Dev/Prod split: Send full traces to dev/staging and sampled/redacted traces to prod to balance debuggability vs cost/privacy.
Tradeoffs at a glance
- Datadog: strong out-of-the-box APM and vendor SDKs; easier to map traces to infra metrics. Slightly higher cost for high-cardinality events.
- New Relic: competitive APM and good support for OTLP; may offer different pricing models and dataset analytics (depends on contract).
- Vendor-agnostic (OTLP + Collector): gives portability and central control at the cost of extra operational overhead to manage the collector.
For an enterprise assessment of AI observability platforms and deeper trade-offs between managed vendors, see our market comparison of platform approaches in our AI observability platforms review.
Failure Modes & Edge Cases
Concrete diagnostics and mitigations:
- Failure mode: Spans never arrive at backend. Diagnostics: check SDK local exporter queue size, collector receiver logs, and network connectivity to exporter endpoints. Mitigation: enable local disk-backed queue in collector, configure retry policies, and monitor span export error counters.
- Failure mode: High tail latency from exporter. Diagnostics: measure opentelemetry_exporter_latency p95/p99, collector CPU/memory, and exporter HTTP/gRPC retry backoffs. Mitigation: increase batch sizes, use asynchronous exporters, or add a sidecar collector to reduce network hops.
- Failure mode: PII leakage via prompt text in traces. Diagnostics: search traces for llm.request.prompt_text or long string attributes. Mitigation: apply attributes processor to delete/replace these attributes and run a one-time cleansing job to remove existing traces if vendor supports it.
- Failure mode: Unbounded cardinality due to prompt_hash uniqueness. Diagnostics: monitor attribute cardinality and APM billing. Mitigation: use coarse-grained prompt bucketing (e.g., training-template-id) and sample only unique prompt hashes for long-term retention.
- Failure mode: Token count inflation and cost misattribution. Diagnostics: reconcile model billing from provider with aggregated llm.response.completion_token_count metrics. Mitigation: ensure token estimation uses the same tokenizer as the model provider and attach provider.request_id attributes for reconciliation.
Performance & Scaling
Scaling rules of thumb and KPIs:
- Telemetry volume: expect spans per inference request to be O(1) (1–5 spans). If you instrument every agent action, spans can grow O(N) where N is agent steps; cap instrumentation per request or use sampling by agent step index.
- Latency budget: keep SDK export and Collector path tail latency < 5–10% of your p95 inference latency. For example, if p95 inference is 1s, aim for telemetry p99 export impact < 50ms.
- Batch sizing: batch span export sizes of 50–500 spans per HTTP/gRPC request typically maximize throughput without increasing tail latency; tune by measurement.
- p95/p99 guidance: measure both trace ingestion latency and collector CPU per 1k req/s. Example benchmark: a Collector node with 4 CPU and 8GB can handle ~5k OTLP spans/sec with batch processor and reasonable attributes; scale horizontally as telemetry grows.
- Retention and cost: track token_count per trace and estimate monthly ingestion cost = sum(token_count * requests) * price_per_unit; use sampled retention for long-term archiving.
Benchmarks: run a small-scale test with realistic prompt sizes. Example target: for an inference service serving 500 RPS with average 250 tokens/request, expect 500 spans/sec plus overhead; provision Collector capacity accordingly and set adaptive sampling for long-term storage.
Production Best Practices
- Security: never send raw prompt content to third-party vendors without explicit consent. Use HMAC or SHA256 prompt hashing with a secret salt. Lock salt as a secret in your secret manager and rotate periodically.
- Testing: create a synthetic load test that exercises typical prompt sizes and agent orchestration steps; measure telemetry throughput and end-to-end export latency.
- Rollout: start with full trace capture in staging, then enable sampling and redaction in production. Use feature flags to toggle verbosity per service.
- Runbooks: create clear runbooks to diagnose LLM regressions: verify model provider request IDs, correlate trace ID → model provider logs, check token_count drift, and validate prompt_hash collisions or mismatches.
- Observability hygiene: keep attribute cardinality low — use enums or coarse buckets (e.g., model tier: small/medium/large), and avoid capturing dynamic user IDs directly in span attributes.
- Audit & compliance: keep a policy for how long raw prompts can be stored, who can access them, and whether they are encrypted at rest with strict KMS controls.
For agentic systems where action sequences produce high-span volume, our field-tested HOTL framework is a practical blueprint for traceable agent observability: see the HOTL framework for agentic production observability for patterns on sampling agent traces and correlating actions to LLM calls.
Further Reading & References
- OpenTelemetry Spec & Semantic Conventions — OpenTelemetry community repo and semantic conventions for AI/LLMs. (Search "OpenTelemetry LLM semantic conventions" in the OpenTelemetry docs.)
- OpenTelemetry Collector — official Collector configuration and processors documentation.
- Datadog APM + OTLP — Datadog docs for receiving OTLP traces and configuring the Datadog Agent.
- New Relic OTLP — New Relic docs for OTLP ingestion and API key setup.
- High-cardinality metrics guidance — vendor best practices for cardinality control and sampling.
For infrastructure-level tracing research that helps when you need system-level observability (e.g., kernel-level tracing of model server behavior), see our writeup on eBPF AI observability and trace-model inference which complements OpenTelemetry tracing of application-layer LLM calls.
Selected external reading
- OpenTelemetry Project — https://opentelemetry.io/
- OpenTelemetry Collector GitHub — https://github.com/open-telemetry/opentelemetry-collector
- Datadog APM and OTLP docs — https://docs.datadoghq.com/
- New Relic. OTLP ingestion docs — https://docs.newrelic.com/
Closing notes
OpenTelemetry AI observability integrates low-level trace hygiene with high-level model observability: semantic conventions for LLMs make traces actionable, the Collector centralizes control (redaction, sampling, routing), and exporting to vendors like Datadog or New Relic is straightforward when you standardize on OTLP. Prioritize privacy-first defaults (prompt hashing and redaction), enforce bounded resource usage for exporters, and validate via realistic load tests. If you're architecting an enterprise AI factory, combine these practices with platform-level infrastructure (kubernetes, sidecars, centralized collectors) to scale safely — and consult our enterprise infrastructure perspectives in our Enterprise AI Factories article for integration patterns across CI/CD, model deployment, and observability.
Author: MAKB — Lead Editor & Principal Engineer-Author. Evidence-led, practical guidance for production AI systems.