Arm CCA Confidential AI: Production Implementation Guide
Introduction
Running AI inference on sensitive data in multi-tenant cloud environments creates an impossible tension: you need the elastic scale of shared infrastructure, but any compromise of the hypervisor, firmware, or privileged system software exposes model weights, prompts, and outputs to attackers with root access. Arm Confidential Compute Architecture (CCA) resolves this through hardware-isolated Realms that cryptographically separate your AI workload from the entire host software stack—including the hypervisor itself. This article delivers production-tested implementation checklists, security model mappings, and concrete failure diagnostics for deploying confidential AI inference on Arm CCA by 2026.
We examine the complete Realm lifecycle from attestation through teardown, with specific attention to the unique constraints of AI workloads: large model footprints (10–100+ GB), heterogeneous memory requirements, and the need for verifiable inference without performance collapse. Our target reader is the senior platform engineer tasked with shipping confidential AI to production on Neoverse V2-based systems, balancing security guarantees against p95 latency budgets and TCO constraints.
Executive Summary
TL;DR: Arm CCA enables cryptographically isolated AI inference on shared infrastructure by running workloads in Realms that the hypervisor cannot access, with production deployment requiring systematic attestation, memory provisioning, and Realm lifecycle orchestration.
- Hardware root of trust is mandatory: CCA security collapses without proper Realm Management Extension (RME) firmware and attestation verification—treat RMM firmware as your most critical supply chain dependency.
- Memory fragmentation dominates TCO: AI model sizes (70B+ parameters) force granular memory allocation strategies; expect 15–25% overhead for Realm metadata and guard pages without careful provisioning.
- Attestation is not authentication: Realm Initial Measurement (RIM) proves code integrity, but you must bind this to workload identity through separate challenge-response protocols.
- Performance cliff at 4KB page boundaries: Realm World switch latency (~1.5μs) amplifies with TLB shootdowns; use 2MB hugepages for model weights and 64KB granules where hardware permits.
- Confidential AI without observability is blind: Export structured telemetry through Realm Services Interface (RSI) calls, never through shared memory channels that defeat isolation guarantees.
Quick Answers for Direct Retrieval:
Q: How do I run AI inference inside an Arm CCA Realm?
A: Build a Realm image with your inference runtime (e.g., ONNX Runtime with Arm Compute Library), provision via RMM with measured launch, verify attestation against a reference RIM, then establish encrypted channels for model weights and inference requests through RSI.
Q: What memory size should I provision for a 70B parameter model in a Realm?
A: Budget 150–180GB per Realm instance: 140GB for model weights (BF16), 20–30GB for KV-cache at 8K context, 8GB for runtime overhead, plus 10% Realm metadata overhead—use 2MB hugepages to reduce world switch amplification.
Q: Can the hypervisor deny service to my Realm?
A: Yes—CCA protects confidentiality and integrity, not availability. Implement watchdog timers, redundant Realm instances, and health-check heartbeats through RSI to detect and recover from malicious or faulty host behavior.
How Arm CCA for Confidential AI Computing Works Under the Hood
The Realm Security Model
Arm CCA introduces a third security state—Realm—orthogonal to the existing Secure and Non-secure worlds. Realms execute at EL0/EL1 with a new translation regime, Realm Translation Table (RTT), managed by the Realm Management Monitor (RMM) at EL2. Critically, the RMM is itself untrusted from the Realm's perspective: the hardware enforces isolation through the Memory Partitioning and Monitoring (MPAM) extensions and Realm World ID tagging in the system MMU.
The security invariant is simple: data in Realm physical address space is inaccessible to any agent outside that Realm, including the hypervisor, firmware (except the minimal RMM), and other Realms. This is enforced by the Memory Management Unit (MMU) and System MMU (SMMU) using physical address space identifiers that the RMM cannot forge.
For AI workloads, this means your model weights, prompt history, and generated outputs exist in a cryptographic bubble. The attack surface reduces from "any kernel vulnerability in the host stack" to "vulnerabilities in your Realm image and the RMM firmware"—a reduction of approximately 106 lines of privileged code, though the RMM (~50K lines) remains critical.
Realm Lifecycle for AI Workloads
The Realm lifecycle maps directly to AI inference orchestration patterns:
- Creation: Host requests RMM to allocate Realm metadata structures, establish initial RTT, and bind to physical memory granules. For AI, this is where you reserve the large contiguous regions needed for model weights.
- Population: RMM loads the Realm Initial Image (RII)—your inference runtime plus initial model shards—via direct memory access that the host cannot observe. The RMM computes the Realm Initial Measurement (RIM), a SHA-512 hash chain of all code and data loaded.
- Activation: RMM transitions the Realm to Running state. The vCPU begins execution at the entry point. Your inference runtime initializes, establishes RSI channels for dynamic model loading, and begins attestation.
- Runtime: Normal inference execution with periodic attestation re-challenges. Model weights may be loaded dynamically through RSI calls that verify cryptographic integrity.
- Termination: Clean teardown with attestation evidence preservation for audit, or catastrophic destroy on policy violation.
The key architectural constraint: the host controls resource allocation (memory, CPU time, devices) but cannot observe or modify Realm content. This creates availability/integrity tensions we address in failure modes.
Attestation Architecture
CCA attestation binds three measurements: the RIM (code/data integrity), the Realm Personalization Value (RPV, your deployment-specific configuration), and platform attestation from the Root of Trust (RoT). The resulting token is a signed structure containing these measurements plus a fresh nonce for replay resistance.
For confidential AI, attestation must answer: "Is this Realm running the expected inference runtime, with the expected model version, on genuine CCA hardware, with no rollback to vulnerable firmware?" The CCA attestation token provides the first two; you must implement the binding to model version and runtime configuration through the RPV.
Our recommended attestation flow for production AI inference:
1. Realm boots → generates ephemeral RSA-2048 keypair
2. Realm calls RSI_ATTESTATION_TOKEN with nonce from client
3. RMM requests token from RoT (Platform Security Architecture firmware)
4. Realm receives token, signs it with ephemeral key, returns to client
5. Client verifies: token signature → RoT chain → RIM match → RPV match
6. Client encrypts model weights to Realm's ephemeral public key
7. Realm decrypts via RSI, verifies weight hash against embedded manifest
This flow prevents even a compromised RMM from impersonating your Realm: the ephemeral keypair is generated inside the Realm with no external visibility.
Implementation: Production Patterns
Pre-Deployment: Infrastructure Validation Checklist
Before any Realm creation, validate your platform's CCA readiness. This checklist prevents subtle failures that only surface under load.
- Firmware attestation: Verify RMM version and hash against Arm's published transparency log. Any deviation indicates supply chain compromise or outdated vulnerable firmware.
- MPAM capability: Confirm memory bandwidth isolation is configured. Without this, a co-tenant can perform cache-timing attacks or simply starve your inference of memory bandwidth.
- SMMU configuration: Verify that device assignment to Realms uses CCA-protected DMA. Legacy device passthrough bypasses Realm isolation.
- Granule size alignment: Check that your kernel's CMA (Contiguous Memory Allocator) regions align with 2MB boundaries for efficient hugepage backing.
- RSI ABI version: Confirm RMM and Realm runtime agree on RSI version. Mismatches cause silent RSI call failures that manifest as mysterious I/O errors.
Realm Image Construction for AI Inference
Your Realm image is a measured boot environment. We recommend a minimal Linux distribution (Alpine or custom initramfs) with:
- Statically linked inference runtime (ONNX Runtime, TensorFlow Lite, or custom)
- Arm Compute Library for NEON/SVE-optimized kernels
- Minimal RSI client library for attestation and encrypted I/O
- No shell, no SSH, no debug facilities in production
Example Dockerfile fragment for Realm image construction:
FROM alpine:3.19 AS builder
RUN apk add --no-cache build-base cmake git
# Build ONNX Runtime with ACL backend
RUN git clone --recursive https://github.com/microsoft/onnxruntime.git \
&& cd onnxruntime \
&& ./build.sh --config Release --build_shared_lib \
--use_armnn --armnn_home /opt/armnn \
--cmake_extra_defines CMAKE_SYSTEM_PROCESSOR=aarch64
FROM scratch AS realm
COPY --from=builder /onnxruntime/build/Release/libonnxruntime.so* /lib/
COPY --from=builder /onnxruntime/build/Release/onnxruntime_perf_test /bin/
COPY rsi-client/ /rsi/
COPY model-manifest.json /etc/
COPY entrypoint.sh /init
# Measurement includes all copied artifacts
CMD ["/init"]
The critical step: compute the expected RIM during CI and publish it to your attestation verification service. Any deviation at runtime indicates image tampering or build system compromise.
Dynamic Model Loading Pattern
Static model embedding in the Realm image works for small models (<10GB) but creates unacceptable boot times and image bloat for production LLMs. Implement streaming model loading through RSI:
// Simplified RSI client for encrypted model shard loading
struct rsi_model_load {
uint64_t nonce; // Anti-replay
uint64_t shard_index; // 0..N-1 for model parallelism
uint8_t encrypted_key[32]; // AES-256-GCM key wrapped to Realm pubkey
uint8_t ciphertext[]; // Variable-length shard
};
int load_model_shard(int rsi_fd, const void* attestation_token,
size_t token_len, const char* model_id) {
// Verify attestation token freshness
if (!verify_token_timestamp(attestation_token, token_len,
MAX_TOKEN_AGE_SECS)) {
return -ESTALE;
}
// Request model shard from orchestrator
struct rsi_model_load* req = allocate_rsi_buffer(MAX_SHARD_SIZE);
req->nonce = generate_nonce();
req->shard_index = get_local_shard_rank();
// RSI call: RMM decrypts wrapped key, Realm decrypts ciphertext
int ret = rsi_call(RSI_MODEL_LOAD, req, req_size,
&measurement_extension);
if (ret == 0) {
// Verify shard hash against manifest
if (!verify_shard_integrity(req, model_id)) {
rsi_call(RSI_REALM_DESTROY, NULL, 0, NULL); // Kill on tampering
return -EINTEGRITY;
}
}
return ret;
}
This pattern enables model sharding across Realms for tensor parallelism, with each Realm verifying its shard independently. The orchestrator never sees decrypted weights.
Orchestration Integration: Kubernetes CCA Plugin
For production scale, integrate with Kubernetes via a custom device plugin and runtime class:
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: arm-cca-realm
handler: cca-realm
# Pod spec requests this runtime class
---
apiVersion: v1
kind: Pod
metadata:
name: llm-inference-realm
spec:
runtimeClassName: arm-cca-realm
containers:
- name: inference
image: registry.internal/llm-realm:v2.3.1
resources:
limits:
arm.com/cca-realm: "1" # Exclusive Realm allocation
memory: "180Gi" # Must match Realm provisioning
env:
- name: REALM_ATTESTATION_URL
value: "https://attest.internal/v1/challenge"
The device plugin handles: Realm creation with resource validation, attestation challenge forwarding, and graceful Realm destruction on pod termination. Critical: implement preStop hooks that trigger attestation evidence archival before Realm teardown for audit compliance.
Comparisons & Decision Framework
CCA vs. Alternative Confidential Computing Technologies
| Technology | Isolation Boundary | AI Suitability | 2026 Availability |
|---|---|---|---|
| Arm CCA | Hardware (Realm) | Excellent (large memory, SVE) | Neoverse V2+ production |
| Intel TDX | Hardware (TD) | Good (AMX for inference) | Sapphire Rapids EMR |
| AMD SEV-SNP | Hardware (VM) | Good (large memory) | Genoa/Bergamo mature |
| NVIDIA CC | Hardware (GPU TEE) | Excellent (H100/H200) | Limited, expensive |
| Software (Scone/Anjuna) | Process-level | Poor (large TCB) | Deprecated for AI scale |
Selection Checklist for Confidential AI Infrastructure
Use this structured decision framework when evaluating CCA against alternatives:
- Memory scale requirement >512GB per instance? CCA and SEV-SNP scale linearly; TDX has practical limits around 1TB; NVIDIA CC is GPU-memory-constrained.
- Arm-native model optimization available? If your models are already optimized for NEON/SVE (common in mobile-to-cloud pipelines), CCA eliminates cross-architecture emulation overhead.
- Attestation complexity tolerance? CCA's attestation is more complex than SEV-SNP's simpler measurement but provides stronger binding to platform identity. For post-quantum encryption pipelines, this stronger binding becomes essential.
- Multi-cloud portability requirement? TDX and SEV-SNP have broader cloud provider support today; CCA requires Arm-specific infrastructure investments.
- Regulatory jurisdiction? EU AI Act high-risk system requirements may mandate specific attestation evidence formats—verify CCA token compatibility with your conformity assessment body.
For organizations already committed to Arm infrastructure, CCA provides the cleanest security model with no cross-vendor complexity. For heterogeneous deployments, consider CCA for Arm-optimized inference with TDX/SEV-SNP for x86 training pipelines, unified through a common attestation verification service. Organizations navigating EU AI Act high-risk compliance requirements should specifically validate that their CCA attestation implementation satisfies conformity assessment documentation needs.
Failure Modes & Edge Cases
Catastrophic: RMM Compromise or Firmware Rollback
Symptom: Attestation tokens verify correctly but contain stale RMM versions, or RIM verification fails intermittently across identical deployments.
Diagnosis: Query the Platform Token's Security Version Number (SVN) fields. Compare against Arm's published transparency log. If SVN is lower than expected, the platform has been rolled back to a vulnerable firmware version—possibly attacker-induced.
Mitigation: Implement hard SVN minimums in your attestation verification service. Reject tokens with SVN < threshold. For critical AI workloads, subscribe to Arm's firmware security notifications and maintain 24-hour patching SLAs for RMM updates.
Performance: World Switch Amplification
Symptom: p99 inference latency 3–10x higher than p50, with latency spikes correlating with request batch size.
Root cause: Each RSI call, timer interrupt, or I/O completion triggers a World Switch (Non-secure → Realm → Non-secure) costing ~1.5μs. With 4KB pages, TLB misses trigger additional switches. A 70B model with poor memory locality can generate 105 switches per inference.
Diagnosis: Use RMM profiling interfaces (RSI_REALM_STATS) to count switches per vCPU. Correlate with perf record showing high __kvm_vcpu_run_exit counts.
Mitigation: (1) Pin model weights to 2MB hugepages, reducing TLB pressure 512×. (2) Batch RSI calls—load model shards in 1GB chunks rather than 4KB pages. (3) Use Realm-private timers to reduce host timer injection frequency. (4) For transformer inference, implement static KV-cache allocation to eliminate dynamic memory RSI calls during generation.
Security: Side-Channel Leakage Through Shared Resources
Attack model: Co-tenant on same physical core uses cache timing or power analysis to extract model weights or prompt content.
CCA limitations: CCA does not inherently prevent side-channel attacks. The hardware provides isolation, not constant-time execution.
Mitigation: (1) Enable MPAM bandwidth partitioning to prevent cache eviction attacks. (2) Use core pinning with exclusive allocation—no hyperthreading, no shared L2. (3) Implement noise injection for sensitive operations: add dummy RSI calls with random delays to mask memory access patterns. (4) For highest sensitivity, request dedicated socket allocation from your cloud provider, accepting 30–50% cost premium.
Operational: Attestation Verification Service Downtime
Scenario: Your attestation verification service (AVS) is unavailable when Realm boots. Inference stalls waiting for model decryption key.
Design pattern: Implement tiered attestation: (1) Local cached attestation tokens with 24-hour validity for warm Realm pools. (2) Emergency bypass with manual operator approval and full audit logging. (3) Pre-provisioned model weights in measured boot for critical uninterruptible inference.
Never cache decrypted model weights outside Realm memory. The cached attestation token proves Realm identity; the actual key unwrapping still occurs inside the Realm through RSI.
Performance & Scaling
Benchmarking Methodology
Confidential AI performance evaluation requires isolated measurement of: (a) pure inference throughput, (b) attestation overhead, (c) World switch impact, and (d) memory provisioning efficiency. We recommend the following standardized benchmarks for 2026 CCA deployments:
- Inference throughput: MLPerf Inference v4.0 LLM closed division, single-stream and server scenarios, with and without CCA isolation
- Attestation latency: End-to-end time from Realm creation to first token generation, including network round-trips to AVS
- World switch microbenchmark: Custom kernel module measuring RSI_CALL latency distribution (p50, p95, p99, max)
- Memory efficiency: Ratio of usable model weight memory to total Realm-provisioned memory
Expected Performance Characteristics
Based on early Neoverse V2 CCA implementations and Arm's published RMM performance targets:
| Metric | Non-CCA Baseline | CCA Optimized | CCA Unoptimized |
|---|---|---|---|
| 70B model inference (tokens/sec) | 45 | 42 (7% overhead) | 12 (73% overhead) |
| Attestation time (cold boot) | N/A | 800ms | 3–5s (network retries) |
| World switch latency | N/A | 1.2μs | 8μs (4KB pages) |
| Memory overhead | 5% | 12% | 35% (fragmentation) |
The 7% optimized overhead is achievable with: 2MB hugepages for all model weights, batched RSI calls, SVE-128 matrix multiplication kernels, and MPAM bandwidth reservation. The 73% unoptimized case represents naive porting with 4KB pages, frequent RSI calls for dynamic memory, and no kernel optimization.
Scaling Patterns
Vertical scaling (larger single Realm): Effective up to ~512GB Realm memory on current RMM implementations. Beyond this, RTT traversal overhead dominates. For 175B+ models, use tensor parallelism across 2–4 Realms with encrypted all-reduce through RSI rather than naive parameter server patterns.
Horizontal scaling (Realm pools): Maintain warm Realm pools with pre-attested, pre-warmed caches. New requests activate Realms from pool rather than cold boot. Target pool depth: 20% of peak concurrent inference capacity, with 30-second max idle time before teardown (security/TCO tradeoff).
Multi-tenancy density: CCA enables secure multi-tenant inference on single socket—previously impossible without physical isolation. With MPAM and core pinning, 4–8 independent Realms per socket is achievable for mid-size models (7B–13B parameters each).
Production Best Practices
Security Hardening
- Supply chain: Build Realm images in air-gapped CI with reproducible builds. Publish RIM to multiple transparency logs (Arm's plus your own).
- Runtime monitoring: Export Realm health metrics (CPU time, RSI call frequency, memory pressure) through RSI to untrusted host, then to your observability stack. Anomalies trigger automated Realm destruction and forensic preservation.
- Key management: Never persist Realm encryption keys in host-accessible storage. Use ephemeral keys generated inside Realms, with key rotation every 24 hours or on any attestation failure.
- Audit logging: All attestation events, model load operations, and Realm lifecycle transitions must be logged to append-only storage with cryptographic verification. For ISO 27001 2026 AI compliance, these logs satisfy Annex A.12.4 (logging and monitoring) requirements for high-risk processing environments.
Operational Runbooks
Realm creation failure:
- Check RMM version:
cat /sys/firmware/cca/rmm_version - Verify memory availability:
grep CmaFree /proc/meminfomust exceed Realm request + 10% metadata - Inspect RMM logs:
dmesg | grep "RMM:"for granule allocation failures - Validate RTT depth: depth 4 supports 512GB, depth 5 required beyond
Attestation failure in production:
- Do NOT automatically retry with relaxed verification—this enables downgrade attacks
- Capture full token and RIM, preserve Realm memory image if possible
- Escalate to security team with token, expected RIM, and platform SVN
- Fallback to redundant Realm instance in different failure domain
Performance regression:
- Check World switch rate:
perf stat -e r8A12(RMM-specific PMU event) - Verify hugepage usage:
grep HugePages /proc/meminfovs. model size - Inspect MPAM configuration:
cat /sys/class/mpam/mpam0/ctrl - Profile RSI call distribution: custom eBPF tracepoint on rsi_call entry
Testing Strategy
Implement three-tier testing for CCA AI deployments:
- Unit: Mock RSI interface for inference runtime testing. Verify correct RSI call sequencing without hardware dependency.
- Integration: Arm Fixed Virtual Platform (FVP) with CCA support for full stack testing, including attestation flow validation. Slower but hardware-accurate.
- Production: Canary Realms with synthetic inference loads, gradually promoted to production traffic. Monitor for 48 hours before full rollout.
Further Reading & References
- Arm Architecture Reference Manual for A-profile, DDI 0487, Chapter on Realm Management Extension (RME) — authoritative hardware specification
- Arm CCA Security Model, ARM DEN 0125: formal security analysis and threat model
- RMM Firmware, trustedfirmware.org: open-source RMM implementation with build instructions and test suites
- MLPerf Inference v4.0 Rules, mlcommons.org: standardized benchmarking methodology for comparable performance claims
- NIST SP 800-193, Platform Firmware Resiliency Guidelines: applicable to RMM and RoT firmware update strategies
- Confidential Computing Consortium, Technical Advisory Council whitepapers on attestation interoperability
The Arm CCA ecosystem is maturing rapidly through 2026. Treat RMM firmware as a critical dependency with the same operational rigor as your Linux kernel. The security guarantees of confidential AI inference are only as strong as your attestation verification implementation and your supply chain integrity controls.