Certification

Capturing the Regulatory Moment

AIDA Model Assessment Certificate â€” meta-llama/Meta-Llama-3-8B

meta-llama/Meta-Llama-3-8B
mmlu_med · 10 March 2026

AI is no longer an experimental technology operating at the margins of policy. It now sits squarely — and uncomfortably — at the centre of global legislative attention. Regulators have drawn a line.

In August 2025, the General Purpose AI obligations under the EU Artificial Intelligence Act became enforceable. Any provider placing a GPAI model on the market must now produce technical documentation, demonstrate model evaluation, and conduct adversarial testing. They must disclose capabilities and limitations with a degree of clarity that has never previously been required. Models judged to pose systemic risk are subject to even stricter obligations. The penalty regime — fines reaching three percent of global annual turnover — is active. These are not aspirational standards. They are binding mandates.

The General Purpose AI Code of Practice, published in July 2025, sets out the operational framework for compliance. It demands state-of-the-art evaluation, including adversarial testing and systematic assessment of model behaviour. In parallel, the NIST AI Risk Management Framework calls for quantitative evidence of validity, reliability, robustness, and fairness. Both frameworks assume the existence of measurement instruments capable of producing the evidence they require. Those instruments do not yet exist.

Aggregate accuracy on benchmark suites cannot satisfy Article 15’s robustness requirements. Confidence calibration does not meet Article 9’s risk management obligations. Standard documentation practices cannot report what has not been measured. The regulatory architecture now demands a level of epistemic visibility that current evaluation methods cannot provide.

AI must become measurable, governable, and certifiable. This marks the beginning of a multi-decade transformation in how digital systems are built, evaluated, and trusted.

AIDA Research provides the scientific foundation for that transformation. Our Adaptive Inference Decision Architecture — AIDA — fills a structural void at the heart of AI governance. Its three convergent instruments, each reading per-sample and per-model epistemic evidence, create the measurement ecosystem that modern regulation presupposes.

This is not an incremental improvement to existing evaluation practice. It is the missing measurement layer upon which the next era of AI governance must be built.

Read the Foundational Paper

What Epistemic Certification Means

An AIDA certificate does not say “this model scored 80%”. It says precisely which answers arise from genuine structural knowledge, which are brittle, and which are epistemically hollow — with measured evidence for every claim.

Per-Sample Diagnostics

Every model–question pair is classified into one of six epistemic regimes. The certificate identifies exactly which answers can be trusted and which cannot — not as a statistical estimate, but as a measured geometric fact.

Fragility Profiling

FEST systematically stress-tests each answer across nine configurations. The certificate reports the model’s accuracy range, distractor dependence, and binary advantage — exposing format-dependent performance invisible to aggregate benchmarks.

Stability Envelope

Layer-by-layer trajectory analysis measures decision depth, flip count, entropy sharpening, and view concordance. The certificate defines the model’s stability envelope — the conditions under which it reasons cleanly and where it fails.

Natural Constants

Certification thresholds are not chosen. They are discovered.

Discovered, Not Tuned

The AIDA framework has identified threshold values that arise directly from the geometry of ensemble correlation structures. These natural constants are not hyperparameters, heuristics, or tuneable knobs. They are structural features of the epistemic manifold — invariant across model families, datasets, and domains.

96–97% Predictive Reliability

The entropy constant yields 96.6% accuracy on MMLU-Med and 96.5% on MedQA — the same value to one decimal place across two independent medical benchmarks with different sample sizes. On MMLU-Pro, the same threshold yields 92.9%, a systematic shift attributable to option-count effects. These are not fitted results. They are emergent geometric properties.

The Certification Model

AIDA certification follows the precedent of established assurance bodies: Lloyd’s Register for maritime, Bureau Veritas for industrial safety, ICANN for internet governance. Independent, technically rigorous, commercially neutral.

Model Certification

Full epistemic assessment of a model against a defined question corpus. The certificate, based on up to 858,000 analysis records per assessment, reports outcome accuracy, structural correctness, epistemic gap, trajectory distribution, stability profile, and FEST fragility classification.

Interim Certification

After any retraining, fine-tuning, LoRA adaptation, or quantisation, the model’s epistemic profile may have changed — even if surface accuracy is preserved. Interim certification re-assesses the modified model to determine whether clean epistemic regimes have been maintained or degraded.

Dataset Certification

Training and evaluation datasets can themselves be assessed for epistemic quality: distractor strength distributions, question difficulty profiles, and susceptibility to format-dependent performance artefacts.

Across the Full Modification Pipeline

AIDA is the first framework capable of epistemic certification across every stage of the model lifecycle.

Base Weights

Does the pre-trained model encode genuine structural knowledge, or only rote associations?

Fine-Tuning

Does fine-tuning preserve clean epistemic regimes, or does it inflate accuracy while widening the epistemic gap?

LoRA Adapters

Do adapters introduce drift, override, or fusion behaviours invisible to accuracy metrics?

Quantisation

Does reduced precision alter collapse layers, gold windows, or harmonic structure?

Instruction Tuning

Our data shows instruction tuning can widen the epistemic gap by 6.9 pp while appearing to improve the model.

Deployment

Has any modification degraded epistemic stability, even when surface-level accuracy remains unchanged?

Who Needs Epistemic Certification

Healthcare

Clinicians need to know which answers they can attach their name and accountability to. AIDA provides the assurance that an answer was produced through a clean, stable process.

Financial Services

Regulatory compliance, risk assessment, and fiduciary responsibility all require epistemic visibility into model behaviour — not just aggregate performance numbers.

Legal

Legal professionals deploying AI for research, contract analysis, or case preparation need per-query confidence that goes beyond statistical averages.

Defence & National Security

Mission-critical applications cannot tolerate epistemically hollow answers. The stability envelope must be known and certified before deployment.

Regulators & Auditors

The EU AI Act and NIST AI RMF presuppose measurement instruments that produce the evidence conformity assessment requires. AIDA provides those instruments.

Model Providers

A certified model is worth more than an uncertified one. Epistemic certification is a competitive differentiator — and, increasingly, a market requirement.