The measurement layer that AI governance now legally requires.
meta-llama/Meta-Llama-3-8B
mmlu_med · 10 March 2026
AI is no longer an experimental technology operating at the margins of policy. It now sits squarely — and uncomfortably — at the centre of global legislative attention. Regulators have drawn a line.
In August 2025, the General Purpose AI obligations under the EU Artificial Intelligence Act became enforceable. Any provider placing a GPAI model on the market must now produce technical documentation, demonstrate model evaluation, and conduct adversarial testing. They must disclose capabilities and limitations with a degree of clarity that has never previously been required. Models judged to pose systemic risk are subject to even stricter obligations. The penalty regime — fines reaching three percent of global annual turnover — is active. These are not aspirational standards. They are binding mandates.
The General Purpose AI Code of Practice, published in July 2025, sets out the operational framework for compliance. It demands state-of-the-art evaluation, including adversarial testing and systematic assessment of model behaviour. In parallel, the NIST AI Risk Management Framework calls for quantitative evidence of validity, reliability, robustness, and fairness. Both frameworks assume the existence of measurement instruments capable of producing the evidence they require. Those instruments do not yet exist.
Aggregate accuracy on benchmark suites cannot satisfy Article 15’s robustness requirements. Confidence calibration does not meet Article 9’s risk management obligations. Standard documentation practices cannot report what has not been measured. The regulatory architecture now demands a level of epistemic visibility that current evaluation methods cannot provide.
AI must become measurable, governable, and certifiable. This marks the beginning of a multi-decade transformation in how digital systems are built, evaluated, and trusted.
AIDA Research provides the scientific foundation for that transformation. Our Adaptive Inference Decision Architecture — AIDA — fills a structural void at the heart of AI governance. Its three convergent instruments, each reading per-sample and per-model epistemic evidence, create the measurement ecosystem that modern regulation presupposes.
This is not an incremental improvement to existing evaluation practice. It is the missing measurement layer upon which the next era of AI governance must be built.
An AIDA certificate does not say “this model scored 80%”. It says precisely which answers arise from genuine structural knowledge, which are brittle, and which are epistemically hollow — with measured evidence for every claim.
Every model–question pair is classified into one of six epistemic regimes. The certificate identifies exactly which answers can be trusted and which cannot — not as a statistical estimate, but as a measured geometric fact.
FEST systematically stress-tests each answer across nine configurations. The certificate reports the model’s accuracy range, distractor dependence, and binary advantage — exposing format-dependent performance invisible to aggregate benchmarks.
Layer-by-layer trajectory analysis measures decision depth, flip count, entropy sharpening, and view concordance. The certificate defines the model’s stability envelope — the conditions under which it reasons cleanly and where it fails.
Certification thresholds are not chosen. They are discovered.
The AIDA framework has identified threshold values that arise directly from the geometry of ensemble correlation structures. These natural constants are not hyperparameters, heuristics, or tuneable knobs. They are structural features of the epistemic manifold — invariant across model families, datasets, and domains.
The entropy constant yields 96.6% accuracy on MMLU-Med and 96.5% on MedQA — the same value to one decimal place across two independent medical benchmarks with different sample sizes. On MMLU-Pro, the same threshold yields 92.9%, a systematic shift attributable to option-count effects. These are not fitted results. They are emergent geometric properties.
AIDA certification follows the precedent of established assurance bodies: Lloyd’s Register for maritime, Bureau Veritas for industrial safety, ICANN for internet governance. Independent, technically rigorous, commercially neutral.
Full epistemic assessment of a model against a defined question corpus. The certificate, based on up to 858,000 analysis records per assessment, reports outcome accuracy, structural correctness, epistemic gap, trajectory distribution, stability profile, and FEST fragility classification.
After any retraining, fine-tuning, LoRA adaptation, or quantisation, the model’s epistemic profile may have changed — even if surface accuracy is preserved. Interim certification re-assesses the modified model to determine whether clean epistemic regimes have been maintained or degraded.
Training and evaluation datasets can themselves be assessed for epistemic quality: distractor strength distributions, question difficulty profiles, and susceptibility to format-dependent performance artefacts.
AIDA is the first framework capable of epistemic certification across every stage of the model lifecycle.
Does the pre-trained model encode genuine structural knowledge, or only rote associations?
Does fine-tuning preserve clean epistemic regimes, or does it inflate accuracy while widening the epistemic gap?
Do adapters introduce drift, override, or fusion behaviours invisible to accuracy metrics?
Does reduced precision alter collapse layers, gold windows, or harmonic structure?
Our data shows instruction tuning can widen the epistemic gap by 6.9 pp while appearing to improve the model.
Has any modification degraded epistemic stability, even when surface-level accuracy remains unchanged?
AIDA certificates are designed to be cryptographically signed, machine-readable, and independently verifiable. The long-term architecture includes a public certificate registry, revocation capabilities, and a real-time verification API — enabling downstream consumers, regulators, and procurement teams to confirm a model’s epistemic status at any point in its lifecycle.
Clinicians need to know which answers they can attach their name and accountability to. AIDA provides the assurance that an answer was produced through a clean, stable process.
Regulatory compliance, risk assessment, and fiduciary responsibility all require epistemic visibility into model behaviour — not just aggregate performance numbers.
Legal professionals deploying AI for research, contract analysis, or case preparation need per-query confidence that goes beyond statistical averages.
Mission-critical applications cannot tolerate epistemically hollow answers. The stability envelope must be known and certified before deployment.
The EU AI Act and NIST AI RMF presuppose measurement instruments that produce the evidence conformity assessment requires. AIDA provides those instruments.
A certified model is worth more than an uncertified one. Epistemic certification is a competitive differentiator — and, increasingly, a market requirement.
The regulatory architecture demands epistemic evidence. AIDA provides the instruments to produce it.
Discuss Certification