Publications

Papers

Why Correctness Is Not Cognition: From Benchmark Accuracy to Autonomous Epistemic Governance of Large Language Models

Tim Hayes · AIDA Research · March 2026 · arXiv preprint

Why Correctness Is Not Cognition â€” Tim Hayes, AIDA Research, March 2026

Hayes, T. · AIDA Research
March 2026

Large language models in safety-critical settings are evaluated almost exclusively by outcome accuracy. This paper argues that correctness is a necessary but radically insufficient measure: a correct answer may arise from genuine structural knowledge or from a brittle internal process that happened, by chance, to land on the right token. The two are indistinguishable at Level A1. This paper introduces the instruments to tell them apart.

Three instruments — ETC, ASCOL, and FEST — decompose aggregate accuracy into six epistemic regimes with distinct geometric signatures. The most important discovery underlying them is that the probability dynamics observed through the logit lens are not amplification events but rotational realignments in high-dimensional representational space. The proof is direct: norm ratios move by 0.002 while answer probabilities shift by 56 percentage points. Nothing was amplified. The analytical camera moved. This finding reframes what the field has been measuring with the logit lens for years.

A carrier–content decomposition follows from this: every transformer’s output is the product of a position-dependent carrier signal (derivable from frozen weights, zero data required) and a content signal representing actual knowledge. Accuracy differentials exceeding 20 percentage points between answer positions arise from this carrier — not from what the model knows. Inference-time correction exploiting the decomposition raises Llama-3-8B accuracy from 67.9% to 71.1% on 1,089 medical licensing questions without any modification to model parameters.

FEST analysis of 274 failures establishes that 260 (94.9%) are architecturally recoverable. Only 14 samples (1.29%) represent genuine knowledge gaps. The model’s true knowledge ceiling is 98.71%. The gap between the 67.9% baseline and that ceiling is not a measure of what the model does not know. It is a measure of how severely the inference architecture prevents the model’s knowledge from reaching the output.

The framework is validated across nine models (7B–14B parameters) from six suppliers, producing 3,246,256 attention probes and 6,411,576 layer probes across 283,289 inferences — production instrumentation, not a research prototype.

Show PDF CC BY‑NC 4.0

Licence

Papers in this section are published under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) licence.

You are free to share (copy and redistribute in any medium or format) and adapt (remix, transform, and build upon) this material, provided you give appropriate attribution, include a link to the licence, and indicate if changes were made. Commercial use is not permitted without prior written permission from AIDA Research.

Cite as:
Hayes, T. (2026). Why Correctness Is Not Cognition: From Benchmark Accuracy to Autonomous Epistemic Governance of Large Language Models. AIDA Research. arXiv preprint. CC BY-NC 4.0

Forthcoming

Cross-Vendor Epistemic Comparison: Ministral-14B vs. Gemma-2-9B

Tim Hayes · AIDA Research · 2026 · In preparation

A three-way comparative analysis across base models, instruct variants, and vendor architectures, demonstrating that the epistemic manifold’s structure is invariant across model families. Both models share 42-layer transformer architectures, enabling direct layer-by-layer comparison.

REGENT: Regime-Gated Adaptive Training for Epistemic Improvement

Tim Hayes · AIDA Research · 2026 · In preparation

The first training system that uses the epistemic manifold to generate prescriptive, regime-specific interventions. Demonstrates up to 55% fine-tuning cost reduction through per-sample freeze maps and ensemble-derived epistemic teaching.

Assessment Reports

AIDA produces auditable internal assessment reports for each model evaluated. Each report is generated from approximately 750,000 analysis records and includes trajectory classification, stability analysis, FEST fragility profiles, and certification-grade diagnostics.