Six instruments, one framework. Each measures a different dimension of what a model knows, how it reasons, and where it fails.
Adaptive Inference Decision Architecture (Patent Filed 10 March 2026). The overarching assessment framework. AIDA reconstructs the layer-wise trajectory of internal representations across the full transformer depth, classifying each model–question pair into one of six epistemic regimes using dual geometric and logit views. A single model assessment produces approximately 750,000 analysis records — auditable, certificate-grade diagnostics from an operational system, not a research prototype.
When we first opened the internal states of a training model, we had no idea what we would find. What emerged was not noise but structure — phase transitions, harmonic oscillations, convergence hierarchies, and synchronised instability events that no existing framework had predicted or described. These are some of the first images from that archaeological journey.
Want to see the model reason in real time? Watch probability bars shift layer by layer.
Try the Interactive DemoInstruction tuning added 7.3 pp of accuracy but just 0.3 pp of structural correctness. The epistemic gap widened by 6.9 pp. Almost every additional correct answer is epistemically hollow.
0% of Differentiated Correct samples show fusion. 100% of Late Crystallisation samples do. This is not a statistical tendency — it is a categorical boundary in the manifold.
Threshold values arising from manifold geometry predict correctness with 96–97% reliability across independent benchmarks. Same value, different datasets. Discovered, not tuned.
At identical accuracy, LoRA produces predominantly fused knowledge. Full fine-tuning produces predominantly rote knowledge. A difference in kind, invisible to accuracy.
Correct Overridden trajectories prove the model possessed the right answer at intermediate layers — then suppressed it. Knowledge exists but is inaccessible through the standard path.
FEST shows a 20–30 pp accuracy range on the same questions depending on which options are present. Aggregate accuracy is not intrinsic to the model — it is an artefact of the test format.
The instruments produce quantitative, auditable diagnostics from the model’s own internal states. No prompting tricks, no self-report, no approximation.
Read the Paper