The Instruments

Six instruments, one framework. Each measures a different dimension of what a model knows, how it reasons, and where it fails.


AIDA — The Framework

Adaptive Inference Decision Architecture (Patent Filed 10 March 2026). The overarching assessment framework. AIDA reconstructs the layer-wise trajectory of internal representations across the full transformer depth, classifying each model–question pair into one of six epistemic regimes using dual geometric and logit views. A single model assessment produces approximately 750,000 analysis records — auditable, certificate-grade diagnostics from an operational system, not a research prototype.

Read the foundational paper →


The Instrument Suite

ASCOL
Augmented Structured Cognition through Observational Lensing. The per-sample diagnostic instrument. ASCOL measures the structural integrity of a model’s knowledge representation through 17 templates plus MCQ — 18 access paths per question. It distinguishes fused knowledge (robust, perturbation-resistant) from rote knowledge (brittle, surface-pattern-dependent), even when both produce identical correct answers on standard benchmarks.
FEST
Factual Elimination Stress Test. A perturbation protocol that systematically removes and recombines answer options across nine configurations (F00–F09), forming a sequential dependency chain. FEST measures how dependent a model’s correct answers are on distractor context. It exposes systematic fragility invisible to aggregate benchmarks: accuracy on the same 1,089 questions varies by up to 30 percentage points depending on which options are present.
ETC
Epistemic Trajectory Classifier. Two-view classification (geometry + logit) across six regimes. ETC traces the evolution of the model’s internal representations at every transformer layer, measuring cosine similarity structures, entropy profiles, margin dynamics, and rank stability. The output: a categorical assignment — Differentiated Correct, Late Crystallisation, Differentiated Wrong, Correct Overridden, Fused Wrong, or Fused Gold.
ELVA
Ensemble Logical Voting Analysis. Structural eliminative governance across model ensembles. ELVA identifies which model exhibits clean reasoning on each question, resolves disagreements through geometric analysis rather than majority voting, and performs deterministic structural elimination of unsafe outputs.
REGENT
Regime-Gated Adaptive Training. The first training system that uses the epistemic manifold to generate prescriptive, regime-specific training interventions. Per-sample freeze maps determine which parameters need updating and which must be preserved. In ensemble mode (REGENT-E), clean-process models act as epistemic teachers, producing the Boosted Training Dataset and the Generalised Epistemic Training Map — a portable, auditable, architecture-agnostic training specification. Potential fine-tuning cost savings: up to 55%.

First Glimpses Inside the Mind

When we first opened the internal states of a training model, we had no idea what we would find. What emerged was not noise but structure — phase transitions, harmonic oscillations, convergence hierarchies, and synchronised instability events that no existing framework had predicted or described. These are some of the first images from that archaeological journey.

Want to see the model reason in real time? Watch probability bars shift layer by layer.

Try the Interactive Demo

What the Instruments Reveal

Instruction Tuning Inflates

Instruction tuning added 7.3 pp of accuracy but just 0.3 pp of structural correctness. The epistemic gap widened by 6.9 pp. Almost every additional correct answer is epistemically hollow.

Fusion Is Categorical

0% of Differentiated Correct samples show fusion. 100% of Late Crystallisation samples do. This is not a statistical tendency — it is a categorical boundary in the manifold.

Natural Constants Exist

Threshold values arising from manifold geometry predict correctness with 96–97% reliability across independent benchmarks. Same value, different datasets. Discovered, not tuned.

LoRA ≠ Full Fine-Tuning

At identical accuracy, LoRA produces predominantly fused knowledge. Full fine-tuning produces predominantly rote knowledge. A difference in kind, invisible to accuracy.

Models Know More Than They Show

Correct Overridden trajectories prove the model possessed the right answer at intermediate layers — then suppressed it. Knowledge exists but is inaccessible through the standard path.

Accuracy Is Format-Dependent

FEST shows a 20–30 pp accuracy range on the same questions depending on which options are present. Aggregate accuracy is not intrinsic to the model — it is an artefact of the test format.

This Is Measurement, Not Interpretation

The instruments produce quantitative, auditable diagnostics from the model’s own internal states. No prompting tricks, no self-report, no approximation.

Read the Paper