Reporting

Internal Management Reporting

AIDA assessments produce two categories of structured management output. Internal Assessment Reports deliver a full diagnostic profile of a single model — covering trajectory classification, layer dynamics, stability analysis, decision volatility, entropy sharpening, view concordance, and FEST fragility profiling. Every metric is grounded in direct measurement of the model’s internal representations across all transformer layers. Comparative Assessment Reports evaluate a proposed model substitution against its deployed predecessor, synthesising the epistemic evidence into a clear deployment recommendation backed by head-to-head data across all diagnostic dimensions.

Reports are produced as part of, and accompany, an AIDA Certificate. They are intended for engineering and governance teams who require the technical detail behind a certification decision. The certificate itself provides the auditable summary; the report provides the evidence.

Internal Assessment Reports

Each internal report covers a single model assessed against a defined question corpus. The full diagnostic record — up to 45,738 individual layer probes per assessment — is analysed across twelve structured sections.

google/gemma‑2‑9b — Internal Assessment

Base (Pre-trained) · 9B parameters · 42 layers · mmlu_med (1,089 questions) · 28 February 2026 · Report ID: GBRAAA00-RPT-e974894c

Accuracy 74.6% Structural 86.0% Gap −11.5pp FEST: LOW Stability 1.65/4 Mean Flips 6.0

AIDA Internal Assessment Report â€” google/gemma-2-9b

google/gemma‑2‑9b
mmlu_med · 28 Feb 2026

Gemma‑2‑9B achieves 74.6% outcome accuracy on the MMLU-Med benchmark, but its structural correctness of 86.0% reveals the defining AIDA finding: the model encodes the correct answer internally in 125 more cases than it delivers correctly. This inverted epistemic gap of 11.5 percentage points is a processing failure, not a knowledge failure. The dominant trajectory is Differentiated Correct at 61.6%, confirming genuine structural knowledge in the majority of samples. Late Crystallisation affects a further 12.7% — correct answers that arrive at the output with no deep geometric support and are therefore sensitive to prompt variation. A concerning 10.5% of samples follow the Differentiated Wrong trajectory: confident, structurally committed, and wrong. Mean flip count of 6.0 across 42 layers reflects moderate internal volatility, and 7.3% of all decisions collapse only at the final layer. FEST fragility is classified as LOW, with only a 0.6pp accuracy reduction under full four-option distractor load compared to binary confrontation — indicating robust discrimination under realistic MCQ conditions. Pipeline test-retest reliability is confirmed at 0.000pp delta.

mistralai/Ministral‑3‑14B‑Instruct‑2512‑BF16 — Internal Assessment

Instruct · 14B parameters · 42 layers · mmlu_med (1,089 questions) · 28 February 2026 · Report ID: GBRAAA00-RPT-bd9d9e43

Accuracy 77.0% Structural 87.4% Gap −10.4pp FEST: MODERATE Stability 1.59/4 Mean Flips 7.5

AIDA Internal Assessment Report â€” mistralai/Ministral-14B-Instruct

Ministral‑14B‑Instruct
mmlu_med · 28 Feb 2026

Ministral‑14B achieves 77.0% outcome accuracy, the highest of the instruct-tuned models assessed to date, with structural correctness of 87.4%. The resulting 10.4pp epistemic gap represents 113 structurally correct answers lost to late-layer processing. The diagnostic signature bears the clear imprint of instruction tuning: mean flip count rises to 7.5 (versus 6.0 for the comparable base model), structural instability at the 0/4 level reaches 26.6%, and views agreement falls to 68.5% — the lowest recorded across the assessed cohort. Late Crystallisation is elevated at 17.1%, meaning nearly one in five correct answers rests on shallow geometric support. On a more positive note, no decisions collapse at the final transformer layer, suggesting the model commits decisively throughout its depth. FEST fragility is classified as MODERATE: accuracy drops 5.1 percentage points when moving from binary confrontation to the full four-option distractor set, indicating meaningful sensitivity to the competing attractor structure that instruction tuning appears to have introduced. Pipeline reliability is confirmed at 0.000pp delta.

[Further internal assessment report — pending publication]

Report in preparation

Additional internal assessment reports will be published here as assessments are completed.

Comparative Assessment Reports

Comparative reports evaluate a proposed model substitution — whether driven by capability, cost, or infrastructure considerations — against the model currently in deployment. Each report concludes with a formal AIDA recommendation and the conditions that must be satisfied before any transition proceeds.

google/gemma‑2‑9b → meta‑llama/Meta‑Llama‑3‑8B — Model Replacement Assessment

Current deployment vs proposed replacement · mmlu_med (1,089 questions) · 01 March 2026 · Report ID: AIDA-CMP-00bcf719

NOT RECOMMENDED Accuracy −17.2pp Structural −14.2pp Flips +21.6 (6.0 → 27.6) Fusion Rate −18.5pp

AIDA Comparative Assessment â€” Gemma-9B vs Llama-8B

Gemma‑9B vs Llama‑8B
01 March 2026

This report assesses the proposed replacement of the currently deployed Gemma‑2‑9B base model with Meta Llama‑3‑8B. The AIDA verdict is unambiguous: NOT RECOMMENDED. Outcome accuracy declines 17.2 percentage points from 74.6% to 57.4%, and structural correctness falls from 86.0% to 71.8%. The epistemic gap widens — meaning the proposed model’s conventional benchmark performance is a more misleading indicator of genuine capability than the model it would replace. The most striking diagnostic signal is decision volatility: mean flip count rises from 6.0 to 27.6, reflecting a near four-fold increase in internal answer oscillation across layers. Final-layer collapse more than doubles, from 7.3% to 17.2%. The Differentiated Wrong trajectory increases by 15.1 percentage points to 25.5%, meaning one in four samples follows a confidently wrong structural pathway. Two indicators move in the proposed model’s favour: views agreement improves from 72.8% to 81.7%, and fusion rate falls to zero — but these are insufficient to offset the scale of epistemic regression across all primary measures. Full transition conditions, including parallel running requirements and post-deployment monitoring schedule, are specified in the report.

[Further comparative assessment report — pending publication]