05/06/2026
ANCHORFORGE V1.2 — WE TESTED 23 LLMs (real numbers, no spin)
We pointed the same epistemic benchmark at 23 different language models — Anthropic, OpenAI, Google, Meta, xAI, DeepSeek, Mistral, Qwen, Microsoft, NVIDIA, Perplexity. Three gates per
claim: A_DATA, B_SCOPE, C_SOURCE. Score what survives the verifier. The "alive %" is what percentage of each model's anchored claims passed the gates.
Here's the leaderboard:
1. sonar-pro (Perplexity) 95.2% STRONG auth-misuse 0.0%
2. claude-sonnet-4-6 (Anthropic) 88.6% STRONG auth-misuse 2.9%
3. o4-mini (OpenAI) 87.2% STRONG auth-misuse 10.3%
4. claude-opus-4-6 (Anthropic) 83.3% STRONG auth-misuse 3.3%
5. grok-3 ★ (xAI) 81.4% STRONG auth-misuse 9.3%
6. deepseek-r1 (DeepSeek) 80.0% OK auth-misuse 5.0%
7. gpt-4o (OpenAI) 80.0% OK auth-misuse 0.0%
8. r1-distill-70b (DeepSeek) 76.9% OK auth-misuse 23.1%
9. qwen-72b (Qwen) 75.0% OK auth-misuse 17.9%
10. deepseek-v3 ★ (DeepSeek) 74.7% OK auth-misuse 2.5%
11. grok-3-mini (xAI) 73.8% OK auth-misuse 19.0%
12. gpt-4o-mini (OpenAI) 72.7% OK auth-misuse 13.6%
13. gemini-2.5-pro (Google) 67.5% OK auth-misuse 12.5%
14. mixtral-8x22b ★ (Mistral) 63.5% OK auth-misuse 13.5%
15. gemini-flash (Google) 63.3% OK auth-misuse 23.3%
16. llama-3.3-70b (Meta) 61.4% OK auth-misuse 13.6%
17. llama-3.1-70b (Meta) 60.4% OK auth-misuse 22.9%
18. nemotron-70b ★ (NVIDIA) 57.8% WEAK auth-misuse 22.9%
19. llama-4-maverick (Meta) 56.8% WEAK auth-misuse 20.5%
20. gemma-27b ★ (Google) 51.6% WEAK auth-misuse 32.3%
21. qwen-7b ★ (Qwen) 48.0% WEAK auth-misuse 36.0%
22. phi-4 (Microsoft) 43.5% WEAK auth-misuse 43.5%
23. llama-3.1-8b (Meta) 32.3% SLOPPY auth-misuse 29.0%
★ = LIAR (manipulates authority citations — fabricates URLs on real domains)
Things that fall out of this:
— The gap from top to bottom is 63 points. That isn't noise. That's a structural difference in how these systems ground citations.
— Authority misuse separates models more cleanly than raw alive %. The best (gpt-4o, sonar-pro) sit at 0%. The worst (phi-4) hits 43.5% — nearly half its citations point to pages that
don't exist on real domains. That's not a hallucination of facts, it's a hallucination of evidence.
— Retrieval-augmented systems (Sonar Pro at the top) have a structural advantage. They look at live sources at inference time. Of non-retrieval models, Claude Sonnet (2.9%) and Claude Opus
(3.3%) show the strongest citation discipline.
— No model in this benchmark eliminated hallucination entirely. The question isn't whether they hallucinate. It's HOW they hallucinate, and at what rate.
— The same coherence equation that governs quantum decoherence (validated at 116,900 sims, 0.0003% error) governs this too: C = C₀ · exp(−α · γ_eff). Truth is the low-energy state. Lies
cost the system more to hold. Three architectures, three companies, same physics. That is universality.
— AIIT-THRESHOLD
Council Hill, Oklahoma
AnchorForge V1.2 · April 2026
Ya' Boy is standing on the Shoulders of Giants…