FinReflectKG - HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

As organizations layer these systems into their information infrastructure, a critical challenge has quietly emerged: the outputs that AI produces can appear fluent, authoritative, and well-reasoned, but still be factually wrong. In high-stakes domains like finance, where decisions hinge on precise figures from regulatory filings, this problem has a name: hallucination.
One of the most promising approaches to reducing hallucination risk is the use of Knowledge Graphs (KGs), which supply structured, verifiable facts to guide model responses. Yet while KG-augmented systems have shown promise in improving accuracy, a fundamental question has gone unanswered: what happens when the knowledge graph itself contains errors? In practice, KG extraction pipelines are noisy: they produce misaligned, incomplete, or contradictory triplets that real-world deployments have no choice but to work with.
Domyn's research team set out to address this gap head-on, introducing FinBench-QA-Hallucination, a benchmark specifically designed to evaluate how well hallucination detection methods hold up when KG signals are imperfect. The team built a dataset of 755 annotated question-answer examples drawn from fiscal year 2024 SEC 10-K filings across 57 S&P 100 companies, manually validated by nine in-house reviewers using a conservative evidence-linkage protocol. Six detection approaches were then evaluated under two controlled conditions — with and without KG triplets — to isolate exactly how retrieval noise affects each method's reliability.
Here's what was found: in clean conditions, LLM-based judges and embedding methods performed strongest, achieving F1 scores between 0.82 and 0.86. But when noisy KG triplets were introduced — the realistic scenario for any live deployment — most methods degraded sharply, with Matthews Correlation Coefficient scores dropping between 44% and 84%. Embedding-based approaches stood apart, exhibiting only a 9% drop in performance under the same conditions. The root cause, surfaced through manual inspection, was a structural over-reliance on structured signals: LLM judges in particular tend to anchor on KG formatting even when it contradicts the source text.
Co-authored by Mahesh Kumar, Bhaskarjit Sarmah, and Stefano Pasquali, the research delivers a clear message for anyone deploying AI over regulatory documents: hallucination detection methods cannot be assumed robust just because they perform well in controlled conditions. Real KG noise — the kind that emerges naturally from extraction pipelines — is enough to undermine most current approaches. What does this mean for building trustworthy financial AI systems? Find out in the dedicated paper.
