FinReflectKG - HalluBench: GraphRAG Hallucination Benchmark for Financial Question Answering Systems

From compliance monitoring to risk assessment and investment analysis, AI-powered question-answering systems are becoming deeply embedded in financial workflows.

As organizations layer these systems into their information infrastructure, a critical challenge has quietly emerged: the outputs that AI produces can appear fluent, authoritative, and well-reasoned, but still be factually wrong. In high-stakes domains like finance, where decisions hinge on precise figures from regulatory filings, this problem has a name: hallucination.

One of the most promising approaches to reducing hallucination risk is the use of Knowledge Graphs (KGs), which supply structured, verifiable facts to guide model responses. Yet while KG-augmented systems have shown promise in improving accuracy, a fundamental question has gone unanswered: what happens when the knowledge graph itself contains errors? In practice, KG extraction pipelines are noisy: they produce misaligned, incomplete, or contradictory triplets that real-world deployments have no choice but to work with.

Domyn's research team set out to address this gap head-on, introducing FinBench-QA-Hallucination, a benchmark specifically designed to evaluate how well hallucination detection methods hold up when KG signals are imperfect. The team built a dataset of 755 annotated question-answer examples drawn from fiscal year 2024 SEC 10-K filings across 57 S&P 100 companies, manually validated by nine in-house reviewers using a conservative evidence-linkage protocol. Six detection approaches were then evaluated under two controlled conditions — with and without KG triplets — to isolate exactly how retrieval noise affects each method's reliability.

Here's what was found: in clean conditions, LLM-based judges and embedding methods performed strongest, achieving F1 scores between 0.82 and 0.86. But when noisy KG triplets were introduced — the realistic scenario for any live deployment — most methods degraded sharply, with Matthews Correlation Coefficient scores dropping between 44% and 84%. Embedding-based approaches stood apart, exhibiting only a 9% drop in performance under the same conditions. The root cause, surfaced through manual inspection, was a structural over-reliance on structured signals: LLM judges in particular tend to anchor on KG formatting even when it contradicts the source text.

Co-authored by Mahesh Kumar, Bhaskarjit Sarmah, and Stefano Pasquali, the research delivers a clear message for anyone deploying AI over regulatory documents: hallucination detection methods cannot be assumed robust just because they perform well in controlled conditions. Real KG noise — the kind that emerges naturally from extraction pipelines — is enough to undermine most current approaches. What does this mean for building trustworthy financial AI systems? Find out in the dedicated paper.

Read the paper

Authors

Pellentesque leo justo, placerat in dui ut, tincidunt tempus tellus praesent viverra consectetur tortor, rhoncus accumsan arcu venenatis id.

No items found.