FinReflectKG - MultiHop: Financial QA Benchmark for Reasoning with Knowledge Graph Evidence
In the complex world of financial data, answering a single question often requires more than just reading a document, it demands connecting dots across multiple years, companies, and filings.
Imagine an analyst asking: “how did a raw material shortage in 2023 affect a company’s revenue for 2025 revenue?”
To answer, an AI system must link events, risks, and outcomes – a process called multi-step reasoning, or in researchers’ jargon, multi-hop reasoning. Yet most large language models (LLMs) struggle with this type of complex analysis because the evidence is scattered across vast, unstructured text, leading to inefficiency, high-token usage, as well as frequent errors.
To tackle this challenge, Domyn’s research team – Abhinav Arun, Reetu Raj Harsh, Baskharjit Sarmah, Stefano Pasquali – introduced FinReflectKG–MultiHop, a new benchmark designed specifically for multi-hop question answering (QA) in finance. Built on top of the FinReflectKG (Knowledge Graph), the benchmark connects structured entities like companies, financial metrics, risk factors, ESG topics, and temporal data extracted from corporate filings such as 10-K reports. By grounding QA tasks in the knowledge graph, the benchmark enables models to retrieve evidence more precisely than when sifting through thousands of text tokens. In short, it introduces a systematic and finance-specific benchmark that enables complex, analyst-style reasoning grounded in structured, time-aware data.
After building the dataset, Domyn’s research went on to design a meticulous pipeline to provide extra realism. First, they extracted common reasoning “patterns” across the S&P 100, then generated thousands of question-answer pairs based on these links. Each question came with three different evidence settings: precise KG-linked paths, text-only snippets, and noisy document windows containing distractions. This structure allowed them to measure not just how well models reason, but also how efficiently they handle retrieval in noisy or structured contexts.
When tested on various open-source and proprietary LLMs, the results were striking. Models using KG-linked evidence performed about 24% better on correctness while using around 84.5% fewer tokens compared to those relying on traditional text-based retrieval. Even smaller models benefited from the structured evidence, demonstrating that better retrieval — not necessarily bigger models — can significantly improve reasoning accuracy in financial tasks. In contrast, models that solely relied on raw text often drowned in irrelevant data, leading to inconsistent or incorrect answers.
For the finance industry, the implications are clear: integrating structured retrieval through knowledge graphs will make AI-powered financial analysis more accurate, efficient, and easy to interpret. And that’s just the beginning. Far beyond surface-level reading, FinReflectKG–MultiHop serves as a crucial step towards deeper, evidence-based reasoning, enhancing the value of specialized LLMs as strategic assets across the industry.