FinReflectKG - EvalBench: Benchmarking Financial KG with Multi-Dimensional Evaluation

Large Language Models (LLMs) serve more than just simple text generation purposes. For the financial service industry, they are particularly useful to extract structured knowledge from complex, unstructured financial documents, such as SEC filings. Despite great progress in knowledge extraction methods, the field has lacked a standardized, rigorous way to benchmark and evaluate financial Knowledge Graphs (KGs) derived from such texts. This absence makes it difficult to compare approaches, validate extracted facts, or trust LLM outputs in high-stakes financial applications.
Domyn’s research team introduces FinReflectKG - EvalBench, the first systematic and evaluation framework designed specifically for financial KG extraction. Built from U.S SEC Form 10-K filings from S&P 100 companies – documents known for their density and ambiguity – the benchmark addresses a critical gap in evaluating how reliably LLMs can extract structured triples from real-world financial disclosures. EvalBench brings multiple extraction strategies into a unified framework, including single-pass, multi-pass, and reflection-based methods, enabling direct and meaningful comparison across approaches.
Instead of relying on a single metric, the research team broke down performance into several dimensions, such as faithfulness, precision, and relevance. This multi-dimensional view reflects the reality that no single metric can fully capture the quality of a financial KG. An extraction method may produce many facts, but those facts vary in how well they are grounded in the source text, how clearly they are expressed, and how useful they are for downstream tasks such as compliance monitoring, investment research, credit risk assessment, or portfolio analysis.
To ensure evaluations are robust, scalable, and reproducible, the framework employs an LLM-as-Judge protocol with explicit bias controls. EvalBench introduces a deterministic commit-then-justify judging procedure that mitigates notorious LLM evaluation biases such as position effects, leniency, verbosity preferences, and reliance on world knowledge beyond the source text. When evidence is ambiguous, the judge follows a conservatism principle, defaulting to a negative decision to prioritize trustworthiness over over-generation. Structured verdicts and warning signals further support transparent error analysis and iterative improvement of extraction pipelines.
The results reveal important trade-offs between extraction strategies. Reflection-based methods generally perform best in terms of precision, relevance, and comprehensiveness, suggesting that iterative reasoning improves coverage and structural quality. However, simpler single-pass extraction achieves the highest faithfulness, indicating a more conservative alignment with the source text. This counterintuitive finding highlights why multi-dimensional evaluation is essential: increasing coverage can come at the cost of strict factual grounding.
Co-authored by Fabrizio Dimino, Abhinav Arun, Bhaskarjit Sarmah, and Stefano Pasquali, this research provides a critical foundation for comparing methods, diagnosing errors, and advancing both transparency and governance in automated financial knowledge extraction. Ultimately, it aims to improve reliability and performance of financial AI.