Automated Red Teaming for LLMs in Financial Services: A Taxonomy-Driven Evaluation Framework

As financial institutions accelerate this adoption, their exposure to a category of risk that safety benchmarks have largely overlooked is quietly growing: what happens when an adversary doesn't ask an AI system to do something obviously harmful, but instead frames a deeply problematic request as a routine professional one?
LLMs are built with guardrails designed to prevent harmful outputs. Yet a well-documented body of research shows that determined adversaries can systematically bypass these protections through jailbreak attacks, carefully engineered prompts and multi-turn interaction strategies that gradually erode a model's constraints. In most domains, this is a serious problem. In finance, it is a regulatory and operational emergency. A model that generates market manipulation strategies framed as "research", offers structuring advice that skirts anti-money-laundering rules, or provides actionable guidance on tax minimization schemes doesn't just fail a safety test; it exposes its deploying institution to enforcement action, financial loss, and reputational damage. In January 2026, the UK House of Commons Treasury Committee warned that a continued wait-and-see posture leaves consumers and the financial system exposed to serious harm, calling for AI-specific stress testing across the sector.
The challenge is that most existing red-teaming benchmarks were not built with this in mind. They focus on general-purpose harms, rely on single-turn interactions, and reduce security evaluation to a binary question: did the model comply or refuse? That framing misses the nuance that defines financial risk — where the danger lies not in explicit requests for violence or weapons, but in professionally plausible prompts that encode regulatory gray areas, compliance-sensitive advice, or market misconduct dressed up as legitimate analysis.
Domyn's research team built a framework to address this gap directly. The work introduces FinRedTeamBench, a domain-specific benchmark that maps LLM failure modes to regulatory, compliance, and operational risk categories across the Banking, Financial Services, and Insurance (BFSI) landscape. Alongside it, the team developed an automated multi-turn red-teaming pipeline in which an attacker model iteratively escalates its approach based on prior responses, and an ensemble-based judging protocol to evaluate outputs across multiple independent models. At the center of the framework is a new metric: the Risk-Adjusted Harm Score (RAHS), which moves beyond simple success rates to capture how severe a model's financial disclosure actually is and whether it includes any meaningful mitigation signals.
The findings are stark. Higher decoding temperature and sustained adversarial interaction don't just increase the likelihood of a successful jailbreak — they systematically escalate the severity of the harm exposed. Models that hold firm in the first round often capitulate under extended pressure, gradually shifting toward more operationally actionable and financially consequential disclosures. Crucially, the models tested across the framework showed a consistent blind spot: while they reliably refused requests for overtly illegal or unethical content, they frequently responded with detailed, helpful answers to requests encoding financial misconduct wrapped in the language of compliance and professional practice.
Co-authored by Fabrizio Dimino, Bhaskarjit Sarmah, and Stefano Pasquali, the paper's message to financial institutions is direct: deploying LLMs in production without continuous, domain-specific adversarial testing is a material risk, not a theoretical one. If generic benchmarks are not enough, find out what rigorous evaluation looks like in the dedicated paper below: