EvasionBench: Detecting Evasive Answers in Financial Q&A via Multi-Model Consensus and LLM-as-Judge
Abstract
EvasionBench introduces a large-scale benchmark for detecting evasive responses in earnings calls using a multi-model annotation framework that leverages disagreement between advanced language models to identify challenging examples, resulting in a highly accurate model with significantly reduced inference costs.
Detecting evasive answers in earnings calls is critical for financial transparency, yet progress is hindered by the lack of large-scale benchmarks. We introduce EvasionBench, comprising 30,000 training samples and 1,000 human-annotated test samples (Cohen's Kappa 0.835) across three evasion levels. Our key contribution is a multi-model annotation framework leveraging a core insight: disagreement between frontier LLMs signals hard examples most valuable for training. We mine boundary cases where two strong annotators conflict, using a judge to resolve labels. This approach outperforms single-model distillation by 2.4 percent, with judge-resolved samples improving generalization despite higher training loss (0.421 vs 0.393) - evidence that disagreement mining acts as implicit regularization. Our trained model Eva-4B (4B parameters) achieves 81.3 percent accuracy, outperforming its base by 25 percentage points and approaching frontier LLM performance at a fraction of inference cost.
Community
Thanks for featuring our work! ๐ EvasionBench aims to bridge the gap in financial transparency. We've released the Eva-4B model and the 1k human-annotated test set.
๐ Paper: https://arxiv.org/abs/2601.09142
๐ค Model: https://huggingface.co/FutureMa/Eva-4B
๐ฎ Demo: https://huggingface.co/spaces/FutureMa/financial-evasion-detection
Feel free to ask any questions!
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 1
Collections including this paper 0
No Collection including this paper