Don’t Make Your LLM an Evaluation Benchmark CheaterDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: To assess the capacity of large language models~(LLMs), a typical approach is to construct evaluation benchmarks for measuring their ability level in different aspects. Although a surge of high-quality benchmarks have been released, the concerns about the appropriate use of benchmarks and the fair comparison are increasingly growing. In this paper, we discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results. Specially, we focus on a special issue that would lead to inappropriate evaluation, \ie \emph{benchmark leakage}, referring that the data related to evaluation sets is occasionally used for model training. This phenomenon now becomes more common since pre-training data is often prepared ahead of model test. We conduct extensive experiments to study the effect of benchmark leakage, and find that it can dramatically boost the evaluation results, which would finally lead to an unreliable assessment of model performance. We hope this work can draw attention to appropriate training and evaluation of LLMs.
Paper Type: short
Research Area: Resources and Evaluation
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
0 Replies

Loading