LAiW: A Chinese Legal Large Language Models Benchmark

ACL ARR 2024 April Submission801 Authors

16 Apr 2024 (modified: 08 May 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: General and legal domain LLMs have demonstrated strong performance in various tasks of LegalAI. However, the current evaluations of these LLMs in LegalAI lack consistency with the legal logic, making LLMs difficult to understand and trust by legal experts. To address this challenge, we are the first to build the Chinese legal LLMs benchmark LAiW, based on the logic of legal syllogism. We categorize the legal capabilities of LLMs into three levels to align with the thinking process of legal experts and legal syllogism: basic information retrieval, legal foundation inference, and complex legal application. Each level collects and tailors multiple tasks to ensure a comprehensive evaluation. Through automatic evaluation of current general and legal domain LLMs on our benchmark, we indicate that although LLMs can answer complex legal questions, the LLMs do not possess the rigorous logical processes inherent in legal syllogism, which may pose obstacles to be accepted by legal experts. To further confirm this scenario of LLMs in legal application, we incorporate manual evaluation with legal experts. The results not only confirm the above conclusion but also reveal the important role of pretraining for LLMs in enhancing legal logic, which may improve the future development of the legal LLM.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Large Language Models, Legal Syllogism, Evaluation, Benchmark
Contribution Types: Data resources
Languages Studied: English, Chinese
Submission Number: 801
Loading