Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models’ Understanding of Discourse RelationsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
TL;DR: We propose a new measure to evaluate language models' faithfulness in understanding discourse relations.
Abstract: While large language models have significantly enhanced the effectiveness of discourse relation classifications, it remains unclear whether their comprehension is faithful and reliable. We provide DISQ, a new method for evaluating the faithfulness of understanding discourse based on question answering. We first employ in-context learning to annotate the reasoning for discourse comprehension, based on the connections among key events within the discourse. Following this, DISQ interrogates the model with a sequence of questions to assess its grasp of core event relations, its resilience to counterfactual queries, as well as its consistency to its previous responses. We then evaluate language models with different architectural designs using DISQ, finding: (1) DISQ presents a significant challenge for all models, with the top-performing GPT model attaining only 42% of the ideal performance; (2) Open-source models generally lag behind their closed-source GPT counterparts, with notable exceptions being those enhanced with chat and code/math features which demonstrate zero-shot capabilities in discourse comprehension; (3) Our analysis reveals that LLMs exhibit asymmetries in grasping particular relations, validate the effectiveness of having explicitly signalled discourse connectives, the role of contextual information, and the benefits of integrating historical QA data.
Paper Type: long
Research Area: Discourse and Pragmatics
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
0 Replies

Loading