Data Multiplication for Cross-Document Event Coreference with Large Language ModelsDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: Creating a Cross-Document Event Corference (CDEC) dataset is complex and labor-intensive. As a result, existing CDEC data sets are small in the size of event mentions and limited in the number of event types that are covered. This is a substantial hurdle in training robust CDEC systems. In this paper, we propose to leverage large language models (LLMs) to address this bottleneck. Specifically, to enrich trigger variety and word order variation, we introduce two data multiplication techniques that employ GPT-4 to generate realistic synthetic training data, effectively increasing the volume of existing annotated CDEC data sets with high-quality annotations. We demonstrate the effectiveness of our approach by conducting experiments on the ECB+ and Aylien Covid datasets, and show that adding LLM-generated CDEC data improves the performance on the two benchmarks by up to 1.8 and 3 points respectively in CoNLL F1. We believe that our method is generally applicable to other tasks as well and underscores the potential of LLMs in addressing data scarcity challenges in natural language processing tasks.
Paper Type: long
Research Area: Information Extraction
Contribution Types: Data resources
Languages Studied: English
0 Replies

Loading