Finding the Right Recipe for Low Resource Domain Adaptation in Neural Machine TranslationDownload PDF

Anonymous

16 Oct 2021 (modified: 05 May 2023)ACL ARR 2021 October Blind SubmissionReaders: Everyone
Abstract: Despite the considerable amount of parallel data used to train neural machine translation models, they can still struggle to generate fluent translations in technical domains. In-domain parallel data is often very low resource and synthetic domain data generated via back-translation is frequently lower quality. To guide machine translation practitioners and characterize the effectiveness of domain adaptation methods under different data availability scenarios, we conduct an in-depth empirical exploration of monolingual and parallel data approaches to domain adaptation. We compare mixed domain fine-tuning, traditional back-translation, tagged back-translation, and shallow fusion with domain specific language models in isolation and combination. We study method effectiveness in very low resource (8k parallel examples) and moderately low resource (46k parallel examples) conditions. We demonstrate the advantages of augmenting clean in-domain parallel data with noisy mined in-domain parallel data and propose an ensemble approach to alleviate reductions in original domain translation quality. Our work includes three domains: consumer electronic, clinical, and biomedical and spans four language pairs - Zh-En, Ja-En, Es-En, and Ru-En. We make concrete recommendations for achieving high in-domain performance. We release our consumer electronic and clinical domain datasets for all languages and make our code publicly available.
0 Replies

Loading