Beyond Scale: The Diversity Coefficient as a Data Quality Metric for Variability in Natural Language Data

ICLR 2024 Workshop DMLR Submission85 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024DMLR @ ICLR 2024EveryoneRevisionsBibTeXCC BY 4.0
Keywords: data-centric machine learning, data quality, foundation models, pre-training, data diversity, diversity, large language models, LLMs
TL;DR: We propose a formal data diversity metric and interventionally show it can be used to improve downstream performance.
Abstract: Current trends in pre-training capable Large Language Models (LLMs) primarily focus on the scaling of model and dataset size. However, the *quality* of pre-training data is likewise an important factor for training powerful LLMs, yet it remains a nebulous concept that has not been rigorously characterized. We use the recently proposed Task2Vec diversity coefficient to ground, characterize, and understand formal aspects of data quality, to go beyond scale alone. We offer a formalization of one key aspect of data quality---the *variability* of natural language data---through the diversity coefficient. We first build confidence in the diversity coefficient through interpretability experiments and find that the coefficient aligns with intuitive properties of diversity and variability, e.g., it increases as the number of latent concepts increases. Then we measure the diversity coefficient of publicly available pre-training datasets and demonstrate that their formal diversity is high when compared to theoretical lower and upper bounds. Finally, we conduct a comprehensive set of controlled *interventional* experiments with GPT-2 and LLaMAv2 that demonstrate the diversity coefficient of pre-training data is a meaningful driver of downstream model evaluation performance---totaling 44 models trained from scratch of various sizes (51M to 7B parameters). We conclude that formal diversity is an important aspect of data quality that captures variability and causally leads to improved evaluation performance, thereby moving discussions of pre-training data beyond scale alone.
Primary Subject Area: Role of data in foundation models: pre-training, prompting, fine-tuning
Paper Type: Research paper: up to 8 pages
DMLR For Good Track: Participate in DMLR for Good Track
Participation Mode: Virtual
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 85
Loading