Synthetic Data-Driven Benchmarking of Integrative Approaches for Multi-Modal Omics Analysis in Personalized Medicine

Published: 13 Dec 2023, Last Modified: 13 Dec 2023NLDL 2024 Abstract TrackEveryoneRevisionsBibTeX
Keywords: Data Integration; Synthetic Data; Omics; Personalised Medicine; Oncohematology
TL;DR: We propose an algorithm for synthetic data generation of multi-modal data and propose a novel learning rule for clinical data integration
Abstract: Recent technological developments have made it possible to generate and collect large amounts of molecular measurements. Such measurements are often referred to as ”omics.” They include, for example, information on the genetic content of a sample and its functionality. The analysis of omics data has revolutionized both biology and medicine, providing profound insights into intricate biological processes and laying the foundation for personalized medicine [1]. Furthermore, when combined with clinical records and medical imaging, omics offers an unprecedented view of an individual patient’s health. Nevertheless, there remains a pressing need for robust and interpretable statistical and machine learning approaches to seamlessly integrate multi-omics data, particularly in basic biology and translational research [2, 3]. Our work aims to address these challenges, emphasizing the integration of clinical data, omics information, and medical imaging to deepen our understanding of complex conditions, including cancer and rare hematological disorders. Moreover, given the inherent complexities and variabilities in real-world datasets, coupled with the absence of a consensus on the best methods for data integration, there emerges a compelling need for a synthetic data generation algorithm. Such an algorithm becomes vital to benchmark the plethora of available integration methods, ensuring their reliability and efficiency in a controlled environment. Our work seeks to address these multifaceted challenges, emphasizing the synthesis of clinical data, omics information, and medical imaging. Traditional machine learning and statistical models [4, 5] offer various methods for data integration but often lack the flexibility needed to handle the interplay of multi-modal data. Moreover, existing rules tend to operate on assumptions that may not hold in the complex healthcare landscape. In response to these limitations, our research introduces a novel learning rule based on mutual information. This rule is designed to adapt to diverse data types and scales, account for interactions between various data forms, manage large-scale datasets efficiently, and shed light on the most influential features or data types in predictions. Building on this foundation, our model aims to generate specialized, localized embeddings for each data modality. These local embeddings serve to condense the information within each specific domain. Subsequently, these localized embeddings are synthesized into a unified, global embedding. The conditional mutual information between the local embeddings guides the formation of this global embedding. This ensures that the global representation captures and accounts for the interaction between modalities in a manner that is sensitive to the specific clinical question at hand. This approach offers the benefits of both dimensionality reduction and the preservation of essential information across multiple modalities. Moreover, one of the strengths of our approach lies in the interpretability of the resulting models. By leveraging localized and global embeddings, we can dissect the contributions of each data type to the final prediction. This facilitates a better understanding of how different modalities contribute to the model’s decision-making process. We utilized our method on a comprehensive internal dataset incorporating omics (transcriptomics, chromosomal and somatic alterations), clinical, and medical imaging data (histological images). Synthetic data is a cornerstone for rigorously testing and validating our multi-omics integration model. By generating synthetic data from a simulated latent distribution, we can create a controlled yet complex environment where the influence of each type of data (clinical, omics, and medical imaging) on patient outcomes can be separately and collectively studied. This enables us to test the robustness of our models under various conditions, including edge cases that may not be frequently observed in real-world datasets. Synthetic instances of each data type are generated using a simulated latent distribution that characterizes each patient. The synthetic data is created in a high-dimensional feature space that closely resembles the complexity of actual patient data. Care is taken to preserve the correlations and dependencies observed in real-world data. In conclusion, our research presents an innovative approach to the challenge of multi-omics data integration in healthcare. By employing machine learning and statistical methods, our model can process many data types—from clinical and omics to medical imaging. Our integration strategy utilizes localized and globally adaptive embeddings, providing a nuanced and flexible system that adjusts according to the clinical question.
Submission Number: 23
Loading