Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models

Published: 04 Mar 2024, Last Modified: 02 May 2024DPFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal; open-ended dialogue; dataset; multi-image; GPT-assisted; instruction-following models
TL;DR: We introduce GPT-assisted training and evaluation dialogue datasets to facilitate chats across multiple images and dialogue turns, and propose accompanying multimodal instruction-following models.
Abstract: Large language models exhibit enhanced zero-shot performance on various tasks when fine-tuned with instruction-following data. Multimodal instruction-following models extend these capabilities by integrating both text and images. However, existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images. A primary reason is the lack of a specialized dataset for this critical application. To bridge these gaps, we introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. Furthermore, we construct SparklesEval, a GPT-assisted benchmark for quantitatively assessing a model's conversational competence across multiple images and dialogue turns. We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images. Our experiments validate the effectiveness of training SparklesChat with SparklesDialogue based on MiniGPT-4 and LLaVA-v1.5, which enhances comprehension across multiple images and dialogue turns, and does not compromise single-image understanding capabilities. Qualitative evaluations further demonstrate SparklesChat's generality in handling real-world applications. All resources related to this study are publicly available at https://github.com/HYPJUDY/Sparkles.
Submission Number: 46
Loading