Exploring Fine-Grained Human Motion Video Captioning

Exploring Fine-Grained Human Motion Video Captioning

ACL ARR 2024 April Submission658 Authors

16 Apr 2024 (modified: 01 Jun 2024)ACL ARR 2024 April SubmissionEveryone, Ethics Reviewers, Ethics ChairsRevisionsBibTeXCC BY 4.0

Abstract: Fine-grained human motion descriptions are crucial for fitness training. This naturally brings into focus the problem of fine-grained human motion video-to-text generation. Previous video captioning models struggle to capture fine-grained semantics of the videos, whilst the generated descriptions are often coarse-grained with little details in demonstrating human motion. Therefore, given the deficiency of datasets with fine-grained, annotated long text, we believe existing methods still have room for improvement. In this paper, we build a fine-grained human motion video captioning dataset named BoFit (Body Fitness Training). We also implement a state-of-the-art baseline named PoseGPT. It extracts angular representations of the videos and utilizes LLMs to generate fine-grained descriptions of human motions via prompting. Results show that PoseGPT outperforms previous methodologies on comprehensive metrics. We aim for this dataset to serve as a useful evaluation set for visio-linguistic models and drive further progress in this field.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: cross-modal content generation, cross-modal application, video processing, multimodality

Contribution Types: Approaches to low-resource settings, Data resources

Languages Studied: English

Submission Number: 658

Loading