Exploring Fine-Grained Human Motion Video Captioning

ACL ARR 2024 April Submission658 Authors

16 Apr 2024 (modified: 01 Jun 2024)ACL ARR 2024 April SubmissionEveryone, Ethics Reviewers, Ethics ChairsRevisionsBibTeXCC BY 4.0
Abstract: Fine-grained human motion descriptions are crucial for fitness training. This naturally brings into focus the problem of fine-grained human motion video-to-text generation. Previous video captioning models struggle to capture fine-grained semantics of the videos, whilst the generated descriptions are often coarse-grained with little details in demonstrating human motion. Therefore, given the deficiency of datasets with fine-grained, annotated long text, we believe existing methods still have room for improvement. In this paper, we build a fine-grained human motion video captioning dataset named BoFit (Body Fitness Training). We also implement a state-of-the-art baseline named PoseGPT. It extracts angular representations of the videos and utilizes LLMs to generate fine-grained descriptions of human motions via prompting. Results show that PoseGPT outperforms previous methodologies on comprehensive metrics. We aim for this dataset to serve as a useful evaluation set for visio-linguistic models and drive further progress in this field.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: cross-modal content generation, cross-modal application, video processing, multimodality
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: English
Submission Number: 658
Loading