Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

David Harwath*; Wei-Ning Hsu*; James Glass

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

David Harwath, Wei-Ning Hsu, James Glass

Published: 20 Dec 2019, Last Modified: 12 Oct 2025ICLR 2020 Conference Blind SubmissionReaders: Everyone

TL;DR: Vector quantization layers incorporated into a self-supervised neural model of speech audio learn hierarchical and discrete linguistic units (phone-like, word-like) when trained with a visual-grounding objective.

Abstract: In this paper, we present a method for learning discrete linguistic units by incorporating vector quantization layers into neural models of visually grounded speech. We show that our method is capable of capturing both word-level and sub-word units, depending on how it is configured. What differentiates this paper from prior work on speech unit learning is the choice of training objective. Rather than using a reconstruction-based loss, we use a discriminative, multimodal grounding objective which forces the learned units to be useful for semantic image retrieval. We evaluate the sub-word units on the ZeroSpeech 2019 challenge, achieving a 27.3% reduction in ABX error rate over the top-performing submission, while keeping the bitrate approximately the same. We also present experiments demonstrating the noise robustness of these units. Finally, we show that a model with multiple quantizers can simultaneously learn phone-like detectors at a lower layer and word-like detectors at a higher layer. We show that these detectors are highly accurate, discovering 279 words with an F1 score of greater than 0.5.

Keywords: visually-grounded speech, self-supervised learning, discrete representation learning, vision and language, vision and speech, hierarchical representation learning

Code: https://github.com/wnhsu/ResDAVEnet-VQ

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/learning-hierarchical-discrete-linguistic/code)

Original Pdf: pdf

10 Replies

Loading

Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

David Harwath*, Wei-Ning Hsu*, James Glass

David Harwath, Wei-Ning Hsu, James Glass