MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Language Modeling from Visually Grounded Speech

Author(s)
Lai, Cheng-I Jeff
Thumbnail
DownloadThesis PDF (4.362Mb)
Advisor
Glass, James R.
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Recent advancements in spoken language processing have significantly reduced automatic speech recognition (ASR) error rates, driven by large-scale supervised training on paired speech–text data and, more recently, self-supervised pre-training on unpaired speech and audio. These methods have facilitated robust transfer learning across diverse speech and audio tasks. However, fully leveraging multimodal inputs, particularly visual context, remains underexplored. This thesis addresses this gap by developing novel language modeling techniques directly from visually grounded speech. We first introduce the Audio-Visual Neural Syntax Learner (AV-NSL), an unsupervised parser that recovers constituency trees directly from raw speech paired with images, demonstrating how visual context effectively bootstraps grammar induction without textual supervision. Next, we investigate Audio-Visual Word Discovery for Speech Translation, using the Fisher Spanish–English corpus to train a series of speech-to-speech translation models based on pseudo-word units discovered via audio-visual grounding. This study highlights that simplistic acoustic tokens and limited training data degrade re-synthesis and translation quality, underscoring two crucial missing ingredients: richer semantic tokens and large-scale training. Guided by these insights, we present Audio-Visual Gemma (AV-Gemma), a family of multimodal foundation models that condition jointly on images and learned semantic speech tokens. At scale, AV-Gemma generates visually coherent spoken captions and transfers robustly to tasks such as video-to-speech generation and spoken visual question answering, significantly advancing multimodal spoken-language processing.
Date issued
2025-09
URI
https://hdl.handle.net/1721.1/164660
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.