MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Steering Vision at Scale: From the Model Weights to Training Data

Author(s)
Materzyńska, Joanna
Thumbnail
DownloadThesis PDF (262.9Mb)
Advisor
Torralba, Antonio
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
We study the interpretability and controllability of multimodal and generative models, with a particular focus on text–image representation models and text-to-image diffusion systems. We begin by addressing limitations in CLIP’s multimodal embeddings, specifically the entanglement between visual and textual concepts within images. We demonstrate the consequences of this entanglement in both generative and discriminative tasks, and introduce a method for disentangling visual and textual representations. We showcase the utility of these disentangled embeddings in typographic attack resistance, improved image generation, and robust out-of-domain OCR detection. Building on this foundation, we explore methods to enhance the controllability of diffusion models. First, we tackle the challenge of unwanted concept generation. We introduce a technique to remove specific visual concepts using only their names, leveraging negative prompts and guidance to suppress target content without modifying training data or requiring model retraining. This approach enhances ethical alignment and enables greater user control in generative systems. We then turn to the complementary problem: incorporating new concepts. We present a few-shot motion customization technique for video generation models, which transfers motion patterns from a small set of examples to novel subjects. This method maintains the generalization capabilities of the base model while enabling consistent, subject-agnostic animation that preserves both identity and temporal coherence. To improve the fine-grained control of visual outputs, we propose a method for continuous manipulation of image attributes. This framework introduces smooth, intuitive controls, that allow for dynamic, continuous steering of generated images. Unlike prompt engineering or token-level interventions, our approach offers real-time adjustment without sacrificing output realism. Finally, we examine whether artistic styles in diffusion models require large-scale pretraining or can be learned in a lightweight, post-training manner. To this end, we train a base model on art-free data and introduce a compact adapter method that learns stylistic concepts from a small set of exemplar artworks. Our findings suggest that artistic domains can be integrated efficiently and ethically, without reliance on web-scale scraped datasets.
Date issued
2025-09
URI
https://hdl.handle.net/1721.1/164645
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.