Steering Vision at Scale: From the Model Weights to Training Data

Materzyńska, Joanna

dc.contributor.advisor	Torralba, Antonio
dc.contributor.author	Materzyńska, Joanna
dc.date.accessioned	2026-01-29T15:05:29Z
dc.date.available	2026-01-29T15:05:29Z
dc.date.issued	2025-09
dc.date.submitted	2025-09-15T14:42:06.359Z
dc.identifier.uri	https://hdl.handle.net/1721.1/164645
dc.description.abstract	We study the interpretability and controllability of multimodal and generative models, with a particular focus on text–image representation models and text-to-image diffusion systems. We begin by addressing limitations in CLIP’s multimodal embeddings, specifically the entanglement between visual and textual concepts within images. We demonstrate the consequences of this entanglement in both generative and discriminative tasks, and introduce a method for disentangling visual and textual representations. We showcase the utility of these disentangled embeddings in typographic attack resistance, improved image generation, and robust out-of-domain OCR detection. Building on this foundation, we explore methods to enhance the controllability of diffusion models. First, we tackle the challenge of unwanted concept generation. We introduce a technique to remove specific visual concepts using only their names, leveraging negative prompts and guidance to suppress target content without modifying training data or requiring model retraining. This approach enhances ethical alignment and enables greater user control in generative systems. We then turn to the complementary problem: incorporating new concepts. We present a few-shot motion customization technique for video generation models, which transfers motion patterns from a small set of examples to novel subjects. This method maintains the generalization capabilities of the base model while enabling consistent, subject-agnostic animation that preserves both identity and temporal coherence. To improve the fine-grained control of visual outputs, we propose a method for continuous manipulation of image attributes. This framework introduces smooth, intuitive controls, that allow for dynamic, continuous steering of generated images. Unlike prompt engineering or token-level interventions, our approach offers real-time adjustment without sacrificing output realism. Finally, we examine whether artistic styles in diffusion models require large-scale pretraining or can be learned in a lightweight, post-training manner. To this end, we train a base model on art-free data and introduce a compact adapter method that learns stylistic concepts from a small set of exemplar artworks. Our findings suggest that artistic domains can be integrated efficiently and ethically, without reliance on web-scale scraped datasets.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Steering Vision at Scale: From the Model Weights to Training Data
dc.type	Thesis
dc.description.degree	Ph.D.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Doctoral
thesis.degree.name	Doctor of Philosophy

Files in this item

Name:: materzynska-jomat-phd-eecs-202 ...
Size:: 262.9Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Doctoral Theses

Show simple item record