| dc.description.abstract | We study the interpretability and controllability of multimodal and generative models, with a particular focus on text–image representation models and text-to-image diffusion systems. We begin by addressing limitations in CLIP’s multimodal embeddings, specifically the entanglement between visual and textual concepts within images. We demonstrate the consequences of this entanglement in both generative and discriminative tasks, and introduce a method for disentangling visual and textual representations. We showcase the utility of these disentangled embeddings in typographic attack resistance, improved image generation, and robust out-of-domain OCR detection. Building on this foundation, we explore methods to enhance the controllability of diffusion models. First, we tackle the challenge of unwanted concept generation. We introduce a technique to remove specific visual concepts using only their names, leveraging negative prompts and guidance to suppress target content without modifying training data or requiring model retraining. This approach enhances ethical alignment and enables greater user control in generative systems. We then turn to the complementary problem: incorporating new concepts. We present a few-shot motion customization technique for video generation models, which transfers motion patterns from a small set of examples to novel subjects. This method maintains the generalization capabilities of the base model while enabling consistent, subject-agnostic animation that preserves both identity and temporal coherence. To improve the fine-grained control of visual outputs, we propose a method for continuous manipulation of image attributes. This framework introduces smooth, intuitive controls, that allow for dynamic, continuous steering of generated images. Unlike prompt engineering or token-level interventions, our approach offers real-time adjustment without sacrificing output realism. Finally, we examine whether artistic styles in diffusion models require large-scale pretraining or can be learned in a lightweight, post-training manner. To this end, we train a base model on art-free data and introduce a compact adapter method that learns stylistic concepts from a small set of exemplar artworks. Our findings suggest that artistic domains can be integrated efficiently and ethically, without reliance on web-scale scraped datasets. | |