MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Doctoral Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

How Data Drives ML Models Performance

Author(s)
Khaddaj, Alaa
Thumbnail
DownloadThesis PDF (20.85Mb)
Advisor
Mądry, Aleksander
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Data has been been playing an increasingly more important role in the machine learning (ML) pipeline. This thesis deepens the understanding of the effect of the data on model performance and reliability. First, we study how choice of training data affects model performance. We consider a transfer learning setting and present a framework for selecting from a large pool of data a pretraining subset that improves model performance on downstream tasks. Our approach, however, requires training multiple target models which becomes prohibitively expensive at large-scale. To that end, we explore using smaller—and cheaper—proxy models to approximate large model behavior and select the pretraining data using that cheaper model. We show the effectiveness of this approach in two dataset selection settings: language modeling and imitation learning. Second, we explore the role of data in model reliability and consider two different threat models: backdoor attacks and malicious data editing. In this first threat model, an adversary injects a few doctered samples into the training set to control model predictions at inference time. We study the effect of these malicious samples on model behavior and then propose a framework for detecting and removing them from the training data. In the second threat model, an adversary leverages generative models such as diffusion models to maliciously modify personal data and generate harmful digital content. We focus on image editing and investigate how we can imperceptibly modify personal images to mitigate editing using diffusion models and raise and the cost of hamrful content generation. Overall, this thesis contributes to the understanding of the role of the data in driving model behavior. Through these efforts, we aim to provide mechanisms for (i) training models that perform better and (ii) are more reliable when deployed in the real world.
Date issued
2025-09
URI
https://hdl.handle.net/1721.1/164640
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Doctoral Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.