How Data Drives ML Models Performance

Khaddaj, Alaa

Author(s)

Khaddaj, Alaa

DownloadThesis PDF (20.85Mb)

Advisor

Mądry, Aleksander

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Data has been been playing an increasingly more important role in the machine learning (ML) pipeline. This thesis deepens the understanding of the effect of the data on model performance and reliability. First, we study how choice of training data affects model performance. We consider a transfer learning setting and present a framework for selecting from a large pool of data a pretraining subset that improves model performance on downstream tasks. Our approach, however, requires training multiple target models which becomes prohibitively expensive at large-scale. To that end, we explore using smaller—and cheaper—proxy models to approximate large model behavior and select the pretraining data using that cheaper model. We show the effectiveness of this approach in two dataset selection settings: language modeling and imitation learning. Second, we explore the role of data in model reliability and consider two different threat models: backdoor attacks and malicious data editing. In this first threat model, an adversary injects a few doctered samples into the training set to control model predictions at inference time. We study the effect of these malicious samples on model behavior and then propose a framework for detecting and removing them from the training data. In the second threat model, an adversary leverages generative models such as diffusion models to maliciously modify personal data and generate harmful digital content. We focus on image editing and investigate how we can imperceptibly modify personal images to mitigate editing using diffusion models and raise and the cost of hamrful content generation. Overall, this thesis contributes to the understanding of the role of the data in driving model behavior. Through these efforts, we aim to provide mechanisms for (i) training models that perform better and (ii) are more reliable when deployed in the real world.

Date issued

2025-09

URI

https://hdl.handle.net/1721.1/164640

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Doctoral Theses