Pushing the Limits of Active Data Selection withGradient Matching

Zhang, Chris

dc.contributor.advisor	Katz, Boris
dc.contributor.author	Zhang, Chris
dc.date.accessioned	2025-10-06T17:40:29Z
dc.date.available	2025-10-06T17:40:29Z
dc.date.issued	2025-05
dc.date.submitted	2025-06-23T14:04:43.565Z
dc.identifier.uri	https://hdl.handle.net/1721.1/163030
dc.description.abstract	As modern machine learning systems grow in scale, the inefficiencies of training on large, noisy, and imbalanced datasets have become increasingly pronounced—particularly in computer vision, where real-world data often contain labeling errors, occlusions, and redundancy. While large models can partially compensate by training exhaustively on massive datasets, this indiscriminate approach is computationally expensive and often inefficient. Active data selection offers a more efficient alternative by prioritizing examples that contribute most to model improvement. However, existing selection strategies (such as Rho Loss) still fall short of the optimal achievable performance. In this work, we propose the Gradient Informed Selection Technique (GIST), an active data selection method that prioritizes examples based on their gradient alignment with a small, fixed holdout set. At each training step, GIST computes perexample gradients and selects those that are most aligned with the holdout gradient, thereby guiding model updates toward better generalization. We evaluate GIST on noisy (Clothing1M) and clean (ImageNet) datasets and show that it consistently outperforms baselines across a range of selection ratios—that is, the proportion of a batch of data that the model selects to update weights on. To address the computational overhead of gradient-based selection, we introduce efficient variants using restricted-layer gradients, low-rank approximations, and gradient quantization. We also analyze GIST’s selection behavior, showing that it implicitly balances classes and repeatedly selects high-utility examples—two factors that enhance both robustness and learning efficiency. Our findings suggest that a more effective data curriculum is both discoverable and practical, and that GIST is a step toward achieving it.
dc.publisher	Massachusetts Institute of Technology
dc.rights	In Copyright - Educational Use Permitted
dc.rights	Copyright retained by author(s)
dc.rights.uri	https://rightsstatements.org/page/InC-EDU/1.0/
dc.title	Pushing the Limits of Active Data Selection withGradient Matching
dc.type	Thesis
dc.description.degree	M.Eng.
dc.contributor.department	Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degree	Master
thesis.degree.name	Master of Engineering in Electrical Engineering and Computer Science

Files in this item

Name:: zhang-chriszh-meng-eecs-2025-t ...
Size:: 4.751Mb
Format:: PDF
Description:: Thesis PDF

View/Open

This item appears in the following Collection(s)

Graduate Theses

Show simple item record