Pushing the Limits of Active Data Selection withGradient Matching

Zhang, Chris

Author(s)

Zhang, Chris

DownloadThesis PDF (4.751Mb)

Advisor

Katz, Boris

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

As modern machine learning systems grow in scale, the inefficiencies of training on large, noisy, and imbalanced datasets have become increasingly pronounced—particularly in computer vision, where real-world data often contain labeling errors, occlusions, and redundancy. While large models can partially compensate by training exhaustively on massive datasets, this indiscriminate approach is computationally expensive and often inefficient. Active data selection offers a more efficient alternative by prioritizing examples that contribute most to model improvement. However, existing selection strategies (such as Rho Loss) still fall short of the optimal achievable performance. In this work, we propose the Gradient Informed Selection Technique (GIST), an active data selection method that prioritizes examples based on their gradient alignment with a small, fixed holdout set. At each training step, GIST computes perexample gradients and selects those that are most aligned with the holdout gradient, thereby guiding model updates toward better generalization. We evaluate GIST on noisy (Clothing1M) and clean (ImageNet) datasets and show that it consistently outperforms baselines across a range of selection ratios—that is, the proportion of a batch of data that the model selects to update weights on. To address the computational overhead of gradient-based selection, we introduce efficient variants using restricted-layer gradients, low-rank approximations, and gradient quantization. We also analyze GIST’s selection behavior, showing that it implicitly balances classes and repeatedly selects high-utility examples—two factors that enhance both robustness and learning efficiency. Our findings suggest that a more effective data curriculum is both discoverable and practical, and that GIST is a step toward achieving it.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/163030

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses