Pushing the Limits of Active Data Selection withGradient Matching
Author(s)
Zhang, Chris
DownloadThesis PDF (4.751Mb)
Advisor
Katz, Boris
Terms of use
Metadata
Show full item recordAbstract
As modern machine learning systems grow in scale, the inefficiencies of training on large, noisy, and imbalanced datasets have become increasingly pronounced—particularly in computer vision, where real-world data often contain labeling errors, occlusions, and redundancy. While large models can partially compensate by training exhaustively on massive datasets, this indiscriminate approach is computationally expensive and often inefficient. Active data selection offers a more efficient alternative by prioritizing examples that contribute most to model improvement. However, existing selection strategies (such as Rho Loss) still fall short of the optimal achievable performance. In this work, we propose the Gradient Informed Selection Technique (GIST), an active data selection method that prioritizes examples based on their gradient alignment with a small, fixed holdout set. At each training step, GIST computes perexample gradients and selects those that are most aligned with the holdout gradient, thereby guiding model updates toward better generalization. We evaluate GIST on noisy (Clothing1M) and clean (ImageNet) datasets and show that it consistently outperforms baselines across a range of selection ratios—that is, the proportion of a batch of data that the model selects to update weights on. To address the computational overhead of gradient-based selection, we introduce efficient variants using restricted-layer gradients, low-rank approximations, and gradient quantization. We also analyze GIST’s selection behavior, showing that it implicitly balances classes and repeatedly selects high-utility examples—two factors that enhance both robustness and learning efficiency. Our findings suggest that a more effective data curriculum is both discoverable and practical, and that GIST is a step toward achieving it.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology