Show simple item record

dc.contributor.advisorKatz, Boris
dc.contributor.authorZhang, Chris
dc.date.accessioned2025-10-06T17:40:29Z
dc.date.available2025-10-06T17:40:29Z
dc.date.issued2025-05
dc.date.submitted2025-06-23T14:04:43.565Z
dc.identifier.urihttps://hdl.handle.net/1721.1/163030
dc.description.abstractAs modern machine learning systems grow in scale, the inefficiencies of training on large, noisy, and imbalanced datasets have become increasingly pronounced—particularly in computer vision, where real-world data often contain labeling errors, occlusions, and redundancy. While large models can partially compensate by training exhaustively on massive datasets, this indiscriminate approach is computationally expensive and often inefficient. Active data selection offers a more efficient alternative by prioritizing examples that contribute most to model improvement. However, existing selection strategies (such as Rho Loss) still fall short of the optimal achievable performance. In this work, we propose the Gradient Informed Selection Technique (GIST), an active data selection method that prioritizes examples based on their gradient alignment with a small, fixed holdout set. At each training step, GIST computes perexample gradients and selects those that are most aligned with the holdout gradient, thereby guiding model updates toward better generalization. We evaluate GIST on noisy (Clothing1M) and clean (ImageNet) datasets and show that it consistently outperforms baselines across a range of selection ratios—that is, the proportion of a batch of data that the model selects to update weights on. To address the computational overhead of gradient-based selection, we introduce efficient variants using restricted-layer gradients, low-rank approximations, and gradient quantization. We also analyze GIST’s selection behavior, showing that it implicitly balances classes and repeatedly selects high-utility examples—two factors that enhance both robustness and learning efficiency. Our findings suggest that a more effective data curriculum is both discoverable and practical, and that GIST is a step toward achieving it.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titlePushing the Limits of Active Data Selection withGradient Matching
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record