MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

AutoDiff: A Scalable Framework for Automated Model Comparison

Author(s)
Woo, Andrew Kyoungwan
Thumbnail
DownloadThesis PDF (1.492Mb)
Advisor
Torralba, Antonio
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Post-training adaptations such as supervised fine-tuning, quantization, and reinforcement learning can cause large language models (LLMs) with identical architectures to exhibit divergent behaviors. However, the mechanisms driving these behavioral shifts remain largely opaque, limiting the reliability and interpretability of adapted models. AutoDiff is a scalable, automated framework for tracing model divergence on a per-neuron basis. It exhaustively profiles every feed-forward (MLP) unit across a pair of models, identifies the neurons with the largest activation gaps, and links these differences to downstream behavioral changes. The pipeline identifies exemplars that maximize between-model activation divergence and clusters the highest-gap neurons into an interpretable, queryable difference report. Proof-ofconcept experiments on GPT-2 small validate AutoDiff’s ability to rediscover synthetic perturbations without manual supervision. A larger case study on Llama3.1–8B contrasts the base model with several adapted variants, surfacing neurons whose behavioral shifts align with observed topic-level gains and losses. By uncovering these mechanistic divergences, AutoDiff transforms black-box model updates into actionable insights, enabling safer deployment, principled debugging, and interpretable model evaluation.
Date issued
2025-05
URI
https://hdl.handle.net/1721.1/163027
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.