AutoDiff: A Scalable Framework for Automated Model Comparison
Author(s)
Woo, Andrew Kyoungwan
DownloadThesis PDF (1.492Mb)
Advisor
Torralba, Antonio
Terms of use
Metadata
Show full item recordAbstract
Post-training adaptations such as supervised fine-tuning, quantization, and reinforcement learning can cause large language models (LLMs) with identical architectures to exhibit divergent behaviors. However, the mechanisms driving these behavioral shifts remain largely opaque, limiting the reliability and interpretability of adapted models. AutoDiff is a scalable, automated framework for tracing model divergence on a per-neuron basis. It exhaustively profiles every feed-forward (MLP) unit across a pair of models, identifies the neurons with the largest activation gaps, and links these differences to downstream behavioral changes. The pipeline identifies exemplars that maximize between-model activation divergence and clusters the highest-gap neurons into an interpretable, queryable difference report. Proof-ofconcept experiments on GPT-2 small validate AutoDiff’s ability to rediscover synthetic perturbations without manual supervision. A larger case study on Llama3.1–8B contrasts the base model with several adapted variants, surfacing neurons whose behavioral shifts align with observed topic-level gains and losses. By uncovering these mechanistic divergences, AutoDiff transforms black-box model updates into actionable insights, enabling safer deployment, principled debugging, and interpretable model evaluation.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology