AutoDiff: A Scalable Framework for Automated Model
Comparison

Woo, Andrew Kyoungwan

Author(s)

Woo, Andrew Kyoungwan

DownloadThesis PDF (1.492Mb)

Advisor

Torralba, Antonio

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Post-training adaptations such as supervised fine-tuning, quantization, and reinforcement learning can cause large language models (LLMs) with identical architectures to exhibit divergent behaviors. However, the mechanisms driving these behavioral shifts remain largely opaque, limiting the reliability and interpretability of adapted models. AutoDiff is a scalable, automated framework for tracing model divergence on a per-neuron basis. It exhaustively profiles every feed-forward (MLP) unit across a pair of models, identifies the neurons with the largest activation gaps, and links these differences to downstream behavioral changes. The pipeline identifies exemplars that maximize between-model activation divergence and clusters the highest-gap neurons into an interpretable, queryable difference report. Proof-ofconcept experiments on GPT-2 small validate AutoDiff’s ability to rediscover synthetic perturbations without manual supervision. A larger case study on Llama3.1–8B contrasts the base model with several adapted variants, surfacing neurons whose behavioral shifts align with observed topic-level gains and losses. By uncovering these mechanistic divergences, AutoDiff transforms black-box model updates into actionable insights, enabling safer deployment, principled debugging, and interpretable model evaluation.

Date issued

2025-05

URI

https://hdl.handle.net/1721.1/163027

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses