Show simple item record

dc.contributor.advisorTorralba, Antonio
dc.contributor.authorWoo, Andrew Kyoungwan
dc.date.accessioned2025-10-06T17:40:21Z
dc.date.available2025-10-06T17:40:21Z
dc.date.issued2025-05
dc.date.submitted2025-06-23T14:04:15.144Z
dc.identifier.urihttps://hdl.handle.net/1721.1/163027
dc.description.abstractPost-training adaptations such as supervised fine-tuning, quantization, and reinforcement learning can cause large language models (LLMs) with identical architectures to exhibit divergent behaviors. However, the mechanisms driving these behavioral shifts remain largely opaque, limiting the reliability and interpretability of adapted models. AutoDiff is a scalable, automated framework for tracing model divergence on a per-neuron basis. It exhaustively profiles every feed-forward (MLP) unit across a pair of models, identifies the neurons with the largest activation gaps, and links these differences to downstream behavioral changes. The pipeline identifies exemplars that maximize between-model activation divergence and clusters the highest-gap neurons into an interpretable, queryable difference report. Proof-ofconcept experiments on GPT-2 small validate AutoDiff’s ability to rediscover synthetic perturbations without manual supervision. A larger case study on Llama3.1–8B contrasts the base model with several adapted variants, surfacing neurons whose behavioral shifts align with observed topic-level gains and losses. By uncovering these mechanistic divergences, AutoDiff transforms black-box model updates into actionable insights, enabling safer deployment, principled debugging, and interpretable model evaluation.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleAutoDiff: A Scalable Framework for Automated Model Comparison
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record