MIT Libraries logoDSpace@MIT

MIT
View Item 
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
  • DSpace@MIT Home
  • MIT Libraries
  • MIT Theses
  • Graduate Theses
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Learning to Interpret Language Model Diffs

Author(s)
Goel, Avichal
Thumbnail
DownloadThesis PDF (827.0Kb)
Advisor
Kim, Yoon
Terms of use
In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/
Metadata
Show full item record
Abstract
Finetuning-induced changes to a model’s weights (a “model diff”) are semantically meaningful but often difficult to interpret. This makes us wonder: can we describe the content of an unknown model diff using natural language? We introduce diff interpretation training, a method that teaches a model describe its own finetuning-induced modifications. Our approach uses synthetic model diffs to train a lightweight adapter, which in turn can be applied to a compatible finetuned model to make it self-describing. Using two simple task settings, we demonstrate that our method can successfully decode model diffs into accurate natural language descriptions.
Date issued
2025-09
URI
https://hdl.handle.net/1721.1/164839
Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Publisher
Massachusetts Institute of Technology

Collections
  • Graduate Theses

Browse

All of DSpaceCommunities & CollectionsBy Issue DateAuthorsTitlesSubjectsThis CollectionBy Issue DateAuthorsTitlesSubjects

My Account

Login

Statistics

OA StatisticsStatistics by CountryStatistics by Department
MIT Libraries
PrivacyPermissionsAccessibilityContact us
MIT
Content created by the MIT Libraries, CC BY-NC unless otherwise noted. Notify us about copyright concerns.