Show simple item record

dc.contributor.advisorKim, Yoon
dc.contributor.authorGoel, Avichal
dc.date.accessioned2026-02-12T17:13:52Z
dc.date.available2026-02-12T17:13:52Z
dc.date.issued2025-09
dc.date.submitted2025-09-15T14:56:30.033Z
dc.identifier.urihttps://hdl.handle.net/1721.1/164839
dc.description.abstractFinetuning-induced changes to a model’s weights (a “model diff”) are semantically meaningful but often difficult to interpret. This makes us wonder: can we describe the content of an unknown model diff using natural language? We introduce diff interpretation training, a method that teaches a model describe its own finetuning-induced modifications. Our approach uses synthetic model diffs to train a lightweight adapter, which in turn can be applied to a compatible finetuned model to make it self-describing. Using two simple task settings, we demonstrate that our method can successfully decode model diffs into accurate natural language descriptions.
dc.publisherMassachusetts Institute of Technology
dc.rightsIn Copyright - Educational Use Permitted
dc.rightsCopyright retained by author(s)
dc.rights.urihttps://rightsstatements.org/page/InC-EDU/1.0/
dc.titleLearning to Interpret Language Model Diffs
dc.typeThesis
dc.description.degreeM.Eng.
dc.contributor.departmentMassachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
mit.thesis.degreeMaster
thesis.degree.nameMaster of Engineering in Electrical Engineering and Computer Science


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record