| dc.contributor.advisor | Kim, Yoon | |
| dc.contributor.author | Goel, Avichal | |
| dc.date.accessioned | 2026-02-12T17:13:52Z | |
| dc.date.available | 2026-02-12T17:13:52Z | |
| dc.date.issued | 2025-09 | |
| dc.date.submitted | 2025-09-15T14:56:30.033Z | |
| dc.identifier.uri | https://hdl.handle.net/1721.1/164839 | |
| dc.description.abstract | Finetuning-induced changes to a model’s weights (a “model diff”) are semantically meaningful but often difficult to interpret. This makes us wonder: can we describe the content of an unknown model diff using natural language? We introduce diff interpretation training, a method that teaches a model describe its own finetuning-induced modifications. Our approach uses synthetic model diffs to train a lightweight adapter, which in turn can be applied to a compatible finetuned model to make it self-describing. Using two simple task settings, we demonstrate that our method can successfully decode model diffs into accurate natural language descriptions. | |
| dc.publisher | Massachusetts Institute of Technology | |
| dc.rights | In Copyright - Educational Use Permitted | |
| dc.rights | Copyright retained by author(s) | |
| dc.rights.uri | https://rightsstatements.org/page/InC-EDU/1.0/ | |
| dc.title | Learning to Interpret Language Model Diffs | |
| dc.type | Thesis | |
| dc.description.degree | M.Eng. | |
| dc.contributor.department | Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science | |
| mit.thesis.degree | Master | |
| thesis.degree.name | Master of Engineering in Electrical Engineering and Computer Science | |