Learning to Interpret Language Model Diffs
Author(s)
Goel, Avichal
DownloadThesis PDF (827.0Kb)
Advisor
Kim, Yoon
Terms of use
Metadata
Show full item recordAbstract
Finetuning-induced changes to a model’s weights (a “model diff”) are semantically meaningful but often difficult to interpret. This makes us wonder: can we describe the content of an unknown model diff using natural language? We introduce diff interpretation training, a method that teaches a model describe its own finetuning-induced modifications. Our approach uses synthetic model diffs to train a lightweight adapter, which in turn can be applied to a compatible finetuned model to make it self-describing. Using two simple task settings, we demonstrate that our method can successfully decode model diffs into accurate natural language descriptions.
Date issued
2025-09Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology