Learning to Interpret Language Model Diffs

Goel, Avichal

Author(s)

Goel, Avichal

DownloadThesis PDF (827.0Kb)

Advisor

Kim, Yoon

Terms of use

In Copyright - Educational Use Permitted Copyright retained by author(s) https://rightsstatements.org/page/InC-EDU/1.0/

Metadata

Show full item record

Abstract

Finetuning-induced changes to a model’s weights (a “model diff”) are semantically meaningful but often difficult to interpret. This makes us wonder: can we describe the content of an unknown model diff using natural language? We introduce diff interpretation training, a method that teaches a model describe its own finetuning-induced modifications. Our approach uses synthetic model diffs to train a lightweight adapter, which in turn can be applied to a compatible finetuned model to make it self-describing. Using two simple task settings, we demonstrate that our method can successfully decode model diffs into accurate natural language descriptions.

Date issued

2025-09

URI

https://hdl.handle.net/1721.1/164839

Department

Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science

Publisher

Massachusetts Institute of Technology

Collections

Graduate Theses