Explaining Black-Box Classifiers by Implicitly Learning Decision Trees
Author(s)
Lange, Jane
DownloadThesis PDF (668.6Kb)
Advisor
Rubinfeld, Ronitt
Terms of use
Metadata
Show full item recordAbstract
We present algorithms for finding two types of objects that explain the classification of a black-box model f : {±1}^d → {±1} on an instance x ∈ {±1}^d . The first is a certificate: a small set of x’s features that in conjunction essentially determines f(x). The second is a counterfactual: a nearest instance x′ for which f(x′) ≠ f(x). We obtain both algorithms via a connection to the problem of implicitly learning decision trees. The implicit nature of this learning task allows for efficient algorithms even when the complexity of f necessitates an intractably large surrogate decision tree. We solve the implicit learning task by bringing together techniques from learning theory, local computation algorithms, and complexity theory. Our approach of “explaining by implicit learning” shares elements of two previously disparate methods for post-hoc explanations, global and local explanations, and we make the case that it enjoys advantages of both. Our certification algorithm runs in time poly(d, C(f)) and outputs a certificate of size poly(C(f)), where C(f) is the “average certificate complexity" of f. Our counterfactual algorithm runs in time S(f)^[O(∆f (x))] ·log d, where S(f) is the sensitivity of f (a discrete analogue of the Lipschitz constant) and ∆f (x) is the distance from x to its nearest counterfactual. We further prove a lower bound of S(f)^[Ω(∆f (x))] + Ω(log d) for finding counterfactuals, thereby showing that the guarantees of our algorithm are essentially optimal.
Date issued
2025-05Department
Massachusetts Institute of Technology. Department of Electrical Engineering and Computer SciencePublisher
Massachusetts Institute of Technology