Transformer Module Networks for Systematic Generalization in Visual Question Answering

Yamada, Moyuru; D'Amario, Vanessa; Takemoto, Kentaro; Boix, Xavier; Sasaki, Tomotake

Author(s)

Yamada, Moyuru; D'Amario, Vanessa; Takemoto, Kentaro; Boix, Xavier; Sasaki, Tomotake

DownloadCBMM-Memo-121.pdf (1.060Mb)

Metadata

Show full item record

Abstract

Transformer-based models achieve great performance on Visual Question Answering (VQA). How- ever, when we evaluate them on systematic generalization, i.e., handling novel combinations of known concepts, their performance degrades. Neural Module Networks (NMNs) are a promising approach for systematic generalization that consists on composing modules, i.e., neural networks that tackle a sub-task. Inspired by Transformers and NMNs, we propose Transformer Module Network (TMN), a novel Transformer-based model for VQA that dynamically composes modules into a question-specific Transformer network. TMNs achieve state-of-the-art systematic generalization performance in three VQA datasets, namely, CLEVR-CoGenT, CLOSURE and GQA-SGL, in some cases improving more than 30% over standard Transformers.

Date issued

2022-02-03

URI

https://hdl.handle.net/1721.1/139843

Publisher

Center for Brains, Minds and Machines (CBMM)

Series/Report no.

CBMM Memo;121

Collections

CBMM Memo Series