jamesmullenbach.github.io by jamesmullenbach

Paper link

The paper introduces the neural module network, which is a framework for composing neural networks to solve compositional reasoning tasks. Applied to VQA, the idea is that the compositionality of the model will reflect the compositional nature of the language used to ask the question. Given a novel composition of concepts, the model could in theory use a novel network from compositions of modules not used in training to answer it. This model could be thought of as a generalized version of “knowing when to look” in image captioning - in effect the authors try to learn when to look (attend), when to classify, when to combine image features, etc. They also introduce a dataset of synthetic images to focus on modeling compositional concepts rather than visual recognition.

The main strength in this paper is the model formulation. The idea is well-motivated and well explained at a high level. It seems like a good candidate to be the basis of lots of future work, although I don’t know if that has been the case almost two years since this was released. The authors seem to have lots of ideas to expand their model, and helpfully share them with the reader. The visualizations are strong, and figure 2(b) is a great summarization of the model execution. Helpfully, they show some details like how instances with similar high level compositional structure can be batched together for efficient training. The strongest part of the model is its flexibility - more modules could be added or removed, the interior network of each module could be altered, the idea could be applied to a different task entirely, and the input language need not even be a human language. It is helpful to see statistics on how ‘compositional’ the questions are in VQA and SHAPES in table 1. Time and further applications of this framework will be the judge of the strength of this model, however.

One thing this paper needs is an appendix, as many likely important details for the model’s success on the VQA task is left out. Though they provide software for their work, they don’t explicitly describe how they translate from parse trees to structured queries. It’s not clear if they use a pre-trained LeNet or if it is jointly trained with the rest of the model. When doing classify[color], is that softmax only over the colors? If so, how is the set of colors derived? How big is the vocabulary of this network structure ‘language’? It could be seen as reaching to include a whole measure module I would like to have seen a comparison with a more robust parsing model that can capture differences between e.g. “what is flying?” and “what are flying?” Finally, almost no details are given in describing the generation of the SHAPES dataset.

Overall, this approach to learning a complex, multimodal task is highly thought-provoking. The results seem promising for their toy dataset, and we will see if this proves useful for other problems in AI in the future.

The view of VQA as a multi-task learning problem is the most thought-provoking to me. Multi-task learning has its own community of researchers, which could provide some useful insights for future work. I also wonder what the statistics of tasks would even look like on the VQA dataset. Could we annotate the questions with the “task” required to answer it, and get an idea of how many tasks we need to learn to do VQA properly? This problem in itself seems hard. This is one of the main difficulties of the VQA challenge, arising from the open-endedness of the questions.

James Mullenbach

Review of "Neural Module Networks" (CVPR 2016)