Towards Unsupervised Visual Reasoning: Do off-the-shelf features know how to reason?

Monika Wysoczańska

supervisor: Tomasz Trzciński



Recent advances in visual representation learning allowed to build a plethora of powerful features that are ready-to-use for numerous downstream tasks. Contrary to existing representation evaluations, the goal of this work is to assess how well these features preserve information about the objects, such as their spatial location, their visual properties and their relative relationships. We propose to do so by evaluating them in the context of visual reasoning, where multiple objects with complex relationships and different attributes are at play. Our underlying assumption is that reasoning performances are strongly correlated with the quality of visual representations. More specifically, we introduce a protocol to evaluate visual representations for the task of Visual Question Answering. We build an attention-based reasoning module with limited capacity trained on top of the frozen visual features to be evaluated. In order to decouple visual feature extraction from reasoning, we design a specific attention-based reasoning module of limited capacity and trained on the frozen visual representations to be evaluated in a spirit similar to standard feature evaluations relying on shallow networks. This involves constraining the complexity of the reasoning module as well as the size of the visual features. Using the proposed evaluation framework, we compare two types of off-the-shelf visual representations, densely extracted local features and object-centric ones.