Dissertations / Theses: 'Graph Attention Networks'

1

Guo, Dalu. "Attention Networks in Visual Question Answering and Visual Dialog." Thesis, The University of Sydney, 2021. https://hdl.handle.net/2123/25079.

Full text

Abstract:

Attention is a substantial mechanism for human to process massive data. It omits the trivial parts and focuses on the important ones. For example, we only need to remember the keywords in a long sentence and the principal objects in an image for rebuilding the sources. Therefore, it is crucial to building an attention network for artificial intelligence to solve the problem as human. This mechanism has been fully explored in the text-based tasks, such as language translation, reading comprehension, and sentimental analysis, as well as the visual-based tasks, such as image recognition, object detection, and action recognition. In this work, we explore the attention mechanism in the multi-modal tasks, which involve the inputs of both text and image, i.e. visual question answering and visual dialog. It involves three vital components in both tasks, the input question (with history for visual dialog), the given image, and the generated answers. Therefore, three kinds of relationships should be investigated step by step to solve the problem. We first build the attention between words and objects for generating the representation of them, followed by the relationship between the representation and the answers if the general word embedding does not work properly, and the relationship between the representation and the attributes of answers comes last for few-shot learning. First, the bilinear graph networks revisit the relationship between the words from question and objects for image in the visual question answering task from a graph perspective. The classical bilinear attention networks build a bilinear attention map to extract the joint representation of words and objects but lack fully exploring the relationship between words for complex reasoning. In contrast, our networks model the context of the joint embeddings of words and objects. Two kinds of graphs are investigated, namely image-graph and question-graph. The image-graph transfers features of the detected objects to their related query words, enabling the output nodes to have both semantic and factual information. The question-graph exchanges information between these output nodes from image-graph to amplify the implicit yet important relationship between objects. These two kinds of graphs cooperate with each other, and thus our resulting model can model the relationship and dependency between objects, which leads to the realization of multi-step reasoning. Then, our novel image-question-answer synergistic network values the role of the answer for precise visual dialog. We extend the traditional one-stage solution to a two-stage solution. In the first stage, candidate answers are coarsely scored according to their relevance to the image and question pair. Afterward, in the second stage, answers with high probability of being correct are re-ranked by synergizing with image and question. Finally, we propose to learn the representations of attributes from the answers with enough data, which are later composed to constrain the learning of the few-shot ones. We generate the few-shot dataset of VQA with a variety of answers and extract their attributes without any human effort. With this dataset, we build our attribute network to disentangle the attributes by learning their features from parts of the image instead of the whole one.