Log in

Relevant bibliographies by topics / Visual question answering (VQA) / Journal articles

To see the other types of publications on this topic, follow the link: Visual question answering (VQA).

Journal articles on the topic 'Visual question answering (VQA)'

Author: Grafiati

Published: 4 June 2021

Last updated: 26 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Visual question answering (VQA).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Agrawal, Aishwarya, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. "VQA: Visual Question Answering." International Journal of Computer Vision 123, no. 1 (November 8, 2016): 4–31. http://dx.doi.org/10.1007/s11263-016-0966-6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Lei, Chenyi, Lei Wu, Dong Liu, Zhao Li, Guoxin Wang, Haihong Tang, and Houqiang Li. "Multi-Question Learning for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11328–35. http://dx.doi.org/10.1609/aaai.v34i07.6794.

Full text

Abstract:

Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider video-question pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). Inspired by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a target video sequence. The learned representations of video-question pairs are then more general to be transferred for new questions. We further propose an effective VQA framework and design a training procedure for MQL, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained. Experimental results on public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.

APA, Harvard, Vancouver, ISO, and other styles

3

Shah, Sanket, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. "KVQA: Knowledge-Aware Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8876–84. http://dx.doi.org/10.1609/aaai.v33i01.33018876.

Full text

Abstract:

Visual Question Answering (VQA) has emerged as an important problem spanning Computer Vision, Natural Language Processing and Artificial Intelligence (AI). In conventional VQA, one may ask questions about an image which can be answered purely based on its content. For example, given an image with people in it, a typical VQA question may inquire about the number of people in the image. More recently, there is growing interest in answering questions which require commonsense knowledge involving common nouns (e.g., cats, dogs, microphones) present in the image. In spite of this progress, the important problem of answering questions requiring world knowledge about named entities (e.g., Barack Obama, White House, United Nations) in the image has not been addressed in prior research. We address this gap in this paper, and introduce KVQA – the first dataset for the task of (world) knowledge-aware VQA. KVQA consists of 183K question-answer pairs involving more than 18K named entities and 24K images. Questions in this dataset require multi-entity, multi-relation, and multi-hop reasoning over large Knowledge Graphs (KG) to arrive at an answer. To the best of our knowledge, KVQA is the largest dataset for exploring VQA over KG. Further, we also provide baseline performances using state-of-the-art methods on KVQA.

APA, Harvard, Vancouver, ISO, and other styles

4

Guo, Zihan, Dezhi Han, and Kuan-Ching Li. "Double-layer affective visual question answering network." Computer Science and Information Systems, no. 00 (2020): 38. http://dx.doi.org/10.2298/csis200515038g.

Full text

Abstract:

Visual Question Answering (VQA) has attracted much attention recently in both natural language processing and computer vision communities, as it offers insight into the relationships between two relevant sources of information. Tremendous advances are seen in the field of VQA due to the success of deep learning. Based upon advances and improvements, the Affective Visual Question Answering Network (AVQAN) enriches the understanding and analysis of VQA models by making use of the emotional information contained in the images to produce sensitive answers, while maintaining the same level of accuracy as ordinary VQA baseline models. It is a reasonably new task to integrate the emotional information contained in the images into VQA. However, it is challenging to separate question guided-attention from mood-guided-attention due to the concatenation of the question words and the mood labels in AVQAN. Also, it is believed that this type of concatenation is harmful to the performance of the model. To mitigate such an effect, we propose the Double-Layer Affective Visual Question Answering Network (DAVQAN) that divides the task of generating emotional answers in VQA into two simpler subtasks: the generation of non-emotional responses and the production of mood labels, and two independent layers are utilized to tackle these subtasks. Comparative experimentation conducted on a preprocessed dataset to performance comparison shows that the overall performance of DAVQAN is 7.6% higher than AVQAN, demonstrating the effectiveness of the proposed model. We also introduce more advanced word embedding method and more fine-grained image feature extractor into AVQAN and DAVQAN to further improve their performance and obtain better results than their original models, which proves that VQA integrated with affective computing can improve the performance of the whole model by improving these two modules just like the general VQA.

APA, Harvard, Vancouver, ISO, and other styles

5

Wu, Chenfei, Jinlai Liu, Xiaojie Wang, and Ruifan Li. "Differential Networks for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8997–9004. http://dx.doi.org/10.1609/aaai.v33i01.33018997.

Full text

Abstract:

The task of Visual Question Answering (VQA) has emerged in recent years for its potential applications. To address the VQA task, the model should fuse feature elements from both images and questions efficiently. Existing models fuse image feature element vi and question feature element qi directly, such as an element product viqi. Those solutions largely ignore the following two key points: 1) Whether vi and qi are in the same space. 2) How to reduce the observation noises in vi and qi. We argue that two differences between those two feature elements themselves, like (vi − vj) and (qi −qj), are more probably in the same space. And the difference operation would be beneficial to reduce observation noise. To achieve this, we first propose Differential Networks (DN), a novel plug-and-play module which enables differences between pair-wise feature elements. With the tool of DN, we then propose DN based Fusion (DF), a novel model for VQA task. We achieve state-of-the-art results on four publicly available datasets. Ablation studies also show the effectiveness of difference operations in DF model.

APA, Harvard, Vancouver, ISO, and other styles

6

Zhou, Yiyi, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Weiqiu Chen. "Dynamic Capsule Attention for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9324–31. http://dx.doi.org/10.1609/aaai.v33i01.33019324.

Full text

Abstract:

In visual question answering (VQA), recent advances have well advocated the use of attention mechanism to precisely link the question to the potential answer areas. As the difficulty of the question increases, more VQA models adopt multiple attention layers to capture the deeper visual-linguistic correlation. But a negative consequence is the explosion of parameters, which makes the model vulnerable to over-fitting, especially when limited training examples are given. In this paper, we propose an extremely compact alternative to this static multi-layer architecture towards accurate yet efficient attention modeling, termed as Dynamic Capsule Attention (CapsAtt). Inspired by the recent work of Capsule Network, CapsAtt treats visual features as capsules and obtains the attention output via dynamic routing, which updates the attention weights by calculating coupling coefficients between the underlying and output capsules. Meanwhile, CapsAtt also discards redundant projection matrices to make the model much more compact. We quantify CapsAtt on three benchmark VQA datasets, i.e., COCO-QA, VQA1.0 and VQA2.0. Compared to the traditional multi-layer attention model, CapsAtt achieves significant improvements of up to 4.1%, 5.2% and 2.2% on three datasets, respectively. Moreover, with much fewer parameters, our approach also yields competitive results compared to the latest VQA models. To further verify the generalization ability of CapsAtt, we also deploy it on another challenging multi-modal task of image captioning, where state-of-the-art performance is achieved with a simple network structure.

APA, Harvard, Vancouver, ISO, and other styles

7

Et. al., K. P. Moholkar,. "Visual Question Answering using Convolutional Neural Networks." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 1S (April 11, 2021): 170–75. http://dx.doi.org/10.17762/turcomat.v12i1s.1602.

Full text

Abstract:

The ability of a computer system to be able to understand surroundings and elements and to think like a human being to process the information has always been the major point of focus in the field of Computer Science. One of the ways to achieve this artificial intelligence is Visual Question Answering. Visual Question Answering (VQA) is a trained system which can answer the questions associated to a given image in Natural Language. VQA is a generalized system which can be used in any image-based scenario with adequate training on the relevant data. This is achieved with the help of Neural Networks, particularly Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In this study, we have compared different approaches of VQA, out of which we are exploring CNN based model. With the continued progress in the field of Computer Vision and Question answering system, Visual Question Answering is becoming the essential system which can handle multiple scenarios with their respective data.

APA, Harvard, Vancouver, ISO, and other styles

8

Guo, Wenya, Ying Zhang, Xiaoping Wu, Jufeng Yang, Xiangrui Cai, and Xiaojie Yuan. "Re-Attention for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (April 3, 2020): 91–98. http://dx.doi.org/10.1609/aaai.v34i01.5338.

Full text

Abstract:

Visual Question Answering~(VQA) requires a simultaneous understanding of images and questions. Existing methods achieve well performance by focusing on both key objects in images and key words in questions. However, the answer also contains rich information which can help to better describe the image and generate more accurate attention maps. In this paper, to utilize the information in answer, we propose a re-attention framework for the VQA task. We first associate image and question by calculating the similarity of each object-word pairs in the feature space. Then, based on the answer, the learned model re-attends the corresponding visual objects in images and reconstructs the initial attention map to produce consistent results. Benefiting from the re-attention procedure, the question can be better understood, and the satisfactory answer is generated. Extensive experiments on the benchmark dataset demonstrate the proposed method performs favorably against the state-of-the-art approaches.

APA, Harvard, Vancouver, ISO, and other styles

9

Boukhers, Zeyd, Timo Hartmann, and Jan Jürjens. "COIN: Counterfactual Image Generation for Visual Question Answering Interpretation." Sensors 22, no. 6 (March 14, 2022): 2245. http://dx.doi.org/10.3390/s22062245.

Full text

Abstract:

Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the VQA model to give a different answer. In addition, our approach ensures that the generated image is realistic. Since quantitative metrics cannot be employed to evaluate the interpretability of the model, we carried out a user study to assess different aspects of our approach. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models’ behaviour.

APA, Harvard, Vancouver, ISO, and other styles

10

Li, Qun, Fu Xiao, Bir Bhanu, Biyun Sheng, and Richang Hong. "Inner Knowledge-based Img2Doc Scheme for Visual Question Answering." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 3 (August 31, 2022): 1–21. http://dx.doi.org/10.1145/3489142.

Full text

Abstract:

Visual Question Answering (VQA) is a research topic of significant interest at the intersection of computer vision and natural language understanding. Recent research indicates that attributes and knowledge can effectively improve performance for both image captioning and VQA. In this article, an inner knowledge-based Img2Doc algorithm for VQA is presented. The inner knowledge is characterized as the inner attribute relationship in visual images. In addition to using an attribute network for inner knowledge-based image representation, VQA scheme is associated with a question-guided Doc2Vec method for question–answering. The attribute network generates inner knowledge-based features for visual images, while a novel question-guided Doc2Vec method aims at converting natural language text to vector features. After the vector features are extracted, they are combined with visual image features into a classifier to provide an answer. Based on our model, the VQA problem is resolved by textual question answering. The experimental results demonstrate that the proposed method achieves superior performance on multiple benchmark datasets.

APA, Harvard, Vancouver, ISO, and other styles

11

Qiu, Yue, Yutaka Satoh, Ryota Suzuki, Kenji Iwata, and Hirokatsu Kataoka. "Multi-View Visual Question Answering with Active Viewpoint Selection." Sensors 20, no. 8 (April 17, 2020): 2281. http://dx.doi.org/10.3390/s20082281.

Full text

Abstract:

This paper proposes a framework that allows the observation of a scene iteratively to answer a given question about the scene. Conventional visual question answering (VQA) methods are designed to answer given questions based on single-view images. However, in real-world applications, such as human–robot interaction (HRI), in which camera angles and occluded scenes must be considered, answering questions based on single-view images might be difficult. Since HRI applications make it possible to observe a scene from multiple viewpoints, it is reasonable to discuss the VQA task in multi-view settings. In addition, because it is usually challenging to observe a scene from arbitrary viewpoints, we designed a framework that allows the observation of a scene actively until the necessary scene information to answer a given question is obtained. The proposed framework achieves comparable performance to a state-of-the-art method in question answering and simultaneously decreases the number of required observation viewpoints by a significant margin. Additionally, we found our framework plausibly learned to choose better viewpoints for answering questions, lowering the required number of camera movements. Moreover, we built a multi-view VQA dataset based on real images. The proposed framework shows high accuracy (94.01%) for the unseen real image dataset.

APA, Harvard, Vancouver, ISO, and other styles

12

R, Lokesh, Madhusudan C, Darshan T, and Sunil kumar N. "VISUAL QUESTIONING AND ANSWERING." International Journal of Innovative Research in Advanced Engineering 9, no. 8 (August 12, 2022): 312–15. http://dx.doi.org/10.26562/ijirae.2022.v0908.29.

Full text

Abstract:

In general, a VQA system is an algorithm that takes a picture and natural language query about the image as input and produces natural language response as output. The nature of a multi-discipline research problem necessitates this. This is a diagnostic dataset that assesses a variety of visual reason skills. VQA features few biases and through annotations defining the type of reasoning required for each question. The dataset is used to examine a number of recent visual reasoning systems, revealing new information about their capabilities and limitations.

APA, Harvard, Vancouver, ISO, and other styles

13

Lee, Doyup, Yeongjae Cheon, and Wook-Shin Han. "Regularizing Attention Networks for Anomaly Detection in Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 3 (May 18, 2021): 1845–53. http://dx.doi.org/10.1609/aaai.v35i3.16279.

Full text

Abstract:

For stability and reliability of real-world applications, the robustness of DNNs in unimodal tasks has been evaluated. However, few studies consider abnormal situations that a visual question answering (VQA) model might encounter at test time after deployment in the real-world. In this study, we evaluate the robustness of state-of-the-art VQA models to five different anomalies, including worst-case scenarios, the most frequent scenarios, and the current limitation of VQA models. Different from the results in unimodal tasks, the maximum confidence of answers in VQA models cannot detect anomalous inputs, and post-training of the outputs, such as outlier exposure, is ineffective for VQA models. Thus, we propose an attention-based method, which uses confidence of reasoning between input images and questions and shows much more promising results than the previous methods in unimodal tasks. In addition, we show that a maximum entropy regularization of attention networks can significantly improve the attention-based anomaly detection of the VQA models. Thanks to the simplicity, attention-based anomaly detection and the regularization are model-agnostic methods, which can be used for various cross-modal attentions in the state-of-the-art VQA models. The results imply that cross-modal attention in VQA is important to improve not only VQA accuracy, but also the robustness to various anomalies.

APA, Harvard, Vancouver, ISO, and other styles

14

Kim, Incheol. "Visual Experience-Based Question Answering with Complex Multimodal Environments." Mathematical Problems in Engineering 2020 (November 19, 2020): 1–18. http://dx.doi.org/10.1155/2020/8567271.

Full text

Abstract:

This paper proposes a novel visual experience-based question answering problem (VEQA) and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand 3D scenes from successive partial input images, and answer natural language questions about its visual experiences in real time. Unlike the conventional visual question answering (VQA), the VEQA problem assumes both partial observability and dynamics of a complex multimodal environment. To address this VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system. The proposed system can generate more accurate scene graphs for dynamic environments with some uncertainty. Moreover, it can answer complex questions through knowledge reasoning with rich background knowledge. Results of experiments using a photo-realistic 3D simulated environment, AI2-THOR, and the VEQA benchmark dataset prove the high performance of the proposed system.

APA, Harvard, Vancouver, ISO, and other styles

15

Li, Haiyan, and Dezhi Han. "Multimodal encoders and decoders with gate attention for visual question answering." Computer Science and Information Systems 18, no. 3 (2021): 1023–40. http://dx.doi.org/10.2298/csis201120032l.

Full text

Abstract:

Visual Question Answering (VQA) is a multimodal research related to Computer Vision (CV) and Natural Language Processing (NLP). How to better obtain useful information from images and questions and give an accurate answer to the question is the core of the VQA task. This paper presents a VQA model based on multimodal encoders and decoders with gate attention (MEDGA). Each encoder and decoder block in the MEDGA applies not only self-attention and crossmodal attention but also gate attention, so that the new model can better focus on inter-modal and intra-modal interactions simultaneously within visual and language modality. Besides, MEDGA further filters out noise information irrelevant to the results via gate attention and finally outputs attention results that are closely related to visual features and language features, which makes the answer prediction result more accurate. Experimental evaluations on the VQA 2.0 dataset and the ablation experiments under different conditions prove the effectiveness of MEDGA. In addition, the MEDGA accuracy on the test-std dataset has reached 70.11%, which exceeds many existing methods.

APA, Harvard, Vancouver, ISO, and other styles

16

He, Shirong, and Dezhi Han. "An Effective Dense Co-Attention Networks for Visual Question Answering." Sensors 20, no. 17 (August 30, 2020): 4897. http://dx.doi.org/10.3390/s20174897.

Full text

Abstract:

At present, the state-of-the-art approaches of Visual Question Answering (VQA) mainly use the co-attention model to relate each visual object with text objects, which can achieve the coarse interactions between multimodalities. However, they ignore the dense self-attention within question modality. In order to solve this problem and improve the accuracy of VQA tasks, in the present paper, an effective Dense Co-Attention Networks (DCAN) is proposed. First, to better capture the relationship between words that are relatively far apart and make the extracted semantics more robust, the Bidirectional Long Short-Term Memory (Bi-LSTM) neural network is introduced to encode questions and answers; second, to realize the fine-grained interactions between the question words and image regions, a dense multimodal co-attention model is proposed. The model’s basic components include the self-attention unit and the guided-attention unit, which are cascaded in depth to form a hierarchical structure. The experimental results on the VQA-v2 dataset show that DCAN has obvious performance advantages, which makes VQA applicable to a wider range of AI scenarios.

APA, Harvard, Vancouver, ISO, and other styles

17

Alizadeh, Mehrdad, and Barbara Di Eugenio. "Incorporating Verb Semantic Information in Visual Question Answering Through Multitask Learning Paradigm." International Journal of Semantic Computing 14, no. 02 (June 2020): 223–48. http://dx.doi.org/10.1142/s1793351x20400085.

Full text

Abstract:

Visual Question Answering (VQA) concerns providing answers to Natural Language questions about images. Several deep neural network approaches have been proposed to model the task in an end-to-end fashion. Whereas the task is grounded in visual processing, if the question focuses on events described by verbs, the language understanding component becomes crucial. Our hypothesis is that models should be aware of verb semantics, as expressed via semantic role labels, argument types, and/or frame elements. Unfortunately, no VQA dataset exists that includes verb semantic information. Our first contribution is a new VQA dataset (imSituVQA) that we built by taking advantage of the imSitu annotations. The imSitu dataset consists of images manually labeled with semantic frame elements, mostly taken from FrameNet. Second, we propose a multi-task CNN-LSTM VQA model that learns to classify the answers as well as the semantic frame elements. Our experiments show that semantic frame element classification helps the VQA system avoid inconsistent responses and improves performance. Third, we employ an automatic semantic role labeler and annotate a subset of the VQA dataset (VQAsub). This way, the proposed multi-task CNN-LSTM VQA model can be trained with the VQAsub as well. The results show a slight improvement over the single-task CNN-LSTM model.

APA, Harvard, Vancouver, ISO, and other styles

18

Garcia, Noa, Mayu Otani, Chenhui Chu, and Yuta Nakashima. "KnowIT VQA: Answering Knowledge-Based Questions about Videos." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 10826–34. http://dx.doi.org/10.1609/aaai.v34i07.6713.

Full text

Abstract:

We propose a novel video understanding task by fusing knowledge-based and video question answering. First, we introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom. The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions, which need of the experience obtained from the viewing of the series to be answered. Second, we propose a video understanding model by combining the visual and textual video content with specific knowledge about the show. Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy, indicating its usefulness for studying current video modelling limitations.

APA, Harvard, Vancouver, ISO, and other styles

19

Yan, Feng, Wushouer Silamu, and Yanbing Li. "Deep Modular Bilinear Attention Network for Visual Question Answering." Sensors 22, no. 3 (January 28, 2022): 1045. http://dx.doi.org/10.3390/s22031045.

Full text

Abstract:

VQA (Visual Question Answering) is a multi-model task. Given a picture and a question related to the image, it will determine the correct answer. The attention mechanism has become a de facto component of almost all VQA models. Most recent VQA approaches use dot-product to calculate the intra-modality and inter-modality attention between visual and language features. In this paper, the BAN (Bilinear Attention Network) method was used to calculate attention. We propose a deep multimodality bilinear attention network (DMBA-NET) framework with two basic attention units (BAN-GA and BAN-SA) to construct inter-modality and intra-modality relations. The two basic attention units are the core of the whole network framework and can be cascaded in depth. In addition, we encode the question based on the dynamic word vector of BERT(Bidirectional Encoder Representations from Transformers), then use self-attention to process the question features further. Then we sum them with the features obtained by BAN-GA and BAN-SA before the final classification. Without using the Visual Genome datasets for augmentation, the accuracy of our model reaches 70.85% on the test-std dataset of VQA 2.0.

APA, Harvard, Vancouver, ISO, and other styles

20

Yang, Cheng, Weijia Wu, Yuxing Wang, and Hong Zhou. "Multi-Modality Global Fusion Attention Network for Visual Question Answering." Electronics 9, no. 11 (November 9, 2020): 1882. http://dx.doi.org/10.3390/electronics9111882.

Full text

Abstract:

Visual question answering (VQA) requires a high-level understanding of both questions and images, along with visual reasoning to predict the correct answer. Therefore, it is important to design an effective attention model to associate key regions in an image with key words in a question. Up to now, most attention-based approaches only model the relationships between individual regions in an image and words in a question. It is not enough to predict the correct answer for VQA, as human beings always think in terms of global information, not only local information. In this paper, we propose a novel multi-modality global fusion attention network (MGFAN) consisting of stacked global fusion attention (GFA) blocks, which can capture information from global perspectives. Our proposed method computes co-attention and self-attention at the same time, rather than computing them individually. We validate our proposed method on the two most commonly used benchmarks, the VQA-v2 datasets. Experimental results show that the proposed method outperforms the previous state-of-the-art. Our best single model achieves 70.67% accuracy on the test-dev set of VQA-v2.

APA, Harvard, Vancouver, ISO, and other styles

21

Li, Qifeng, Xinyi Tang, and Yi Jian. "Adversarial Learning with Bidirectional Attention for Visual Question Answering." Sensors 21, no. 21 (October 28, 2021): 7164. http://dx.doi.org/10.3390/s21217164.

Full text

Abstract:

In this paper, we provide external image features and use the internal attention mechanism to solve the VQA problem given a dataset of textual questions and related images. Most previous models for VQA use a pair of images and questions as input. In addition, the model adopts a question-oriented attention mechanism to extract the features of the entire image and then perform feature fusion. However, the shortcoming of these models is that they cannot effectively eliminate the irrelevant features of the image. In addition, the problem-oriented attention mechanism lacks in the mining of image features, which will bring in redundant image features. In this paper, we propose a VQA model based on adversarial learning and bidirectional attention. We exploit external image features that are not related to the question to form an adversarial mechanism to boost the accuracy of the model. Target detection is performed on the image—that is, the image-oriented attention mechanism. The bidirectional attention mechanism is conducive to promoting model attention and eliminating interference. Experimental results are evaluated on benchmark datasets, and our model performs better than other models based on attention methods. In addition, the qualitative results show the attention maps on the images and leads to predicting correct answers.

APA, Harvard, Vancouver, ISO, and other styles

22

Zhang, Pufen, Hong Lan, and Muhammad Asim Khan. "Multiple Context Learning Networks for Visual Question Answering." Scientific Programming 2022 (February 9, 2022): 1–11. http://dx.doi.org/10.1155/2022/4378553.

Full text

Abstract:

A novel Multiple Context Learning Network (MCLN) is proposed to model multiple contexts for visual question answering (VQA), aiming to learn comprehensive contexts. Three kinds of contexts are discussed and the corresponding three context learning modules are proposed based on a uniform context learning strategy. Specifically, the proposed context learning modules are visual context learning module (VCL), textual context learning module (TCL), and visual-textual context learning module (VTCL). The VCL and TCL, respectively, learn the context of objects in an image and the context of words in a question, allowing object and word features to own intra-modal context information. The VTCL is performed on the concatenated visual-textual features that endows the output features with synergic visual-textual context information. These modules work together to form a multiple context learning layer (MCL) and MCL can be stacked in depth for deep context learning. Furthermore, a contextualized text encoder based on the pretrained BERT is introduced and fine-tuned, which enhances the textual context learning at the feature extraction stage of text. The approach is evaluated by using two benchmark datasets: VQA v2.0 dataset and GQA dataset. The MCLN achieves 71.05% and 71.48% overall accuracy on the test-dev and test-std sets of VQA v2.0, respectively. And an accuracy of 57.0% is gained by the MCLN on the test-standard split of GQA dataset. The MCLN outperforms the previous state-of-the-art models and the extensive ablation studies examine the effectiveness of the proposed method.

APA, Harvard, Vancouver, ISO, and other styles

23

Guo, Zihan, and Dezhi Han. "Multi-Modal Explicit Sparse Attention Networks for Visual Question Answering." Sensors 20, no. 23 (November 26, 2020): 6758. http://dx.doi.org/10.3390/s20236758.

Full text

Abstract:

Visual question answering (VQA) is a multi-modal task involving natural language processing (NLP) and computer vision (CV), which requires models to understand of both visual information and textual information simultaneously to predict the correct answer for the input visual image and textual question, and has been widely used in smart and intelligent transport systems, smart city, and other fields. Today, advanced VQA approaches model dense interactions between image regions and question words by designing co-attention mechanisms to achieve better accuracy. However, modeling interactions between each image region and each question word will force the model to calculate irrelevant information, thus causing the model’s attention to be distracted. In this paper, to solve this problem, we propose a novel model called Multi-modal Explicit Sparse Attention Networks (MESAN), which concentrates the model’s attention by explicitly selecting the parts of the input features that are the most relevant to answering the input question. We consider that this method based on top-k selection can reduce the interference caused by irrelevant information and ultimately help the model to achieve better performance. The experimental results on the benchmark dataset VQA v2 demonstrate the effectiveness of our model. Our best single model delivers 70.71% and 71.08% overall accuracy on the test-dev and test-std sets, respectively. In addition, we also demonstrate that our model can obtain better attended features than other advanced models through attention visualization. Our work proves that the models with sparse attention mechanisms can also achieve competitive results on VQA datasets. We hope that it can promote the development of VQA models and the application of artificial intelligence (AI) technology related to VQA in various aspects.

APA, Harvard, Vancouver, ISO, and other styles

24

Shen, Xiang, Dezhi Han, Chongqing Chen, Gaofeng Luo, and Zhongdai Wu. "An effective spatial relational reasoning networks for visual question answering." PLOS ONE 17, no. 11 (November 28, 2022): e0277693. http://dx.doi.org/10.1371/journal.pone.0277693.

Full text

Abstract:

Visual Question Answering (VQA) is a method of answering questions in natural language based on the content of images and has been widely concerned by researchers. The existing research on the visual question answering model mainly focuses on the point of view of attention mechanism and multi-modal fusion. It only pays attention to the visual semantic features of the image in the process of image modeling, ignoring the importance of modeling the spatial relationship of visual objects. We are aiming at the existing problems of the existing VQA model research. An effective spatial relationship reasoning network model is proposed, which can combine visual object semantic reasoning and spatial relationship reasoning at the same time to realize fine-grained multi-modal reasoning and fusion. A sparse attention encoder is designed to capture contextual information effectively in the semantic reasoning module. In the spatial relationship reasoning module, the graph neural network attention mechanism is used to model the spatial relationship of visual objects, which can correctly answer complex spatial relationship reasoning questions. Finally, a practical compact self-attention (CSA) mechanism is designed to reduce the redundancy of self-attention in linear transformation and the number of model parameters and effectively improve the model’s overall performance. Quantitative and qualitative experiments are conducted on the benchmark datasets of VQA 2.0 and GQA. The experimental results demonstrate that the proposed method performs favorably against the state-of-the-art approaches. Our best single model has an overall accuracy of 71.18% on the VQA 2.0 dataset and 57.59% on the GQA dataset.

APA, Harvard, Vancouver, ISO, and other styles

25

Jing, Chenchen, Yuwei Wu, Xiaoxun Zhang, Yunde Jia, and Qi Wu. "Overcoming Language Priors in VQA via Decomposed Linguistic Representations." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11181–88. http://dx.doi.org/10.1609/aaai.v34i07.6776.

Full text

Abstract:

Most existing Visual Question Answering (VQA) models overly rely on language priors between questions and answers. In this paper, we present a novel method of language attention-based VQA that learns decomposed linguistic representations of questions and utilizes the representations to infer answers for overcoming language priors. We introduce a modular language attention mechanism to parse a question into three phrase representations: type representation, object representation, and concept representation. We use the type representation to identify the question type and the possible answer set (yes/no or specific concepts such as colors or numbers), and the object representation to focus on the relevant region of an image. The concept representation is verified with the attended region to infer the final answer. The proposed method decouples the language-based concept discovery and vision-based concept verification in the process of answer inference to prevent language priors from dominating the answering process. Experiments on the VQA-CP dataset demonstrate the effectiveness of our method.

APA, Harvard, Vancouver, ISO, and other styles

26

Yanqing Cui, Yanqing Cui, Guangjie Han Yanqing Cui, and Hongbo Zhu Guangjie Han. "A Novel Online Teaching Effect Evaluation Model Based on Visual Question Answering." 網際網路技術學刊 23, no. 1 (January 2022): 093–100. http://dx.doi.org/10.53106/160792642022012301009.

Full text

Abstract:

<p>The paper proposes a novel visual question answering (VQA)-based online teaching effect evaluation model. Based on the text interaction between teacher and students, we give a guide-attention (GA) model to discover the directive clues. Combining the self-attention (SA) models, we reweight the vital feature to locate the critical information on the whiteboard and students’ faces and further recognize their content and facial expressions. Three branches of information are encoded into the feature vectors to be fed into a bidirectional GRU network. With the real labels of the students’ answers annotated by two teachers and the predicted labels from the text and facial expression feedback, we train the chained network. Experiment reports a couple of competitive performance in the 2-class and 5-class tasks on the self-collected dataset, respectively.</p> <p> </p>

APA, Harvard, Vancouver, ISO, and other styles

27

Huang, Jia-Hong, Cuong Duc Dao, Modar Alfadly, and Bernard Ghanem. "A Novel Framework for Robustness Analysis of Visual QA Models." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8449–56. http://dx.doi.org/10.1609/aaai.v33i01.33018449.

Full text

Abstract:

Deep neural networks have been playing an essential role in many computer vision tasks including Visual Question Answering (VQA). Until recently, the study of their accuracy was the main focus of research but now there is a trend toward assessing the robustness of these models against adversarial attacks by evaluating their tolerance to varying noise levels. In VQA, adversarial attacks can target the image and/or the proposed main question and yet there is a lack of proper analysis of the later. In this work, we propose a flexible framework that focuses on the language part of VQA that uses semantically relevant questions, dubbed basic questions, acting as controllable noise to evaluate the robustness of VQA models. We hypothesize that the level of noise is negatively correlated to the similarity of a basic question to the main question. Hence, to apply noise on any given main question, we rank a pool of basic questions based on their similarity by casting this ranking task as a LASSO optimization problem. Then, we propose a novel robustness measure Rscore and two largescale basic question datasets (BQDs) in order to standardize robustness analysis for VQA models.

APA, Harvard, Vancouver, ISO, and other styles

28

Xiang, Yingxin, Chengyuan Zhang, Zhichao Han, Hao Yu, Jiaye Li, and Lei Zhu. "Path-Wise Attention Memory Network for Visual Question Answering." Mathematics 10, no. 18 (September 7, 2022): 3244. http://dx.doi.org/10.3390/math10183244.

Full text

Abstract:

Visual question answering (VQA) is regarded as a multi-modal fine-grained feature fusion task, which requires the construction of multi-level and omnidirectional relations between nodes. One main solution is the composite attention model which is composed of co-attention (CA) and self-attention(SA). However, the existing composite models only consider the stack of single attention blocks, lack of path-wise historical memory, and overall adjustments. We propose a path attention memory network (PAM) to construct a more robust composite attention model. After each single-hop attention block (SA or CA), the importance of the cumulative nodes is used to calibrate the signal strength of nodes’ features. Four memoried single-hop attention matrices are used to obtain the path-wise co-attention matrix of path-wise attention (PA); therefore, the PA block is capable of synthesizing and strengthening the learning effect on the whole path. Moreover, we use guard gates of the target modal to check the source modal values in CA and conditioning gates of another modal to guide the query and key of the current modal in SA. The proposed PAM is beneficial to construct a robust multi-hop neighborhood relationship between visual and language and achieves excellent performance on both VQA2.0 and VQA-CP V2 datasets.

APA, Harvard, Vancouver, ISO, and other styles

29

Ben-younes, Hedi, Remi Cadene, Nicolas Thome, and Matthieu Cord. "BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8102–9. http://dx.doi.org/10.1609/aaai.v33i01.33018102.

Full text

Abstract:

Multimodal representation learning is gaining more and more interest within the deep learning community. While bilinear models provide an interesting framework to find subtle combination of modalities, their number of parameters grows quadratically with the input dimensions, making their practical implementation within classical deep learning pipelines challenging. In this paper, we introduce BLOCK, a new multimodal fusion based on the block-superdiagonal tensor decomposition. It leverages the notion of block-term ranks, which generalizes both concepts of rank and mode ranks for tensors, already used for multimodal fusion. It allows to define new ways for optimizing the tradeoff between the expressiveness and complexity of the fusion model, and is able to represent very fine interactions between modalities while maintaining powerful mono-modal representations. We demonstrate the practical interest of our fusion model by using BLOCK for two challenging tasks: Visual Question Answering (VQA) and Visual Relationship Detection (VRD), where we design end-to-end learnable architectures for representing relevant interactions between modalities. Through extensive experiments, we show that BLOCK compares favorably with respect to state-of-the-art multimodal fusion models for both VQA and VRD tasks. Our code is available at https://github.com/Cadene/block.bootstrap.pytorch.

APA, Harvard, Vancouver, ISO, and other styles

30

Wang, Ruiping, Shihong Wu, and Xiaoping Wang. "The Core of Smart Cities: Knowledge Representation and Descriptive Framework Construction in Knowledge-Based Visual Question Answering." Sustainability 14, no. 20 (October 14, 2022): 13236. http://dx.doi.org/10.3390/su142013236.

Full text

Abstract:

Visual question answering (VQA), which is an important presentation form of AI-complete task and visual Turing tests, coupled with its potential application value, attracted widespread attention from both researchers in computer vision and natural language processing. However, there are no relevant research regarding the expression and participation methods of knowledge in VQA. Considering the importance of knowledge for answering questions correctly, this paper analyzes and researches the stratification, expression and participation process of knowledge in VQA and proposes a knowledge description framework (KDF) to guide the research of knowledge-based VQA (Kb-VQA). The KDF consists of a basic theory, implementation methods and specific applications. This paper focuses on describing mathematical models at basic theoretical levels, as well as the knowledge hierarchy theories and key implementation behaviors established on this basis. In our experiment, using the statistics of VQA’s accuracy in the relevant literature, we propose a good corroboration of the research results from knowledge stratification, participation methods and expression forms in this paper.

APA, Harvard, Vancouver, ISO, and other styles

31

Zhu, Han, Xiaohai He, Meiling Wang, Mozhi Zhang, and Linbo Qing. "Medical visual question answering via corresponding feature fusion combined with semantic attention." Mathematical Biosciences and Engineering 19, no. 10 (2022): 10192–212. http://dx.doi.org/10.3934/mbe.2022478.

Full text

Abstract:

<abstract> <p>Medical visual question answering (Med-VQA) aims to leverage a pre-trained artificial intelligence model to answer clinical questions raised by doctors or patients regarding radiology images. However, owing to the high professional requirements in the medical field and the difficulty of annotating medical data, Med-VQA lacks sufficient large-scale, well-annotated radiology images for training. Researchers have mainly focused on improving the ability of the model's visual feature extractor to address this problem. However, there are few researches focused on the textual feature extraction, and most of them underestimated the interactions between corresponding visual and textual features. In this study, we propose a corresponding feature fusion (CFF) method to strengthen the interactions of specific features from corresponding radiology images and questions. In addition, we designed a semantic attention (SA) module for textual feature extraction. This helps the model consciously focus on the meaningful words in various questions while reducing the attention spent on insignificant information. Extensive experiments demonstrate that the proposed method can achieve competitive results in two benchmark datasets and outperform existing state-of-the-art methods on answer prediction accuracy. Experimental results also prove that our model is capable of semantic understanding during answer prediction, which has certain advantages in Med-VQA.</p> </abstract>

APA, Harvard, Vancouver, ISO, and other styles

32

Liu, Yibing, Yangyang Guo, Jianhua Yin, Xuemeng Song, Weifeng Liu, Liqiang Nie, and Min Zhang. "Answer Questions with Right Image Regions: A Visual Attention Regularization Approach." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 4 (November 30, 2022): 1–18. http://dx.doi.org/10.1145/3498340.

Full text

Abstract:

Visual attention in Visual Question Answering (VQA) targets at locating the right image regions regarding the answer prediction, offering a powerful technique to promote multi-modal understanding. However, recent studies have pointed out that the highlighted image regions from the visual attention are often irrelevant to the given question and answer, leading to model confusion for correct visual reasoning. To tackle this problem, existing methods mostly resort to aligning the visual attention weights with human attentions. Nevertheless, gathering such human data is laborious and expensive, making it burdensome to adapt well-developed models across datasets. To address this issue, in this article, we devise a novel visual attention regularization approach, namely, AttReg, for better visual grounding in VQA. Specifically, AttReg first identifies the image regions that are essential for question answering yet unexpectedly ignored (i.e., assigned with low attention weights) by the backbone model. And then a mask-guided learning scheme is leveraged to regularize the visual attention to focus more on these ignored key regions. The proposed method is very flexible and model-agnostic, which can be integrated into most visual attention-based VQA models and require no human attention supervision. Extensive experiments over three benchmark datasets, i.e., VQA-CP v2, VQA-CP v1, and VQA v2, have been conducted to evaluate the effectiveness of AttReg. As a by-product, when incorporating AttReg into the strong baseline LMH, our approach can achieve a new state-of-the-art accuracy of 60.00% with an absolute performance gain of 7.01% on the VQA-CP v2 benchmark dataset. In addition to the effectiveness validation, we recognize that the faithfulness of the visual attention in VQA has not been well explored in literature. In the light of this, we propose to empirically validate such property of visual attention and compare it with the prevalent gradient-based approaches.

APA, Harvard, Vancouver, ISO, and other styles

33

Zhang, Xu, DeZhi Han, and Chin-Chen Chang. "RDMMFET: Representation of Dense Multimodality Fusion Encoder Based on Transformer." Mobile Information Systems 2021 (October 18, 2021): 1–9. http://dx.doi.org/10.1155/2021/2662064.

Full text

Abstract:

Visual question answering (VQA) is the natural language question-answering of visual images. The model of VQA needs to make corresponding answers according to specific questions based on understanding images, the most important of which is to understand the relationship between images and language. Therefore, this paper proposes a new model, Representation of Dense Multimodality Fusion Encoder Based on Transformer, for short, RDMMFET, which can learn the related knowledge between vision and language. The RDMMFET model consists of three parts: dense language encoder, image encoder, and multimodality fusion encoder. In addition, we designed three types of pretraining tasks: masked language model, masked image model, and multimodality fusion task. These pretraining tasks can help to understand the fine-grained alignment between text and image regions. Simulation results on the VQA v2.0 data set show that the RDMMFET model can work better than the previous model. Finally, we conducted detailed ablation studies on the RDMMFET model and provided the results of attention visualization, which proves that the RDMMFET model can significantly improve the effect of VQA.

APA, Harvard, Vancouver, ISO, and other styles

34

Acharya, Manoj, Kushal Kafle, and Christopher Kanan. "TallyQA: Answering Complex Counting Questions." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8076–84. http://dx.doi.org/10.1609/aaai.v33i01.33018076.

Full text

Abstract:

Most counting questions in visual question answering (VQA) datasets are simple and require no more than object detection. Here, we study algorithms for complex counting questions that involve relationships between objects, attribute identification, reasoning, and more. To do this, we created TallyQA, the world’s largest dataset for open-ended counting. We propose a new algorithm for counting that uses relation networks with region proposals. Our method lets relation networks be efficiently used with high-resolution imagery. It yields stateof-the-art results compared to baseline and recent systems on both TallyQA and the HowMany-QA benchmark.

APA, Harvard, Vancouver, ISO, and other styles

35

Lei, Zhi, Guixian Zhang, Lijuan Wu, Kui Zhang, and Rongjiao Liang. "A Multi-level Mesh Mutual Attention Model for Visual Question Answering." Data Science and Engineering 7, no. 4 (October 30, 2022): 339–53. http://dx.doi.org/10.1007/s41019-022-00200-9.

Full text

Abstract:

AbstractVisual question answering is a complex multimodal task involving images and text, with broad application prospects in human–computer interaction and medical assistance. Therefore, how to deal with the feature interaction and multimodal feature fusion between the critical regions in the image and the keywords in the question is an important issue. To this end, we propose a neural network based on the encoder–decoder structure of the transformer architecture. Specifically, in the encoder, we use multi-head self-attention to mine word–word connections within question features and stack multiple layers of attention to obtain multi-level question features. We propose a mutual attention module to perform information exchange between modalities for better question features and image features representation on the decoder side. Besides, we connect the encoder and decoder in a meshed manner, perform mutual attention operations with multi-level question features, and aggregate information in an adaptive way. We propose a multi-scale fusion module in the fusion stage, which utilizes feature information at different scales to complete modal fusion. We test and validate the model effectiveness on VQA v1 and VQA v2 datasets. Our model achieves better results than state-of-the-art methods.

APA, Harvard, Vancouver, ISO, and other styles

36

Park, Sungho, Sunhee Hwang, Jongkwang Hong, and Hyeran Byun. "Fair-VQA: Fairness-Aware Visual Question Answering Through Sensitive Attribute Prediction." IEEE Access 8 (2020): 215091–99. http://dx.doi.org/10.1109/access.2020.3041503.

Full text

APA, Harvard, Vancouver, ISO, and other styles

37

Liu, Feng, Tao Xiang, Timothy M. Hospedales, Wankou Yang, and Changyin Sun. "Inverse Visual Question Answering: A New Benchmark and VQA Diagnosis Tool." IEEE Transactions on Pattern Analysis and Machine Intelligence 42, no. 2 (February 1, 2020): 460–74. http://dx.doi.org/10.1109/tpami.2018.2880185.

Full text

APA, Harvard, Vancouver, ISO, and other styles

38

Yang, Zhengyuan, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. "An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 3 (June 28, 2022): 3081–89. http://dx.doi.org/10.1609/aaai.v36i3.20215.

Full text

Abstract:

Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3’s power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8.6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.

APA, Harvard, Vancouver, ISO, and other styles

39

Cao, Qingqing, Prerna Khanna, Nicholas D. Lane, and Aruna Balasubramanian. "MobiVQA." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, no. 2 (July 4, 2022): 1–23. http://dx.doi.org/10.1145/3534619.

Full text

Abstract:

Visual Question Answering (VQA) is a relatively new task where a user can ask a natural question about an image and obtain an answer. VQA is useful for many applications and is widely popular for users with visual impairments. Our goal is to design a VQA application that works efficiently on mobile devices without requiring cloud support. Such a system will allow users to ask visual questions privately, without having to send their questions to the cloud, while also reduce cloud communication costs. However, existing VQA applications use deep learning models that significantly improve accuracy, but is computationally heavy. Unfortunately, existing techniques that optimize deep learning for mobile devices cannot be applied for VQA because the VQA task is multi-modal---it requires both processing vision and text data. Existing mobile optimizations that work for vision-only or text-only neural networks cannot be applied here because of the dependencies between the two modes. Instead, we design MobiVQA, a set of optimizations that leverage the multi-modal nature of VQA. We show using extensive evaluation on two VQA testbeds and two mobile platforms, that MobiVQA significantly improves latency and energy with minimal accuracy loss compared to state-of-the-art VQA models. For instance, MobiVQA can answer a visual question in 163 milliseconds on the phone, compared to over 20-second latency incurred by the most accurate state-of-the-art model, while incurring less than 1 point reduction in accuracy.

APA, Harvard, Vancouver, ISO, and other styles

40

Li, Mingxiao, and Marie-Francine Moens. "Dynamic Key-Value Memory Enhanced Multi-Step Graph Reasoning for Knowledge-Based Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (June 28, 2022): 10983–92. http://dx.doi.org/10.1609/aaai.v36i10.21346.

Full text

Abstract:

Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions using knowledge that is not presented in the given image. It is not only a more challenging task than regular VQA but also a vital step towards building a general VQA system. Most existing knowledge-based VQA systems process knowledge and image information similarly and ignore the fact that the knowledge base (KB) contains complete information about a triplet, while the extracted image information might be incomplete as the relations between two objects are missing or wrongly detected. In this paper, we propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR), which performs explicit and implicit reasoning over a key-value knowledge memory module and a spatial-aware image graph, respectively. Specifically, the memory module learns a dynamic knowledge representation and generates a knowledge-aware question representation at each reasoning step. Then, this representation is used to guide a graph attention operator over the spatial-aware image graph. Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets. We also conduct ablation experiments to prove the effectiveness of each component of the proposed model.

APA, Harvard, Vancouver, ISO, and other styles

41

Narayanan, Abhishek, Abijna Rao, Abhishek Prasad, and Natarajan S. "VQA as a factoid question answering problem: A novel approach for knowledge-aware and explainable visual question answering." Image and Vision Computing 116 (December 2021): 104328. http://dx.doi.org/10.1016/j.imavis.2021.104328.

Full text

APA, Harvard, Vancouver, ISO, and other styles

42

Liang, Haotian, and Zhanqing Wang. "Hierarchical Attention Networks for Multimodal Machine Learning." Journal of Physics: Conference Series 2218, no. 1 (March 1, 2022): 012020. http://dx.doi.org/10.1088/1742-6596/2218/1/012020.

Full text

Abstract:

Abstract The Visual Question Answering (VQA) task is to infer the correct answer to a free-form question based on the given image. This task is challenging because it requires model handling both visual and textual information. Most successful attempts on VQA task have been achieved by using attention mechanism which can capture inter-modal and intra-modal dependencies. In this paper, we raise a new attention-based model to solve VQA. We use question information to guide model concentrate on special regions and attribute and hierarchically reason the answer. We also propose multi-modal fusion strategy based on co-attention method to fuse both visual and textual information. Under the same experimental conditions, extensive experiments on VQA-v2.0 dataset illustrate our method performance exceeds the performance of some state-of-the-art methods of the same experimental conditions.

APA, Harvard, Vancouver, ISO, and other styles

43

Yuan, Desen, Lei Wang, Qingbo Wu, Fanman Meng, King Ngi Ngan, and Linfeng Xu. "Language Bias-Driven Self-Knowledge Distillation with Generalization Uncertainty for Reducing Language Bias in Visual Question Answering." Applied Sciences 12, no. 15 (July 28, 2022): 7588. http://dx.doi.org/10.3390/app12157588.

Full text

Abstract:

To answer questions, visual question answering systems (VQA) rely on language bias but ignore the information of the images, which has negative information on its generalization. The mainstream debiased methods focus on removing language prior to inferring. However, the image samples are distributed unevenly in the dataset, so the feature sets acquired by the model often cannot cover the features (views) of the tail samples. Therefore, language bias occurs. This paper proposes a language bias-driven self-knowledge distillation framework to implicitly learn the feature sets of multi-views so as to reduce language bias. Moreover, to measure the performance of student models, the authors of this paper use a generalization uncertainty index to help student models learn unbiased visual knowledge and force them to focus more on the questions that cannot be answered based on language bias alone. In addition, the authors of this paper analyze the theory of the proposed method and verify the positive correlation between generalization uncertainty and expected test error. The authors of this paper validate the method’s effectiveness on the VQA-CP v2, VQA-CP v1 and VQA v2 datasets through extensive ablation experiments.

APA, Harvard, Vancouver, ISO, and other styles

44

Gaidamavičius, Dainius, and Tomas Iešmantas. "Deep learning method for visual question answering in the digital radiology domain." Mathematical Models in Engineering 8, no. 2 (June 26, 2022): 58–71. http://dx.doi.org/10.21595/mme.2022.22737.

Full text

Abstract:

Computer vision applications in the medical field are widespread, and language processing models have gained more and more interest as well. However, these two different tasks often go separately: disease or pathology detection is often based purely on image models, while for example patient notes are treated only from the natural language processing perspective. However, there is an important task between: given a medical image, describe what is inside it – organs, modality, pathology, location, and stage of the pathology, etc. This type of area falls into the so-called VQA area – Visual Question Answering. In this work, we concentrate on blending deep features extracted from image and language models into a single representation. A new method of feature fusion is proposed and shown to be superior in terms of accuracy compared to summation and concatenation methods. For the radiology image dataset VQA-2019 Med [1], the new method achieves 84.8 % compared to 82.2 % for other considered feature fusion methods. In addition to increased accuracy, the proposed model does not become more difficult to train as the number of unknown parameters does not increase, as compared with the simple addition operation for fusing features.

APA, Harvard, Vancouver, ISO, and other styles

45

Li, Xuewei, Dezhi Han, and Chin-Chen Chang. "Pre-training Model Based on Parallel Cross-Modality Fusion Layer." PLOS ONE 17, no. 2 (February 3, 2022): e0260784. http://dx.doi.org/10.1371/journal.pone.0260784.

Full text

Abstract:

Visual Question Answering (VQA) is a learning task that combines computer vision with natural language processing. In VQA, it is important to understand the alignment between visual concepts and linguistic semantics. In this paper, we proposed a Pre-training Model Based on Parallel Cross-Modality Fusion Layer (P-PCFL) to learn the fine-grained relationship between vision and language. The P-PCFL model is composed of three Encoders: Object Encoder, Language Encoder, and Parallel Cross-Modality Fusion Encoder, with Transformer as the core. We use four different Pre-training missions, namely, Cross-Modality Mask Language Modeling, Cross-Modality Mask Region Modeling, Image-Text Matching, and Image-Text Q&A, to pre-train the P-PCFL model and improve its reasoning and universality, which help to learn the relationship between Intra-modality and Inter-modality. Experimental results on the platform of Visual Question Answering dataset VQA v2.0 show that the Pre-trained P-PCFL model has a good effect after fine-tuning the parameters. In addition, we also conduct ablation experiments and provide some results of Attention visualization to verify the effectiveness of P-PCFL model.

APA, Harvard, Vancouver, ISO, and other styles

46

Lobry, S., D. Marcos, B. Kellenberger, and D. Tuia. "BETTER GENERIC OBJECTS COUNTING WHEN ASKING QUESTIONS TO IMAGES: A MULTITASK APPROACH FOR REMOTE SENSING VISUAL QUESTION ANSWERING." ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences V-2-2020 (August 3, 2020): 1021–27. http://dx.doi.org/10.5194/isprs-annals-v-2-2020-1021-2020.

Full text

Abstract:

Abstract. Visual Question Answering for Remote Sensing (RSVQA) aims at extracting information from remote sensing images through queries formulated in natural language. Since the answer to the query is also provided in natural language, the system is accessible to non-experts, and therefore dramatically increases the value of remote sensing images as a source of information, for example for journalism purposes or interactive land planning. Ideally, an RSVQA system should be able to provide an answer to questions that vary both in terms of topic (presence, localization, counting) and image content. However, aiming at such flexibility generates problems related to the variability of the possible answers. A striking example is counting, where the number of objects present in a remote sensing image can vary by multiple orders of magnitude, depending on both the scene and type of objects. This represents a challenge for traditional Visual Question Answering (VQA) methods, which either become intractable or result in an accuracy loss, as the number of possible answers has to be limited. To this end, we introduce a new model that jointly solves a classification problem (which is the most common approach in VQA) and a regression problem (to answer numerical questions more precisely). An evaluation of this method on the RSVQA dataset shows that this finer numerical output comes at the cost of a small loss of performance on non-numerical questions.

APA, Harvard, Vancouver, ISO, and other styles

47

Wu, Jialin, Jiasen Lu, Ashish Sabharwal, and Roozbeh Mottaghi. "Multi-Modal Answer Validation for Knowledge-Based VQA." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 3 (June 28, 2022): 2712–21. http://dx.doi.org/10.1609/aaai.v36i3.20174.

Full text

Abstract:

The problem of knowledge-based visual question answering involves answering questions that require external knowledge in addition to the content of the image. Such knowledge typically comes in various forms, including visual, textual, and commonsense knowledge. Using more knowledge sources increases the chance of retrieving more irrelevant or noisy facts, making it challenging to comprehend the facts and find the answer. To address this challenge, we propose Multi-modal Answer Validation using External knowledge (MAVEx), where the idea is to validate a set of promising answer candidates based on answer-specific knowledge retrieval. Instead of searching for the answer in a vast collection of often irrelevant facts as most existing approaches do, MAVEx aims to learn how to extract relevant knowledge from noisy sources, which knowledge source to trust for each answer candidate, and how to validate the candidate using that source. Our multi-modal setting is the first to leverage external visual knowledge (images searched using Google), in addition to textual knowledge in the form of Wikipedia sentences and ConceptNet concepts. Our experiments with OK-VQA, a challenging knowledge-based VQA dataset, demonstrate that MAVEx achieves new state-of-the-art results. Our code is available at https://github.com/jialinwu17/MAVEX

APA, Harvard, Vancouver, ISO, and other styles

48

Li, Qifeng, Xinyi Tang, and Yi Jian. "Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering." Sensors 22, no. 4 (February 17, 2022): 1575. http://dx.doi.org/10.3390/s22041575.

Full text

Abstract:

Collaborative reasoning for knowledge-based visual question answering is challenging but vital and efficient in understanding the features of the images and questions. While previous methods jointly fuse all kinds of features by attention mechanism or use handcrafted rules to generate a layout for performing compositional reasoning, which lacks the process of visual reasoning and introduces a large number of parameters for predicting the correct answer. For conducting visual reasoning on all kinds of image–question pairs, in this paper, we propose a novel reasoning model of a question-guided tree structure with a knowledge base (QGTSKB) for addressing these problems. In addition, our model consists of four neural module networks: the attention model that locates attended regions based on the image features and question embeddings by attention mechanism, the gated reasoning model that forgets and updates the fused features, the fusion reasoning model that mines high-level semantics of the attended visual features and knowledge base and knowledge-based fact model that makes up for the lack of visual and textual information with external knowledge. Therefore, our model performs visual analysis and reasoning based on tree structures, knowledge base and four neural module networks. Experimental results show that our model achieves superior performance over existing methods on the VQA v2.0 and CLVER dataset, and visual reasoning experiments prove the interpretability of the model.

APA, Harvard, Vancouver, ISO, and other styles

49

Ma, Zhiyang, Wenfeng Zheng, Xiaobing Chen, and Lirong Yin. "Joint embedding VQA model based on dynamic word vector." PeerJ Computer Science 7 (March 3, 2021): e353. http://dx.doi.org/10.7717/peerj-cs.353.

Full text

Abstract:

The existing joint embedding Visual Question Answering models use different combinations of image characterization, text characterization and feature fusion method, but all the existing models use static word vectors for text characterization. However, in the real language environment, the same word may represent different meanings in different contexts, and may also be used as different grammatical components. These differences cannot be effectively expressed by static word vectors, so there may be semantic and grammatical deviations. In order to solve this problem, our article constructs a joint embedding model based on dynamic word vector—none KB-Specific network (N-KBSN) model which is different from commonly used Visual Question Answering models based on static word vectors. The N-KBSN model consists of three main parts: question text and image feature extraction module, self attention and guided attention module, feature fusion and classifier module. Among them, the key parts of N-KBSN model are: image characterization based on Faster R-CNN, text characterization based on ELMo and feature enhancement based on multi-head attention mechanism. The experimental results show that the N-KBSN constructed in our experiment is better than the other 2017—winner (glove) model and 2019—winner (glove) model. The introduction of dynamic word vector improves the accuracy of the overall results.

APA, Harvard, Vancouver, ISO, and other styles

50

Zhu, He, Ren Togo, Takahiro Ogawa, and Miki Haseyama. "Diversity Learning Based on Multi-Latent Space for Medical Image Visual Question Generation." Sensors 23, no. 3 (January 17, 2023): 1057. http://dx.doi.org/10.3390/s23031057.

Full text

Abstract:

Auxiliary clinical diagnosis has been researched to solve unevenly and insufficiently distributed clinical resources. However, auxiliary diagnosis is still dominated by human physicians, and how to make intelligent systems more involved in the diagnosis process is gradually becoming a concern. An interactive automated clinical diagnosis with a question-answering system and a question generation system can capture a patient’s conditions from multiple perspectives with less physician involvement by asking different questions to drive and guide the diagnosis. This clinical diagnosis process requires diverse information to evaluate a patient from different perspectives to obtain an accurate diagnosis. Recently proposed medical question generation systems have not considered diversity. Thus, we propose a diversity learning-based visual question generation model using a multi-latent space to generate informative question sets from medical images. The proposed method generates various questions by embedding visual and language information in different latent spaces, whose diversity is trained by our newly proposed loss. We have also added control over the categories of generated questions, making the generated questions directional. Furthermore, we use a new metric named similarity to accurately evaluate the proposed model’s performance. The experimental results on the Slake and VQA-RAD datasets demonstrate that the proposed method can generate questions with diverse information. Our model works with an answering model for interactive automated clinical diagnosis and generates datasets to replace the process of annotation that incurs huge labor costs.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!