Relevant bibliographies by topics / Visual question answering (VQA)

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers

Academic literature on the topic 'Visual question answering (VQA)'

Author: Grafiati

Published: 4 June 2021

Last updated: 26 January 2023

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Visual question answering (VQA).'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Visual question answering (VQA)"

Agrawal, Aishwarya, Jiasen Lu, Stanislaw Antol, Margaret Mitchell, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. "VQA: Visual Question Answering." International Journal of Computer Vision 123, no. 1 (November 8, 2016): 4–31. http://dx.doi.org/10.1007/s11263-016-0966-6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Lei, Chenyi, Lei Wu, Dong Liu, Zhao Li, Guoxin Wang, Haihong Tang, and Houqiang Li. "Multi-Question Learning for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (April 3, 2020): 11328–35. http://dx.doi.org/10.1609/aaai.v34i07.6794.

Full text

Abstract:

Visual Question Answering (VQA) raises a great challenge for computer vision and natural language processing communities. Most of the existing approaches consider video-question pairs individually during training. However, we observe that there are usually multiple (either sequentially generated or not) questions for the target video in a VQA task, and the questions themselves have abundant semantic relations. To explore these relations, we propose a new paradigm for VQA termed Multi-Question Learning (MQL). Inspired by the multi-task learning, MQL learns from multiple questions jointly together with their corresponding answers for a target video sequence. The learned representations of video-question pairs are then more general to be transferred for new questions. We further propose an effective VQA framework and design a training procedure for MQL, where the specifically designed attention network models the relation between input video and corresponding questions, enabling multiple video-question pairs to be co-trained. Experimental results on public datasets show the favorable performance of the proposed MQL-VQA framework compared to state-of-the-arts.

APA, Harvard, Vancouver, ISO, and other styles

Shah, Sanket, Anand Mishra, Naganand Yadati, and Partha Pratim Talukdar. "KVQA: Knowledge-Aware Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8876–84. http://dx.doi.org/10.1609/aaai.v33i01.33018876.

Full text

Abstract:

Visual Question Answering (VQA) has emerged as an important problem spanning Computer Vision, Natural Language Processing and Artificial Intelligence (AI). In conventional VQA, one may ask questions about an image which can be answered purely based on its content. For example, given an image with people in it, a typical VQA question may inquire about the number of people in the image. More recently, there is growing interest in answering questions which require commonsense knowledge involving common nouns (e.g., cats, dogs, microphones) present in the image. In spite of this progress, the important problem of answering questions requiring world knowledge about named entities (e.g., Barack Obama, White House, United Nations) in the image has not been addressed in prior research. We address this gap in this paper, and introduce KVQA – the first dataset for the task of (world) knowledge-aware VQA. KVQA consists of 183K question-answer pairs involving more than 18K named entities and 24K images. Questions in this dataset require multi-entity, multi-relation, and multi-hop reasoning over large Knowledge Graphs (KG) to arrive at an answer. To the best of our knowledge, KVQA is the largest dataset for exploring VQA over KG. Further, we also provide baseline performances using state-of-the-art methods on KVQA.

APA, Harvard, Vancouver, ISO, and other styles

Guo, Zihan, Dezhi Han, and Kuan-Ching Li. "Double-layer affective visual question answering network." Computer Science and Information Systems, no. 00 (2020): 38. http://dx.doi.org/10.2298/csis200515038g.

Full text

Abstract:

Visual Question Answering (VQA) has attracted much attention recently in both natural language processing and computer vision communities, as it offers insight into the relationships between two relevant sources of information. Tremendous advances are seen in the field of VQA due to the success of deep learning. Based upon advances and improvements, the Affective Visual Question Answering Network (AVQAN) enriches the understanding and analysis of VQA models by making use of the emotional information contained in the images to produce sensitive answers, while maintaining the same level of accuracy as ordinary VQA baseline models. It is a reasonably new task to integrate the emotional information contained in the images into VQA. However, it is challenging to separate question guided-attention from mood-guided-attention due to the concatenation of the question words and the mood labels in AVQAN. Also, it is believed that this type of concatenation is harmful to the performance of the model. To mitigate such an effect, we propose the Double-Layer Affective Visual Question Answering Network (DAVQAN) that divides the task of generating emotional answers in VQA into two simpler subtasks: the generation of non-emotional responses and the production of mood labels, and two independent layers are utilized to tackle these subtasks. Comparative experimentation conducted on a preprocessed dataset to performance comparison shows that the overall performance of DAVQAN is 7.6% higher than AVQAN, demonstrating the effectiveness of the proposed model. We also introduce more advanced word embedding method and more fine-grained image feature extractor into AVQAN and DAVQAN to further improve their performance and obtain better results than their original models, which proves that VQA integrated with affective computing can improve the performance of the whole model by improving these two modules just like the general VQA.

APA, Harvard, Vancouver, ISO, and other styles

Wu, Chenfei, Jinlai Liu, Xiaojie Wang, and Ruifan Li. "Differential Networks for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8997–9004. http://dx.doi.org/10.1609/aaai.v33i01.33018997.

Full text

Abstract:

The task of Visual Question Answering (VQA) has emerged in recent years for its potential applications. To address the VQA task, the model should fuse feature elements from both images and questions efficiently. Existing models fuse image feature element vi and question feature element qi directly, such as an element product viqi. Those solutions largely ignore the following two key points: 1) Whether vi and qi are in the same space. 2) How to reduce the observation noises in vi and qi. We argue that two differences between those two feature elements themselves, like (vi − vj) and (qi −qj), are more probably in the same space. And the difference operation would be beneficial to reduce observation noise. To achieve this, we first propose Differential Networks (DN), a novel plug-and-play module which enables differences between pair-wise feature elements. With the tool of DN, we then propose DN based Fusion (DF), a novel model for VQA task. We achieve state-of-the-art results on four publicly available datasets. Ablation studies also show the effectiveness of difference operations in DF model.

APA, Harvard, Vancouver, ISO, and other styles

Zhou, Yiyi, Rongrong Ji, Jinsong Su, Xiaoshuai Sun, and Weiqiu Chen. "Dynamic Capsule Attention for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9324–31. http://dx.doi.org/10.1609/aaai.v33i01.33019324.

Full text

Abstract:

In visual question answering (VQA), recent advances have well advocated the use of attention mechanism to precisely link the question to the potential answer areas. As the difficulty of the question increases, more VQA models adopt multiple attention layers to capture the deeper visual-linguistic correlation. But a negative consequence is the explosion of parameters, which makes the model vulnerable to over-fitting, especially when limited training examples are given. In this paper, we propose an extremely compact alternative to this static multi-layer architecture towards accurate yet efficient attention modeling, termed as Dynamic Capsule Attention (CapsAtt). Inspired by the recent work of Capsule Network, CapsAtt treats visual features as capsules and obtains the attention output via dynamic routing, which updates the attention weights by calculating coupling coefficients between the underlying and output capsules. Meanwhile, CapsAtt also discards redundant projection matrices to make the model much more compact. We quantify CapsAtt on three benchmark VQA datasets, i.e., COCO-QA, VQA1.0 and VQA2.0. Compared to the traditional multi-layer attention model, CapsAtt achieves significant improvements of up to 4.1%, 5.2% and 2.2% on three datasets, respectively. Moreover, with much fewer parameters, our approach also yields competitive results compared to the latest VQA models. To further verify the generalization ability of CapsAtt, we also deploy it on another challenging multi-modal task of image captioning, where state-of-the-art performance is achieved with a simple network structure.

APA, Harvard, Vancouver, ISO, and other styles

Et. al., K. P. Moholkar,. "Visual Question Answering using Convolutional Neural Networks." Turkish Journal of Computer and Mathematics Education (TURCOMAT) 12, no. 1S (April 11, 2021): 170–75. http://dx.doi.org/10.17762/turcomat.v12i1s.1602.

Full text

Abstract:

The ability of a computer system to be able to understand surroundings and elements and to think like a human being to process the information has always been the major point of focus in the field of Computer Science. One of the ways to achieve this artificial intelligence is Visual Question Answering. Visual Question Answering (VQA) is a trained system which can answer the questions associated to a given image in Natural Language. VQA is a generalized system which can be used in any image-based scenario with adequate training on the relevant data. This is achieved with the help of Neural Networks, particularly Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). In this study, we have compared different approaches of VQA, out of which we are exploring CNN based model. With the continued progress in the field of Computer Vision and Question answering system, Visual Question Answering is becoming the essential system which can handle multiple scenarios with their respective data.

APA, Harvard, Vancouver, ISO, and other styles

Guo, Wenya, Ying Zhang, Xiaoping Wu, Jufeng Yang, Xiangrui Cai, and Xiaojie Yuan. "Re-Attention for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (April 3, 2020): 91–98. http://dx.doi.org/10.1609/aaai.v34i01.5338.

Full text

Abstract:

Visual Question Answering~(VQA) requires a simultaneous understanding of images and questions. Existing methods achieve well performance by focusing on both key objects in images and key words in questions. However, the answer also contains rich information which can help to better describe the image and generate more accurate attention maps. In this paper, to utilize the information in answer, we propose a re-attention framework for the VQA task. We first associate image and question by calculating the similarity of each object-word pairs in the feature space. Then, based on the answer, the learned model re-attends the corresponding visual objects in images and reconstructs the initial attention map to produce consistent results. Benefiting from the re-attention procedure, the question can be better understood, and the satisfactory answer is generated. Extensive experiments on the benchmark dataset demonstrate the proposed method performs favorably against the state-of-the-art approaches.

APA, Harvard, Vancouver, ISO, and other styles

Boukhers, Zeyd, Timo Hartmann, and Jan Jürjens. "COIN: Counterfactual Image Generation for Visual Question Answering Interpretation." Sensors 22, no. 6 (March 14, 2022): 2245. http://dx.doi.org/10.3390/s22062245.

Full text

Abstract:

Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the VQA model to give a different answer. In addition, our approach ensures that the generated image is realistic. Since quantitative metrics cannot be employed to evaluate the interpretability of the model, we carried out a user study to assess different aspects of our approach. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models’ behaviour.

APA, Harvard, Vancouver, ISO, and other styles

Li, Qun, Fu Xiao, Bir Bhanu, Biyun Sheng, and Richang Hong. "Inner Knowledge-based Img2Doc Scheme for Visual Question Answering." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 3 (August 31, 2022): 1–21. http://dx.doi.org/10.1145/3489142.

Full text

Abstract:

Visual Question Answering (VQA) is a research topic of significant interest at the intersection of computer vision and natural language understanding. Recent research indicates that attributes and knowledge can effectively improve performance for both image captioning and VQA. In this article, an inner knowledge-based Img2Doc algorithm for VQA is presented. The inner knowledge is characterized as the inner attribute relationship in visual images. In addition to using an attribute network for inner knowledge-based image representation, VQA scheme is associated with a question-guided Doc2Vec method for question–answering. The attribute network generates inner knowledge-based features for visual images, while a novel question-guided Doc2Vec method aims at converting natural language text to vector features. After the vector features are extracted, they are combined with visual image features into a classifier to provide an answer. Based on our model, the VQA problem is resolved by textual question answering. The experimental results demonstrate that the proposed method achieves superior performance on multiple benchmark datasets.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Visual question answering (VQA)"

Chowdhury, Muhammad Iqbal Hasan. "Question-answering on image/video content." Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/205096/1/Muhammad%20Iqbal%20Hasan_Chowdhury_Thesis.pdf.

Full text

Abstract:

This thesis explores a computer's ability to understand multimodal data where the correspondence between image/video content and natural language text are utilised to answer open-ended natural language questions through question-answering tasks. Static image data consisting of both indoor and outdoor scenes, where complex textual questions are arbitrarily posed to a machine to generate correct answers, was examined. Dynamic videos consisting of both single-camera and multi-camera settings for the exploration of more challenging and unconstrained question-answering tasks were also considered. In exploring these challenges, new deep learning processes were developed to improve a computer's ability to understand and consider multimodal data.

APA, Harvard, Vancouver, ISO, and other styles

Strub, Florian. "Développement de modèles multimodaux interactifs pour l'apprentissage du langage dans des environnements visuels." Thesis, Lille 1, 2020. http://www.theses.fr/2020LIL1I030.

Full text

Abstract:

Alors que nous nous représentons le monde au travers de nos sens, de notre langage et de nos interactions, chacun de ces domaines a été historiquement étudié de manière indépendante en apprentissage automatique. Heureusement, ce cloisonnement tend à se défaire grâce aux dernières avancées en apprentissage profond, ce qui a conduit à l'uniformisation de l'extraction des données au travers des communautés. Cependant, les architectures neuronales multimodales n'en sont qu'à leurs premiers balbutiements et l’apprentissage par renforcement profond est encore souvent restreint à des environnements limités. Idéalement, nous aimerions pourtant développer des modèles multimodaux et interactifs afin qu’ils puissent correctement appréhender la complexité du monde réel. Dans cet objectif, cette thèse s’attache à la compréhension du langage combiné à la vision pour trois raisons : (i) ce sont deux modalités longuement étudiées aux travers des différentes communautés scientifiques (ii) nous pouvons bénéficier des dernières avancées en apprentissage profond pour les modèles de langues et de vision (iii) l’interaction entre l’apprentissage du langage et notre perception a été validé en science cognitives. Ainsi, nous avons conçu le jeu GuessWhat?! (KéZaKo) afin d’évaluer la compréhension de langue combiné à la vision de nos modèles : deux joueurs doivent ainsi localiser un objet caché dans une image en posant une série de questions. Nous introduisons ensuite le principe de modulation comme un nouveau module d’apprentissage profond multimodal. Nous montrons qu’une telle approche permet de fusionner efficacement des représentations visuelles et langagières en prenant en compte la structure hiérarchique propre aux réseaux de neurones. Enfin, nous explorons comment l'apprentissage par renforcement permet l’apprentissage de la langue et cimente l'apprentissage des représentations multimodales sous-jacentes. Nous montrons qu’un tel apprentissage interactif conduit à des stratégies langagières valides mais donne lieu à de nouvelles problématiques de recherche
While our representation of the world is shaped by our perceptions, our languages, and our interactions, they have traditionally been distinct fields of study in machine learning. Fortunately, this partitioning started opening up with the recent advents of deep learning methods, which standardized raw feature extraction across communities. However, multimodal neural architectures are still at their beginning, and deep reinforcement learning is often limited to constrained environments. Yet, we ideally aim to develop large-scale multimodal and interactive models towards correctly apprehending the complexity of the world. As a first milestone, this thesis focuses on visually grounded language learning for three reasons (i) they are both well-studied modalities across different scientific fields (ii) it builds upon deep learning breakthroughs in natural language processing and computer vision (ii) the interplay between language and vision has been acknowledged in cognitive science. More precisely, we first designed the GuessWhat?! game for assessing visually grounded language understanding of the models: two players collaborate to locate a hidden object in an image by asking a sequence of questions. We then introduce modulation as a novel deep multimodal mechanism, and we show that it successfully fuses visual and linguistic representations by taking advantage of the hierarchical structure of neural networks. Finally, we investigate how reinforcement learning can support visually grounded language learning and cement the underlying multimodal representation. We show that such interactive learning leads to consistent language strategies but gives raise to new research issues

APA, Harvard, Vancouver, ISO, and other styles

Mahendru, Aroma. "Role of Premises in Visual Question Answering." Thesis, Virginia Tech, 2017. http://hdl.handle.net/10919/78030.

Full text

Abstract:

In this work, we make a simple but important observation questions about images often contain premises -- objects and relationships implied by the question -- and that reasoning about premises can help Visual Question Answering (VQA) models respond more intelligently to irrelevant or previously unseen questions. When presented with a question that is irrelevant to an image, state-of-the-art VQA models will still answer based purely on learned language biases, resulting in nonsensical or even misleading answers. We note that a visual question is irrelevant to an image if at least one of its premises is false (i.e. not depicted in the image). We leverage this observation to construct a dataset for Question Relevance Prediction and Explanation (QRPE) by searching for false premises. We train novel irrelevant question detection models and show that models that reason about premises consistently outperform models that do not. We also find that forcing standard VQA models to reason about premises during training can lead to improvements on tasks requiring compositional reasoning.
Master of Science

APA, Harvard, Vancouver, ISO, and other styles

Malinowski, Mateusz [Verfasser], and Mario [Akademischer Betreuer] Fritz. "Towards holistic machines : From visual recognition to question answering about real-world images / Mateusz Malinowski ; Betreuer: Mario Fritz." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2017. http://d-nb.info/1136607889/34.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Dushi, Denis. "Using Deep Learning to Answer Visual Questions from Blind People." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-247910.

Full text

Abstract:

A natural application of artificial intelligence is to help blind people overcome their daily visual challenges through AI-based assistive technologies. In this regard, one of the most promising tasks is Visual Question Answering (VQA): the model is presented with an image and a question about this image. It must then predict the correct answer. Recently has been introduced the VizWiz dataset, a collection of images and questions originating from blind people. Being the first VQA dataset deriving from a natural setting, VizWiz presents many limitations and peculiarities. More specifically, the characteristics observed are the high uncertainty of the answers, the conversational aspect of questions, the relatively small size of the datasets and ultimately, the imbalance between answerable and unanswerable classes. These characteristics could be observed, individually or jointly, in other VQA datasets, resulting in a burden when solving the VQA task. Particularly suitable to address these aspects of the data are data science pre-processing techniques. Therefore, to provide a solid contribution to the VQA task, we answered the research question “Can data science pre-processing techniques improve the VQA task?” by proposing and studying the effects of four different pre-processing techniques. To address the high uncertainty of answers we employed a pre-processing step in which it is computed the uncertainty of each answer and used this measure to weight the soft scores of our model during training. The adoption of an “uncertainty-aware” training procedure boosted the predictive accuracy of our model of 10% providing a new state-of-the-art when evaluated on the test split of the VizWiz dataset. In order to overcome the limited amount of data, we designed and tested a new pre-processing procedure able to augment the training set and almost double its data points by computing the cosine similarity between answers representation. We addressed also the conversational aspect of questions collected from real world verbal conversations by proposing an alternative question pre-processing pipeline in which conversational terms are removed. This led in a further improvement: from a predictive accuracy of 0.516 with the standard question processing pipeline, we were able to achieve 0.527 predictive accuracy when employing the new pre-processing pipeline. Ultimately, we addressed the imbalance between answerable and unanswerable classes when predicting the answerability of a visual question. We tested two standard pre-processing techniques to adjust the dataset class distribution: oversampling and undersampling. Oversampling provided an albeit small improvement in both average precision and F1 score.
En naturlig tillämpning av artificiell intelligens är att hjälpa blinda med deras dagliga visuella utmaningar genom AI-baserad hjälpmedelsteknik. I detta avseende, är en av de mest lovande uppgifterna Visual Question Answering (VQA): modellen presenteras med en bild och en fråga om denna bild, och måste sedan förutspå det korrekta svaret. Nyligen introducerades VizWiz-datamängd, en samling bilder och frågor till dessa från blinda personer. Då detta är det första VQA-datamängden som härstammar från en naturlig miljö, har det många begränsningar och särdrag. Mer specifikt är de observerade egenskaperna: hög osäkerhet i svaren, informell samtalston i frågorna, relativt liten datamängd och slutligen obalans mellan svarbara och icke svarbara klasser. Dessa egenskaper kan även observeras, enskilda eller tillsammans, i andra VQA-datamängd, vilket utgör särskilda utmaningar vid lösning av VQA-uppgiften. Särskilt lämplig för att hantera dessa aspekter av data är förbehandlingsteknik från området data science. För att bidra till VQA-uppgiften, svarade vi därför på frågan “Kan förbehandlingstekniker från området data science bidra till lösningen av VQA-uppgiften?” genom att föreslå och studera effekten av fyra olika förbehandlingstekniker. För att hantera den höga osäkerheten i svaren använde vi ett förbehandlingssteg där vi beräknade osäkerheten i varje svar och använde detta mått för att vikta modellens utdata-värden under träning. Användandet av en ”osäkerhetsmedveten” träningsprocedur förstärkte den förutsägbara noggrannheten hos vår modell med 10%. Med detta nådde vi ett toppresultat när modellen utvärderades på testdelen av VizWiz-datamängden. För att övervinna problemet med den begränsade mängden data, konstruerade och testade vi en ny förbehandlingsprocedur som nästan dubblerar datapunkterna genom att beräkna cosinuslikheten mellan svarens vektorer. Vi hanterade även problemet med den informella samtalstonen i frågorna, som samlats in från den verkliga världens verbala konversationer, genom att föreslå en alternativ väg att förbehandla frågorna, där samtalstermer är borttagna. Detta ledde till en ytterligare förbättring: från en förutsägbar noggrannhet på 0.516 med det vanliga sättet att bearbeta frågorna kunde vi uppnå 0.527 prediktiv noggrannhet vid användning av det nya sättet att förbehandla frågorna. Slutligen hanterade vi obalansen mellan svarbara och icke svarbara klasser genom att förutse om en visuell fråga har ett möjligt svar. Vi testade två standard-förbehandlingstekniker för att justeradatamängdens klassdistribution: översampling och undersampling. Översamplingen gav en om än liten förbättring i både genomsnittlig precision och F1-poäng.

APA, Harvard, Vancouver, ISO, and other styles

Ben-Younes, Hedi. "Multi-modal representation learning towards visual reasoning." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS173.

Full text

Abstract:

La quantité d'images présentes sur internet augmente considérablement, et il est nécessaire de développer des techniques permettant le traitement automatique de ces contenus. Alors que les méthodes de reconnaissance visuelle sont de plus en plus évoluées, la communauté scientifique s'intéresse désormais à des systèmes aux capacités de raisonnement plus poussées. Dans cette thèse, nous nous intéressons au Visual Question Answering (VQA), qui consiste en la conception de systèmes capables de répondre à une question portant sur une image. Classiquement, ces architectures sont conçues comme des systèmes d'apprentissage automatique auxquels on fournit des images, des questions et leur réponse. Ce problème difficile est habituellement abordé par des techniques d'apprentissage profond. Dans la première partie de cette thèse, nous développons des stratégies de fusion multimodales permettant de modéliser des interactions entre les représentations d'image et de question. Nous explorons des techniques de fusion bilinéaire, et assurons l'expressivité et la simplicité des modèles en utilisant des techniques de factorisation tensorielle. Dans la seconde partie, on s'intéresse au raisonnement visuel qui encapsule ces fusions. Après avoir présenté les schémas classiques d'attention visuelle, nous proposons une architecture plus avancée qui considère les objets ainsi que leurs relations mutuelles. Tous les modèles sont expérimentalement évalués sur des jeux de données standards et obtiennent des résultats compétitifs avec ceux de la littérature
The quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding. In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature

APA, Harvard, Vancouver, ISO, and other styles

Lin, Xiao. "Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/79521.

Full text

Abstract:

Learning and reasoning with common sense is a challenging problem in Artificial Intelligence (AI). Humans have the remarkable ability to interpret images and text from different perspectives in multiple modalities, and to use large amounts of commonsense knowledge while performing visual or textual tasks. Inspired by that ability, we approach commonsense learning as leveraging perspectives from multiple modalities for images and text in the context of vision and language tasks. Given a target task (e.g., textual reasoning, matching images with captions), our system first represents input images and text in multiple modalities (e.g., vision, text, abstract scenes and facts). Those modalities provide different perspectives to interpret the input images and text. And then based on those perspectives, the system performs reasoning to make a joint prediction for the target task. Surprisingly, we show that interpreting textual assertions and scene descriptions in the modality of abstract scenes improves performance on various textual reasoning tasks, and interpreting images in the modality of Visual Question Answering improves performance on caption retrieval, which is a visual reasoning task. With grounding, imagination and question-answering approaches to interpret images and text in different modalities, we show that learning commonsense knowledge from multiple modalities effectively improves the performance of downstream vision and language tasks, improves interpretability of the model and is able to make more efficient use of training data. Complementary to the model aspect, we also study the data aspect of commonsense learning in vision and language. We study active learning for Visual Question Answering (VQA) where a model iteratively grows its knowledge through querying informative questions about images for answers. Drawing analogies from human learning, we explore cramming (entropy), curiosity-driven (expected model change), and goal-driven (expected error reduction) active learning approaches, and propose a new goal-driven scoring function for deep VQA models under the Bayesian Neural Network framework. Once trained with a large initial training set, a deep VQA model is able to efficiently query informative question-image pairs for answers to improve itself through active learning, saving human effort on commonsense annotations.
Ph. D.

APA, Harvard, Vancouver, ISO, and other styles

Huang, Jia-Hong. "Robustness Analysis of Visual Question Answering Models by Basic Questions." Thesis, 2017. http://hdl.handle.net/10754/626314.

Full text

Abstract:

Visual Question Answering (VQA) models should have both high robustness and accuracy. Unfortunately, most of the current VQA research only focuses on accuracy because there is a lack of proper methods to measure the robustness of VQA models. There are two main modules in our algorithm. Given a natural language question about an image, the first module takes the question as input and then outputs the ranked basic questions, with similarity scores, of the main given question. The second module takes the main question, image and these basic questions as input and then outputs the text-based answer of the main question about the given image. We claim that a robust VQA model is one, whose performance is not changed much when related basic questions as also made available to it as input. We formulate the basic questions generation problem as a LASSO optimization, and also propose a large scale Basic Question Dataset (BQD) and Rscore (novel robustness measure), for analyzing the robustness of VQA models. We hope our BQD will be used as a benchmark for to evaluate the robustness of VQA models, so as to help the community build more robust and accurate VQA models.

APA, Harvard, Vancouver, ISO, and other styles

Anderson, Peter James. "Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents." Phd thesis, 2018. http://hdl.handle.net/1885/164018.

Full text

Abstract:

Each time we ask for an object, describe a scene, follow directions or read a document containing images or figures, we are converting information between visual and linguistic representations. Indeed, for many tasks it is essential to reason jointly over visual and linguistic information. People do this with ease, typically without even noticing. Intelligent systems that perform useful tasks in unstructured situations, and interact with people, will also require this ability. In this thesis, we focus on the joint modelling of visual and linguistic information using deep neural networks. We begin by considering the challenging problem of automatically describing the content of an image in natural language, i.e., image captioning. Although there is considerable interest in this task, progress is hindered by the difficulty of evaluating the generated captions. Our first contribution is a new automatic image caption evaluation metric that measures the quality of generated captions by analysing their semantic content. Extensive evaluations across a range of models and datasets indicate that our metric, dubbed SPICE, shows high correlation with human judgements. Armed with a more effective evaluation metric, we address the challenge of image captioning. Visual attention mechanisms have been widely adopted in image captioning and visual question answering (VQA) architectures to facilitate fine-grained visual processing. We extend existing approaches by proposing a bottom-up and top-down attention mechanism that enables attention to be focused at the level of objects and other salient image regions, which is the natural basis for attention to be considered. Applying this approach to image captioning we achieve state of the art results on the COCO test server. Demonstrating the broad applicability of the method, applying the same approach to VQA we obtain first place in the 2017 VQA Challenge. Despite these advances, recurrent neural network (RNN) image captioning models typically do not generalise well to out-of-domain images containing novel scenes or objects. This limitation severely hinders the use of these models in real applications. To address this problem, we propose constrained beam search, an approximate search algorithm that enforces constraints over RNN output sequences. Using this approach, we show that existing RNN captioning architectures can take advantage of side information such as object detector outputs and ground-truth image annotations at test time, without retraining. Our results significantly outperform previous approaches that incorporate the same information into the learning algorithm, achieving state of the art results for out-of-domain captioning on COCO. Last, to enable and encourage the application of vision and language methods to problems involving embodied agents, we present the Matterport3D Simulator, a large-scale interactive reinforcement learning environment constructed from densely-sampled panoramic RGB-D images of 90 real buildings. Using this simulator, which can in future support a range of embodied vision and language tasks, we collect the first benchmark dataset for visually-grounded natural language navigation in real buildings. We investigate the difficulty of this task, and particularly the difficulty of operating in unseen environments, using several baselines and a sequence-to-sequence model based on methods successfully applied to other vision and language tasks.

APA, Harvard, Vancouver, ISO, and other styles

"Compressive Visual Question Answering." Master's thesis, 2017. http://hdl.handle.net/2286/R.I.45952.

Full text

Abstract:

abstract: Compressive sensing theory allows to sense and reconstruct signals/images with lower sampling rate than Nyquist rate. Applications in resource constrained environment stand to benefit from this theory, opening up many possibilities for new applications at the same time. The traditional inference pipeline for computer vision sequence reconstructing the image from compressive measurements. However,the reconstruction process is a computationally expensive step that also provides poor results at high compression rate. There have been several successful attempts to perform inference tasks directly on compressive measurements such as activity recognition. In this thesis, I am interested to tackle a more challenging vision problem - Visual question answering (VQA) without reconstructing the compressive images. I investigate the feasibility of this problem with a series of experiments, and I evaluate proposed methods on a VQA dataset and discuss promising results and direction for future work.
Dissertation/Thesis
Masters Thesis Computer Engineering 2017

APA, Harvard, Vancouver, ISO, and other styles

More sources

Books on the topic "Visual question answering (VQA)"

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He, and Wenwu Zhu. Visual Question Answering. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Visual Question Answering: From Theory to Application. Springer Singapore Pte. Limited, 2022.

Find full text

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Visual question answering (VQA)"

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He, and Wenwu Zhu. "Medical VQA." In Visual Question Answering, 165–76. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_11.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He, and Wenwu Zhu. "Embodied VQA." In Visual Question Answering, 147–64. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_10.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He, and Wenwu Zhu. "Knowledge-Based VQA." In Visual Question Answering, 73–90. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_5.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He, and Wenwu Zhu. "Text-Based VQA." In Visual Question Answering, 177–87. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_12.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wu, Qi, Peng Wang, Xin Wang, Xiaodong He, and Wenwu Zhu. "Vision-and-Language Pretraining for VQA." In Visual Question Answering, 91–107. Singapore: Springer Nature Singapore, 2022. http://dx.doi.org/10.1007/978-981-19-0964-1_6.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Goel, Vatsal, Mohit Chandak, Ashish Anand, and Prithwijit Guha. "IQ-VQA: Intelligent Visual Question Answering." In Pattern Recognition. ICPR International Workshops and Challenges, 357–70. Cham: Springer International Publishing, 2021. http://dx.doi.org/10.1007/978-3-030-68790-8_28.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Gokhale, Tejas, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. "VQA-LOL: Visual Question Answering Under the Lens of Logic." In Computer Vision – ECCV 2020, 379–96. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58589-1_23.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Seenivasan, Lalithkumar, Mobarakol Islam, Adithya K. Krishna, and Hongliang Ren. "Surgical-VQA: Visual Question Answering in Surgical Scenes Using Transformer." In Lecture Notes in Computer Science, 33–43. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-16449-1_4.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Narayanan, Abhishek, Abijna Rao, Abhishek Prasad, and S. Natarajan. "Towards Open Ended and Free Form Visual Question Answering: Modeling VQA as a Factoid Question Answering Problem." In Emerging Technologies in Data Mining and Information Security, 749–59. Singapore: Springer Singapore, 2021. http://dx.doi.org/10.1007/978-981-15-9774-9_69.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Salewski, Leonard, A. Sophia Koepke, Hendrik P. A. Lensch, and Zeynep Akata. "CLEVR-X: A Visual Reasoning Dataset for Natural Language Explanations." In xxAI - Beyond Explainable AI, 69–88. Cham: Springer International Publishing, 2022. http://dx.doi.org/10.1007/978-3-031-04083-2_5.

Full text

Abstract:

AbstractProviding explanations in the context of Visual Question Answering (VQA) presents a fundamental problem in machine learning. To obtain detailed insights into the process of generating natural language explanations for VQA, we introduce the large-scale CLEVR-X dataset that extends the CLEVR dataset with natural language explanations. For each image-question pair in the CLEVR dataset, CLEVR-X contains multiple structured textual explanations which are derived from the original scene graphs. By construction, the CLEVR-X explanations are correct and describe the reasoning and visual information that is necessary to answer a given question. We conducted a user study to confirm that the ground-truth explanations in our proposed dataset are indeed complete and relevant. We present baseline results for generating natural language explanations in the context of VQA using two state-of-the-art frameworks on the CLEVR-X dataset. Furthermore, we provide a detailed analysis of the explanation generation quality for different question and answer types. Additionally, we study the influence of using different numbers of ground-truth explanations on the convergence of natural language generation (NLG) metrics. The CLEVR-X dataset is publicly available at https://github.com/ExplainableML/CLEVR-X.

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Visual question answering (VQA)"

Antol, Stanislaw, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. "VQA: Visual Question Answering." In 2015 IEEE International Conference on Computer Vision (ICCV). IEEE, 2015. http://dx.doi.org/10.1109/iccv.2015.279.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Mishra, Aakansha, Ashish Anand, and Prithwijit Guha. "CQ-VQA: Visual Question Answering on Categorized Questions." In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020. http://dx.doi.org/10.1109/ijcnn48605.2020.9206913.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Liu, Fei, Jing Liu, Zhiwei Fang, Richang Hong, and Hanqing Lu. "Densely Connected Attention Flow for Visual Question Answering." In Twenty-Eighth International Joint Conference on Artificial Intelligence {IJCAI-19}. California: International Joint Conferences on Artificial Intelligence Organization, 2019. http://dx.doi.org/10.24963/ijcai.2019/122.

Full text

Abstract:

Learning effective interactions between multi-modal features is at the heart of visual question answering (VQA). A common defect of the existing VQA approaches is that they only consider a very limited amount of interactions, which may be not enough to model latent complex image-question relations that are necessary for accurately answering questions. Therefore, in this paper, we propose a novel DCAF (Densely Connected Attention Flow) framework for modeling dense interactions. It densely connects all pairwise layers of the network via Attention Connectors, capturing fine-grained interplay between image and question across all hierarchical levels. The proposed Attention Connector efficiently connects the multi-modal features at any two layers with symmetric co-attention, and produces interaction-aware attention features. Experimental results on three publicly available datasets show that the proposed method achieves state-of-the-art performance.

APA, Harvard, Vancouver, ISO, and other styles

Liu, Yuhang, Wei Wei, Daowan Peng, and Feida Zhu. "Declaration-based Prompt Tuning for Visual Question Answering." In Thirty-First International Joint Conference on Artificial Intelligence {IJCAI-22}. California: International Joint Conferences on Artificial Intelligence Organization, 2022. http://dx.doi.org/10.24963/ijcai.2022/453.

Full text

Abstract:

In recent years, the pre-training-then-fine-tuning paradigm has yielded immense success on a wide spectrum of cross-modal tasks, such as visual question answering (VQA), in which a visual-language (VL) model is first optimized via self-supervised task objectives, e.g., masked language modeling (MLM) and image-text matching (ITM), and then fine-tuned to adapt to downstream task (e.g., VQA) via a brand-new objective function, e.g., answer prediction. However, the inconsistency of the objective forms not only severely limits the generalization of pre-trained VL models to downstream tasks, but also requires a large amount of labeled data for fine-tuning. To alleviate the problem, we propose an innovative VL fine-tuning paradigm (named Declaration-based Prompt Tuning, abbreviated as DPT), which fine-tunes the model for downstream VQA using the pre-training objectives, boosting the effective adaptation of pre-trained models to the downstream task. Specifically, DPT reformulates the VQA task via (1) textual adaptation, which converts the given questions into declarative sentence form for prompt-tuning, and (2) task adaptation, which optimizes the objective function of VQA problem in the manner of pre-training phase. Experimental results on GQA dataset show that DPT outperforms the fine-tuned counterpart by a large margin regarding accuracy in both fully-supervised (2.68%) and zero-shot/fewshot (over 31%) settings. All the data and codes will be available to facilitate future research.

APA, Harvard, Vancouver, ISO, and other styles

Qiao, Yanyuan, Zheng Yu, and Jing Liu. "VC-VQA: Visual Calibration Mechanism For Visual Question Answering." In 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020. http://dx.doi.org/10.1109/icip40778.2020.9190828.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Lao, Mingrui, Yanming Guo, Wei Chen, Nan Pu, and Michael S. Lew. "VQA-BC: Robust Visual Question Answering Via Bidirectional Chaining." In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022. http://dx.doi.org/10.1109/icassp43922.2022.9746493.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Lin, Yuetan, Zhangyang Pang, Donghui Wang, and Yueting Zhuang. "Feature Enhancement in Attention for Visual Question Answering." In Twenty-Seventh International Joint Conference on Artificial Intelligence {IJCAI-18}. California: International Joint Conferences on Artificial Intelligence Organization, 2018. http://dx.doi.org/10.24963/ijcai.2018/586.

Full text

Abstract:

Attention mechanism has been an indispensable part of Visual Question Answering (VQA) models, due to the importance of its selective ability on image regions and/or question words. However, attention mechanism in almost all the VQA models takes as input the image visual and question textual features, which stem from different sources and between which there exists essential semantic gap. In order to further improve the accuracy of correlation between region and question in attention, we focus on region representation and propose the idea of feature enhancement, which includes three aspects. (1) We propose to leverage region semantic representation which is more consistent with the question representation. (2) We enrich the region representation using features from multiple hierarchies and (3) we refine the semantic representation for richer information. With these three incremental feature enhancement mechanisms, we improve the region representation and achieve better attentive effect and VQA performance. We conduct extensive experiments on the largest VQA v2.0 benchmark dataset and achieve competitive results without additional training data, and prove the effectiveness of our proposed feature-enhanced attention by visual demonstrations.

APA, Harvard, Vancouver, ISO, and other styles

Song, Jingkuan, Pengpeng Zeng, Lianli Gao, and Heng Tao Shen. "From Pixels to Objects: Cubic Visual Attention for Visual Question Answering." In Twenty-Seventh International Joint Conference on Artificial Intelligence {IJCAI-18}. California: International Joint Conferences on Artificial Intelligence Organization, 2018. http://dx.doi.org/10.24963/ijcai.2018/126.

Full text

Abstract:

Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar, i.e., different channels of the last conv-layer feature map of an image share the same weight. This conflicts with the attention mechanism because CNN features are naturally spatial and channel-wise. Also, visual attention models are usually conducted on pixel-level, which may cause region discontinuous problem. In this paper we propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Specifically, instead of attending to pixels, we first take advantage of the object proposal networks to generate a set of object candidates and extract their associated conv features. Then, we utilize the question to guide channel attention and spatial attention calculation based on the con-layer feature map. Finally, the attended visual features and the question are combined to infer the answer. We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.

APA, Harvard, Vancouver, ISO, and other styles

Gao, Chenyu, Qi Zhu, Peng Wang, and Qi Wu. "Chop Chop BERT: Visual Question Answering by Chopping VisualBERT’s Heads." In Thirtieth International Joint Conference on Artificial Intelligence {IJCAI-21}. California: International Joint Conferences on Artificial Intelligence Organization, 2021. http://dx.doi.org/10.24963/ijcai.2021/92.

Full text

Abstract:

Vision-and-Language (VL) pre-training has shown great potential on many related downstream tasks, such as Visual Question Answering (VQA), one of the most popular problems in the VL field. All of these pre-trained models (such as VisualBERT, ViLBERT, LXMERT and UNITER) are built with Transformer, which extends the classical attention mechanism to multiple layers and heads. To investigate why and how these models work on VQA so well, in this paper we explore the roles of individual heads and layers in Transformer models when handling 12 different types of questions. Specifically, we manually remove (chop) heads (or layers) from a pre-trained VisualBERT model at a time, and test it on different levels of questions to record its performance. As shown in the interesting echelon shape of the result matrices, experiments reveal different heads and layers are responsible for different question types, with higher-level layers activated by higher-level visual reasoning questions. Based on this observation, we design a dynamic chopping module that can automatically remove heads and layers of the VisualBERT at an instance level when dealing with different questions. Our dynamic chopping module can effectively reduce the parameters of the original model by 50%, while only damaging the accuracy by less than 1% on the VQA task.

APA, Harvard, Vancouver, ISO, and other styles

Cascante-Bonilla, Paola, Hui Wu, Letao Wang, Rogerio Feris, and Vicente Ordonez. "Sim VQA: Exploring Simulated Environments for Visual Question Answering." In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. http://dx.doi.org/10.1109/cvpr52688.2022.00500.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Contents

Academic literature on the topic 'Visual question answering (VQA)'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Journal articles on the topic "Visual question answering (VQA)"

Dissertations / Theses on the topic "Visual question answering (VQA)"

Books on the topic "Visual question answering (VQA)"

Book chapters on the topic "Visual question answering (VQA)"

Conference papers on the topic "Visual question answering (VQA)"