Dissertations / Theses on the topic 'Visual question answering (VQA)'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the top 17 dissertations / theses for your research on the topic 'Visual question answering (VQA).'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Browse dissertations / theses on a wide variety of disciplines and organise your bibliography correctly.
Chowdhury, Muhammad Iqbal Hasan. "Question-answering on image/video content." Thesis, Queensland University of Technology, 2020. https://eprints.qut.edu.au/205096/1/Muhammad%20Iqbal%20Hasan_Chowdhury_Thesis.pdf.
Full textStrub, Florian. "Développement de modèles multimodaux interactifs pour l'apprentissage du langage dans des environnements visuels." Thesis, Lille 1, 2020. http://www.theses.fr/2020LIL1I030.
Full textWhile our representation of the world is shaped by our perceptions, our languages, and our interactions, they have traditionally been distinct fields of study in machine learning. Fortunately, this partitioning started opening up with the recent advents of deep learning methods, which standardized raw feature extraction across communities. However, multimodal neural architectures are still at their beginning, and deep reinforcement learning is often limited to constrained environments. Yet, we ideally aim to develop large-scale multimodal and interactive models towards correctly apprehending the complexity of the world. As a first milestone, this thesis focuses on visually grounded language learning for three reasons (i) they are both well-studied modalities across different scientific fields (ii) it builds upon deep learning breakthroughs in natural language processing and computer vision (ii) the interplay between language and vision has been acknowledged in cognitive science. More precisely, we first designed the GuessWhat?! game for assessing visually grounded language understanding of the models: two players collaborate to locate a hidden object in an image by asking a sequence of questions. We then introduce modulation as a novel deep multimodal mechanism, and we show that it successfully fuses visual and linguistic representations by taking advantage of the hierarchical structure of neural networks. Finally, we investigate how reinforcement learning can support visually grounded language learning and cement the underlying multimodal representation. We show that such interactive learning leads to consistent language strategies but gives raise to new research issues
Mahendru, Aroma. "Role of Premises in Visual Question Answering." Thesis, Virginia Tech, 2017. http://hdl.handle.net/10919/78030.
Full textMaster of Science
Malinowski, Mateusz [Verfasser], and Mario [Akademischer Betreuer] Fritz. "Towards holistic machines : From visual recognition to question answering about real-world images / Mateusz Malinowski ; Betreuer: Mario Fritz." Saarbrücken : Saarländische Universitäts- und Landesbibliothek, 2017. http://d-nb.info/1136607889/34.
Full textDushi, Denis. "Using Deep Learning to Answer Visual Questions from Blind People." Thesis, KTH, Skolan för elektroteknik och datavetenskap (EECS), 2019. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-247910.
Full textEn naturlig tillämpning av artificiell intelligens är att hjälpa blinda med deras dagliga visuella utmaningar genom AI-baserad hjälpmedelsteknik. I detta avseende, är en av de mest lovande uppgifterna Visual Question Answering (VQA): modellen presenteras med en bild och en fråga om denna bild, och måste sedan förutspå det korrekta svaret. Nyligen introducerades VizWiz-datamängd, en samling bilder och frågor till dessa från blinda personer. Då detta är det första VQA-datamängden som härstammar från en naturlig miljö, har det många begränsningar och särdrag. Mer specifikt är de observerade egenskaperna: hög osäkerhet i svaren, informell samtalston i frågorna, relativt liten datamängd och slutligen obalans mellan svarbara och icke svarbara klasser. Dessa egenskaper kan även observeras, enskilda eller tillsammans, i andra VQA-datamängd, vilket utgör särskilda utmaningar vid lösning av VQA-uppgiften. Särskilt lämplig för att hantera dessa aspekter av data är förbehandlingsteknik från området data science. För att bidra till VQA-uppgiften, svarade vi därför på frågan “Kan förbehandlingstekniker från området data science bidra till lösningen av VQA-uppgiften?” genom att föreslå och studera effekten av fyra olika förbehandlingstekniker. För att hantera den höga osäkerheten i svaren använde vi ett förbehandlingssteg där vi beräknade osäkerheten i varje svar och använde detta mått för att vikta modellens utdata-värden under träning. Användandet av en ”osäkerhetsmedveten” träningsprocedur förstärkte den förutsägbara noggrannheten hos vår modell med 10%. Med detta nådde vi ett toppresultat när modellen utvärderades på testdelen av VizWiz-datamängden. För att övervinna problemet med den begränsade mängden data, konstruerade och testade vi en ny förbehandlingsprocedur som nästan dubblerar datapunkterna genom att beräkna cosinuslikheten mellan svarens vektorer. Vi hanterade även problemet med den informella samtalstonen i frågorna, som samlats in från den verkliga världens verbala konversationer, genom att föreslå en alternativ väg att förbehandla frågorna, där samtalstermer är borttagna. Detta ledde till en ytterligare förbättring: från en förutsägbar noggrannhet på 0.516 med det vanliga sättet att bearbeta frågorna kunde vi uppnå 0.527 prediktiv noggrannhet vid användning av det nya sättet att förbehandla frågorna. Slutligen hanterade vi obalansen mellan svarbara och icke svarbara klasser genom att förutse om en visuell fråga har ett möjligt svar. Vi testade två standard-förbehandlingstekniker för att justeradatamängdens klassdistribution: översampling och undersampling. Översamplingen gav en om än liten förbättring i både genomsnittlig precision och F1-poäng.
Ben-Younes, Hedi. "Multi-modal representation learning towards visual reasoning." Electronic Thesis or Diss., Sorbonne université, 2019. http://www.theses.fr/2019SORUS173.
Full textThe quantity of images that populate the Internet is dramatically increasing. It becomes of critical importance to develop the technology for a precise and automatic understanding of visual contents. As image recognition systems are becoming more and more relevant, researchers in artificial intelligence now seek for the next generation vision systems that can perform high-level scene understanding. In this thesis, we are interested in Visual Question Answering (VQA), which consists in building models that answer any natural language question about any image. Because of its nature and complexity, VQA is often considered as a proxy for visual reasoning. Classically, VQA architectures are designed as trainable systems that are provided with images, questions about them and their answers. To tackle this problem, typical approaches involve modern Deep Learning (DL) techniques. In the first part, we focus on developping multi-modal fusion strategies to model the interactions between image and question representations. More specifically, we explore bilinear fusion models and exploit concepts from tensor analysis to provide tractable and expressive factorizations of parameters. These fusion mechanisms are studied under the widely used visual attention framework: the answer to the question is provided by focusing only on the relevant image regions. In the last part, we move away from the attention mechanism and build a more advanced scene understanding architecture where we consider objects and their spatial and semantic relations. All models are thoroughly experimentally evaluated on standard datasets and the results are competitive with the literature
Lin, Xiao. "Leveraging Multimodal Perspectives to Learn Common Sense for Vision and Language Tasks." Diss., Virginia Tech, 2017. http://hdl.handle.net/10919/79521.
Full textPh. D.
Huang, Jia-Hong. "Robustness Analysis of Visual Question Answering Models by Basic Questions." Thesis, 2017. http://hdl.handle.net/10754/626314.
Full textAnderson, Peter James. "Vision and Language Learning: From Image Captioning and Visual Question Answering towards Embodied Agents." Phd thesis, 2018. http://hdl.handle.net/1885/164018.
Full text"Compressive Visual Question Answering." Master's thesis, 2017. http://hdl.handle.net/2286/R.I.45952.
Full textDissertation/Thesis
Masters Thesis Computer Engineering 2017
"Towards Supporting Visual Question and Answering Applications." Doctoral diss., 2017. http://hdl.handle.net/2286/R.I.45546.
Full textDissertation/Thesis
Doctoral Dissertation Computer Science 2017
Pan, Wei-Zhi, and 潘韋志. "Learning from Noisy Labels for Visual Question Answering." Thesis, 2018. http://ndltd.ncl.edu.tw/handle/jnaynq.
Full text國立交通大學
多媒體工程研究所
107
This thesis conducts a study of learning algorithms to address noisy label issues inherent in Visual Question Answering (VQA) tasks. The noisy labelling in VQA tasks refers to the phenomenon of possibly collecting different answers to an image-question pair from different human subjects. This often arises because some image-question pairs may create an ambiguous context that leads to indefinite answers. When trained with such noisy supervision, the performance of the VQA model suffers. To address noisy label issues, we first survey three mainstream algorithms for learning from noisy labels, including (1) loss-correction, (2) label cleansing and (3) graphical models. We then implement these algorithms based on a dual attention VQA network (which we call the base VQA model) and test their performance on VirginiaTech VQA dataset. Experimental results show that (1) the performances of the loss-correction algorithms rely heavily on accurate estimation of label transition probabilities due to noise or accurate detection of noise level, that (2) the label cleansing algorithms require enough verified labels to perform effectively, and that (3) the graphical models need to differentiate the noise level of each QA input to work well. In addition, the capability of the base VQA model can have a profound effect on the performances of these noisy label learning algorithms.
Pahuja, Vardaan. "Visual question answering with modules and language modeling." Thèse, 2019. http://hdl.handle.net/1866/22534.
Full textFarazi, Moshiur. "Visual Reasoning and Image Understanding: A Question Answering Approach." Phd thesis, 2020. http://hdl.handle.net/1885/216465.
Full textHajič, Jakub. "Zodpovídání dotazů o obrázcích." Master's thesis, 2017. http://www.nusl.cz/ntk/nusl-365173.
Full textXu, Huijuan. "Vision and language understanding with localized evidence." Thesis, 2018. https://hdl.handle.net/2144/34790.
Full textBahdanau, Dzmitry. "On sample efficiency and systematic generalization of grounded language understanding with deep learning." Thesis, 2020. http://hdl.handle.net/1866/23943.
Full textBy using the methodology of deep learning that advocates relying more on data and flexible neural models rather than on the expert's knowledge of the domain, the research community has recently achieved remarkable progress in natural language understanding and generation. Nevertheless, it remains unclear whether simply scaling up existing deep learning methods will be sufficient to achieve the goal of using natural language for human-computer interaction. We focus on two related aspects in which current methods appear to require major improvements. The first such aspect is the data inefficiency of deep learning systems: they are known to require extreme amounts of data to perform well. The second aspect is their limited ability to generalize systematically, namely to understand language in situations when the data distribution changes yet the principles of syntax and semantics remain the same. In this thesis, we present four case studies in which we seek to provide more clarity regarding the aforementioned data efficiency and systematic generalization aspects of deep learning approaches to language understanding, as well as to facilitate further work on these topics. In order to separate the problem of representing open-ended real-world knowledge from the problem of core language learning, we conduct all these studies using synthetic languages that are grounded in simple visual environments. In the first article, we study how to train agents to follow compositional instructions in environments with a restricted form of supervision. Namely for every instruction and initial environment configuration we only provide a goal-state instead of a complete trajectory with actions at all steps. We adapt adversarial imitation learning methods to this setting and demonstrate that such a restricted form of data is sufficient to learn compositional meanings of the instructions. Our second article also focuses on instruction following. We develop the BabyAI platform to facilitate further, more extensive and rigorous studies of this setup. The platform features a compositional Baby language with $10^{19}$ instructions, whose semantics is precisely defined in a partially-observable gridworld environment. We report baseline results on how much supervision is required to teach the agent certain subsets of Baby language with different training methods, such as reinforcement learning and imitation learning. In the third article we study systematic generalization of visual question answering (VQA) models. In the VQA setting the system must answer compositional questions about images. We construct a dataset of spatial questions about object pairs and evaluate how well different models perform on questions about pairs of objects that never occured in the same question in the training distribution. We show that models in which word meanings are represented by separate modules that perform independent computation generalize much better than models whose design is not explicitly modular. The modular models, however, generalize well only when the modules are connected in an appropriate layout, and our experiments highlight the challenges of learning the layout by end-to-end learning on the training distribution. In our fourth and final article we also study generalization of VQA models to questions outside of the training distribution, but this time using the popular CLEVR dataset of complex questions about 3D-rendered scenes as the platform. We generate novel CLEVR-like questions by using similarity-based references (e.g. ``the ball that has the same color as ...'') in contexts that occur in CLEVR questions but only with location-based references (e.g. ``the ball that is to the left of ...''). We analyze zero- and few- shot generalization to CLOSURE after training on CLEVR for a number of existing models as well as a novel one.