To see the other types of publications on this topic, follow the link: Visual question generation.

Journal articles on the topic 'Visual question generation'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Visual question generation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Patil, Charulata, and Manasi Patwardhan. "Visual Question Generation." ACM Computing Surveys 53, no. 3 (2020): 1–22. http://dx.doi.org/10.1145/3383465.

Full text
APA, Harvard, Vancouver, ISO, and other styles
2

Liu, Hongfei, Jiali Chen, Wenhao Fang, Jiayuan Xie, and Yi Cai. "Category-Guided Visual Question Generation (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 13 (2023): 16262–63. http://dx.doi.org/10.1609/aaai.v37i13.26991.

Full text
Abstract:
Visual question generation aims to generate high-quality questions related to images. Generating questions based only on images can better reduce labor costs and thus be easily applied. However, their methods tend to generate similar general questions that fail to ask questions about the specific content of each image scene. In this paper, we propose a category-guided visual question generation model that can generate questions with multiple categories that focus on different objects in an image. Specifically, our model first selects the appropriate question category based on the objects in the image and the relationships among objects. Then, we generate corresponding questions based on the selected question categories. Experiments conducted on the TDIUC dataset show that our proposed model outperforms existing models in terms of diversity and quality.
APA, Harvard, Vancouver, ISO, and other styles
3

Xie, Jiayuan, Mengqiu Cheng, Xinting Zhang, et al. "Explicitly Guided Difficulty-Controllable Visual Question Generation." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 25552–60. https://doi.org/10.1609/aaai.v39i24.34745.

Full text
Abstract:
Visual question generation (VQG) aims to generate questions from images automatically. While existing studies primarily focus on the quality of generated questions, such as fluency and relevance, the difficulty of the questions is also a crucial factor in assessing their quality. Question difficulty directly impacts the effectiveness of VQG systems in applications like education and human-computer interaction, where appropriately challenging questions can stimulate learning interest and improve interaction experiences. However, accurately defining and controlling question difficulty is a challenging task due to its multidimensional and subjective nature. In this paper, we propose a new definition of the difficulty of questions, i.e., being positively correlated with the number of reasoning steps required to answer a question. For our definition, we construct a corresponding dataset and propose a benchmark as a foundation for future research. Our benchmark is designed to progressively increase the reasoning steps involved in generating questions. Specifically, we first extract the relationships among objects in the image to form a reasoning chain, then gradually increase the difficulty by rewriting the generated question to include more reasoning sub-chains. Experimental results on our constructed dataset show that our benchmark significantly outperforms existing baselines in controlling the reasoning chains of generated questions, producing questions with varying difficulty levels.
APA, Harvard, Vancouver, ISO, and other styles
4

Mi, Li, Syrielle Montariol, Javiera Castillo Navarro, Xianjie Dai, Antoine Bosselut, and Devis Tuia. "ConVQG: Contrastive Visual Question Generation with Multimodal Guidance." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 5 (2024): 4207–15. http://dx.doi.org/10.1609/aaai.v38i5.28216.

Full text
Abstract:
Asking questions about visual environments is a crucial way for intelligent agents to understand rich multi-faceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enforcing a high relevance to the image content remains a challenge, as VQG systems often ignore one or both forms of grounding. In this work, we propose Contrastive Visual Question Generation (ConVQG), a method using a dual contrastive objective to discriminate questions generated using both modalities from those based on a single one. Experiments on both knowledge-aware and standard VQG benchmarks demonstrate that ConVQG outperforms the state-of-the-art methods and generates image-grounded, text-guided, and knowledge-rich questions. Our human evaluation results also show preference for ConVQG questions compared to non-contrastive baselines.
APA, Harvard, Vancouver, ISO, and other styles
5

Sarrouti, Mourad, Asma Ben Abacha, and Dina Demner-Fushman. "Goal-Driven Visual Question Generation from Radiology Images." Information 12, no. 8 (2021): 334. http://dx.doi.org/10.3390/info12080334.

Full text
Abstract:
Visual Question Generation (VQG) from images is a rising research topic in both fields of natural language processing and computer vision. Although there are some recent efforts towards generating questions from images in the open domain, the VQG task in the medical domain has not been well-studied so far due to the lack of labeled data. In this paper, we introduce a goal-driven VQG approach for radiology images called VQGRaD that generates questions targeting specific image aspects such as modality and abnormality. In particular, we study generating natural language questions based on the visual content of the image and on additional information such as the image caption and the question category. VQGRaD encodes the dense vectors of different inputs into two latent spaces, which allows generating, for a specific question category, relevant questions about the images, with or without their captions. We also explore the impact of domain knowledge incorporation (e.g., medical entities and semantic types) and data augmentation techniques on visual question generation in the medical domain. Experiments performed on the VQA-RAD dataset of clinical visual questions showed that VQGRaD achieves 61.86% BLEU score and outperforms strong baselines. We also performed a blinded human evaluation of the grammaticality, fluency, and relevance of the generated questions. The human evaluation demonstrated the better quality of VQGRaD outputs and showed that incorporating medical entities improves the quality of the generated questions. Using the test data and evaluation process of the ImageCLEF 2020 VQA-Med challenge, we found that relying on the proposed data augmentation technique to generate new training samples by applying different kinds of transformations, can mitigate the lack of data, avoid overfitting, and bring a substantial improvement in medical VQG.
APA, Harvard, Vancouver, ISO, and other styles
6

Pang, Wei, and Xiaojie Wang. "Visual Dialogue State Tracking for Question Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11831–38. http://dx.doi.org/10.1609/aaai.v34i07.6856.

Full text
Abstract:
GuessWhat?! is a visual dialogue task between a guesser and an oracle. The guesser aims to locate an object supposed by the oracle oneself in an image by asking a sequence of Yes/No questions. Asking proper questions with the progress of dialogue is vital for achieving successful final guess. As a result, the progress of dialogue should be properly represented and tracked. Previous models for question generation pay less attention on the representation and tracking of dialogue states, and therefore are prone to asking low quality questions such as repeated questions. This paper proposes visual dialogue state tracking (VDST) based method for question generation. A visual dialogue state is defined as the distribution on objects in the image as well as representations of objects. Representations of objects are updated with the change of the distribution on objects. An object-difference based attention is used to decode new question. The distribution on objects is updated by comparing the question-answer pair and objects. Experimental results on GuessWhat?! dataset show that our model significantly outperforms existing methods and achieves new state-of-the-art performance. It is also noticeable that our model reduces the rate of repeated questions from more than 50% to 21.9% compared with previous state-of-the-art methods.
APA, Harvard, Vancouver, ISO, and other styles
7

Srinivas, Dr Rhea. "VISUAL QUESTION ANSWERING." International Scientific Journal of Engineering and Management 04, no. 04 (2025): 1–7. https://doi.org/10.55041/isjem03029.

Full text
Abstract:
Abstract - Vision-Language Pre-Training (VLP) significantly improves performance for a variety of multimodal tasks. However, existing models are often specialized in understanding or generation, which limits their versatility. Furthermore, trust in text data for large, loud web text remains the optimal approach for monitoring. To address these challenges, we propose VLX, a uniform VLP framework that distinguishes both vision languages and generation tasks. VLX introduces a new type of data optimization strategy. This strategy allows the generator to create high-quality synthetic training data, highlight the identifier noise, and allow the web to use the data records collected to more efficiently use the data records. Our framework achieves cutting-edge results with important benchmarks, including image text call (+3.1% average recall @1), visual answer questions (+2.0% accuracy), and multimodal capacitage (+2.5% cider). Additionally, VLX demonstrates the robust transferability of zero-shot transmissions to video language tasks without any additional tweaks. Publish codes, models and data records to promote future research.
APA, Harvard, Vancouver, ISO, and other styles
8

Kamala, M. "Visual Question Generation from Remote Sensing Images Using Gemini API." International Journal for Research in Applied Science and Engineering Technology 12, no. 3 (2024): 2924–29. http://dx.doi.org/10.22214/ijraset.2024.59537.

Full text
Abstract:
Abstract: Visual Question Generation Extracting Information from Remote Sensing Images Remote Sensing Images plays a vital role in understanding and extracting information from aerial and satellite images. Utilizing Bidirectional Encoder Representation from Transformers (BERT) for extracting valuable insights from remote sensing images. Gemini Application Programming Interface(API), and Convolution Neural Networks (CNNs) are used. First, The proposed methodology employs CNN to extract high-level features from remote sensing images, capturing spatial data and generatingquestions. Similarly, the Gemini Application Programming Interface(API) integrates contextual understanding into the question-generation process by providing relevant environmental data. Lastly, BERT functions as a language model in which employees enhance and refine the generated questions by taking into account both the syntax and semantics. Hence, by combining all these techniques we are capable of generating required relevant questions from remote sensing images in an enhanced and efficient way.
APA, Harvard, Vancouver, ISO, and other styles
9

Kachare, Atul, Mukesh Kalla, and Ashutosh Gupta. "Visual Question Generation Answering (VQG-VQA) using Machine Learning Models." WSEAS TRANSACTIONS ON SYSTEMS 22 (June 28, 2023): 663–70. http://dx.doi.org/10.37394/23202.2023.22.67.

Full text
Abstract:
Presented automated visual question-answer system generates graphics-based question-answer pairs. The system consists of the Visual Query Generation (VQG) and Visual Question Answer (VQA) modules. VQG generates questions based on visual cues, and VQA provides matching answers to the VQG modules. VQG system generates questions using LSTM and VGG19 model, training parameters, and predicting words with the highest probability for output. VQA uses VGG-19 convolutional neural network for image encoding, embedding, and multilayer perceptron for high-quality responses. The proposed system reduces the need for human annotation and thus supports the traditional education sector by significantly reducing the human intervention required to generate text queries. The system can be used in interactive interfaces to help young children learn.
APA, Harvard, Vancouver, ISO, and other styles
10

Sandhya, Vidyashankar, Vahi Rakshit, Karkhanis Yash, and Srinivasa Gowri. "Vis Quelle: Visual Question-based Elementary Learning Companion a system to Facilitate Learning Word-Object Associations." International Journal of Innovative Technology and Exploring Engineering (IJITEE) 11, no. 1 (2021): 41–49. https://doi.org/10.35940/ijitee.A9599.1111121.

Full text
Abstract:
We present an automated, visual question answering based companion – Vis Quelle - to facilitate elementary learning of word-object associations. In particular, we attempt to harness the power of machine learning models for object recognition and the understanding of combined processing of images and text data from visual-question answering to provide variety and nuance in the images associated with letters or words presented to the elementary learner. We incorporate elements such as gamification to motivate the learner by recording scores, errors, etc., to track the learner’s progress. Translation is also provided to reinforce word-object associations in the user’s native tongue, if the learner is using Vis Quelle to learn a second language. Keywords: Visual question answering; object recognition; question generation; question answering; word-object association.
APA, Harvard, Vancouver, ISO, and other styles
11

Zhu, He, Ren Togo, Takahiro Ogawa, and Miki Haseyama. "Diversity Learning Based on Multi-Latent Space for Medical Image Visual Question Generation." Sensors 23, no. 3 (2023): 1057. http://dx.doi.org/10.3390/s23031057.

Full text
Abstract:
Auxiliary clinical diagnosis has been researched to solve unevenly and insufficiently distributed clinical resources. However, auxiliary diagnosis is still dominated by human physicians, and how to make intelligent systems more involved in the diagnosis process is gradually becoming a concern. An interactive automated clinical diagnosis with a question-answering system and a question generation system can capture a patient’s conditions from multiple perspectives with less physician involvement by asking different questions to drive and guide the diagnosis. This clinical diagnosis process requires diverse information to evaluate a patient from different perspectives to obtain an accurate diagnosis. Recently proposed medical question generation systems have not considered diversity. Thus, we propose a diversity learning-based visual question generation model using a multi-latent space to generate informative question sets from medical images. The proposed method generates various questions by embedding visual and language information in different latent spaces, whose diversity is trained by our newly proposed loss. We have also added control over the categories of generated questions, making the generated questions directional. Furthermore, we use a new metric named similarity to accurately evaluate the proposed model’s performance. The experimental results on the Slake and VQA-RAD datasets demonstrate that the proposed method can generate questions with diverse information. Our model works with an answering model for interactive automated clinical diagnosis and generates datasets to replace the process of annotation that incurs huge labor costs.
APA, Harvard, Vancouver, ISO, and other styles
12

Boukhers, Zeyd, Timo Hartmann, and Jan Jürjens. "COIN: Counterfactual Image Generation for Visual Question Answering Interpretation." Sensors 22, no. 6 (2022): 2245. http://dx.doi.org/10.3390/s22062245.

Full text
Abstract:
Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the VQA model to give a different answer. In addition, our approach ensures that the generated image is realistic. Since quantitative metrics cannot be employed to evaluate the interpretability of the model, we carried out a user study to assess different aspects of our approach. In addition to interpreting the result of VQA models on single images, the obtained results and the discussion provides an extensive explanation of VQA models’ behaviour.
APA, Harvard, Vancouver, ISO, and other styles
13

Yu, Ting, Zixuan Tong, Jun Yu, and Ke Zhang. "Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 9 (2025): 9662–70. https://doi.org/10.1609/aaai.v39i9.33047.

Full text
Abstract:
Medical Visual Question Answering (MedVQA) serves as an automated medical assistant, capable of answering patient queries and aiding physician diagnoses based on medical images and questions. Recent advancements have shown that incorporating Large Language Models (LLMs) into MedVQA tasks significantly enhances the capability for answer generation. However, for tasks requiring fine-grained organ-level precise localization, relying solely on language prompts struggles to accurately locate relevant regions within medical images due to substantial background noise. To address this challenge, we explore the use of visual prompts in MedVQA tasks for the first time and propose fine-grained adaptive visual prompts to enhance generative MedVQA. Specifically, we introduce an Adaptive Visual Prompt Creator that adaptively generates region-level visual prompts based on image characteristics of various organs, providing fine-grained references for LLMs during answer retrieval and generation from the medical domain, thereby improving the model's precise cross-modal localization capabilities on original images. Furthermore, we incorporate a Hierarchical Answer Generator with Parameter-Efficient Fine-Tuning (PEFT) techniques, significantly enhancing the model's understanding of spatial and contextual information with minimal parameter increase, promoting the alignment of representation learning with the medical space. Extensive experiments on VQA-RAD, SLAKE, and DME datasets validate the effectiveness of our proposed method, demonstrating its potential in generative MedVQA.
APA, Harvard, Vancouver, ISO, and other styles
14

Cai, Shuo, Xinzhe Han, and Shuhui Wang. "Divide-and-Conquer: Tree-structured Strategy with Answer Distribution Estimator for Goal-Oriented Visual Dialogue." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 2 (2025): 1917–25. https://doi.org/10.1609/aaai.v39i2.32187.

Full text
Abstract:
Goal-oriented visual dialogue involves multi-round interaction between artificial agents, which has been of remarkable attention due to its wide applications. Given a visual scene, this task occurs when a Questioner asks an action-oriented question and an Answerer responds with the intent of letting the Questioner know the correct action to take. The quality of questions affects the accuracy and efficiency of the target search progress. However, existing methods lack a clear strategy to guide the generation of questions, resulting in the randomness in the search process and inconvergent results. We propose a Tree-Structured Strategy with Answer Distribution Estimator (TSADE) which guides the question generation by excluding half of the current candidate objects in each round. The above process is implemented by maximizing a binary reward inspired by the ``divide-and-conquer'' paradigm. We further design a candidate-minimization reward which encourages the model to narrow down the scope of candidate objects toward the end of the dialogue. We experimentally demonstrate that our method can enable the agents to achieve high task-oriented accuracy with fewer repeating questions and rounds compared to traditional ergodic question generation approaches. Qualitative results further show that TSADE facilitates agents to generate higher-quality questions.
APA, Harvard, Vancouver, ISO, and other styles
15

Shridhar, Mohit, Dixant Mittal, and David Hsu. "INGRESS: Interactive visual grounding of referring expressions." International Journal of Robotics Research 39, no. 2-3 (2020): 217–32. http://dx.doi.org/10.1177/0278364919897133.

Full text
Abstract:
This article presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The key question here is to ground referring expressions: understand expressions about objects and their relationships from image and natural language inputs. INGRESS allows unconstrained object categories and rich language expressions. Further, it asks questions to clarify ambiguous referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural-network model for grounding. The first stage uses a neural network to generate visual descriptions of objects, compares them with the input language expressions, and identifies a set of candidate objects. The second stage uses another neural network to examine all pairwise relations between the candidates and infers the most likely referred objects. The same neural networks are used for both grounding and question generation for disambiguation. Experiments show that INGRESS outperformed a state-of-the-art method on the RefCOCO dataset and in robot experiments with humans. The INGRESS source code is available at https://github.com/MohitShridhar/ingress .
APA, Harvard, Vancouver, ISO, and other styles
16

Kim, Incheol. "Visual Experience-Based Question Answering with Complex Multimodal Environments." Mathematical Problems in Engineering 2020 (November 19, 2020): 1–18. http://dx.doi.org/10.1155/2020/8567271.

Full text
Abstract:
This paper proposes a novel visual experience-based question answering problem (VEQA) and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand 3D scenes from successive partial input images, and answer natural language questions about its visual experiences in real time. Unlike the conventional visual question answering (VQA), the VEQA problem assumes both partial observability and dynamics of a complex multimodal environment. To address this VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neural network-based scene graph generation model and a rule-based knowledge reasoning system. The proposed system can generate more accurate scene graphs for dynamic environments with some uncertainty. Moreover, it can answer complex questions through knowledge reasoning with rich background knowledge. Results of experiments using a photo-realistic 3D simulated environment, AI2-THOR, and the VEQA benchmark dataset prove the high performance of the proposed system.
APA, Harvard, Vancouver, ISO, and other styles
17

Guo, Zihan, Dezhi Han, and Kuan-Ching Li. "Double-layer affective visual question answering network." Computer Science and Information Systems, no. 00 (2020): 38. http://dx.doi.org/10.2298/csis200515038g.

Full text
Abstract:
Visual Question Answering (VQA) has attracted much attention recently in both natural language processing and computer vision communities, as it offers insight into the relationships between two relevant sources of information. Tremendous advances are seen in the field of VQA due to the success of deep learning. Based upon advances and improvements, the Affective Visual Question Answering Network (AVQAN) enriches the understanding and analysis of VQA models by making use of the emotional information contained in the images to produce sensitive answers, while maintaining the same level of accuracy as ordinary VQA baseline models. It is a reasonably new task to integrate the emotional information contained in the images into VQA. However, it is challenging to separate question guided-attention from mood-guided-attention due to the concatenation of the question words and the mood labels in AVQAN. Also, it is believed that this type of concatenation is harmful to the performance of the model. To mitigate such an effect, we propose the Double-Layer Affective Visual Question Answering Network (DAVQAN) that divides the task of generating emotional answers in VQA into two simpler subtasks: the generation of non-emotional responses and the production of mood labels, and two independent layers are utilized to tackle these subtasks. Comparative experimentation conducted on a preprocessed dataset to performance comparison shows that the overall performance of DAVQAN is 7.6% higher than AVQAN, demonstrating the effectiveness of the proposed model. We also introduce more advanced word embedding method and more fine-grained image feature extractor into AVQAN and DAVQAN to further improve their performance and obtain better results than their original models, which proves that VQA integrated with affective computing can improve the performance of the whole model by improving these two modules just like the general VQA.
APA, Harvard, Vancouver, ISO, and other styles
18

Singh, Anjali, Ruhi Sharma Mittal, Shubham Atreja, et al. "Automatic Generation of Leveled Visual Assessments for Young Learners." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9713–20. http://dx.doi.org/10.1609/aaai.v33i01.33019713.

Full text
Abstract:
Images are an essential tool for communicating with children, particularly at younger ages when they are still developing their emergent literacy skills. Hence, assessments that use images to assess their conceptual knowledge and visual literacy, are an important component of their learning process. Creating assessments at scale is a challenging task, which has led to several techniques being proposed for automatic generation of textual assessments. However, none of them focuses on generating image-based assessments. To understand the manual process of creating visual assessments, we interviewed primary school teachers. Based on the findings from the preliminary study, we present a novel approach which uses image semantics to generate visual multiple choice questions (VMCQs) for young learners, wherein options are presented in the form of images. We propose a metric to measure the semantic similarity between two images, which we use to identify the four options – one answer and three distractor images – for a given question. We also use this metric for generating VMCQs at two difficulty levels – easy and hard. Through a quantitative evaluation, we show that the system-generated VMCQs are comparable to VMCQs created by experts, hence establishing the effectiveness of our approach.
APA, Harvard, Vancouver, ISO, and other styles
19

Long, Xinwei, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, and Bowen Zhou. "Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 23 (2025): 24723–31. https://doi.org/10.1609/aaai.v39i23.34653.

Full text
Abstract:
Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. Current methods mainly employ separate retrieval and generation modules to acquire external knowledge and generate answers, respectively. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task, which seamlessly integrates knowledge retriever into the generative multi-modal large language model, serving as a built-in search engine. Specifically, our model functions both as a generative retriever and an accurate answer generator. It not only helps retrieve documents from the knowledge base by producing identifier for each document, but it also answers visual questions based on the retrieved documents. Furthermore, we also propose a reinforced retrieval calibration module from relevance feedback to improve retrieval performance and align with the preferences for accurate answer generation. Extensive experiments on two representative OKVQA and A-OKVQA datasets demonstrate significant improvements ranging from 2.9% to 9.6% across all evaluation metrics when compared to strong baselines.
APA, Harvard, Vancouver, ISO, and other styles
20

Kim, Jung-Jun, Dong-Gyu Lee, Jialin Wu, Hong-Gyu Jung, and Seong-Whan Lee. "Visual question answering based on local-scene-aware referring expression generation." Neural Networks 139 (July 2021): 158–67. http://dx.doi.org/10.1016/j.neunet.2021.02.001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Liu, Yuhang, Daowan Peng, Wei Wei, Yuanyuan Fu, Wenfeng Xie, and Dangyang Chen. "Detection-Based Intermediate Supervision for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (2024): 14061–68. http://dx.doi.org/10.1609/aaai.v38i12.29315.

Full text
Abstract:
Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one grounded object yet overlooks other potentially associated grounded objects, impeding full cross-modal alignment learning; (2) IoU-based intermediate supervisions may introduce noise signals as the bounding box overlap issue might guide the model's focus towards irrelevant objects. To address these issues, a novel method, Detection-based Intermediate Supervision (DIS), is proposed, which adopts a generative detection framework to facilitate multiple grounding supervisions via sequence generation. As such, DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance. Furthermore, by considering intermediate results, DIS enhances the consistency in answering compositional questions and their sub-questions. Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency compared to prior approaches.
APA, Harvard, Vancouver, ISO, and other styles
22

Feng, Chun-Mei, Yang Bai, Tao Luo, et al. "VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 3 (2025): 2942–50. https://doi.org/10.1609/aaai.v39i3.32301.

Full text
Abstract:
Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find the retrieved images inconsistent with the relative caption, we resort to the "QA generation → VQA" self-verification pipeline. For QA generation, we suggest fine-tuning LLM (e.g., LLaMA) to generate several pairs of questions and answers from each relative caption. We then fine-tune LVLM (e.g., LLaVA) to obtain the VQA model. By feeding the retrieved image and question to the VQA model, one can find the images inconsistent with relative caption when the answer by VQA is inconsistent with the answer in the QA pair. Consequently, the CIR performance can be boosted by modifying the ranks of inconsistently retrieved images. Experimental results show that our proposed method outperforms state-of-the-art CIR methods on the CIRR and Fashion-IQ datasets.
APA, Harvard, Vancouver, ISO, and other styles
23

Ghosh, Akash, Arkadeep Acharya, Raghav Jain, Sriparna Saha, Aman Chadha, and Setu Sinha. "CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 20 (2024): 22031–39. http://dx.doi.org/10.1609/aaai.v38i20.30206.

Full text
Abstract:
In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution of our work, pairs medical queries with visual aids, facilitating a richer and more nuanced understanding of patient needs. We also propose a framework, utilizing the power of Contrastive Language Image Pretraining(CLIP) and Large Language Models(LLMs), consisting of four modules that identify medical disorders, generate relevant context, filter medical concepts, and craft visually aware summaries. Our comprehensive framework harnesses the power of CLIP, a multimodal foundation model, and various general-purpose LLMs, comprising four main modules: the medical disorder identification module, the relevant context generation module, the context filtration module for distilling relevant medical concepts and knowledge, and finally, a general-purpose LLM to generate visually aware medical question summaries. Leveraging our MMQS dataset, we showcase how visual cues from images enhance the generation of medically nuanced summaries. This multimodal approach not only enhances the decision-making process in healthcare but also fosters a more nuanced understanding of patient queries, laying the groundwork for future research in personalized and responsive medical care. Disclaimer: The article features graphic medical imagery, a result of the subject's inherent requirements.
APA, Harvard, Vancouver, ISO, and other styles
24

闫, 婧昕. "Sub-Med VQA: A Medical Visual Question Answering Model Integrating Sub-Question Generation and Multimodal Reasoning." Statistics and Application 14, no. 02 (2025): 115–25. https://doi.org/10.12677/sa.2025.142041.

Full text
APA, Harvard, Vancouver, ISO, and other styles
25

Zhang, Lizong, Haojun Yin, Bei Hui, Sijuan Liu, and Wei Zhang. "Knowledge-Based Scene Graph Generation with Visual Contextual Dependency." Mathematics 10, no. 14 (2022): 2525. http://dx.doi.org/10.3390/math10142525.

Full text
Abstract:
Scene graph generation is the basis of various computer vision applications, including image retrieval, visual question answering, and image captioning. Previous studies have relied on visual features or incorporated auxiliary information to predict object relationships. However, the rich semantics of external knowledge have not yet been fully utilized, and the combination of visual and auxiliary information can lead to visual dependencies, which impacts relationship prediction among objects. Therefore, we propose a novel knowledge-based model with adjustable visual contextual dependency. Our model has three key components. The first module extracts the visual features and bounding boxes in the input image. The second module uses two encoders to fully integrate visual information and external knowledge. Finally, visual context loss and visual relationship loss are introduced to adjust the visual dependency of the model. The difference between the initial prediction results and the visual dependency results is calculated to generate the dependency-corrected results. The proposed model can obtain better global and contextual information for predicting object relationships, and the visual dependencies can be adjusted through the two loss functions. The results of extensive experiments show that our model outperforms most existing methods.
APA, Harvard, Vancouver, ISO, and other styles
26

Zhang, Weifeng, Jing Yu, Wenhong Zhao, and Chuan Ran. "DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation." Information Fusion 72 (August 2021): 70–79. http://dx.doi.org/10.1016/j.inffus.2021.02.006.

Full text
APA, Harvard, Vancouver, ISO, and other styles
27

Lim, Youngsun, Hojun Choi, and Hyunjung Shim. "Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 25 (2025): 26290–98. https://doi.org/10.1609/aaai.v39i25.34827.

Full text
Abstract:
Despite the impressive success of text-to-image (TTI) models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by TTI models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose. As part of this process, we develop a pipeline that generates high-quality question-answer pairs using multiple GPT-4 Omni-based agents, with human judgments to ensure accuracy. Our evaluation protocols measure image hallucination by testing if images from existing TTI models can correctly respond to these questions. The I-HallA v1.0 dataset comprises 1.2K diverse image-text pairs across nine categories with 1,000 rigorously curated questions covering various compositional challenges. We evaluate five TTI models using I-HallA and reveal that these state-of-the-art models often fail to accurately convey factual information. Moreover, we validate the reliability of our metric by demonstrating a strong Spearman correlation (ρ=0.95) with human judgments. Our benchmark dataset and metric can serve as a foundation for developing factually accurate TTI models.
APA, Harvard, Vancouver, ISO, and other styles
28

Zhu, He, Ren Togo, Takahiro Ogawa, and Miki Haseyama. "Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data." Electronics 12, no. 10 (2023): 2183. http://dx.doi.org/10.3390/electronics12102183.

Full text
Abstract:
As deep learning research continues to advance, interpretability is becoming as important as model performance. Conducting interpretability studies to understand the decision-making processes of deep learning models can improve performance and provide valuable insights for humans. The interpretability of visual question answering (VQA), a crucial task for human–computer interaction, has garnered the attention of researchers due to its wide range of applications. The generation of natural language explanations for VQA that humans can better understand has gradually supplanted heatmap representations as the mainstream focus in the field. Humans typically answer questions by first identifying the primary objects in an image and then referring to various information sources, both within and beyond the image, including prior knowledge. However, previous studies have only considered input images, resulting in insufficient information that can lead to incorrect answers and implausible explanations. To address this issue, we introduce multiple references in addition to the input image. Specifically, we propose a multimodal model that generates natural language explanations for VQA. We introduce outside knowledge using the input image and question and incorporate object information into the model through an object detection module. By increasing the information available during the model generation process, we significantly improve VQA accuracy and the reliability of the generated explanations. Moreover, we employ a simple and effective feature fusion joint vector to combine information from multiple modalities while maximizing information preservation. Qualitative and quantitative evaluation experiments demonstrate that the proposed method can generate more reliable explanations than state-of-the-art methods while maintaining answering accuracy.
APA, Harvard, Vancouver, ISO, and other styles
29

Yi, Ziruo, Ting Xiao, and Mark V. Albert. "A Survey on Multimodal Large Language Models in Radiology for Report Generation and Visual Question Answering." Information 16, no. 2 (2025): 136. https://doi.org/10.3390/info16020136.

Full text
Abstract:
Large language models (LLMs) and large vision models (LVMs) have driven significant advancements in natural language processing (NLP) and computer vision (CV), establishing a foundation for multimodal large language models (MLLMs) to integrate diverse data types in real-world applications. This survey explores the evolution of MLLMs in radiology, focusing on radiology report generation (RRG) and radiology visual question answering (RVQA), where MLLMs leverage the combined capabilities of LLMs and LVMs to improve clinical efficiency. We begin by tracing the history of radiology and the development of MLLMs, followed by an overview of MLLM applications in RRG and RVQA, detailing core datasets, evaluation metrics, and leading MLLMs that demonstrate their potential in generating radiology reports and answering image-based questions. We then discuss the challenges MLLMs face in radiology, including dataset scarcity, data privacy and security, and issues within MLLMs such as bias, toxicity, hallucinations, catastrophic forgetting, and limitations in traditional evaluation metrics. Finally, this paper proposes future research directions to address these challenges, aiming to help AI researchers and radiologists overcome these obstacles and advance the study of MLLMs in radiology.
APA, Harvard, Vancouver, ISO, and other styles
30

Kruchinin, Vladimir, and Vladimir Kuzovkin. "Overview of Existing Methods for Automatic Generation of Tasks with Conditions in Natural Language." Computer tools in education, no. 1 (March 28, 2022): 85–96. http://dx.doi.org/10.32603/2071-2340-2022-1-85-96.

Full text
Abstract:
The paper considers the main algorithms for generating various school subject problems of closed and open type. Some of these algorythms (i.e. question answering, Visual question answering) use artificial intelligence and some not (i.e. sets of AND/OR tree, templates). It was shown that methods for generating tests using artificial intelligence have a high potential, but they require further development, in particular, the creation of large question-answer database in russian language.
APA, Harvard, Vancouver, ISO, and other styles
31

ELSHAMY, Ghada, Marco ALFONSE, Islam HEGAZY, and Mostafa AREF. "A multi-modal transformer-based model for generative visual dialog system." Applied Computer Science 21, no. 1 (2025): 1–17. https://doi.org/10.35784/acs_6856.

Full text
Abstract:
Recent advancements in generative artificial intelligence have boosted significant interest in conversational agents. The visual dialog task, a synthesis of visual question-answering and dialog systems, requires agents capable of both seeing and chatting in natural language interactions. These agents must effectively understand cross-modal contextual information and generate coherent, human-like responses to a sequence of questions about a given visual scene. Despite progress, previous approaches often required complex architectures and substantial resources. This paper introduces a generative dialog agent that effectively addresses these challenges while maintaining a relatively simple architecture, dataset, and resource requirements. The proposed model employs an encoder-decoder architecture, incorporating ViLBERT for cross-modal information grounding and GPT-2 for autoregressive answer generation. This is the first visual dialog agent solely reliant on an autoregressive decoder for text generation. Evaluated on the VisDial dataset, the model achieves promising results, with scores of 64.05, 62.67, 70.17, and 15.37 on normalized discounted cumulative gain (NDCG), rank@5, rank@10, and the mean, respectively. These outcomes underscore the effectiveness of this approach, particularly considering its efficiency in terms of dataset size, architecture complexity, and generation process. The code and dataset are available at https://github.com/GhadaElshamy/MS-GPT-visdial.git , complete with usage instructions to facilitate replication of these experiments.
APA, Harvard, Vancouver, ISO, and other styles
32

Li, Xiaochuan, Baoyu Fan, Runze Zhang, et al. "Image Content Generation with Causal Reasoning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (2024): 13646–54. http://dx.doi.org/10.1609/aaai.v38i12.29269.

Full text
Abstract:
The emergence of ChatGPT has once again sparked research in generative artificial intelligence (GAI). While people have been amazed by the generated results, they have also noticed the reasoning potential reflected in the generated textual content. However, this current ability for causal reasoning is primarily limited to the domain of language generation, such as in models like GPT-3. In visual modality, there is currently no equivalent research. Considering causal reasoning in visual content generation is significant. This is because visual information contains infinite granularity. Particularly, images can provide more intuitive and specific demonstrations for certain reasoning tasks, especially when compared to coarse-grained text. Hence, we propose a new image generation task called visual question answering with image (VQAI) and establish a dataset of the same name based on the classic Tom and Jerry animated series. Additionally, we develop a new paradigm for image generation to tackle the challenges of this task. Finally, we perform extensive experiments and analyses, including visualizations of the generated content and discussions on the potentials and limitations. The code and data are publicly available under the license of CC BY-NC-SA 4.0 for academic and non-commercial usage at: https://github.com/IEIT-AGI/MIX-Shannon/blob/main/projects/VQAI/lgd_vqai.md.
APA, Harvard, Vancouver, ISO, and other styles
33

Tanaka, Ryota, Kyosuke Nishida, and Sen Yoshida. "VisualMRC: Machine Reading Comprehension on Document Images." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 15 (2021): 13878–88. http://dx.doi.org/10.1609/aaai.v35i15.17635.

Full text
Abstract:
Recent studies on machine reading comprehension have focused on text-level understanding but have not yet reached the level of human understanding of the visual layout and content of real-world documents. In this study, we introduce a new visual machine reading comprehension dataset, named VisualMRC, wherein given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language. Compared with existing visual question answering datasets that contain texts in images, VisualMRC focuses more on developing natural language understanding and generation abilities. It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages. We also introduce a new model that extends existing sequence-to-sequence models, pre-trained with large-scale text corpora, to take into account the visual layout and content of documents. Experiments with VisualMRC show that this model outperformed the base sequence-to-sequence models and a state-of-the-art VQA model. However, its performance is still below that of humans on most automatic evaluation metrics. The dataset will facilitate research aimed at connecting vision and language understanding.
APA, Harvard, Vancouver, ISO, and other styles
34

Kamala Mekala. "Enhancing VQA with SELM: A Multi-Model Approach Using SBERT." Journal of Information Systems Engineering and Management 10, no. 41s (2025): 795–808. https://doi.org/10.52783/jisem.v10i41s.8003.

Full text
Abstract:
In VQA or Visual Question Answering, a model is provided with an image and a natural language question related to it. For the model to generate appropriate answers, it must be able to understand both textual and visual input. However, there are still we have two key challenges persist in VQA.The first challenge is the inconsistency of answers and explanations provided by current approaches. The second is bridging the semantic gap that exists in between images and questions, resulting in explanations that are less accurate. Our goal is to reduce problems between image (any image) visual components and text generation alongside imbalance compensation. We propose a novel approach named System of Ensemble Learning model (SELM).The proposed approach utilizes stacked models for the extraction of text and an image features. The output of the stacked models are taken as input to the multi model fusion transformer (Similarity BERT) The SBERT model compares the predicted output with the actual ground truth results. The proposed SBERT has 95% accuracy, making it better than the state-of-the-art methods. In the future, this model may be extended to different domains like healthcare, geospatial, and satellite images etc.
APA, Harvard, Vancouver, ISO, and other styles
35

Wörgötter, Florentin, Ernst Niebur, and Christof Koch. "Generation of Direction Selectivity by Isotropic Intracortical Connections." Neural Computation 4, no. 3 (1992): 332–40. http://dx.doi.org/10.1162/neco.1992.4.3.332.

Full text
Abstract:
To what extent do the mechanisms generating different receptive field properties of neurons depend on each other? We investigated this question theoretically within the context of orientation and direction tuning of simple cells in the mammalian visual cortex. In our model a cortical cell of the "simple" type receives its orientation tuning by afferent convergence of aligned receptive fields of the lateral geniculate nucleus (Hubel and Wiesel 1962). We sharpen this orientation bias by postulating a special type of radially symmetric long-range lateral inhibition called circular inhibition. Surprisingly, this isotropic mechanism leads to the emergence of a strong bias for the direction of motion of a bar. We show that this directional anisotropy is neither caused by the probabilistic nature of the connections nor is it a consequence of the specific columnar structure chosen but that it is an inherent feature of the architecture of visual cortex.
APA, Harvard, Vancouver, ISO, and other styles
36

Zhu, Qiaoyi. "A Study of the Aesthetic Art of New Patriotism in Red Film and Television Drama." International Journal of Education, Humanities and Social Sciences 1, no. 1 (2024): 16–22. http://dx.doi.org/10.70088/zb1sr964.

Full text
Abstract:
Red cinema and television works have long been pivotal components in the propagation of Chinese culture and the inculcation of patriotic education. These productions chronicle the trajectory of historical evolution, conveying the spirit of patriotism in the new era through intense emotions and strikingly impactful visuals. Against the backdrop of globalization, where cultural exchanges among nations are increasingly prevalent, the question of how China's red cinema and television can inspire the younger generation through the aesthetics of new patriotism has become a focal point of attention. From early revolutionary story films to modern epic masterpieces, red cinema and television employ unique aesthetic expressions, innovative narrative techniques, and visual arts to seamlessly blend patriotism with art, offering audiences of the new era a series of visual and spiritual feasts.
APA, Harvard, Vancouver, ISO, and other styles
37

Wang, Junjue, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. "EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (2024): 5481–89. http://dx.doi.org/10.1609/aaai.v38i6.28357.

Full text
Abstract:
Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. As objects are the basis for complex relational reasoning, we propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way. To preserve refined spatial locations and semantics, SOBA leverages a segmentation network for object semantics generation. The object-guided attention aggregates object interior features via pseudo masks, and bidirectional cross-attention further models object external relations hierarchically. To optimize object counting, we propose a numerical difference loss that dynamically adds difference penalties, unifying the classification and regression tasks. Experimental results show that SOBA outperforms both advanced general and remote sensing methods. We believe this dataset and framework provide a strong benchmark for Earth vision's complex analysis. The project page is at https://Junjue-Wang.github.io/homepage/EarthVQA.
APA, Harvard, Vancouver, ISO, and other styles
38

BELZ, A., T. L. BERG, and L. YU. "From image to language and back again." Natural Language Engineering 24, no. 3 (2018): 325–62. http://dx.doi.org/10.1017/s1351324918000086.

Full text
Abstract:
Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Franket al.), multimodal machine translation (Madhyasthaet al., Franket al.), image caption generation (Madhyasthaet al., Tantiet al.), visual scene understanding (Silbereret al.), and multimodal learning of high-level attributes (Sorodocet al.). In this article, we touch upon all of these topics as we review work involving images and text under the three main headings of image description (Section 2), visually grounded referring expression generation (REG) and comprehension (Section 3), and visual question answering (VQA) (Section 4).
APA, Harvard, Vancouver, ISO, and other styles
39

Abrecht, Stephanie, Lydia Gauerhof, Christoph Gladisch, Konrad Groh, Christian Heinzemann, and Matthias Woehrle. "Testing Deep Learning-based Visual Perception for Automated Driving." ACM Transactions on Cyber-Physical Systems 5, no. 4 (2021): 1–28. http://dx.doi.org/10.1145/3450356.

Full text
Abstract:
Due to the impressive performance of deep neural networks (DNNs) for visual perception, there is an increased demand for their use in automated systems. However, to use deep neural networks in practice, novel approaches are needed, e.g., for testing. In this work, we focus on the question of how to test deep learning-based visual perception functions for automated driving. Classical approaches for testing are not sufficient: A purely statistical approach based on a dataset split is not enough, as testing needs to address various purposes and not only average case performance. Additionally, a complete specification is elusive due to the complexity of the perception task in the open context of automated driving. In this article, we review and discuss existing work on testing DNNs for visual perception with a special focus on automated driving for test input and test oracle generation as well as test adequacy. We conclude that testing of DNNs in this domain requires several diverse test sets. We show how such tests sets can be constructed based on the presented approaches addressing different purposes based on the presented methods and identify open research questions.
APA, Harvard, Vancouver, ISO, and other styles
40

Cheng, Zesen, Kehan Li, Peng Jin, et al. "Parallel Vertex Diffusion for Unified Visual Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 2 (2024): 1326–34. http://dx.doi.org/10.1609/aaai.v38i2.27896.

Full text
Abstract:
Unified visual grounding (UVG) capitalizes on a wealth of task-related knowledge across various grounding tasks via one-shot training, which curtails retraining costs and task-specific architecture design efforts. Vertex generation-based UVG methods achieve this versatility by unified modeling object box and contour prediction and provide a text-powered interface to vast related multi-modal tasks, e.g., visual question answering and captioning. However, these methods typically generate vertexes sequentially through autoregression, which is prone to be trapped in error accumulation and heavy computation, especially for high-dimension sequence generation in complex scenarios. In this paper, we develop Parallel Vertex Diffusion (PVD) based on the parallelizability of diffusion models to accurately and efficiently generate vertexes in a parallel and scalable manner. Since the coordinates fluctuate greatly, it typically encounters slow convergence when training diffusion models without geometry constraints. Therefore, we consummate our PVD by two critical components, i.e., center anchor mechanism and angle summation loss, which serve to normalize coordinates and adopt a differentiable geometry descriptor from the point-in-polygon problem of computational geometry to constrain the overall difference of prediction and label vertexes. These innovative designs empower our PVD to demonstrate its superiority with state-of-the-art performance across various grounding tasks.
APA, Harvard, Vancouver, ISO, and other styles
41

Pettitt, Joanne. "Visual-Textual Encounters with a German Grandfather: The Work of Angela Findlay." Jewish Film & New Media: An International Journal 11, no. 1 (2023): 90–115. http://dx.doi.org/10.1353/jfn.2023.a937530.

Full text
Abstract:
ABSTRACT: In this article, I consider the artistic and literary work of Angela Findlay, the daughter of a German mother and English father, and the granddaughter of a highly decorated Wehrmacht soldier. Working as a visual artist, public speaker and writer, Findlay moves between representational forms in order to express the complexities associated with her dual heritage and the "legacy of shame" that she carries with her. I take Findlay's In My Grandfather's Shadow as a case study that foregrounds new forms of visual-textual witnessing in the descendants of the war generation. I conclude that, by moving between different artistic practices, Findlay is able to encapsulate the complex and transnational experience of being a descendant of both Germany and the Allied nations; in so doing, she challenges the reader to question the conceptual boundaries of testimony, straightforward perceptions of perpetration, and the diluted, but nevertheless affected and affective, experiences of subsequent generations.
APA, Harvard, Vancouver, ISO, and other styles
42

Khademi, Mahmoud, and Oliver Schulte. "Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11237–45. http://dx.doi.org/10.1609/aaai.v34i07.6783.

Full text
Abstract:
We propose a new algorithm, called Deep Generative Probabilistic Graph Neural Networks (DG-PGNN), to generate a scene graph for an image. The input to DG-PGNN is an image, together with a set of region-grounded captions and object bounding-box proposals for the image. To generate the scene graph, DG-PGNN constructs and updates a new model, called a Probabilistic Graph Network (PGN). A PGN can be thought of as a scene graph with uncertainty: it represents each node and each edge by a CNN feature vector and defines a probability mass function (PMF) for node-type (object category) of each node and edge-type (predicate class) of each edge. The DG-PGNN sequentially adds a new node to the current PGN by learning the optimal ordering in a Deep Q-learning framework, where states are partial PGNs, actions choose a new node, and rewards are defined based on the ground-truth. After adding a node, DG-PGNN uses message passing to update the feature vectors of the current PGN by leveraging contextual relationship information, object co-occurrences, and language priors from captions. The updated features are then used to fine-tune the PMFs. Our experiments show that the proposed algorithm significantly outperforms the state-of-the-art results on the Visual Genome dataset for scene graph generation. We also show that the scene graphs constructed by DG-PGNN improve performance on the visual question answering task, for questions that need reasoning about objects and their interactions in the scene context.
APA, Harvard, Vancouver, ISO, and other styles
43

Liu, Xiulong, Sudipta Paul, Moitreya Chatterjee, and Anoop Cherian. "CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 4 (2024): 3765–73. http://dx.doi.org/10.1609/aaai.v38i4.28167.

Full text
Abstract:
Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our CAVEN agent can engage in fully-bidirectional natural language conversations by producing relevant questions and interpret free-form, potentially noisy responses from the oracle based on the audio-visual context. To enable such a capability, CAVEN is equipped with: i) a trajectory forecasting network that is grounded in audio-visual cues to produce a potential trajectory to the estimated goal, and (ii) a natural language based question generation and reasoning network to pose an interactive question to the oracle or interpret the oracle's response to produce navigation instructions. To train the interactive modules, we present a large scale dataset: AVN-Instruct, based on the Landmark-RxR dataset. To substantiate the usefulness of conversations, we present experiments on the benchmark audio-goal task using the SoundSpaces simulator under various noisy settings. Our results reveal that our fully-conversational approach leads to nearly an order-of-magnitude improvement in success rate, especially in localizing new sound sources and against methods that use only uni-directional interaction.
APA, Harvard, Vancouver, ISO, and other styles
44

Zhao, Chengfang, Mingwei Tang, Yanxi Zheng, and Chaocong Ran. "An Adaptive Multimodal Fusion Network Based on Multilinear Gradients for Visual Question Answering." Electronics 14, no. 1 (2024): 9. https://doi.org/10.3390/electronics14010009.

Full text
Abstract:
As an interdisciplinary field of natural language processing and computer vision, Visual Question Answering (VQA) has emerged as a prominent research focus in artificial intelligence. The core of the VQA task is to combine natural language understanding and image analysis to infer answers by extracting meaningful features from textual and visual inputs. However, most current models struggle to fully capture the deep semantic relationships between images and text owing to their limited capacity to comprehend feature interactions, which constrains their performance. To address these challenges, this paper proposes an innovative Trilinear Multigranularity and Multimodal Adaptive Fusion algorithm (TriMMF) that is designed to improve the efficiency of multimodal feature extraction and fusion in VQA tasks. Specifically, the TriMMF consists of three key modules: (1) an Answer Generation Module, which generates candidate answers by extracting fused features and leveraging question features to focus on critical regions within the image; (2) a Fine-grained and Coarse-grained Interaction Module, which achieves multimodal interaction between question and image features at different granularities and incorporates implicit answer information to capture complex multimodal correlations; and (3) an Adaptive Weight Fusion Module, which selectively integrates coarse-grained and fine-grained interaction features based on task requirements, thereby enhancing the model’s robustness and generalization capability. Experimental results demonstrate that the proposed TriMMF significantly outperforms existing methods on the VQA v1.0 and VQA v2.0 datasets, achieving state-of-the-art performance in question–answer accuracy. These findings indicate that the TriMMF effectively captures the deep semantic associations between images and text. The proposed approach provides new insights into multimodal interaction and fusion research, combining domain adaptation techniques to address a broader range of cross-domain visual question answering tasks.
APA, Harvard, Vancouver, ISO, and other styles
45

Zhou, Luowei, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. "Unified Vision-Language Pre-Training for Image Captioning and VQA." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 13041–49. http://dx.doi.org/10.1609/aaai.v34i07.7005.

Full text
Abstract:
This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequence-to-sequence (seq2seq) masked vision-language prediction. The two tasks differ solely in what context the prediction conditions on. This is controlled by utilizing specific self-attention masks for the shared transformer network. To the best of our knowledge, VLP is the first reported model that achieves state-of-the-art results on both vision-language generation and understanding tasks, as disparate as image captioning and visual question answering, across three challenging benchmark datasets: COCO Captions, Flickr30k Captions, and VQA 2.0. The code and the pre-trained models are available at https://github.com/LuoweiZhou/VLP.
APA, Harvard, Vancouver, ISO, and other styles
46

Katz, Chaim N., Kramay Patel, Omid Talakoub, David Groppe, Kari Hoffman, and Taufik A. Valiante. "Differential Generation of Saccade, Fixation, and Image-Onset Event-Related Potentials in the Human Mesial Temporal Lobe." Cerebral Cortex 30, no. 10 (2020): 5502–16. http://dx.doi.org/10.1093/cercor/bhaa132.

Full text
Abstract:
Abstract Event-related potentials (ERPs) are a commonly used electrophysiological signature for studying mesial temporal lobe (MTL) function during visual memory tasks. The ERPs associated with the onset of visual stimuli (image-onset) and eye movements (saccades and fixations) provide insights into the mechanisms of their generation. We hypothesized that since eye movements and image-onset provide MTL structures with salient visual information, perhaps they both engage similar neural mechanisms. To explore this question, we used intracranial electroencephalographic data from the MTLs of 11 patients with medically refractory epilepsy who participated in a visual search task. We characterized the electrophysiological responses of MTL structures to saccades, fixations, and image-onset. We demonstrated that the image-onset response is an evoked/additive response with a low-frequency power increase. In contrast, ERPs following eye movements appeared to arise from phase resetting of higher frequencies than the image-onset ERP. Intriguingly, this reset was associated with saccade onset and not termination (fixation), suggesting it is likely the MTL response to a corollary discharge, rather than a response to visual stimulation. We discuss the distinct mechanistic underpinnings of these responses which shed light on the underlying neural circuitry involved in visual memory processing.
APA, Harvard, Vancouver, ISO, and other styles
47

Kumar Singh, Ashutosh, Anish Khobragade, and Vikas Kanake. "IMAGE STORY: ENHANCED COGNITIVE VISUAL NARRATIVE SYSTEM." International Journal of Advanced Research 13, no. 06 (2025): 1218–31. https://doi.org/10.21474/ijar01/21185.

Full text
Abstract:
This paper presents the Enhanced Cognitive Visual Narrative System (ECVNS), a sophisticated multi-modal artificial intelligence framework designed for automated visual storytelling. The system integrates multiple state-of-the-art deep learning models including OWLv2for object detection, BLIP for image captioning and visual question answering, CLIP for emotional analysis, and ViLT for scene understanding. The framework demonstrates the capability to generate coherent, contextually relevant narratives in six languages based on comprehensive visual analysis. Our approach combines computer vision techniques with natural language generation to create a unified system that can understand visual content at multiple semantic levels and translate this understanding into creative storytelling. The system achieves high accuracy in object detection, scene understanding, and emotional inference, resulting in narratives that demonstrate both technical precision and creative quality. This work contributes to the advancing field of multimodal AI and has applications in content creation, accessibility, education, and entertainment.
APA, Harvard, Vancouver, ISO, and other styles
48

Reddy, Revant Gangi, Xilin Rui, Manling Li, et al. "MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 11200–11208. http://dx.doi.org/10.1609/aaai.v36i10.21370.

Full text
Abstract:
Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.
APA, Harvard, Vancouver, ISO, and other styles
49

Sejati, Sadewa Purba, and Ifnu Rifki Nurhidayanto. "Peningkatan Literasi Sumber Daya Air Tanah Menggunakan Media Interaktif Berbasis Android." Dinamisia : Jurnal Pengabdian Kepada Masyarakat 6, no. 6 (2022): 1454–60. http://dx.doi.org/10.31849/dinamisia.v6i6.11118.

Full text
Abstract:
Groundwater is one of the elements of the geosphere that plays an important role in achieving sustainable development. The decrease in the quantity and quality of groundwater is very likely to occur due to intensive anthropogenic dynamics which often ignore environmental rules. The neglect of environmental rules that have the potential to reduce the quantity and quality of groundwater resources is caused by a lack of literacy and knowledge of groundwater science. Literacy of groundwater resources needs to be applied to all levels of society, especially the younger generation as the successor of sustainable development so that groundwater sustainability is maintained. Literacy resources to increase insight need to contain visual elements, animations, descriptions and can be accessed by Android-based smart phones. The realization of solutions to partner problems will be realized through training, discussions, and questions and answers. The training activities were carried out to provide an understanding of how to download, install, and use the Groundwater App with a smartphone. Discussion and question and answer activities were carried out to discuss the visual and interactive substance presented by the application. The activities that have been carried out have been able to increase the insight of the younger generation about groundwater resources.
APA, Harvard, Vancouver, ISO, and other styles
50

Restrepo, David, Chenwei Wu, Zhengxu Tang, et al. "Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 27 (2025): 28321–30. https://doi.org/10.1609/aaai.v39i27.35053.

Full text
Abstract:
Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological question-answering benchmark with manually curated questions parallel across languages, allowing for direct cross-lingual comparisons. Our evaluation of 6 popular LLMs across 7 different languages reveals substantial bias across different languages, highlighting risks for clinical deployment of LLMs in LMICs. Existing debiasing methods such as Translation Chain-of-Thought or Retrieval-augmented generation (RAG) by themselves fall short of closing this performance gap, often failing to improve performance across all languages and lacking specificity for the medical domain. To address this issue, We propose CLARA (Cross-Lingual Reflective Agentic system), a novel inference time de-biasing method leveraging retrieval augmented generation and self-verification. Our approach not only improves performance across all languages but also significantly reduces the multilingual bias gap, facilitating equitable LLM application across the globe.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!