Log in

Relevant bibliographies by topics / Visual question generation / Journal articles

To see the other types of publications on this topic, follow the link: Visual question generation.

Journal articles on the topic 'Visual question generation'

Author: Grafiati

Published: 25 May 2024

Last updated: 31 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Visual question generation.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Patil, Charulata, and Manasi Patwardhan. "Visual Question Generation." ACM Computing Surveys 53, no. 3 (2020): 1–22. http://dx.doi.org/10.1145/3383465.

Full text

APA, Harvard, Vancouver, ISO, and other styles

2

Liu, Hongfei, Jiali Chen, Wenhao Fang, Jiayuan Xie, and Yi Cai. "Category-Guided Visual Question Generation (Student Abstract)." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 13 (2023): 16262–63. http://dx.doi.org/10.1609/aaai.v37i13.26991.

Full text

Abstract:

Visual question generation aims to generate high-quality questions related to images. Generating questions based only on images can better reduce labor costs and thus be easily applied. However, their methods tend to generate similar general questions that fail to ask questions about the specific content of each image scene. In this paper, we propose a category-guided visual question generation model that can generate questions with multiple categories that focus on different objects in an image. Specifically, our model first selects the appropriate question category based on the objects in th

APA, Harvard, Vancouver, ISO, and other styles

3

Xie, Jiayuan, Mengqiu Cheng, Xinting Zhang, et al. "Explicitly Guided Difficulty-Controllable Visual Question Generation." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 25552–60. https://doi.org/10.1609/aaai.v39i24.34745.

Full text

Abstract:

Visual question generation (VQG) aims to generate questions from images automatically. While existing studies primarily focus on the quality of generated questions, such as fluency and relevance, the difficulty of the questions is also a crucial factor in assessing their quality. Question difficulty directly impacts the effectiveness of VQG systems in applications like education and human-computer interaction, where appropriately challenging questions can stimulate learning interest and improve interaction experiences. However, accurately defining and controlling question difficulty is a chall

APA, Harvard, Vancouver, ISO, and other styles

4

Mi, Li, Syrielle Montariol, Javiera Castillo Navarro, Xianjie Dai, Antoine Bosselut, and Devis Tuia. "ConVQG: Contrastive Visual Question Generation with Multimodal Guidance." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 5 (2024): 4207–15. http://dx.doi.org/10.1609/aaai.v38i5.28216.

Full text

Abstract:

Asking questions about visual environments is a crucial way for intelligent agents to understand rich multi-faceted scenes, raising the importance of Visual Question Generation (VQG) systems. Apart from being grounded to the image, existing VQG systems can use textual constraints, such as expected answers or knowledge triplets, to generate focused questions. These constraints allow VQG systems to specify the question content or leverage external commonsense knowledge that can not be obtained from the image content only. However, generating focused questions using textual constraints while enfo

APA, Harvard, Vancouver, ISO, and other styles

5

Pang, Wei, and Xiaojie Wang. "Visual Dialogue State Tracking for Question Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11831–38. http://dx.doi.org/10.1609/aaai.v34i07.6856.

Full text

Abstract:

GuessWhat?! is a visual dialogue task between a guesser and an oracle. The guesser aims to locate an object supposed by the oracle oneself in an image by asking a sequence of Yes/No questions. Asking proper questions with the progress of dialogue is vital for achieving successful final guess. As a result, the progress of dialogue should be properly represented and tracked. Previous models for question generation pay less attention on the representation and tracking of dialogue states, and therefore are prone to asking low quality questions such as repeated questions. This paper proposes visual

APA, Harvard, Vancouver, ISO, and other styles

6

Sarrouti, Mourad, Asma Ben Abacha, and Dina Demner-Fushman. "Goal-Driven Visual Question Generation from Radiology Images." Information 12, no. 8 (2021): 334. http://dx.doi.org/10.3390/info12080334.

Full text

Abstract:

Visual Question Generation (VQG) from images is a rising research topic in both fields of natural language processing and computer vision. Although there are some recent efforts towards generating questions from images in the open domain, the VQG task in the medical domain has not been well-studied so far due to the lack of labeled data. In this paper, we introduce a goal-driven VQG approach for radiology images called VQGRaD that generates questions targeting specific image aspects such as modality and abnormality. In particular, we study generating natural language questions based on the vis

APA, Harvard, Vancouver, ISO, and other styles

7

Srinivas, Dr Rhea. "VISUAL QUESTION ANSWERING." International Scientific Journal of Engineering and Management 04, no. 04 (2025): 1–7. https://doi.org/10.55041/isjem03029.

Full text

Abstract:

Abstract - Vision-Language Pre-Training (VLP) significantly improves performance for a variety of multimodal tasks. However, existing models are often specialized in understanding or generation, which limits their versatility. Furthermore, trust in text data for large, loud web text remains the optimal approach for monitoring. To address these challenges, we propose VLX, a uniform VLP framework that distinguishes both vision languages and generation tasks. VLX introduces a new type of data optimization strategy. This strategy allows the generator to create high-quality synthetic training data,

APA, Harvard, Vancouver, ISO, and other styles

8

Kamala, M. "Visual Question Generation from Remote Sensing Images Using Gemini API." International Journal for Research in Applied Science and Engineering Technology 12, no. 3 (2024): 2924–29. http://dx.doi.org/10.22214/ijraset.2024.59537.

Full text

Abstract:

Abstract: Visual Question Generation Extracting Information from Remote Sensing Images Remote Sensing Images plays a vital role in understanding and extracting information from aerial and satellite images. Utilizing Bidirectional Encoder Representation from Transformers (BERT) for extracting valuable insights from remote sensing images. Gemini Application Programming Interface(API), and Convolution Neural Networks (CNNs) are used. First, The proposed methodology employs CNN to extract high-level features from remote sensing images, capturing spatial data and generatingquestions. Similarly, the

APA, Harvard, Vancouver, ISO, and other styles

9

Kachare, Atul, Mukesh Kalla, and Ashutosh Gupta. "Visual Question Generation Answering (VQG-VQA) using Machine Learning Models." WSEAS TRANSACTIONS ON SYSTEMS 22 (June 28, 2023): 663–70. http://dx.doi.org/10.37394/23202.2023.22.67.

Full text

Abstract:

Presented automated visual question-answer system generates graphics-based question-answer pairs. The system consists of the Visual Query Generation (VQG) and Visual Question Answer (VQA) modules. VQG generates questions based on visual cues, and VQA provides matching answers to the VQG modules. VQG system generates questions using LSTM and VGG19 model, training parameters, and predicting words with the highest probability for output. VQA uses VGG-19 convolutional neural network for image encoding, embedding, and multilayer perceptron for high-quality responses. The proposed system reduces the

APA, Harvard, Vancouver, ISO, and other styles

10

Sandhya, Vidyashankar, Vahi Rakshit, Karkhanis Yash, and Srinivasa Gowri. "Vis Quelle: Visual Question-based Elementary Learning Companion a system to Facilitate Learning Word-Object Associations." International Journal of Innovative Technology and Exploring Engineering (IJITEE) 11, no. 1 (2021): 41–49. https://doi.org/10.35940/ijitee.A9599.1111121.

Full text

Abstract:

We present an automated, visual question answering based companion – Vis Quelle - to facilitate elementary learning of word-object associations. In particular, we attempt to harness the power of machine learning models for object recognition and the understanding of combined processing of images and text data from visual-question answering to provide variety and nuance in the images associated with letters or words presented to the elementary learner. We incorporate elements such as gamification to motivate the learner by recording scores, errors, etc., to track the learner’s progr

APA, Harvard, Vancouver, ISO, and other styles

11

Zhu, He, Ren Togo, Takahiro Ogawa, and Miki Haseyama. "Diversity Learning Based on Multi-Latent Space for Medical Image Visual Question Generation." Sensors 23, no. 3 (2023): 1057. http://dx.doi.org/10.3390/s23031057.

Full text

Abstract:

Auxiliary clinical diagnosis has been researched to solve unevenly and insufficiently distributed clinical resources. However, auxiliary diagnosis is still dominated by human physicians, and how to make intelligent systems more involved in the diagnosis process is gradually becoming a concern. An interactive automated clinical diagnosis with a question-answering system and a question generation system can capture a patient’s conditions from multiple perspectives with less physician involvement by asking different questions to drive and guide the diagnosis. This clinical diagnosis process requi

APA, Harvard, Vancouver, ISO, and other styles

12

Boukhers, Zeyd, Timo Hartmann, and Jan Jürjens. "COIN: Counterfactual Image Generation for Visual Question Answering Interpretation." Sensors 22, no. 6 (2022): 2245. http://dx.doi.org/10.3390/s22062245.

Full text

Abstract:

Due to the significant advancement of Natural Language Processing and Computer Vision-based models, Visual Question Answering (VQA) systems are becoming more intelligent and advanced. However, they are still error-prone when dealing with relatively complex questions. Therefore, it is important to understand the behaviour of the VQA models before adopting their results. In this paper, we introduce an interpretability approach for VQA models by generating counterfactual images. Specifically, the generated image is supposed to have the minimal possible change to the original image and leads the V

APA, Harvard, Vancouver, ISO, and other styles

13

Yu, Ting, Zixuan Tong, Jun Yu, and Ke Zhang. "Fine-grained Adaptive Visual Prompt for Generative Medical Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 9 (2025): 9662–70. https://doi.org/10.1609/aaai.v39i9.33047.

Full text

Abstract:

Medical Visual Question Answering (MedVQA) serves as an automated medical assistant, capable of answering patient queries and aiding physician diagnoses based on medical images and questions. Recent advancements have shown that incorporating Large Language Models (LLMs) into MedVQA tasks significantly enhances the capability for answer generation. However, for tasks requiring fine-grained organ-level precise localization, relying solely on language prompts struggles to accurately locate relevant regions within medical images due to substantial background noise. To address this challenge, we ex

APA, Harvard, Vancouver, ISO, and other styles

14

Cai, Shuo, Xinzhe Han, and Shuhui Wang. "Divide-and-Conquer: Tree-structured Strategy with Answer Distribution Estimator for Goal-Oriented Visual Dialogue." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 2 (2025): 1917–25. https://doi.org/10.1609/aaai.v39i2.32187.

Full text

Abstract:

Goal-oriented visual dialogue involves multi-round interaction between artificial agents, which has been of remarkable attention due to its wide applications. Given a visual scene, this task occurs when a Questioner asks an action-oriented question and an Answerer responds with the intent of letting the Questioner know the correct action to take. The quality of questions affects the accuracy and efficiency of the target search progress. However, existing methods lack a clear strategy to guide the generation of questions, resulting in the randomness in the search process and inconvergent result

APA, Harvard, Vancouver, ISO, and other styles

15

Shridhar, Mohit, Dixant Mittal, and David Hsu. "INGRESS: Interactive visual grounding of referring expressions." International Journal of Robotics Research 39, no. 2-3 (2020): 217–32. http://dx.doi.org/10.1177/0278364919897133.

Full text

Abstract:

This article presents INGRESS, a robot system that follows human natural language instructions to pick and place everyday objects. The key question here is to ground referring expressions: understand expressions about objects and their relationships from image and natural language inputs. INGRESS allows unconstrained object categories and rich language expressions. Further, it asks questions to clarify ambiguous referring expressions interactively. To achieve these, we take the approach of grounding by generation and propose a two-stage neural-network model for grounding. The first stage uses

APA, Harvard, Vancouver, ISO, and other styles

16

Kim, Incheol. "Visual Experience-Based Question Answering with Complex Multimodal Environments." Mathematical Problems in Engineering 2020 (November 19, 2020): 1–18. http://dx.doi.org/10.1155/2020/8567271.

Full text

Abstract:

This paper proposes a novel visual experience-based question answering problem (VEQA) and the corresponding dataset for embodied intelligence research that requires an agent to do actions, understand 3D scenes from successive partial input images, and answer natural language questions about its visual experiences in real time. Unlike the conventional visual question answering (VQA), the VEQA problem assumes both partial observability and dynamics of a complex multimodal environment. To address this VEQA problem, we propose a hybrid visual question answering system, VQAS, integrating a deep neu

APA, Harvard, Vancouver, ISO, and other styles

17

Guo, Zihan, Dezhi Han, and Kuan-Ching Li. "Double-layer affective visual question answering network." Computer Science and Information Systems, no. 00 (2020): 38. http://dx.doi.org/10.2298/csis200515038g.

Full text

Abstract:

Visual Question Answering (VQA) has attracted much attention recently in both natural language processing and computer vision communities, as it offers insight into the relationships between two relevant sources of information. Tremendous advances are seen in the field of VQA due to the success of deep learning. Based upon advances and improvements, the Affective Visual Question Answering Network (AVQAN) enriches the understanding and analysis of VQA models by making use of the emotional information contained in the images to produce sensitive answers, while maintaining the same level of accur

APA, Harvard, Vancouver, ISO, and other styles

18

Singh, Anjali, Ruhi Sharma Mittal, Shubham Atreja, et al. "Automatic Generation of Leveled Visual Assessments for Young Learners." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9713–20. http://dx.doi.org/10.1609/aaai.v33i01.33019713.

Full text

Abstract:

Images are an essential tool for communicating with children, particularly at younger ages when they are still developing their emergent literacy skills. Hence, assessments that use images to assess their conceptual knowledge and visual literacy, are an important component of their learning process. Creating assessments at scale is a challenging task, which has led to several techniques being proposed for automatic generation of textual assessments. However, none of them focuses on generating image-based assessments. To understand the manual process of creating visual assessments, we interview

APA, Harvard, Vancouver, ISO, and other styles

19

Long, Xinwei, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, and Bowen Zhou. "Retrieval-Augmented Visual Question Answering via Built-in Autoregressive Search Engines." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 23 (2025): 24723–31. https://doi.org/10.1609/aaai.v39i23.34653.

Full text

Abstract:

Retrieval-augmented generation (RAG) has emerged to address the knowledge-intensive visual question answering (VQA) task. Current methods mainly employ separate retrieval and generation modules to acquire external knowledge and generate answers, respectively. We propose ReAuSE, an alternative to the previous RAG model for the knowledge-based VQA task, which seamlessly integrates knowledge retriever into the generative multi-modal large language model, serving as a built-in search engine. Specifically, our model functions both as a generative retriever and an accurate answer generator. It not o

APA, Harvard, Vancouver, ISO, and other styles

20

Kim, Jung-Jun, Dong-Gyu Lee, Jialin Wu, Hong-Gyu Jung, and Seong-Whan Lee. "Visual question answering based on local-scene-aware referring expression generation." Neural Networks 139 (July 2021): 158–67. http://dx.doi.org/10.1016/j.neunet.2021.02.001.

Full text

APA, Harvard, Vancouver, ISO, and other styles

21

Liu, Yuhang, Daowan Peng, Wei Wei, Yuanyuan Fu, Wenfeng Xie, and Dangyang Chen. "Detection-Based Intermediate Supervision for Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (2024): 14061–68. http://dx.doi.org/10.1609/aaai.v38i12.29315.

Full text

Abstract:

Recently, neural module networks (NMNs) have yielded ongoing success in answering compositional visual questions, especially those involving multi-hop visual and logical reasoning. NMNs decompose the complex question into several sub-tasks using instance-modules from the reasoning paths of that question and then exploit intermediate supervisions to guide answer prediction, thereby improving inference interpretability. However, their performance may be hindered due to sketchy modeling of intermediate supervisions. For instance, (1) a prior assumption that each instance-module refers to only one

APA, Harvard, Vancouver, ISO, and other styles

22

Feng, Chun-Mei, Yang Bai, Tao Luo, et al. "VQA4CIR: Boosting Composed Image Retrieval with Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 3 (2025): 2942–50. https://doi.org/10.1609/aaai.v39i3.32301.

Full text

Abstract:

Albeit progress has been made in Composed Image Retrieval (CIR), we empirically find that a certain percentage of failure retrieval results are not consistent with their relative captions. To address this issue, this work provides a Visual Question Answering (VQA) perspective to boost the performance of CIR. The resulting VQA4CIR is a post-processing approach and can be directly plugged into existing CIR methods. Given the top-C retrieved images by a CIR method, VQA4CIR aims to decrease the adverse effect of the failure retrieval results being inconsistent with the relative caption. To find th

APA, Harvard, Vancouver, ISO, and other styles

23

Ghosh, Akash, Arkadeep Acharya, Raghav Jain, Sriparna Saha, Aman Chadha, and Setu Sinha. "CLIPSyntel: CLIP and LLM Synergy for Multimodal Question Summarization in Healthcare." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 20 (2024): 22031–39. http://dx.doi.org/10.1609/aaai.v38i20.30206.

Full text

Abstract:

In the era of modern healthcare, swiftly generating medical question summaries is crucial for informed and timely patient care. Despite the increasing complexity and volume of medical data, existing studies have focused solely on text-based summarization, neglecting the integration of visual information. Recognizing the untapped potential of combining textual queries with visual representations of medical conditions, we introduce the Multimodal Medical Question Summarization (MMQS) Dataset. This dataset, a major contribution of our work, pairs medical queries with visual aids, facilitating a r

APA, Harvard, Vancouver, ISO, and other styles

24

闫, 婧昕. "Sub-Med VQA: A Medical Visual Question Answering Model Integrating Sub-Question Generation and Multimodal Reasoning." Statistics and Application 14, no. 02 (2025): 115–25. https://doi.org/10.12677/sa.2025.142041.

Full text

APA, Harvard, Vancouver, ISO, and other styles

25

Zhang, Lizong, Haojun Yin, Bei Hui, Sijuan Liu, and Wei Zhang. "Knowledge-Based Scene Graph Generation with Visual Contextual Dependency." Mathematics 10, no. 14 (2022): 2525. http://dx.doi.org/10.3390/math10142525.

Full text

Abstract:

Scene graph generation is the basis of various computer vision applications, including image retrieval, visual question answering, and image captioning. Previous studies have relied on visual features or incorporated auxiliary information to predict object relationships. However, the rich semantics of external knowledge have not yet been fully utilized, and the combination of visual and auxiliary information can lead to visual dependencies, which impacts relationship prediction among objects. Therefore, we propose a novel knowledge-based model with adjustable visual contextual dependency. Our

APA, Harvard, Vancouver, ISO, and other styles

26

Zhang, Weifeng, Jing Yu, Wenhong Zhao, and Chuan Ran. "DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation." Information Fusion 72 (August 2021): 70–79. http://dx.doi.org/10.1016/j.inffus.2021.02.006.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Lim, Youngsun, Hojun Choi, and Hyunjung Shim. "Evaluating Image Hallucination in Text-to-Image Generation with Question-Answering." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 25 (2025): 26290–98. https://doi.org/10.1609/aaai.v39i25.34827.

Full text

Abstract:

Despite the impressive success of text-to-image (TTI) models, existing studies overlook the issue of whether these models accurately convey factual information. In this paper, we focus on the problem of image hallucination, where images created by TTI models fail to faithfully depict factual content. To address this, we introduce I-HallA (Image Hallucination evaluation with Question Answering), a novel automated evaluation metric that measures the factuality of generated images through visual question answering (VQA). We also introduce I-HallA v1.0, a curated benchmark dataset for this purpose

APA, Harvard, Vancouver, ISO, and other styles

28

Zhu, He, Ren Togo, Takahiro Ogawa, and Miki Haseyama. "Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data." Electronics 12, no. 10 (2023): 2183. http://dx.doi.org/10.3390/electronics12102183.

Full text

Abstract:

As deep learning research continues to advance, interpretability is becoming as important as model performance. Conducting interpretability studies to understand the decision-making processes of deep learning models can improve performance and provide valuable insights for humans. The interpretability of visual question answering (VQA), a crucial task for human–computer interaction, has garnered the attention of researchers due to its wide range of applications. The generation of natural language explanations for VQA that humans can better understand has gradually supplanted heatmap representa

APA, Harvard, Vancouver, ISO, and other styles

29

Yi, Ziruo, Ting Xiao, and Mark V. Albert. "A Survey on Multimodal Large Language Models in Radiology for Report Generation and Visual Question Answering." Information 16, no. 2 (2025): 136. https://doi.org/10.3390/info16020136.

Full text

Abstract:

Large language models (LLMs) and large vision models (LVMs) have driven significant advancements in natural language processing (NLP) and computer vision (CV), establishing a foundation for multimodal large language models (MLLMs) to integrate diverse data types in real-world applications. This survey explores the evolution of MLLMs in radiology, focusing on radiology report generation (RRG) and radiology visual question answering (RVQA), where MLLMs leverage the combined capabilities of LLMs and LVMs to improve clinical efficiency. We begin by tracing the history of radiology and the developm

APA, Harvard, Vancouver, ISO, and other styles

30

Kruchinin, Vladimir, and Vladimir Kuzovkin. "Overview of Existing Methods for Automatic Generation of Tasks with Conditions in Natural Language." Computer tools in education, no. 1 (March 28, 2022): 85–96. http://dx.doi.org/10.32603/2071-2340-2022-1-85-96.

Full text

Abstract:

The paper considers the main algorithms for generating various school subject problems of closed and open type. Some of these algorythms (i.e. question answering, Visual question answering) use artificial intelligence and some not (i.e. sets of AND/OR tree, templates). It was shown that methods for generating tests using artificial intelligence have a high potential, but they require further development, in particular, the creation of large question-answer database in russian language.

APA, Harvard, Vancouver, ISO, and other styles

31

ELSHAMY, Ghada, Marco ALFONSE, Islam HEGAZY, and Mostafa AREF. "A multi-modal transformer-based model for generative visual dialog system." Applied Computer Science 21, no. 1 (2025): 1–17. https://doi.org/10.35784/acs_6856.

Full text

Abstract:

Recent advancements in generative artificial intelligence have boosted significant interest in conversational agents. The visual dialog task, a synthesis of visual question-answering and dialog systems, requires agents capable of both seeing and chatting in natural language interactions. These agents must effectively understand cross-modal contextual information and generate coherent, human-like responses to a sequence of questions about a given visual scene. Despite progress, previous approaches often required complex architectures and substantial resources. This paper introduces a generative

APA, Harvard, Vancouver, ISO, and other styles

32

Li, Xiaochuan, Baoyu Fan, Runze Zhang, et al. "Image Content Generation with Causal Reasoning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (2024): 13646–54. http://dx.doi.org/10.1609/aaai.v38i12.29269.

Full text

Abstract:

The emergence of ChatGPT has once again sparked research in generative artificial intelligence (GAI). While people have been amazed by the generated results, they have also noticed the reasoning potential reflected in the generated textual content. However, this current ability for causal reasoning is primarily limited to the domain of language generation, such as in models like GPT-3. In visual modality, there is currently no equivalent research. Considering causal reasoning in visual content generation is significant. This is because visual information contains infinite granularity. Particul

APA, Harvard, Vancouver, ISO, and other styles

33

Tanaka, Ryota, Kyosuke Nishida, and Sen Yoshida. "VisualMRC: Machine Reading Comprehension on Document Images." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 15 (2021): 13878–88. http://dx.doi.org/10.1609/aaai.v35i15.17635.

Full text

Abstract:

Recent studies on machine reading comprehension have focused on text-level understanding but have not yet reached the level of human understanding of the visual layout and content of real-world documents. In this study, we introduce a new visual machine reading comprehension dataset, named VisualMRC, wherein given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language. Compared with existing visual question answering datasets that contain texts in images, VisualMRC focuses more on developing natural language understanding

APA, Harvard, Vancouver, ISO, and other styles

34

Kamala Mekala. "Enhancing VQA with SELM: A Multi-Model Approach Using SBERT." Journal of Information Systems Engineering and Management 10, no. 41s (2025): 795–808. https://doi.org/10.52783/jisem.v10i41s.8003.

Full text

Abstract:

In VQA or Visual Question Answering, a model is provided with an image and a natural language question related to it. For the model to generate appropriate answers, it must be able to understand both textual and visual input. However, there are still we have two key challenges persist in VQA.The first challenge is the inconsistency of answers and explanations provided by current approaches. The second is bridging the semantic gap that exists in between images and questions, resulting in explanations that are less accurate. Our goal is to reduce problems between image (any image) visual compone

APA, Harvard, Vancouver, ISO, and other styles

35

Wörgötter, Florentin, Ernst Niebur, and Christof Koch. "Generation of Direction Selectivity by Isotropic Intracortical Connections." Neural Computation 4, no. 3 (1992): 332–40. http://dx.doi.org/10.1162/neco.1992.4.3.332.

Full text

Abstract:

To what extent do the mechanisms generating different receptive field properties of neurons depend on each other? We investigated this question theoretically within the context of orientation and direction tuning of simple cells in the mammalian visual cortex. In our model a cortical cell of the "simple" type receives its orientation tuning by afferent convergence of aligned receptive fields of the lateral geniculate nucleus (Hubel and Wiesel 1962). We sharpen this orientation bias by postulating a special type of radially symmetric long-range lateral inhibition called circular inhibition. Sur

APA, Harvard, Vancouver, ISO, and other styles

36

Zhu, Qiaoyi. "A Study of the Aesthetic Art of New Patriotism in Red Film and Television Drama." International Journal of Education, Humanities and Social Sciences 1, no. 1 (2024): 16–22. http://dx.doi.org/10.70088/zb1sr964.

Full text

Abstract:

Red cinema and television works have long been pivotal components in the propagation of Chinese culture and the inculcation of patriotic education. These productions chronicle the trajectory of historical evolution, conveying the spirit of patriotism in the new era through intense emotions and strikingly impactful visuals. Against the backdrop of globalization, where cultural exchanges among nations are increasingly prevalent, the question of how China's red cinema and television can inspire the younger generation through the aesthetics of new patriotism has become a focal point of attention.

APA, Harvard, Vancouver, ISO, and other styles

37

Wang, Junjue, Zhuo Zheng, Zihang Chen, Ailong Ma, and Yanfei Zhong. "EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote Sensing Visual Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (2024): 5481–89. http://dx.doi.org/10.1609/aaai.v38i6.28357.

Full text

Abstract:

Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded. As objects are the basis for complex relational reasoning, we propose a Semantic OBject Aware

APA, Harvard, Vancouver, ISO, and other styles

38

BELZ, A., T. L. BERG, and L. YU. "From image to language and back again." Natural Language Engineering 24, no. 3 (2018): 325–62. http://dx.doi.org/10.1017/s1351324918000086.

Full text

Abstract:

Work in computer vision and natural language processing involving images and text has been experiencing explosive growth over the past decade, with a particular boost coming from the neural network revolution. The present volume brings together five research articles from several different corners of the area: multilingual multimodal image description (Franket al.), multimodal machine translation (Madhyasthaet al., Franket al.), image caption generation (Madhyasthaet al., Tantiet al.), visual scene understanding (Silbereret al.), and multimodal learning of high-level attributes (Sorodocet al.)

APA, Harvard, Vancouver, ISO, and other styles

39

Abrecht, Stephanie, Lydia Gauerhof, Christoph Gladisch, Konrad Groh, Christian Heinzemann, and Matthias Woehrle. "Testing Deep Learning-based Visual Perception for Automated Driving." ACM Transactions on Cyber-Physical Systems 5, no. 4 (2021): 1–28. http://dx.doi.org/10.1145/3450356.

Full text

Abstract:

Due to the impressive performance of deep neural networks (DNNs) for visual perception, there is an increased demand for their use in automated systems. However, to use deep neural networks in practice, novel approaches are needed, e.g., for testing. In this work, we focus on the question of how to test deep learning-based visual perception functions for automated driving. Classical approaches for testing are not sufficient: A purely statistical approach based on a dataset split is not enough, as testing needs to address various purposes and not only average case performance. Additionally, a c

APA, Harvard, Vancouver, ISO, and other styles

40

Cheng, Zesen, Kehan Li, Peng Jin, et al. "Parallel Vertex Diffusion for Unified Visual Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 2 (2024): 1326–34. http://dx.doi.org/10.1609/aaai.v38i2.27896.

Full text

Abstract:

Unified visual grounding (UVG) capitalizes on a wealth of task-related knowledge across various grounding tasks via one-shot training, which curtails retraining costs and task-specific architecture design efforts. Vertex generation-based UVG methods achieve this versatility by unified modeling object box and contour prediction and provide a text-powered interface to vast related multi-modal tasks, e.g., visual question answering and captioning. However, these methods typically generate vertexes sequentially through autoregression, which is prone to be trapped in error accumulation and heavy co

APA, Harvard, Vancouver, ISO, and other styles

41

Pettitt, Joanne. "Visual-Textual Encounters with a German Grandfather: The Work of Angela Findlay." Jewish Film & New Media: An International Journal 11, no. 1 (2023): 90–115. http://dx.doi.org/10.1353/jfn.2023.a937530.

Full text

Abstract:

ABSTRACT: In this article, I consider the artistic and literary work of Angela Findlay, the daughter of a German mother and English father, and the granddaughter of a highly decorated Wehrmacht soldier. Working as a visual artist, public speaker and writer, Findlay moves between representational forms in order to express the complexities associated with her dual heritage and the "legacy of shame" that she carries with her. I take Findlay's In My Grandfather's Shadow as a case study that foregrounds new forms of visual-textual witnessing in the descendants of the war generation. I conclude that

APA, Harvard, Vancouver, ISO, and other styles

42

Khademi, Mahmoud, and Oliver Schulte. "Deep Generative Probabilistic Graph Neural Networks for Scene Graph Generation." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11237–45. http://dx.doi.org/10.1609/aaai.v34i07.6783.

Full text

Abstract:

We propose a new algorithm, called Deep Generative Probabilistic Graph Neural Networks (DG-PGNN), to generate a scene graph for an image. The input to DG-PGNN is an image, together with a set of region-grounded captions and object bounding-box proposals for the image. To generate the scene graph, DG-PGNN constructs and updates a new model, called a Probabilistic Graph Network (PGN). A PGN can be thought of as a scene graph with uncertainty: it represents each node and each edge by a CNN feature vector and defines a probability mass function (PMF) for node-type (object category) of each node an

APA, Harvard, Vancouver, ISO, and other styles

43

Liu, Xiulong, Sudipta Paul, Moitreya Chatterjee, and Anoop Cherian. "CAVEN: An Embodied Conversational Agent for Efficient Audio-Visual Navigation in Noisy Environments." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 4 (2024): 3765–73. http://dx.doi.org/10.1609/aaai.v38i4.28167.

Full text

Abstract:

Audio-visual navigation of an agent towards locating an audio goal is a challenging task especially when the audio is sporadic or the environment is noisy. In this paper, we present CAVEN, a Conversation-based Audio-Visual Embodied Navigation framework in which the agent may interact with a human/oracle for solving the task of navigating to an audio goal. Specifically, CAVEN is modeled as a budget-aware partially observable semi-Markov decision process that implicitly learns the uncertainty in the audio-based navigation policy to decide when and how the agent may interact with the oracle. Our

APA, Harvard, Vancouver, ISO, and other styles

44

Zhao, Chengfang, Mingwei Tang, Yanxi Zheng, and Chaocong Ran. "An Adaptive Multimodal Fusion Network Based on Multilinear Gradients for Visual Question Answering." Electronics 14, no. 1 (2024): 9. https://doi.org/10.3390/electronics14010009.

Full text

Abstract:

As an interdisciplinary field of natural language processing and computer vision, Visual Question Answering (VQA) has emerged as a prominent research focus in artificial intelligence. The core of the VQA task is to combine natural language understanding and image analysis to infer answers by extracting meaningful features from textual and visual inputs. However, most current models struggle to fully capture the deep semantic relationships between images and text owing to their limited capacity to comprehend feature interactions, which constrains their performance. To address these challenges,

APA, Harvard, Vancouver, ISO, and other styles

45

Zhou, Luowei, Hamid Palangi, Lei Zhang, Houdong Hu, Jason Corso, and Jianfeng Gao. "Unified Vision-Language Pre-Training for Image Captioning and VQA." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 13041–49. http://dx.doi.org/10.1609/aaai.v34i07.7005.

Full text

Abstract:

This paper presents a unified Vision-Language Pre-training (VLP) model. The model is unified in that (1) it can be fine-tuned for either vision-language generation (e.g., image captioning) or understanding (e.g., visual question answering) tasks, and (2) it uses a shared multi-layer transformer network for both encoding and decoding, which differs from many existing methods where the encoder and decoder are implemented using separate models. The unified VLP model is pre-trained on a large amount of image-text pairs using the unsupervised learning objectives of two tasks: bidirectional and sequ

APA, Harvard, Vancouver, ISO, and other styles

46

Katz, Chaim N., Kramay Patel, Omid Talakoub, David Groppe, Kari Hoffman, and Taufik A. Valiante. "Differential Generation of Saccade, Fixation, and Image-Onset Event-Related Potentials in the Human Mesial Temporal Lobe." Cerebral Cortex 30, no. 10 (2020): 5502–16. http://dx.doi.org/10.1093/cercor/bhaa132.

Full text

Abstract:

Abstract Event-related potentials (ERPs) are a commonly used electrophysiological signature for studying mesial temporal lobe (MTL) function during visual memory tasks. The ERPs associated with the onset of visual stimuli (image-onset) and eye movements (saccades and fixations) provide insights into the mechanisms of their generation. We hypothesized that since eye movements and image-onset provide MTL structures with salient visual information, perhaps they both engage similar neural mechanisms. To explore this question, we used intracranial electroencephalographic data from the MTLs of 11 pa

APA, Harvard, Vancouver, ISO, and other styles

47

Kumar Singh, Ashutosh, Anish Khobragade, and Vikas Kanake. "IMAGE STORY: ENHANCED COGNITIVE VISUAL NARRATIVE SYSTEM." International Journal of Advanced Research 13, no. 06 (2025): 1218–31. https://doi.org/10.21474/ijar01/21185.

Full text

Abstract:

This paper presents the Enhanced Cognitive Visual Narrative System (ECVNS), a sophisticated multi-modal artificial intelligence framework designed for automated visual storytelling. The system integrates multiple state-of-the-art deep learning models including OWLv2for object detection, BLIP for image captioning and visual question answering, CLIP for emotional analysis, and ViLT for scene understanding. The framework demonstrates the capability to generate coherent, contextually relevant narratives in six languages based on comprehensive visual analysis. Our approach combines computer vision

APA, Harvard, Vancouver, ISO, and other styles

48

Reddy, Revant Gangi, Xilin Rui, Manling Li, et al. "MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 11200–11208. http://dx.doi.org/10.1609/aaai.v36i10.21370.

Full text

Abstract:

Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, t

APA, Harvard, Vancouver, ISO, and other styles

49

Sejati, Sadewa Purba, and Ifnu Rifki Nurhidayanto. "Peningkatan Literasi Sumber Daya Air Tanah Menggunakan Media Interaktif Berbasis Android." Dinamisia : Jurnal Pengabdian Kepada Masyarakat 6, no. 6 (2022): 1454–60. http://dx.doi.org/10.31849/dinamisia.v6i6.11118.

Full text

Abstract:

Groundwater is one of the elements of the geosphere that plays an important role in achieving sustainable development. The decrease in the quantity and quality of groundwater is very likely to occur due to intensive anthropogenic dynamics which often ignore environmental rules. The neglect of environmental rules that have the potential to reduce the quantity and quality of groundwater resources is caused by a lack of literacy and knowledge of groundwater science. Literacy of groundwater resources needs to be applied to all levels of society, especially the younger generation as the successor o

APA, Harvard, Vancouver, ISO, and other styles

50

Restrepo, David, Chenwei Wu, Zhengxu Tang, et al. "Multi-OphthaLingua: A Multilingual Benchmark for Assessing and Debiasing LLM Ophthalmological QA in LMICs." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 27 (2025): 28321–30. https://doi.org/10.1609/aaai.v39i27.35053.

Full text

Abstract:

Current ophthalmology clinical workflows are plagued by over-referrals, long waits, and complex and heterogeneous medical records. Large language models (LLMs) present a promising solution to automate various procedures such as triaging, preliminary tests like visual acuity assessment, and report summaries. However, LLMs have demonstrated significantly varied performance across different languages in natural language question-answering tasks, potentially exacerbating healthcare disparities in Low and Middle-Income Countries (LMICs). This study introduces the first multilingual ophthalmological

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!