To see the other types of publications on this topic, follow the link: Multi-Modal representations.

Journal articles on the topic 'Multi-Modal representations'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Multi-Modal representations.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Wu, Lianlong, Seewon Choi, Daniel Raggi, et al. "Generation of Visual Representations for Multi-Modal Mathematical Knowledge." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 21 (2024): 23850–52. http://dx.doi.org/10.1609/aaai.v38i21.30586.

Full text
Abstract:
In this paper we introduce MaRE, a tool designed to generate representations in multiple modalities for a given mathematical problem while ensuring the correctness and interpretability of the transformations between different representations. The theoretical foundation for this tool is Representational Systems Theory (RST), a mathematical framework for studying the structure and transformations of representations. In MaRE’s web front-end user interface, a set of probability equations in Bayesian Notation can be rigorously transformed into Area Diagrams, Contingency Tables, and Probability Tree
APA, Harvard, Vancouver, ISO, and other styles
2

Zhang, Yi, Mingyuan Chen, Jundong Shen, and Chongjun Wang. "Tailor Versatile Multi-Modal Learning for Multi-Label Emotion Recognition." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 8 (2022): 9100–9108. http://dx.doi.org/10.1609/aaai.v36i8.20895.

Full text
Abstract:
Multi-modal Multi-label Emotion Recognition (MMER) aims to identify various human emotions from heterogeneous visual, audio and text modalities. Previous methods mainly focus on projecting multiple modalities into a common latent space and learning an identical representation for all labels, which neglects the diversity of each modality and fails to capture richer semantic information for each label from different perspectives. Besides, associated relationships of modalities and labels have not been fully exploited. In this paper, we propose versaTile multi-modAl learning for multI-labeL emOti
APA, Harvard, Vancouver, ISO, and other styles
3

Zhang, Yichi, Zhuo Chen, Lingbing Guo, et al. "Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 12 (2025): 13322–30. https://doi.org/10.1609/aaai.v39i12.33454.

Full text
Abstract:
Multi-modal knowledge graph completion (MMKGC) aims to discover unobserved knowledge from given multi-modal knowledge graphs (MMKG), collaboratively leveraging structural information from the triples and multi-modal information of the entities to overcome the inherent incompleteness. Existing MMKGC methods usually extract multi-modal features with pre-trained models and employ fusion modules to integrate multi-modal features for the entities. This often results in coarse handling of multi-modal entity information, overlooking the nuanced, fine-grained semantic details and their complex interac
APA, Harvard, Vancouver, ISO, and other styles
4

Dixitha, Bandi. "Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition." INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH IN ENGINEERING AND MANAGEMENT 09, no. 06 (2025): 1–9. https://doi.org/10.55041/ijsrem49325.

Full text
Abstract:
Abstract - In this paper, we propose a novel Multi-Stage Multi-Modal Pre-Training framework for Automatic Speech Recognition (ASR) that effectively leverages the complementary information from multiple modalities, such as audio, text, and visual context, to enhance model performance. Our approach consists of three sequential pre-training stages: (1) a Masked Audio Encoding (MAE) stage that learns robust acoustic representations by reconstructing masked segments of speech, (2) a Cross-Modal Learning Regularization (CLR) stage that aligns acoustic and visual-textual representations using a contr
APA, Harvard, Vancouver, ISO, and other styles
5

Zhang, Dong, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. "Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (2021): 14347–55. http://dx.doi.org/10.1609/aaai.v35i16.17687.

Full text
Abstract:
Multi-modal named entity recognition (MNER) aims to discover named entities in free text and classify them into pre-defined types with images. However, dominant MNER models do not fully exploit fine-grained semantic correspondences between semantic units of different modalities, which have the potential to refine multi-modal representation learning. To deal with this issue, we propose a unified multi-modal graph fusion (UMGF) approach for MNER. Specifically, we first represent the input sentence and image using a unified multi-modal graph, which captures various semantic relationships between
APA, Harvard, Vancouver, ISO, and other styles
6

Liu, Hao, Jindong Han, Yanjie Fu, Jingbo Zhou, Xinjiang Lu, and Hui Xiong. "Multi-modal transportation recommendation with unified route representation learning." Proceedings of the VLDB Endowment 14, no. 3 (2020): 342–50. http://dx.doi.org/10.14778/3430915.3430924.

Full text
Abstract:
Multi-modal transportation recommendation aims to provide the most appropriate travel route with various transportation modes according to certain criteria. After analyzing large-scale navigation data, we find that route representations exhibit two patterns: spatio-temporal autocorrelations within transportation networks and the semantic coherence of route sequences. However, there are few studies that consider both patterns when developing multi-modal transportation systems. To this end, in this paper, we study multi-modal transportation recommendation with unified route representation learni
APA, Harvard, Vancouver, ISO, and other styles
7

Wang, Huansha, Qinrang Liu, Ruiyang Huang, and Jianpeng Zhang. "Multi-Modal Entity Alignment Method Based on Feature Enhancement." Applied Sciences 13, no. 11 (2023): 6747. http://dx.doi.org/10.3390/app13116747.

Full text
Abstract:
Multi-modal entity alignment refers to identifying equivalent entities between two different multi-modal knowledge graphs that consist of multi-modal information such as structural triples and descriptive images. Most previous multi-modal entity alignment methods have mainly used corresponding encoders of each modality to encode entity information and then perform feature fusion to obtain the multi-modal joint representation. However, this approach does not fully utilize the multi-modal information of aligned entities. To address this issue, we propose MEAFE, a multi-modal entity alignment met
APA, Harvard, Vancouver, ISO, and other styles
8

Hu, Shizhe, Jiahao Fan, Guoliang Zou, and Yangdong Ye. "Multi-aspect Self-guided Deep Information Bottleneck for Multi-modal Clustering." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 16 (2025): 17314–22. https://doi.org/10.1609/aaai.v39i16.33903.

Full text
Abstract:
Deep multi-modal clustering can extract useful information among modals, thus benefiting the final clustering and many related fields. However, existing multi-modal clustering methods have two major limitations. First, they often ignore different levels of guiding information from both the feature representations and cluster assignments, which thus are difficult in learning discriminative representations. Second, most methods fail to effectively eliminate redundant information between multi-modal data, negatively affecting clustering results. In this paper, we propose a novel multi-aspect self
APA, Harvard, Vancouver, ISO, and other styles
9

Wu, Tianxing, Chaoyu Gao, Lin Li, and Yuxiang Wang. "Leveraging Multi-Modal Information for Cross-Lingual Entity Matching across Knowledge Graphs." Applied Sciences 12, no. 19 (2022): 10107. http://dx.doi.org/10.3390/app121910107.

Full text
Abstract:
In recent years, the scale of knowledge graphs and the number of entities have grown rapidly. Entity matching across different knowledge graphs has become an urgent problem to be solved for knowledge fusion. With the importance of entity matching being increasingly evident, the use of representation learning technologies to find matched entities has attracted extensive attention due to the computability of vector representations. However, existing studies on representation learning technologies cannot make full use of knowledge graph relevant multi-modal information. In this paper, we propose
APA, Harvard, Vancouver, ISO, and other styles
10

Sun, Shuoji, Miao Yu, and Xu Yu. "Diversified Interpretable Compatibility Modeling Based on Multi-modal Disentanglement." Applied and Computational Engineering 163, no. 1 (2025): 66–78. https://doi.org/10.54254/2755-2721/2025.24500.

Full text
Abstract:
In recent years, compatibility modeling for evaluating whether fashion items match has received widespread attention. The existing compatibility modeling methods typically model the compatibility between fashion items based on multi-modal information. However, these methods often fail to disentangle the rich attribute information in the high-dimensional continuous representations of items, resulting in a lack of interpretability in recommendations. At the same time, they also overlook the diverse matching methods among the attributes of complementary items. This article proposes a Diversified
APA, Harvard, Vancouver, ISO, and other styles
11

Huang, Yufeng, Jiji Tang, Zhuo Chen, et al. "Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-Modal Structured Representations." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (2024): 2417–25. http://dx.doi.org/10.1609/aaai.v38i3.28017.

Full text
Abstract:
Large-scale vision-language pre-training has achieved significant performance in multi-modal understanding and generation tasks. However, existing methods often perform poorly on image-text matching tasks that require structured representations, i.e., representations of objects, attributes, and relations. The models cannot make a distinction between "An astronaut rides a horse" and "A horse rides an astronaut". This is because they fail to fully leverage structured knowledge when learning multi-modal representations. In this paper, we present an end-to-end framework Structure-CLIP, which integ
APA, Harvard, Vancouver, ISO, and other styles
12

Han, Ning, Jingjing Chen, Hao Zhang, Huanwen Wang, and Hao Chen. "Adversarial Multi-Grained Embedding Network for Cross-Modal Text-Video Retrieval." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 2 (2022): 1–23. http://dx.doi.org/10.1145/3483381.

Full text
Abstract:
Cross-modal retrieval between texts and videos has received consistent research interest in the multimedia community. Existing studies follow a trend of learning a joint embedding space to measure the distance between text and video representations. In common practice, video representation is constructed by feeding clips into 3D convolutional neural networks for a coarse-grained global visual feature extraction. In addition, several studies have attempted to align the local objects of video with the text. However, these representations share a drawback of neglecting rich fine-grained relation
APA, Harvard, Vancouver, ISO, and other styles
13

Ying, Qichao, Xiaoxiao Hu, Yangming Zhou, Zhenxing Qian, Dan Zeng, and Shiming Ge. "Bootstrapping Multi-View Representations for Fake News Detection." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 4 (2023): 5384–92. http://dx.doi.org/10.1609/aaai.v37i4.25670.

Full text
Abstract:
Previous researches on multimedia fake news detection include a series of complex feature extraction and fusion networks to gather useful information from the news. However, how cross-modal consistency relates to the fidelity of news and how features from different modalities affect the decision-making are still open questions. This paper presents a novel scheme of Bootstrapping Multi-view Representations (BMR) for fake news detection. Given a multi-modal news, we extract representations respectively from the views of the text, the image pattern and the image semantics. Improved Multi-gate Mix
APA, Harvard, Vancouver, ISO, and other styles
14

Kiela, Douwe, and Stephen Clark. "Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception." Journal of Artificial Intelligence Research 60 (December 26, 2017): 1003–30. http://dx.doi.org/10.1613/jair.5665.

Full text
Abstract:
Multi-modal semantics, which aims to ground semantic representations in perception, has relied on feature norms or raw image data for perceptual input. In this paper we examine grounding semantic representations in raw auditory data, using standard evaluations for multi-modal semantics. After having shown the quality of such auditorily grounded representations, we show how they can be applied to tasks where auditory perception is relevant, including two unsupervised categorization experiments, and provide further analysis. We find that features transfered from deep neural networks outperform b
APA, Harvard, Vancouver, ISO, and other styles
15

Cui, Xiaohui, Xiaolong Qu, Dongmei Li, Yu Yang, Yuxun Li, and Xiaoping Zhang. "MKGCN: Multi-Modal Knowledge Graph Convolutional Network for Music Recommender Systems." Electronics 12, no. 12 (2023): 2688. http://dx.doi.org/10.3390/electronics12122688.

Full text
Abstract:
With the emergence of online music platforms, music recommender systems are becoming increasingly crucial in music information retrieval. Knowledge graphs (KGs) are a rich source of semantic information for entities and relations, allowing for improved modeling and analysis of entity relations to enhance recommendations. Existing research has primarily focused on the modeling and analysis of structural triples, while largely ignoring the representation and information processing capabilities of multi-modal data such as music videos and lyrics, which has hindered the improvement and user experi
APA, Harvard, Vancouver, ISO, and other styles
16

Li, Yehao, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin, and Tao Mei. "Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular Vision-Language Pre-training." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 2 (2022): 1–16. http://dx.doi.org/10.1145/3473140.

Full text
Abstract:
Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks. Unlike existing works that predominantly learn a single generic encoder, we present a pre-trainable Universal Encoder-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception (e.g., visual question answering) and generation (e.g., image captioning). Uni-EDEN is a two-stream Transformer-based structure, consisting of three modules: object and sentence encoders that separately learns
APA, Harvard, Vancouver, ISO, and other styles
17

van Tulder, Gijs, and Marleen de Bruijne. "Learning Cross-Modality Representations From Multi-Modal Images." IEEE Transactions on Medical Imaging 38, no. 2 (2019): 638–48. http://dx.doi.org/10.1109/tmi.2018.2868977.

Full text
APA, Harvard, Vancouver, ISO, and other styles
18

Dong, Bin, Songlei Jian, and Kai Lu. "Learning Multimodal Representations by Symmetrically Transferring Local Structures." Symmetry 12, no. 9 (2020): 1504. http://dx.doi.org/10.3390/sym12091504.

Full text
Abstract:
Multimodal representations play an important role in multimodal learning tasks, including cross-modal retrieval and intra-modal clustering. However, existing multimodal representation learning approaches focus on building one common space by aligning different modalities and ignore the complementary information across the modalities, such as the intra-modal local structures. In other words, they only focus on the object-level alignment and ignore structure-level alignment. To tackle the problem, we propose a novel symmetric multimodal representation learning framework by transferring local str
APA, Harvard, Vancouver, ISO, and other styles
19

Gu, Zhihao, Jiangning Zhang, Liang Liu, et al. "Rethinking Reverse Distillation for Multi-Modal Anomaly Detection." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (2024): 8445–53. http://dx.doi.org/10.1609/aaai.v38i8.28687.

Full text
Abstract:
In recent years, there has been significant progress in employing color images for anomaly detection in industrial scenarios, but it is insufficient for identifying anomalies that are invisible in RGB images alone. As a supplement, introducing extra modalities such as depth and surface normal maps can be helpful to detect these anomalies. To this end, we present a novel Multi-Modal Reverse Distillation (MMRD) paradigm that consists of a frozen multi-modal teacher encoder to generate distillation targets and a learnable student decoder targeting to restore multi-modal representations from the t
APA, Harvard, Vancouver, ISO, and other styles
20

Wang, Zi, Chenglong Li, Aihua Zheng, Ran He, and Jin Tang. "Interact, Embed, and EnlargE: Boosting Modality-Specific Representations for Multi-Modal Person Re-identification." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 3 (2022): 2633–41. http://dx.doi.org/10.1609/aaai.v36i3.20165.

Full text
Abstract:
Multi-modal person Re-ID introduces more complementary information to assist the traditional Re-ID task. Existing multi-modal methods ignore the importance of modality-specific information in the feature fusion stage. To this end, we propose a novel method to boost modality-specific representations for multi-modal person Re-ID: Interact, Embed, and EnlargE (IEEE). First, we propose a cross-modal interacting module to exchange useful information between different modalities in the feature extraction phase. Second, we propose a relation-based embedding module to enhance the richness of feature d
APA, Harvard, Vancouver, ISO, and other styles
21

Liang, Meiyu, Junping Du, Zhengyang Liang, Yongwang Xing, Wei Huang, and Zhe Xue. "Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (2024): 13744–53. http://dx.doi.org/10.1609/aaai.v38i12.29280.

Full text
Abstract:
Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic association
APA, Harvard, Vancouver, ISO, and other styles
22

He, Qibin. "Prompting Multi-Modal Image Segmentation with Semantic Grouping." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (2024): 2094–102. http://dx.doi.org/10.1609/aaai.v38i3.27981.

Full text
Abstract:
Multi-modal image segmentation is one of the core issues in computer vision. The main challenge lies in integrating common information between modalities while retaining specific patterns for each modality. Existing methods typically perform full fine-tuning on RGB-based pre-trained parameters to inherit the powerful representation of the foundation model. Although effective, such paradigm is not optimal due to weak transferability and scarce downstream data. Inspired by the recent success of prompt learning in language models, we propose the Grouping Prompt Tuning Framework (GoPT), which intr
APA, Harvard, Vancouver, ISO, and other styles
23

Sikdar, Aniruddh, Jayant Teotia, and Suresh Sundaram. "OGP-Net: Optical Guidance Meets Pixel-Level Contrastive Distillation for Robust Multi-Modal and Missing Modality Segmentation." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 7 (2025): 6922–30. https://doi.org/10.1609/aaai.v39i7.32743.

Full text
Abstract:
Enhancing the performance of semantic segmentation models with multi-spectral images (RGB-IR) is crucial, particularly for low-light and adverse environments. While multi-modal fusion techniques aim to learn cross-modality features for generating fused images or engage in knowledge distillation, they often treat multi-modal and missing modality scenarios as separate challenges, which is not an optimal approach. To address this, a novel multi-modal fusion approach called Optically-Guided Pixel-level contrastive learning Network (OGP-Net) is proposed, which uses Distillation with Multi-View Cont
APA, Harvard, Vancouver, ISO, and other styles
24

Wróblewska, Anna, Jacek Dąbrowski, Michał Pastuszak, et al. "Designing Multi-Modal Embedding Fusion-Based Recommender." Electronics 11, no. 9 (2022): 1391. http://dx.doi.org/10.3390/electronics11091391.

Full text
Abstract:
Recommendation systems have lately been popularised globally. However, often they need to be adapted to particular data and the use case. We have developed a machine learning-based recommendation system, which can be easily applied to almost any items and/or actions domain. Contrary to existing recommendation systems, our system supports multiple types of interaction data with various modalities of metadata through a multi-modal fusion of different data representations. We deployed the system into numerous e-commerce stores, e.g., food and beverages, shoes, fashion items, and telecom operators
APA, Harvard, Vancouver, ISO, and other styles
25

Liu, Hao, Ting Li, Renjun Hu, Yanjie Fu, Jingjing Gu, and Hui Xiong. "Joint Representation Learning for Multi-Modal Transportation Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 1036–43. http://dx.doi.org/10.1609/aaai.v33i01.33011036.

Full text
Abstract:
Multi-modal transportation recommendation has a goal of recommending a travel plan which considers various transportation modes, such as walking, cycling, automobile, and public transit, and how to connect among these modes. The successful development of multi-modal transportation recommendation systems can help to satisfy the diversified needs of travelers and improve the efficiency of transport networks. However, existing transport recommender systems mainly focus on unimodal transport planning. To this end, in this paper, we propose a joint representation learning framework for multi-modal
APA, Harvard, Vancouver, ISO, and other styles
26

Bodapati, Jyostna Devi, Veeranjaneyulu Naralasetti, Shaik Nagur Shareef, et al. "Blended Multi-Modal Deep ConvNet Features for Diabetic Retinopathy Severity Prediction." Electronics 9, no. 6 (2020): 914. http://dx.doi.org/10.3390/electronics9060914.

Full text
Abstract:
Diabetic Retinopathy (DR) is one of the major causes of visual impairment and blindness across the world. It is usually found in patients who suffer from diabetes for a long period. The major focus of this work is to derive optimal representation of retinal images that further helps to improve the performance of DR recognition models. To extract optimal representation, features extracted from multiple pre-trained ConvNet models are blended using proposed multi-modal fusion module. These final representations are used to train a Deep Neural Network (DNN) used for DR identification and severity
APA, Harvard, Vancouver, ISO, and other styles
27

Yang, Fan, Wei Li, Menglong Yang, Binbin Liang, and Jianwei Zhang. "Multi-Modal Disordered Representation Learning Network for Description-Based Person Search." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 15 (2024): 16316–24. http://dx.doi.org/10.1609/aaai.v38i15.29567.

Full text
Abstract:
Description-based person search aims to retrieve images of the target identity via textual descriptions. One of the challenges for this task is to extract discriminative representation from images and descriptions. Most existing methods apply the part-based split method or external models to explore the fine-grained details of local features, which ignore the global relationship between partial information and cause network instability. To overcome these issues, we propose a Multi-modal Disordered Representation Learning Network (MDRL) for description-based person search to fully extract the v
APA, Harvard, Vancouver, ISO, and other styles
28

Jüttner, Martin, and Ingo Rentschler. "Imagery in multi-modal object learning." Behavioral and Brain Sciences 25, no. 2 (2002): 197–98. http://dx.doi.org/10.1017/s0140525x0238004x.

Full text
Abstract:
Spatial objects may not only be perceived visually but also by touch. We report recent experiments investigating to what extent prior object knowledge acquired in either the haptic or visual sensory modality transfers to a subsequent visual learning task. Results indicate that even mental object representations learnt in one sensory modality may attain a multi-modal quality. These findings seem incompatible with picture-based reasoning schemas but leave open the possibility of modality-specific reasoning mechanisms.
APA, Harvard, Vancouver, ISO, and other styles
29

Tao, Rui, Meng Zhu, Haiyan Cao, and Honge Ren. "Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective." Sensors 24, no. 10 (2024): 3130. http://dx.doi.org/10.3390/s24103130.

Full text
Abstract:
Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences
APA, Harvard, Vancouver, ISO, and other styles
30

Yan, Facheng, Mingshu Zhang, and Bin Wei. "Multimodal integration for fake news detection on social media platforms." MATEC Web of Conferences 395 (2024): 01013. http://dx.doi.org/10.1051/matecconf/202439501013.

Full text
Abstract:
The widespread dissemination of fake news on social media platforms can cause serious social impact, making the detection of fake news on social media platforms an urgent problem to be solved. Up to now, scholars have proposed various methods ranging from traditional manual feature extraction to deep learning algorithms for detecting fake news. However, these methods still have some limitations and two difficult problems: (1) How to learn informative news feature representations without losing information as much as possible? (2) How to effectively fuse multi-modal information to obtain high-o
APA, Harvard, Vancouver, ISO, and other styles
31

Escobar-Grisales, Daniel, Cristian David Ríos-Urrego, and Juan Rafael Orozco-Arroyave. "Deep Learning and Artificial Intelligence Applied to Model Speech and Language in Parkinson’s Disease." Diagnostics 13, no. 13 (2023): 2163. http://dx.doi.org/10.3390/diagnostics13132163.

Full text
Abstract:
Parkinson’s disease (PD) is the second most prevalent neurodegenerative disorder in the world, and it is characterized by the production of different motor and non-motor symptoms which negatively affect speech and language production. For decades, the research community has been working on methodologies to automatically model these biomarkers to detect and monitor the disease; however, although speech impairments have been widely explored, language remains underexplored despite being a valuable source of information, especially to assess cognitive impairments associated with non-motor symptoms
APA, Harvard, Vancouver, ISO, and other styles
32

Geng, Shijie, Peng Gao, Moitreya Chatterjee, et al. "Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (2021): 1415–23. http://dx.doi.org/10.1609/aaai.v35i2.16231.

Full text
Abstract:
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and produci
APA, Harvard, Vancouver, ISO, and other styles
33

Pugeault, Nicolas, Florentin Wörgötter, and Norbert Krüger. "Disambiguating Multi–Modal Scene Representations Using Perceptual Grouping Constraints." PLoS ONE 5, no. 6 (2010): e10663. http://dx.doi.org/10.1371/journal.pone.0010663.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Lara, Bruno, Juan Manuel Rendon-Mancha, and Marcos A. Capistran. "Prediction of Undesired Situations based on Multi-Modal Representations." IEEE Latin America Transactions 5, no. 2 (2007): 103–8. http://dx.doi.org/10.1109/tla.2007.4381351.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Yang, Yiying, Fukun Yin, Wen Liu, et al. "PM-INR: Prior-Rich Multi-Modal Implicit Large-Scale Scene Neural Representation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 7 (2024): 6594–602. http://dx.doi.org/10.1609/aaai.v38i7.28481.

Full text
Abstract:
Recent advancements in implicit neural representations have contributed to high-fidelity surface reconstruction and photorealistic novel view synthesis. However, with the expansion of the scene scale, such as block or city level, existing methods will encounter challenges because traditional sampling cannot cope with the cubically growing sampling space. To alleviate the dependence on filling the sampling space, we explore using multi-modal priors to assist individual points to obtain more global semantic information and propose a priorrich multi-modal implicit neural representation network, P
APA, Harvard, Vancouver, ISO, and other styles
36

Zhai, Hanming, Xiaojun Lv, Zhiwen Hou, Xin Tong, and Fanliang Bu. "MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion." Mathematical Biosciences and Engineering 20, no. 8 (2023): 14096–116. http://dx.doi.org/10.3934/mbe.2023630.

Full text
Abstract:
<abstract><p>With the rise of multi-modal methods, multi-modal knowledge graphs have become a better choice for storing human knowledge. However, knowledge graphs often suffer from the problem of incompleteness due to the infinite and constantly updating nature of knowledge, and thus the task of knowledge graph completion has been proposed. Existing multi-modal knowledge graph completion methods mostly rely on either embedding-based representations or graph neural networks, and there is still room for improvement in terms of interpretability and the ability to handle multi-hop task
APA, Harvard, Vancouver, ISO, and other styles
37

Hua, Yan, Yingyun Yang, and Jianhe Du. "Deep Multi-Modal Metric Learning with Multi-Scale Correlation for Image-Text Retrieval." Electronics 9, no. 3 (2020): 466. http://dx.doi.org/10.3390/electronics9030466.

Full text
Abstract:
Multi-modal retrieval is a challenge due to heterogeneous gap and a complex semantic relationship between different modal data. Typical research map different modalities into a common subspace with a one-to-one correspondence or similarity/dissimilarity relationship of inter-modal data, in which the distances of heterogeneous data can be compared directly; thus, inter-modal retrieval can be achieved by the nearest neighboring search. However, most of them ignore intra-modal relations and complicated semantics between multi-modal data. In this paper, we propose a deep multi-modal metric learnin
APA, Harvard, Vancouver, ISO, and other styles
38

Hill, Felix, Roi Reichart, and Anna Korhonen. "Multi-Modal Models for Concrete and Abstract Concept Meaning." Transactions of the Association for Computational Linguistics 2 (December 2014): 285–96. http://dx.doi.org/10.1162/tacl_a_00183.

Full text
Abstract:
Multi-modal models that learn semantic representations from both linguistic and perceptual input outperform language-only models on a range of evaluations, and better reflect human concept acquisition. Most perceptual input to such models corresponds to concrete noun concepts and the superiority of the multi-modal approach has only been established when evaluating on such concepts. We therefore investigate which concepts can be effectively learned by multi-modal models. We show that concreteness determines both which linguistic features are most informative and the impact of perceptual input i
APA, Harvard, Vancouver, ISO, and other styles
39

Liu, Xuanwu, Guoxian Yu, Carlotta Domeniconi, Jun Wang, Yazhou Ren, and Maozu Guo. "Ranking-Based Deep Cross-Modal Hashing." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 4400–4407. http://dx.doi.org/10.1609/aaai.v33i01.33014400.

Full text
Abstract:
Cross-modal hashing has been receiving increasing interests for its low storage cost and fast query speed in multi-modal data retrievals. However, most existing hashing methods are based on hand-crafted or raw level features of objects, which may not be optimally compatible with the coding process. Besides, these hashing methods are mainly designed to handle simple pairwise similarity. The complex multilevel ranking semantic structure of instances associated with multiple labels has not been well explored yet. In this paper, we propose a ranking-based deep cross-modal hashing approach (RDCMH).
APA, Harvard, Vancouver, ISO, and other styles
40

Gou, Yingdong, Kexin Wang, Siwen Wei, and Changxin Shi. "GMDA: GCN-Based Multi-Modal Domain Adaptation for Real-Time Disaster Detection." International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 31, no. 06 (2023): 957–73. http://dx.doi.org/10.1142/s0218488523500435.

Full text
Abstract:
Nowadays, with the rapid expansion of social media as a means of quick communication, real-time disaster information is widely disseminated through these platforms. Determining which real-time and multi-modal disaster information can effectively support humanitarian aid has become a major challenge. In this paper, we propose a novel end-to-end model, named GCN-based Multi-modal Domain Adaptation (GMDA), which consists of three essential modules: the GCN-based feature extraction module, the attention-based fusion module and the MMD domain adaptation module. The GCN-based feature extraction modu
APA, Harvard, Vancouver, ISO, and other styles
41

Hu, Lianyu, Liqing Gao, Zekang Liu, Chi-Man Pun, and Wei Feng. "COMMA: Co-articulated Multi-Modal Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (2024): 2238–46. http://dx.doi.org/10.1609/aaai.v38i3.27997.

Full text
Abstract:
Pretrained large-scale vision-language models such as CLIP have demonstrated excellent generalizability over a series of downstream tasks. However, they are sensitive to the variation of input text prompts and need a selection of prompt templates to achieve satisfactory performance. Recently, various methods have been proposed to dynamically learn the prompts as the textual inputs to avoid the requirements of laboring hand-crafted prompt engineering in the fine-tuning process. We notice that these methods are suboptimal in two aspects. First, the prompts of the vision and language branches in
APA, Harvard, Vancouver, ISO, and other styles
42

Sezerer, Erhan, and Selma Tekir. "Incorporating Concreteness in Multi-Modal Language Models with Curriculum Learning." Applied Sciences 11, no. 17 (2021): 8241. http://dx.doi.org/10.3390/app11178241.

Full text
Abstract:
Over the last few years, there has been an increase in the studies that consider experiential (visual) information by building multi-modal language models and representations. It is shown by several studies that language acquisition in humans starts with learning concrete concepts through images and then continues with learning abstract ideas through the text. In this work, the curriculum learning method is used to teach the model concrete/abstract concepts through images and their corresponding captions to accomplish multi-modal language modeling/representation. We use the BERT and Resnet-152
APA, Harvard, Vancouver, ISO, and other styles
43

Jang, Jiho, Chaerin Kong, DongHyeon Jeon, Seonhoon Kim, and Nojun Kwak. "Unifying Vision-Language Representation Space with Single-Tower Transformer." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 1 (2023): 980–88. http://dx.doi.org/10.1609/aaai.v37i1.25178.

Full text
Abstract:
Contrastive learning is a form of distance learning that aims to learn invariant features from two related representations. In this work, we explore the hypothesis that an image and caption can be regarded as two different views of the underlying mutual information, and train a model to learn a unified vision-language representation space that encodes both modalities at once in a modality-agnostic manner. We first identify difficulties in learning a one-tower model for vision-language pretraining (VLP), and propose One Representation (OneR) as a simple yet effective framework for our goal. We
APA, Harvard, Vancouver, ISO, and other styles
44

Fang, Feiyi, Tao Zhou, Zhenbo Song, and Jianfeng Lu. "MMCAN: Multi-Modal Cross-Attention Network for Free-Space Detection with Uncalibrated Hyperspectral Sensors." Remote Sensing 15, no. 4 (2023): 1142. http://dx.doi.org/10.3390/rs15041142.

Full text
Abstract:
Free-space detection plays a pivotal role in autonomous vehicle applications, and its state-of-the-art algorithms are typically based on semantic segmentation of road areas. Recently, hyperspectral images have proven useful supplementary information in multi-modal segmentation for providing more texture details to the RGB representations, thus performing well in road segmentation tasks. Existing multi-modal segmentation methods assume that all the inputs are well-aligned, and then the problem is converted to fuse feature maps from different modalities. However, there exist cases where sensors
APA, Harvard, Vancouver, ISO, and other styles
45

Qian, Shengsheng, Dizhan Xue, Huaiwen Zhang, Quan Fang, and Changsheng Xu. "Dual Adversarial Graph Neural Networks for Multi-label Cross-modal Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 3 (2021): 2440–48. http://dx.doi.org/10.1609/aaai.v35i3.16345.

Full text
Abstract:
Cross-modal retrieval has become an active study field with the expanding scale of multimodal data. To date, most existing methods transform multimodal data into a common representation space where semantic similarities between items can be directly measured across different modalities. However, these methods typically suffer from following limitations: 1) They usually attempt to bridge the modality gap by designing losses in the common representation space which may not be sufficient to eliminate potential heterogeneity of different modalities in the common space. 2) They typically treat labe
APA, Harvard, Vancouver, ISO, and other styles
46

Kabir, Anowarul, and Amarda Shehu. "GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction." Biomolecules 12, no. 11 (2022): 1709. http://dx.doi.org/10.3390/biom12111709.

Full text
Abstract:
Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feat
APA, Harvard, Vancouver, ISO, and other styles
47

Alam, Mohammad Arif Ul. "College Student Retention Risk Analysis from Educational Database Using Multi-Task Multi-Modal Neural Fusion." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 11 (2022): 12689–97. http://dx.doi.org/10.1609/aaai.v36i11.21545.

Full text
Abstract:
We develop a Multimodal Spatiotemporal Neural Fusion network for MTL (MSNF-MTCL) to predict 5 important students' retention risks: future dropout, next semester dropout, type of dropout, duration of dropout and cause of dropout. First, we develop a general purpose multi-modal neural fusion network model MSNF for learning students' academic information representation by fusing spatial and temporal unstructured advising notes with spatiotemporal structured data. MSNF combines a Bidirectional Encoder Representations from Transformers (BERT)-based document embedding framework to represent each adv
APA, Harvard, Vancouver, ISO, and other styles
48

Bao, Peijun, Wenhan Yang, Boon Poh Ng, Meng Hwa Er, and Alex C. Kot. "Cross-Modal Label Contrastive Learning for Unsupervised Audio-Visual Event Localization." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 1 (2023): 215–22. http://dx.doi.org/10.1609/aaai.v37i1.25093.

Full text
Abstract:
This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However, building large-scale multi-modality datasets with category annotations is human-intensive and thus not scalable to real-world applications. To this end, we propose cross-modal label contrastive learning to exploit multi-modal information among unlabeled audio and visual streams as self-supervision signals. At the feature representation level
APA, Harvard, Vancouver, ISO, and other styles
49

Lu, Lyujian, Saad Elbeleidy, Lauren Zoe Baker, and Hua Wang. "Learning Multi-Modal Biomarker Representations via Globally Aligned Longitudinal Enrichments." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (2020): 817–24. http://dx.doi.org/10.1609/aaai.v34i01.5426.

Full text
Abstract:
Alzheimer's Disease (AD) is a chronic neurodegenerative disease that severely impacts patients' thinking, memory and behavior. To aid automatic AD diagnoses, many longitudinal learning models have been proposed to predict clinical outcomes and/or disease status, which, though, often fail to consider missing temporal phenotypic records of the patients that can convey valuable information of AD progressions. Another challenge in AD studies is how to integrate heterogeneous genotypic and phenotypic biomarkers to improve diagnosis prediction. To cope with these challenges, in this paper we propose
APA, Harvard, Vancouver, ISO, and other styles
50

Zhang, Heng, Vishal M. Patel, and Rama Chellappa. "Low-Rank and Joint Sparse Representations for Multi-Modal Recognition." IEEE Transactions on Image Processing 26, no. 10 (2017): 4741–52. http://dx.doi.org/10.1109/tip.2017.2721838.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!