Log in

Relevant bibliographies by topics / Multimodal Embeddings / Journal articles

To see the other types of publications on this topic, follow the link: Multimodal Embeddings.

Journal articles on the topic 'Multimodal Embeddings'

Author: Grafiati

Published: 26 October 2024

Last updated: 31 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Multimodal Embeddings.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Tyshchuk, Kirill, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev, and Alexander Panchenko. "On Isotropy of Multimodal Embeddings." Information 14, no. 7 (2023): 392. http://dx.doi.org/10.3390/info14070392.

Full text

Abstract:

Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based text encoders, such as the BERT model. Anisotropic word embeddings do not use the entire space, instead concentrating on a narrow cone in such a pretrained vector space, negatively affecting the performance of applications, such as textual semantic similarity. Transforming a vector space to optimize isotropy has been shown to be beneficial for

APA, Harvard, Vancouver, ISO, and other styles

2

Guo, Zhiqiang, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi, and Bin Ruan. "LGMRec: Local and Global Graph Learning for Multimodal Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (2024): 8454–62. http://dx.doi.org/10.1609/aaai.v38i8.28688.

Full text

Abstract:

The multimodal recommendation has gradually become the infrastructure of online media platforms, enabling them to provide personalized service to users through a joint modeling of user historical behaviors (e.g., purchases, clicks) and item various modalities (e.g., visual and textual). The majority of existing studies typically focus on utilizing modal features or modal-related graph structure to learn user local interests. Nevertheless, these approaches encounter two limitations: (1) Shared updates of user ID embeddings result in the consequential coupling between collaboration and multimoda

APA, Harvard, Vancouver, ISO, and other styles

3

Shang, Bin, Yinliang Zhao, Jun Liu, and Di Wang. "LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (2024): 8957–65. http://dx.doi.org/10.1609/aaai.v38i8.28744.

Full text

Abstract:

Recently, an enormous amount of research has emerged on multimodal knowledge graph completion (MKGC), which seeks to extract knowledge from multimodal data and predict the most plausible missing facts to complete a given multimodal knowledge graph (MKG). However, existing MKGC approaches largely ignore that visual information may introduce noise and lead to uncertainty when adding them to the traditional KG embeddings due to the contribution of each associated image to entity is different in diverse link scenarios. Moreover, treating each triple independently when learning entity embeddings le

APA, Harvard, Vancouver, ISO, and other styles

4

Sun, Zhongkai, Prathusha Sarma, William Sethares, and Yingyu Liang. "Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8992–99. http://dx.doi.org/10.1609/aaai.v34i05.6431.

Full text

Abstract:

Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment anal

APA, Harvard, Vancouver, ISO, and other styles

5

Merkx, Danny, and Stefan L. Frank. "Learning semantic sentence representations from visually grounded language without lexical knowledge." Natural Language Engineering 25, no. 4 (2019): 451–66. http://dx.doi.org/10.1017/s1351324919000196.

Full text

Abstract:

AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves re

APA, Harvard, Vancouver, ISO, and other styles

6

Mihail Mateev. "Comparative Analysis on Implementing Embeddings for Image Analysis." Journal of Information Systems Engineering and Management 10, no. 17s (2025): 89–102. https://doi.org/10.52783/jisem.v10i17s.2710.

Full text

Abstract:

This research explores how artificial intelligence enhances construction maintenance and diagnostics, achieving 95% accuracy on a dataset of 10,000 cases. The findings highlight AI's potential to revolutionize predictive maintenance in the industry. The growing adoption of image embeddings has transformed visual data processing across AI applications. This study evaluates embedding implementations in major platforms, including Azure AI, OpenAI's GPT-4 Vision, and frameworks like Hugging Face, Replicate, and Eden AI. It assesses their scalability, accuracy, cost-effectiveness, and integration f

APA, Harvard, Vancouver, ISO, and other styles

7

Tang, Zhenchao, Jiehui Huang, Guanxing Chen, and Calvin Yu-Chian Chen. "Comprehensive View Embedding Learning for Single-Cell Multimodal Integration." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (2024): 15292–300. http://dx.doi.org/10.1609/aaai.v38i14.29453.

Full text

Abstract:

Motivation: Advances in single-cell measurement techniques provide rich multimodal data, which helps us to explore the life state of cells more deeply. However, multimodal integration, or, learning joint embeddings from multimodal data remains a current challenge. The difficulty in integrating unpaired single-cell multimodal data is that different modalities have different feature spaces, which easily leads to information loss in joint embedding. And few existing methods have fully exploited and fused the information in single-cell multimodal data. Result: In this study, we propose CoVEL, a de

APA, Harvard, Vancouver, ISO, and other styles

8

Zhang, Linhai, Deyu Zhou, Yulan He, and Zeng Yang. "MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (2021): 14420–27. http://dx.doi.org/10.1609/aaai.v35i16.17695.

Full text

Abstract:

Previous work has shown the effectiveness of using event representations for tasks such as script event prediction and stock market prediction. It is however still challenging to learn the subtle semantic differences between events based solely on textual descriptions of events often represented as (subject, predicate, object) triples. As an alternative, images offer a more intuitive way of understanding event semantics. We observe that event described in text and in images show different abstraction levels and therefore should be projected onto heterogeneous embedding spaces, as opposed to wh

APA, Harvard, Vancouver, ISO, and other styles

9

Sah, Shagan, Sabarish Gopalakishnan, and Raymond Ptucha. "Aligned attention for common multimodal embeddings." Journal of Electronic Imaging 29, no. 02 (2020): 1. http://dx.doi.org/10.1117/1.jei.29.2.023013.

Full text

APA, Harvard, Vancouver, ISO, and other styles

10

Alkaabi, Hussein, Ali Kadhim Jasim, and Ali Darroudi. "From Static to Contextual: A Survey of Embedding Advances in NLP." PERFECT: Journal of Smart Algorithms 2, no. 2 (2025): 57–66. https://doi.org/10.62671/perfect.v2i2.77.

Full text

Abstract:

Embedding techniques have been a cornerstone of Natural Language Processing (NLP), enabling machines to represent textual data in a form that captures semantic and syntactic relationships. Over the years, the field has witnessed a significant evolution—from static word embeddings, such as Word2Vec and GloVe, which represent words as fixed vectors, to dynamic, contextualized embeddings like BERT and GPT, which generate word representations based on their surrounding context. This survey provides a comprehensive overview of embedding techniques, tracing their development from early methods to st

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Rongchao, Yiwei Lou, Dexuan Xu, Yongzhi Cao, Hanpin Wang, and Yu Huang. "A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 15 (2024): 16803–11. http://dx.doi.org/10.1609/aaai.v38i15.29621.

Full text

Abstract:

The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacy-protecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realis

APA, Harvard, Vancouver, ISO, and other styles

12

Lin, Kaiyi, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. "Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11515–22. http://dx.doi.org/10.1609/aaai.v34i07.6817.

Full text

Abstract:

Zero-Shot Cross-Modal Retrieval (ZS-CMR) is an emerging research hotspot that aims to retrieve data of new classes across different modality data. It is challenging for not only the heterogeneous distributions across different modalities, but also the inconsistent semantics across seen and unseen classes. A handful of recently proposed methods typically borrow the idea from zero-shot learning, i.e., exploiting word embeddings of class labels (i.e., class-embeddings) as common semantic space, and using generative adversarial network (GAN) to capture the underlying multimodal data structures, as

APA, Harvard, Vancouver, ISO, and other styles

13

Khalifa, Omar Yasser Ibrahim, and Muhammad Zafran Muhammad Zaly Shah. "MultiPhishNet: A Multimodal Approach of QR Code Phishing Detection using Multi-Head Attention and Multilingual Embeddings." International Journal of Innovative Computing 15, no. 1 (2025): 53–61. https://doi.org/10.11113/ijic.v15n1.512.

Full text

Abstract:

Phishing attacks leveraging QR codes have become a significant threat due to their increasing use in contactless services. These attacks are challenging to detect since QR codes typically encode URLs leading to phishing websites designed to steal sensitive information. Existing detection methods often rely on blacklists or handcrafted features, which are inadequate for handling obfuscated URLs and multilingual content. This paper proposes MultiPhishNet, a multimodal phishing detection model that integrates advanced embedding techniques, Convolutional Neural Networks (CNNs), and multi-head atte

APA, Harvard, Vancouver, ISO, and other styles

14

Waqas, Asim, Aakash Tripathi, Mia Naeini, Paul A. Stewart, Matthew B. Schabath, and Ghulam Rasool. "Abstract 991: PARADIGM: an embeddings-based multimodal learning framework with foundation models and graph neural networks." Cancer Research 85, no. 8_Supplement_1 (2025): 991. https://doi.org/10.1158/1538-7445.am2025-991.

Full text

Abstract:

Abstract Introduction: Cancer research faces significant challenges in integrating heterogeneous data across varying spatial and temporal scales, limiting the ability to gain a comprehensive understanding of the disease. PARADIGM(Pan-Cancer Embeddings Representation using Advanced Multimodal Learning with Graph-based Modeling) addresses this challenge by providing a framework leveraging foundation models (FMs) and Graph Neural Networks (GNN). PARADIGM framework generates embeddings from multi-resolution datasets using modality-specific FMs, aggregates sample embeddings, fuses them into a unifi

APA, Harvard, Vancouver, ISO, and other styles

15

Li, Xiaolong, Yang Dong, Yunfei Yi, Zhixun Liang, and Shuqi Yan. "Hypergraph Neural Network for Multimodal Depression Recognition." Electronics 13, no. 22 (2024): 4544. http://dx.doi.org/10.3390/electronics13224544.

Full text

Abstract:

Deep learning-based approaches for automatic depression recognition offer advantages of low cost and high efficiency. However, depression symptoms are challenging to detect and vary significantly between individuals. Traditional deep learning methods often struggle to capture and model these nuanced features effectively, leading to lower recognition accuracy. This paper introduces a novel multimodal depression recognition method, HYNMDR, which utilizes hypergraphs to represent the complex, high-order relationships among patients with depression. HYNMDR comprises two primary components: a tempo

APA, Harvard, Vancouver, ISO, and other styles

16

Zhu, Chaoyu, Zhihao Yang, Xiaoqiong Xia, Nan Li, Fan Zhong, and Lei Liu. "Multimodal reasoning based on knowledge graph embedding for specific diseases." Bioinformatics 38, no. 8 (2022): 2235–45. http://dx.doi.org/10.1093/bioinformatics/btac085.

Full text

Abstract:

Abstract Motivation Knowledge Graph (KG) is becoming increasingly important in the biomedical field. Deriving new and reliable knowledge from existing knowledge by KG embedding technology is a cutting-edge method. Some add a variety of additional information to aid reasoning, namely multimodal reasoning. However, few works based on the existing biomedical KGs are focused on specific diseases. Results This work develops a construction and multimodal reasoning process of Specific Disease Knowledge Graphs (SDKGs). We construct SDKG-11, a SDKG set including five cancers, six non-cancer diseases, a

APA, Harvard, Vancouver, ISO, and other styles

17

Tripathi, Aakash Gireesh, Asim Waqas, Yasin Yilmaz, Matthew B. Schabath, and Ghulam Rasool. "Abstract 3641: Predicting treatment outcomes using cross-modality correlations in multimodal oncology data." Cancer Research 85, no. 8_Supplement_1 (2025): 3641. https://doi.org/10.1158/1538-7445.am2025-3641.

Full text

Abstract:

Abstract Accurate prediction of treatment outcomes in oncology requires modeling the intricate relationships across diverse data modalities. This study investigates cross-modality correlations by leveraging imaging and clinical data curated through the Multimodal Integration of Oncology Data System (MINDS) and HoneyBee frameworks to uncover actionable patterns for personalized treatment strategies. Using data from over 10, 000 cancer patients, we developed a machine learning pipeline that employs advanced embedding techniques to capture associations between radiological imaging phenotypes and

APA, Harvard, Vancouver, ISO, and other styles

18

Tripathi, Aakash, Asim Waqas, Yasin Yilmaz, and Ghulam Rasool. "Abstract 4905: Multimodal transformer model improves survival prediction in lung cancer compared to unimodal approaches." Cancer Research 84, no. 6_Supplement (2024): 4905. http://dx.doi.org/10.1158/1538-7445.am2024-4905.

Full text

Abstract:

Abstract Integrating multimodal lung data including clinical notes, medical images, and molecular data is critical for predictive modeling tasks like survival prediction, yet effectively aligning these disparate data types remains challenging. We present a novel method to integrate heterogeneous lung modalities by first thoroughly analyzing various domain-specific models and selecting the optimal model for embedding feature extraction per data type based on performance on representative pretrained tasks. For clinical notes, the GatorTron models showed the lowest regression loss on an initial e

APA, Harvard, Vancouver, ISO, and other styles

19

Xu, Jinfeng, Zheyu Chen, Shuo Yang, Jinze Li, Hewei Wang, and Edith C. H. Ngai. "MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 12 (2025): 12908–17. https://doi.org/10.1609/aaai.v39i12.33408.

Full text

Abstract:

As multimedia information proliferates, multimodal recommendation systems have garnered significant attention. These systems leverage multimodal information to alleviate the data sparsity issue inherent in recommendation systems, thereby enhancing the accuracy of recommendations. Due to the natural semantic disparities among multimodal features, recent research has primarily focused on cross-modal alignment using self-supervised learning to bridge these gaps. However, aligning different modal features might result in the loss of valuable interaction information, distancing them from ID embeddi

APA, Harvard, Vancouver, ISO, and other styles

20

Ota, Kosuke, Keiichiro Shirai, Hidetoshi Miyao, and Minoru Maruyama. "Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings." Journal of Advanced Computational Intelligence and Intelligent Informatics 26, no. 6 (2022): 995–1003. http://dx.doi.org/10.20965/jaciii.2022.p0995.

Full text

Abstract:

In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neur

APA, Harvard, Vancouver, ISO, and other styles

21

Yi, Moung-Ho, Keun-Chang Kwak, and Ju-Hyun Shin. "KoHMT: A Multimodal Emotion Recognition Model Integrating KoELECTRA, HuBERT with Multimodal Transformer." Electronics 13, no. 23 (2024): 4674. http://dx.doi.org/10.3390/electronics13234674.

Full text

Abstract:

With the advancement of human-computer interaction, the role of emotion recognition has become increasingly significant. Emotion recognition technology provides practical benefits across various industries, including user experience enhancement, education, and organizational productivity. For instance, in educational settings, it enables real-time understanding of students’ emotional states, facilitating tailored feedback. In workplaces, monitoring employees’ emotions can contribute to improved job performance and satisfaction. Recently, emotion recognition has also gained attention in media a

APA, Harvard, Vancouver, ISO, and other styles

22

Mai, Sijie, Haifeng Hu, and Songlong Xing. "Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (2020): 164–72. http://dx.doi.org/10.1609/aaai.v34i01.5347.

Full text

Abstract:

Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore

APA, Harvard, Vancouver, ISO, and other styles

23

Kapil Adhar Wagh. "A Review: Word Embedding Models with Machine Learning Based Context Depend and Context Independent Techniques." Advances in Nonlinear Variational Inequalities 28, no. 3s (2024): 251–58. https://doi.org/10.52783/anvi.v28.2928.

Full text

Abstract:

Natural language processing (NLP) has been transformed by word embedding models, which convert text into meaningful numerical representations. These models fall into two general categories: context-dependent methods like ELMo, BERT, and GPT, and context-independent methods like Word2Vec, GloVe, and FastText. Although static word representations are provided by context-independent models, polysemy and contextual subtleties are difficult for them to capture. These issues are addressed by context-dependent approaches that make use of sophisticated deep learning architectures to produce dynamic em

APA, Harvard, Vancouver, ISO, and other styles

24

Kim, Donghyun, Kuniaki Saito, Kate Saenko, Stan Sclaroff, and Bryan Plummer. "MULE: Multimodal Universal Language Embedding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11254–61. http://dx.doi.org/10.1609/aaai.v34i07.6785.

Full text

Abstract:

Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabli

APA, Harvard, Vancouver, ISO, and other styles

25

Vijay Vaibhav Singh. "Vector Embeddings: The Mathematical Foundation of Modern AI Systems." International Journal of Scientific Research in Computer Science, Engineering and Information Technology 11, no. 1 (2025): 2408–17. https://doi.org/10.32628/cseit251112257.

Full text

Abstract:

This comprehensive article examines vector embeddings as a fundamental component of modern artificial intelligence systems, detailing their mathematical foundations, key properties, implementation techniques, and practical applications. The article traces the evolution from basic word embeddings to sophisticated transformer-based architectures, highlighting how these representations enable machines to capture and process semantic relationships in human language and visual data. The article encompasses both theoretical frameworks and practical implementations, from the groundbreaking Word2Vec a

APA, Harvard, Vancouver, ISO, and other styles

26

Wehrmann, Jônatas, Anderson Mattjie, and Rodrigo C. Barros. "Order embeddings and character-level convolutions for multimodal alignment." Pattern Recognition Letters 102 (January 2018): 15–22. http://dx.doi.org/10.1016/j.patrec.2017.11.020.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

Mithun, Niluthpol C., Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. "Joint embeddings with multimodal cues for video-text retrieval." International Journal of Multimedia Information Retrieval 8, no. 1 (2019): 3–18. http://dx.doi.org/10.1007/s13735-018-00166-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

28

Fodor, Ádám, András Lőrincz, and Rachid R. Saboundji. "Enhancing apparent personality trait analysis with cross-modal embeddings." Annales Universitatis Scientiarum Budapestinensis de Rolando Eötvös Nominatae. Sectio computatorica 57 (2024): 167–85. https://doi.org/10.71352/ac.57.167.

Full text

Abstract:

utomatic personality trait assessment is essential for high-quality human-machine interactions. Systems capable of human behavior analysis could be used for self-driving cars, medical research, and surveillance, among many others. We present a multimodal deep neural network with a distance learning network extension for apparent personality trait prediction trained on short video recordings and exploiting modality invariant embeddings. Acoustic, visual, and textual information are utilized to reach high-performance solutions in this task. Due to the highly centralized target distribution of th

APA, Harvard, Vancouver, ISO, and other styles

29

Roshan, Nayak, S. Ullas Kannantha B, S. Kruthi, and Gururaj C. "Multimodal Offensive Meme Classification Using Transformers and BiLSTM." International Journal of Engineering and Advanced Technology (IJEAT) 11, no. 3 (2022): 96–102. https://doi.org/10.35940/ijeat.C3392.0211322.

Full text

Abstract:

<strong>Abstract:</strong> Nowadays memes have become a way in which people express their ideas on social media. These memes can convey various views including offensive ones. Memes can be intended for a personal attack, homophobic abuse, racial abuse, attack on minority etc. The memes are implicit and multi-modal in nature. Here we analyze the meme by categorizing them as offensive or not offensive and this becomes a binary classification problem. We propose a novel offensive meme classification using the transformer-based image encoder, BiLSTM for text with mean pooling as text encoder and a

APA, Harvard, Vancouver, ISO, and other styles

30

Nayak, Roshan, B. S. Ullas Kannantha, Kruthi S, and C. Gururaj. "Multimodal Offensive Meme Classification u sing Transformers and BiLSTM." International Journal of Engineering and Advanced Technology 11, no. 3 (2022): 96–102. http://dx.doi.org/10.35940/ijeat.c3392.0211322.

Full text

Abstract:

Nowadays memes have become a way in which people express their ideas on social media. These memes can convey various views including offensive ones. Memes can be intended for a personal attack, homophobic abuse, racial abuse, attack on minority etc. The memes are implicit and multi-modal in nature. Here we analyze the meme by categorizing them as offensive or not offensive and this becomes a binary classification problem. We propose a novel offensive meme classification using the transformer-based image encoder, BiLSTM for text with mean pooling as text encoder and a Feed-Forward Network as a

APA, Harvard, Vancouver, ISO, and other styles

31

Chen, Weijia, Zhijun Lu, Lijue You, Lingling Zhou, Jie Xu, and Ken Chen. "Artificial Intelligence–Based Multimodal Risk Assessment Model for Surgical Site Infection (AMRAMS): Development and Validation Study." JMIR Medical Informatics 8, no. 6 (2020): e18186. http://dx.doi.org/10.2196/18186.

Full text

Abstract:

Background Surgical site infection (SSI) is one of the most common types of health care–associated infections. It increases mortality, prolongs hospital length of stay, and raises health care costs. Many institutions developed risk assessment models for SSI to help surgeons preoperatively identify high-risk patients and guide clinical intervention. However, most of these models had low accuracies. Objective We aimed to provide a solution in the form of an Artificial intelligence–based Multimodal Risk Assessment Model for Surgical site infection (AMRAMS) for inpatients undergoing operations, us

APA, Harvard, Vancouver, ISO, and other styles

32

N.D., Smelik. "Multimodal topic model for texts and images utilizing their embeddings." Machine Learning and Data Analysis 2, no. 4 (2016): 421–41. http://dx.doi.org/10.21469/22233792.2.4.05.

Full text

APA, Harvard, Vancouver, ISO, and other styles

33

Abdou, Ahmed, Ekta Sood, Philipp Müller, and Andreas Bulling. "Gaze-enhanced Crossmodal Embeddings for Emotion Recognition." Proceedings of the ACM on Human-Computer Interaction 6, ETRA (2022): 1–18. http://dx.doi.org/10.1145/3530879.

Full text

Abstract:

Emotional expressions are inherently multimodal -- integrating facial behavior, speech, and gaze -- but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, an explicit representation of gaze was not included. We propose a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework. We show that our method outperforms the previous state of the art f

APA, Harvard, Vancouver, ISO, and other styles

34

Hu, Wenbo, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. "BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (2024): 2256–64. http://dx.doi.org/10.1609/aaai.v38i3.27999.

Full text

Abstract:

Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limi

APA, Harvard, Vancouver, ISO, and other styles

35

Chen, Qihua, Xuejin Chen, Chenxuan Wang, Yixiong Liu, Zhiwei Xiong, and Feng Wu. "Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 2 (2024): 1174–82. http://dx.doi.org/10.1609/aaai.v38i2.27879.

Full text

Abstract:

The current neuron reconstruction pipeline for electron microscopy (EM) data usually includes automatic image segmentation followed by extensive human expert proofreading. In this work, we aim to reduce human workload by predicting connectivity between over-segmented neuron pieces, taking both microscopy image and 3D morphology features into account, similar to human proofreading workflow. To this end, we first construct a dataset, named FlyTracing, that contains millions of pairwise connections of segments expanding the whole fly brain, which is three orders of magnitude larger than existing

APA, Harvard, Vancouver, ISO, and other styles

36

Shen, Aili, Bahar Salehi, Jianzhong Qi, and Timothy Baldwin. "A General Approach to Multimodal Document Quality Assessment." Journal of Artificial Intelligence Research 68 (July 22, 2020): 607–32. http://dx.doi.org/10.1613/jair.1.11647.

Full text

Abstract:

   The perceived quality of a document is affected by various factors, including grammat- icality, readability, stylistics, and expertise depth, making the task of document quality assessment a complex one. In this paper, we explore this task in the context of assessing the quality of Wikipedia articles and academic papers. Observing that the visual rendering of a document can capture implicit quality indicators that are not present in the document text — such as images, font choices, and visual layout — we propose a joint model that combines the text content with a visual re

APA, Harvard, Vancouver, ISO, and other styles

37

Sata, Ikumi, Motoki Amagasaki, and Masato Kiyama. "Multimodal Retrieval Method for Images and Diagnostic Reports Using Cross-Attention." AI 6, no. 2 (2025): 38. https://doi.org/10.3390/ai6020038.

Full text

Abstract:

Background: Conventional medical image retrieval methods treat images and text as independent embeddings, limiting their ability to fully utilize the complementary information from both modalities. This separation often results in suboptimal retrieval performance, as the intricate relationships between images and text remain underexplored. Methods: To address this limitation, we propose a novel retrieval method that integrates medical image and text embeddings using a cross-attention mechanism. Our approach creates a unified representation by directly modeling the interactions between the two

APA, Harvard, Vancouver, ISO, and other styles

38

Kiran Chitturi. "Demystifying Multimodal AI: A Technical Deep Dive." International Journal of Scientific Research in Computer Science, Engineering and Information Technology 10, no. 6 (2024): 2011–17. https://doi.org/10.32628/cseit2410612394.

Full text

Abstract:

This article explores the transformative impact of multimodal AI systems in bridging diverse data types and processing capabilities. It examines how these systems have revolutionized various domains through their ability to handle multiple modalities simultaneously, from visual-linguistic understanding to complex search operations. The article delves into the technical foundations of multimodal embeddings, analyzes leading models like CLIP and MUM, and investigates their real-world applications across different sectors. Through a detailed examination of current implementations, challenges, and

APA, Harvard, Vancouver, ISO, and other styles

39

Tokar, Tomas, and Scott Sanner. "ICE-T: Interactions-aware Cross-column Contrastive Embedding for Heterogeneous Tabular Datasets." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 20 (2025): 20904–11. https://doi.org/10.1609/aaai.v39i20.35385.

Full text

Abstract:

Finding high-quality representations of heterogeneous tabular datasets is crucial for their effective use in downstream machine learning tasks. Contrastive representation learning (CRL) methods have been previously shown to provide a straightforward way to learn such representations across various data domains. Current tabular CRL methods learn joint embeddings of data instances (tabular rows) by minimizing a contrastive loss between the original instance and its perturbations. Unlike existing tabular CRL methods, we propose leveraging frameworks established in multimodal representation learni

APA, Harvard, Vancouver, ISO, and other styles

40

Ma, Shukui, Pengyuan Ma, Shuaichao Feng, Fei Ma, and Guangping Zhuo. "Multimodal Data-Based Text Generation Depression Classification Model." International Journal of Computer Science and Information Technology 5, no. 1 (2025): 175–93. https://doi.org/10.62051/ijcsit.v5n1.16.

Full text

Abstract:

Depression classification often relies on multimodal features, but existing models struggle to capture the similarity between multimodal features. Moreover, the social stigma surrounding depression leads to limited availability of datasets, which constrains model accuracy. This study aims to improve multimodal depression recognition methods by proposing a Multimodal Generation-Text Depression Classification Model. The model introduces a Multimodal-Deep-Extract-Feature Net to capture both long- and short-term sequential features. A Dual Text Contrastive Learning Module is employed to generate e

APA, Harvard, Vancouver, ISO, and other styles

41

Zhang, Jianqiang, Renyao Chen, Shengwen Li, Tailong Li, and Hong Yao. "MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation." Algorithms 17, no. 12 (2024): 593. https://doi.org/10.3390/a17120593.

Full text

Abstract:

Geographic knowledge graph representation learning embeds entities and relationships in geographic knowledge graphs into a low-dimensional continuous vector space, which serves as a basic method that bridges geographic knowledge graphs and geographic applications. Previous geographic knowledge graph representation methods primarily learn the vectors of entities and their relationships from their spatial attributes and relationships, which ignores various semantics of entities, resulting in poor embeddings on geographic knowledge graphs. This study proposes a two-stage multimodal geographic kno

APA, Harvard, Vancouver, ISO, and other styles

42

Jyoti, Arora, Khapekar Priyal, and Pal Rakhi. "Multimodal Sentiment Analysis using LSTM and RoBerta." Advanced Innovations in Computer Programming Languages 5, no. 2 (2023): 24–35. https://doi.org/10.5281/zenodo.8130701.

Full text

Abstract:

<em>Social media is a valuable data source for understanding people's thoughts and feelings. Sentiment analysis and affective computing help analyze sentiment and emotions in social media posts. Our research paper proposes a model for tweet emotions analysis using LSTM, GloVe embeddings, and RoBERTa. This model captures sequential dependencies in tweets, leverages semantic representations, and enhances contextual understanding. We evaluate the model on a tweet emotions dataset, demonstrating its effectiveness in accurately classifying emotions in tweets.Through evaluation on a tweet emotio

APA, Harvard, Vancouver, ISO, and other styles

43

Tseng, Shao-Yen, Shrikanth Narayanan, and Panayiotis Georgiou. "Multimodal Embeddings From Language Models for Emotion Recognition in the Wild." IEEE Signal Processing Letters 28 (2021): 608–12. http://dx.doi.org/10.1109/lsp.2021.3065598.

Full text

APA, Harvard, Vancouver, ISO, and other styles

44

Jing, Xuebin, Liang He, Zhida Song, and Shaolei Wang. "Audio–Visual Fusion Based on Interactive Attention for Person Verification." Sensors 23, no. 24 (2023): 9845. http://dx.doi.org/10.3390/s23249845.

Full text

Abstract:

With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio–visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verificat

APA, Harvard, Vancouver, ISO, and other styles

45

Azeroual, Saadia, Zakaria Hamane, Rajaa Sebihi, and Fatima-Ezzahraa Ben-Bouazza. "Toward Improved Glioma Mortality Prediction: A Multimodal Framework Combining Radiomic and Clinical Features." International Journal of Online and Biomedical Engineering (iJOE) 21, no. 05 (2025): 31–46. https://doi.org/10.3991/ijoe.v21i05.52691.

Full text

Abstract:

Gliomas, especially diffuse gliomas, remain a major challenge in neuro-oncology due to their highly heterogeneous nature and poor prognosis. Accurately predicting patient mortality is essential for improving treatment strategies and outcomes, yet current models often fail to fully utilize the wealth of available multimodal data. To address this, we developed a novel multimodal predictive model that integrates diverse magnetic resonance imaging (MRI) sequences—T1, T2, FLAIR, DWI, SWI, and advanced diffusion metrics such as high angular resolution diffusion imaging (HARDI)—with detailed clinical

APA, Harvard, Vancouver, ISO, and other styles

46

Salin, Emmanuelle, Badreddine Farah, Stéphane Ayache, and Benoit Favre. "Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 11248–57. http://dx.doi.org/10.1609/aaai.v36i10.21375.

Full text

Abstract:

In recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of

APA, Harvard, Vancouver, ISO, and other styles

47

Bikshapathy Peruka. "Sentemonet: A Comprehensive Framework for Multimodal Sentiment Analysis from Text and Emotions." Journal of Information Systems Engineering and Management 10, no. 34s (2025): 569–87. https://doi.org/10.52783/jisem.v10i34s.5852.

Full text

Abstract:

Sentiment analysis, a crucial aspect of Natural Language Processing (NLP), plays a pivotal role in understanding public opinion, customer feedback, and user sentiments in various domains. In this study, we present a comprehensive approach to sentiment analysis that incorporates both textual and emoji data, leveraging diverse datasets from sources such as social media, customer reviews, and surveys. Our methodology consists of several key steps, including data collection, pre-processing, feature extraction, feature fusion, and feature selection. For data pre-processing, we apply techniques such

APA, Harvard, Vancouver, ISO, and other styles

48

Skantze, Gabriel, and Bram Willemsen. "CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings." Journal of Artificial Intelligence Research 74 (July 9, 2022): 1201–23. http://dx.doi.org/10.1613/jair.1.13689.

Full text

Abstract:

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this vector, so that the adjustment is only applied when needed. Unlike traditional few-shot learning, th

APA, Harvard, Vancouver, ISO, and other styles

49

Li, Wenxiang, Longyuan Ding, Yuliang Zhang, and Ziyuan Pu. "Understanding multimodal travel patterns based on semantic embeddings of human mobility trajectories." Journal of Transport Geography 124 (April 2025): 104169. https://doi.org/10.1016/j.jtrangeo.2025.104169.

Full text

APA, Harvard, Vancouver, ISO, and other styles

50

Wang, Jenq-Haur, Mehdi Norouzi, and Shu Ming Tsai. "Augmenting Multimodal Content Representation with Transformers for Misinformation Detection." Big Data and Cognitive Computing 8, no. 10 (2024): 134. http://dx.doi.org/10.3390/bdcc8100134.

Full text

Abstract:

Information sharing on social media has become a common practice for people around the world. Since it is difficult to check user-generated content on social media, huge amounts of rumors and misinformation are being spread with authentic information. On the one hand, most of the social platforms identify rumors through manual fact-checking, which is very inefficient. On the other hand, with an emerging form of misinformation that contains inconsistent image–text pairs, it would be beneficial if we could compare the meaning of multimodal content within the same post for detecting image–text in

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!