Conecte-se

Bibliografias temáticas / Multimodal Embeddings / Artigos de revistas

Siga este link para ver outros tipos de publicações sobre o tema: Multimodal Embeddings.

Artigos de revistas sobre o tema "Multimodal Embeddings"

Autor: Grafiati

Publicado: 26 de outubro de 2024

Última modificação: 31 de julho de 2025

Crie uma referência precisa em APA, MLA, Chicago, Harvard, e outros estilos

Selecione um tipo de fonte:

Veja os 50 melhores artigos de revistas para estudos sobre o assunto "Multimodal Embeddings".

Ao lado de cada fonte na lista de referências, há um botão "Adicionar à bibliografia". Clique e geraremos automaticamente a citação bibliográfica do trabalho escolhido no estilo de citação de que você precisa: APA, MLA, Harvard, Chicago, Vancouver, etc.

Você também pode baixar o texto completo da publicação científica em formato .pdf e ler o resumo do trabalho online se estiver presente nos metadados.

Veja os artigos de revistas das mais diversas áreas científicas e compile uma bibliografia correta.

1

Tyshchuk, Kirill, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev, and Alexander Panchenko. "On Isotropy of Multimodal Embeddings." Information 14, no. 7 (2023): 392. http://dx.doi.org/10.3390/info14070392.

Texto completo da fonte

Resumo:

Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based text encoders, such as the BERT model. Anisotropic word embeddings do not use the entire space, instead concentrating on a narrow cone in such a pretrained vector space, negatively affecting the performance of applications, such as textual semantic similarity. Transforming a vector space to optimize isotropy has been shown to be beneficial for

Estilos ABNT, Harvard, Vancouver, APA, etc.

2

Guo, Zhiqiang, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi, and Bin Ruan. "LGMRec: Local and Global Graph Learning for Multimodal Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (2024): 8454–62. http://dx.doi.org/10.1609/aaai.v38i8.28688.

Texto completo da fonte

Resumo:

The multimodal recommendation has gradually become the infrastructure of online media platforms, enabling them to provide personalized service to users through a joint modeling of user historical behaviors (e.g., purchases, clicks) and item various modalities (e.g., visual and textual). The majority of existing studies typically focus on utilizing modal features or modal-related graph structure to learn user local interests. Nevertheless, these approaches encounter two limitations: (1) Shared updates of user ID embeddings result in the consequential coupling between collaboration and multimoda

Estilos ABNT, Harvard, Vancouver, APA, etc.

3

Shang, Bin, Yinliang Zhao, Jun Liu, and Di Wang. "LAFA: Multimodal Knowledge Graph Completion with Link Aware Fusion and Aggregation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (2024): 8957–65. http://dx.doi.org/10.1609/aaai.v38i8.28744.

Texto completo da fonte

Resumo:

Recently, an enormous amount of research has emerged on multimodal knowledge graph completion (MKGC), which seeks to extract knowledge from multimodal data and predict the most plausible missing facts to complete a given multimodal knowledge graph (MKG). However, existing MKGC approaches largely ignore that visual information may introduce noise and lead to uncertainty when adding them to the traditional KG embeddings due to the contribution of each associated image to entity is different in diverse link scenarios. Moreover, treating each triple independently when learning entity embeddings le

Estilos ABNT, Harvard, Vancouver, APA, etc.

4

Sun, Zhongkai, Prathusha Sarma, William Sethares, and Yingyu Liang. "Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 05 (2020): 8992–99. http://dx.doi.org/10.1609/aaai.v34i05.6431.

Texto completo da fonte

Resumo:

Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment anal

Estilos ABNT, Harvard, Vancouver, APA, etc.

5

Merkx, Danny, and Stefan L. Frank. "Learning semantic sentence representations from visually grounded language without lexical knowledge." Natural Language Engineering 25, no. 4 (2019): 451–66. http://dx.doi.org/10.1017/s1351324919000196.

Texto completo da fonte

Resumo:

AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves re

Estilos ABNT, Harvard, Vancouver, APA, etc.

6

Mihail Mateev. "Comparative Analysis on Implementing Embeddings for Image Analysis." Journal of Information Systems Engineering and Management 10, no. 17s (2025): 89–102. https://doi.org/10.52783/jisem.v10i17s.2710.

Texto completo da fonte

Resumo:

This research explores how artificial intelligence enhances construction maintenance and diagnostics, achieving 95% accuracy on a dataset of 10,000 cases. The findings highlight AI's potential to revolutionize predictive maintenance in the industry. The growing adoption of image embeddings has transformed visual data processing across AI applications. This study evaluates embedding implementations in major platforms, including Azure AI, OpenAI's GPT-4 Vision, and frameworks like Hugging Face, Replicate, and Eden AI. It assesses their scalability, accuracy, cost-effectiveness, and integration f

Estilos ABNT, Harvard, Vancouver, APA, etc.

7

Tang, Zhenchao, Jiehui Huang, Guanxing Chen, and Calvin Yu-Chian Chen. "Comprehensive View Embedding Learning for Single-Cell Multimodal Integration." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (2024): 15292–300. http://dx.doi.org/10.1609/aaai.v38i14.29453.

Texto completo da fonte

Resumo:

Motivation: Advances in single-cell measurement techniques provide rich multimodal data, which helps us to explore the life state of cells more deeply. However, multimodal integration, or, learning joint embeddings from multimodal data remains a current challenge. The difficulty in integrating unpaired single-cell multimodal data is that different modalities have different feature spaces, which easily leads to information loss in joint embedding. And few existing methods have fully exploited and fused the information in single-cell multimodal data. Result: In this study, we propose CoVEL, a de

Estilos ABNT, Harvard, Vancouver, APA, etc.

8

Zhang, Linhai, Deyu Zhou, Yulan He, and Zeng Yang. "MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (2021): 14420–27. http://dx.doi.org/10.1609/aaai.v35i16.17695.

Texto completo da fonte

Resumo:

Previous work has shown the effectiveness of using event representations for tasks such as script event prediction and stock market prediction. It is however still challenging to learn the subtle semantic differences between events based solely on textual descriptions of events often represented as (subject, predicate, object) triples. As an alternative, images offer a more intuitive way of understanding event semantics. We observe that event described in text and in images show different abstraction levels and therefore should be projected onto heterogeneous embedding spaces, as opposed to wh

Estilos ABNT, Harvard, Vancouver, APA, etc.

9

Sah, Shagan, Sabarish Gopalakishnan, and Raymond Ptucha. "Aligned attention for common multimodal embeddings." Journal of Electronic Imaging 29, no. 02 (2020): 1. http://dx.doi.org/10.1117/1.jei.29.2.023013.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

10

Alkaabi, Hussein, Ali Kadhim Jasim, and Ali Darroudi. "From Static to Contextual: A Survey of Embedding Advances in NLP." PERFECT: Journal of Smart Algorithms 2, no. 2 (2025): 57–66. https://doi.org/10.62671/perfect.v2i2.77.

Texto completo da fonte

Resumo:

Embedding techniques have been a cornerstone of Natural Language Processing (NLP), enabling machines to represent textual data in a form that captures semantic and syntactic relationships. Over the years, the field has witnessed a significant evolution—from static word embeddings, such as Word2Vec and GloVe, which represent words as fixed vectors, to dynamic, contextualized embeddings like BERT and GPT, which generate word representations based on their surrounding context. This survey provides a comprehensive overview of embedding techniques, tracing their development from early methods to st

Estilos ABNT, Harvard, Vancouver, APA, etc.

11

Zhang, Rongchao, Yiwei Lou, Dexuan Xu, Yongzhi Cao, Hanpin Wang, and Yu Huang. "A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 15 (2024): 16803–11. http://dx.doi.org/10.1609/aaai.v38i15.29621.

Texto completo da fonte

Resumo:

The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacy-protecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realis

Estilos ABNT, Harvard, Vancouver, APA, etc.

12

Lin, Kaiyi, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. "Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11515–22. http://dx.doi.org/10.1609/aaai.v34i07.6817.

Texto completo da fonte

Resumo:

Zero-Shot Cross-Modal Retrieval (ZS-CMR) is an emerging research hotspot that aims to retrieve data of new classes across different modality data. It is challenging for not only the heterogeneous distributions across different modalities, but also the inconsistent semantics across seen and unseen classes. A handful of recently proposed methods typically borrow the idea from zero-shot learning, i.e., exploiting word embeddings of class labels (i.e., class-embeddings) as common semantic space, and using generative adversarial network (GAN) to capture the underlying multimodal data structures, as

Estilos ABNT, Harvard, Vancouver, APA, etc.

13

Khalifa, Omar Yasser Ibrahim, and Muhammad Zafran Muhammad Zaly Shah. "MultiPhishNet: A Multimodal Approach of QR Code Phishing Detection using Multi-Head Attention and Multilingual Embeddings." International Journal of Innovative Computing 15, no. 1 (2025): 53–61. https://doi.org/10.11113/ijic.v15n1.512.

Texto completo da fonte

Resumo:

Phishing attacks leveraging QR codes have become a significant threat due to their increasing use in contactless services. These attacks are challenging to detect since QR codes typically encode URLs leading to phishing websites designed to steal sensitive information. Existing detection methods often rely on blacklists or handcrafted features, which are inadequate for handling obfuscated URLs and multilingual content. This paper proposes MultiPhishNet, a multimodal phishing detection model that integrates advanced embedding techniques, Convolutional Neural Networks (CNNs), and multi-head atte

Estilos ABNT, Harvard, Vancouver, APA, etc.

14

Waqas, Asim, Aakash Tripathi, Mia Naeini, Paul A. Stewart, Matthew B. Schabath, and Ghulam Rasool. "Abstract 991: PARADIGM: an embeddings-based multimodal learning framework with foundation models and graph neural networks." Cancer Research 85, no. 8_Supplement_1 (2025): 991. https://doi.org/10.1158/1538-7445.am2025-991.

Texto completo da fonte

Resumo:

Abstract Introduction: Cancer research faces significant challenges in integrating heterogeneous data across varying spatial and temporal scales, limiting the ability to gain a comprehensive understanding of the disease. PARADIGM(Pan-Cancer Embeddings Representation using Advanced Multimodal Learning with Graph-based Modeling) addresses this challenge by providing a framework leveraging foundation models (FMs) and Graph Neural Networks (GNN). PARADIGM framework generates embeddings from multi-resolution datasets using modality-specific FMs, aggregates sample embeddings, fuses them into a unifi

Estilos ABNT, Harvard, Vancouver, APA, etc.

15

Li, Xiaolong, Yang Dong, Yunfei Yi, Zhixun Liang, and Shuqi Yan. "Hypergraph Neural Network for Multimodal Depression Recognition." Electronics 13, no. 22 (2024): 4544. http://dx.doi.org/10.3390/electronics13224544.

Texto completo da fonte

Resumo:

Deep learning-based approaches for automatic depression recognition offer advantages of low cost and high efficiency. However, depression symptoms are challenging to detect and vary significantly between individuals. Traditional deep learning methods often struggle to capture and model these nuanced features effectively, leading to lower recognition accuracy. This paper introduces a novel multimodal depression recognition method, HYNMDR, which utilizes hypergraphs to represent the complex, high-order relationships among patients with depression. HYNMDR comprises two primary components: a tempo

Estilos ABNT, Harvard, Vancouver, APA, etc.

16

Zhu, Chaoyu, Zhihao Yang, Xiaoqiong Xia, Nan Li, Fan Zhong, and Lei Liu. "Multimodal reasoning based on knowledge graph embedding for specific diseases." Bioinformatics 38, no. 8 (2022): 2235–45. http://dx.doi.org/10.1093/bioinformatics/btac085.

Texto completo da fonte

Resumo:

Abstract Motivation Knowledge Graph (KG) is becoming increasingly important in the biomedical field. Deriving new and reliable knowledge from existing knowledge by KG embedding technology is a cutting-edge method. Some add a variety of additional information to aid reasoning, namely multimodal reasoning. However, few works based on the existing biomedical KGs are focused on specific diseases. Results This work develops a construction and multimodal reasoning process of Specific Disease Knowledge Graphs (SDKGs). We construct SDKG-11, a SDKG set including five cancers, six non-cancer diseases, a

Estilos ABNT, Harvard, Vancouver, APA, etc.

17

Tripathi, Aakash Gireesh, Asim Waqas, Yasin Yilmaz, Matthew B. Schabath, and Ghulam Rasool. "Abstract 3641: Predicting treatment outcomes using cross-modality correlations in multimodal oncology data." Cancer Research 85, no. 8_Supplement_1 (2025): 3641. https://doi.org/10.1158/1538-7445.am2025-3641.

Texto completo da fonte

Resumo:

Abstract Accurate prediction of treatment outcomes in oncology requires modeling the intricate relationships across diverse data modalities. This study investigates cross-modality correlations by leveraging imaging and clinical data curated through the Multimodal Integration of Oncology Data System (MINDS) and HoneyBee frameworks to uncover actionable patterns for personalized treatment strategies. Using data from over 10, 000 cancer patients, we developed a machine learning pipeline that employs advanced embedding techniques to capture associations between radiological imaging phenotypes and

Estilos ABNT, Harvard, Vancouver, APA, etc.

18

Tripathi, Aakash, Asim Waqas, Yasin Yilmaz, and Ghulam Rasool. "Abstract 4905: Multimodal transformer model improves survival prediction in lung cancer compared to unimodal approaches." Cancer Research 84, no. 6_Supplement (2024): 4905. http://dx.doi.org/10.1158/1538-7445.am2024-4905.

Texto completo da fonte

Resumo:

Abstract Integrating multimodal lung data including clinical notes, medical images, and molecular data is critical for predictive modeling tasks like survival prediction, yet effectively aligning these disparate data types remains challenging. We present a novel method to integrate heterogeneous lung modalities by first thoroughly analyzing various domain-specific models and selecting the optimal model for embedding feature extraction per data type based on performance on representative pretrained tasks. For clinical notes, the GatorTron models showed the lowest regression loss on an initial e

Estilos ABNT, Harvard, Vancouver, APA, etc.

19

Xu, Jinfeng, Zheyu Chen, Shuo Yang, Jinze Li, Hewei Wang, and Edith C. H. Ngai. "MENTOR: Multi-level Self-supervised Learning for Multimodal Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 12 (2025): 12908–17. https://doi.org/10.1609/aaai.v39i12.33408.

Texto completo da fonte

Resumo:

As multimedia information proliferates, multimodal recommendation systems have garnered significant attention. These systems leverage multimodal information to alleviate the data sparsity issue inherent in recommendation systems, thereby enhancing the accuracy of recommendations. Due to the natural semantic disparities among multimodal features, recent research has primarily focused on cross-modal alignment using self-supervised learning to bridge these gaps. However, aligning different modal features might result in the loss of valuable interaction information, distancing them from ID embeddi

Estilos ABNT, Harvard, Vancouver, APA, etc.

20

Ota, Kosuke, Keiichiro Shirai, Hidetoshi Miyao, and Minoru Maruyama. "Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings." Journal of Advanced Computational Intelligence and Intelligent Informatics 26, no. 6 (2022): 995–1003. http://dx.doi.org/10.20965/jaciii.2022.p0995.

Texto completo da fonte

Resumo:

In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neur

Estilos ABNT, Harvard, Vancouver, APA, etc.

21

Yi, Moung-Ho, Keun-Chang Kwak, and Ju-Hyun Shin. "KoHMT: A Multimodal Emotion Recognition Model Integrating KoELECTRA, HuBERT with Multimodal Transformer." Electronics 13, no. 23 (2024): 4674. http://dx.doi.org/10.3390/electronics13234674.

Texto completo da fonte

Resumo:

With the advancement of human-computer interaction, the role of emotion recognition has become increasingly significant. Emotion recognition technology provides practical benefits across various industries, including user experience enhancement, education, and organizational productivity. For instance, in educational settings, it enables real-time understanding of students’ emotional states, facilitating tailored feedback. In workplaces, monitoring employees’ emotions can contribute to improved job performance and satisfaction. Recently, emotion recognition has also gained attention in media a

Estilos ABNT, Harvard, Vancouver, APA, etc.

22

Mai, Sijie, Haifeng Hu, and Songlong Xing. "Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (2020): 164–72. http://dx.doi.org/10.1609/aaai.v34i01.5347.

Texto completo da fonte

Resumo:

Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore

Estilos ABNT, Harvard, Vancouver, APA, etc.

23

Kapil Adhar Wagh. "A Review: Word Embedding Models with Machine Learning Based Context Depend and Context Independent Techniques." Advances in Nonlinear Variational Inequalities 28, no. 3s (2024): 251–58. https://doi.org/10.52783/anvi.v28.2928.

Texto completo da fonte

Resumo:

Natural language processing (NLP) has been transformed by word embedding models, which convert text into meaningful numerical representations. These models fall into two general categories: context-dependent methods like ELMo, BERT, and GPT, and context-independent methods like Word2Vec, GloVe, and FastText. Although static word representations are provided by context-independent models, polysemy and contextual subtleties are difficult for them to capture. These issues are addressed by context-dependent approaches that make use of sophisticated deep learning architectures to produce dynamic em

Estilos ABNT, Harvard, Vancouver, APA, etc.

24

Kim, Donghyun, Kuniaki Saito, Kate Saenko, Stan Sclaroff, and Bryan Plummer. "MULE: Multimodal Universal Language Embedding." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11254–61. http://dx.doi.org/10.1609/aaai.v34i07.6785.

Texto completo da fonte

Resumo:

Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabli

Estilos ABNT, Harvard, Vancouver, APA, etc.

25

Vijay Vaibhav Singh. "Vector Embeddings: The Mathematical Foundation of Modern AI Systems." International Journal of Scientific Research in Computer Science, Engineering and Information Technology 11, no. 1 (2025): 2408–17. https://doi.org/10.32628/cseit251112257.

Texto completo da fonte

Resumo:

This comprehensive article examines vector embeddings as a fundamental component of modern artificial intelligence systems, detailing their mathematical foundations, key properties, implementation techniques, and practical applications. The article traces the evolution from basic word embeddings to sophisticated transformer-based architectures, highlighting how these representations enable machines to capture and process semantic relationships in human language and visual data. The article encompasses both theoretical frameworks and practical implementations, from the groundbreaking Word2Vec a

Estilos ABNT, Harvard, Vancouver, APA, etc.

26

Wehrmann, Jônatas, Anderson Mattjie, and Rodrigo C. Barros. "Order embeddings and character-level convolutions for multimodal alignment." Pattern Recognition Letters 102 (January 2018): 15–22. http://dx.doi.org/10.1016/j.patrec.2017.11.020.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

27

Mithun, Niluthpol C., Juncheng Li, Florian Metze, and Amit K. Roy-Chowdhury. "Joint embeddings with multimodal cues for video-text retrieval." International Journal of Multimedia Information Retrieval 8, no. 1 (2019): 3–18. http://dx.doi.org/10.1007/s13735-018-00166-3.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

28

Fodor, Ádám, András Lőrincz, and Rachid R. Saboundji. "Enhancing apparent personality trait analysis with cross-modal embeddings." Annales Universitatis Scientiarum Budapestinensis de Rolando Eötvös Nominatae. Sectio computatorica 57 (2024): 167–85. https://doi.org/10.71352/ac.57.167.

Texto completo da fonte

Resumo:

utomatic personality trait assessment is essential for high-quality human-machine interactions. Systems capable of human behavior analysis could be used for self-driving cars, medical research, and surveillance, among many others. We present a multimodal deep neural network with a distance learning network extension for apparent personality trait prediction trained on short video recordings and exploiting modality invariant embeddings. Acoustic, visual, and textual information are utilized to reach high-performance solutions in this task. Due to the highly centralized target distribution of th

Estilos ABNT, Harvard, Vancouver, APA, etc.

29

Roshan, Nayak, S. Ullas Kannantha B, S. Kruthi, and Gururaj C. "Multimodal Offensive Meme Classification Using Transformers and BiLSTM." International Journal of Engineering and Advanced Technology (IJEAT) 11, no. 3 (2022): 96–102. https://doi.org/10.35940/ijeat.C3392.0211322.

Texto completo da fonte

Resumo:

<strong>Abstract:</strong> Nowadays memes have become a way in which people express their ideas on social media. These memes can convey various views including offensive ones. Memes can be intended for a personal attack, homophobic abuse, racial abuse, attack on minority etc. The memes are implicit and multi-modal in nature. Here we analyze the meme by categorizing them as offensive or not offensive and this becomes a binary classification problem. We propose a novel offensive meme classification using the transformer-based image encoder, BiLSTM for text with mean pooling as text encoder and a

Estilos ABNT, Harvard, Vancouver, APA, etc.

30

Nayak, Roshan, B. S. Ullas Kannantha, Kruthi S, and C. Gururaj. "Multimodal Offensive Meme Classification u sing Transformers and BiLSTM." International Journal of Engineering and Advanced Technology 11, no. 3 (2022): 96–102. http://dx.doi.org/10.35940/ijeat.c3392.0211322.

Texto completo da fonte

Resumo:

Nowadays memes have become a way in which people express their ideas on social media. These memes can convey various views including offensive ones. Memes can be intended for a personal attack, homophobic abuse, racial abuse, attack on minority etc. The memes are implicit and multi-modal in nature. Here we analyze the meme by categorizing them as offensive or not offensive and this becomes a binary classification problem. We propose a novel offensive meme classification using the transformer-based image encoder, BiLSTM for text with mean pooling as text encoder and a Feed-Forward Network as a

Estilos ABNT, Harvard, Vancouver, APA, etc.

31

Chen, Weijia, Zhijun Lu, Lijue You, Lingling Zhou, Jie Xu, and Ken Chen. "Artificial Intelligence–Based Multimodal Risk Assessment Model for Surgical Site Infection (AMRAMS): Development and Validation Study." JMIR Medical Informatics 8, no. 6 (2020): e18186. http://dx.doi.org/10.2196/18186.

Texto completo da fonte

Resumo:

Background Surgical site infection (SSI) is one of the most common types of health care–associated infections. It increases mortality, prolongs hospital length of stay, and raises health care costs. Many institutions developed risk assessment models for SSI to help surgeons preoperatively identify high-risk patients and guide clinical intervention. However, most of these models had low accuracies. Objective We aimed to provide a solution in the form of an Artificial intelligence–based Multimodal Risk Assessment Model for Surgical site infection (AMRAMS) for inpatients undergoing operations, us

Estilos ABNT, Harvard, Vancouver, APA, etc.

32

N.D., Smelik. "Multimodal topic model for texts and images utilizing their embeddings." Machine Learning and Data Analysis 2, no. 4 (2016): 421–41. http://dx.doi.org/10.21469/22233792.2.4.05.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

33

Abdou, Ahmed, Ekta Sood, Philipp Müller, and Andreas Bulling. "Gaze-enhanced Crossmodal Embeddings for Emotion Recognition." Proceedings of the ACM on Human-Computer Interaction 6, ETRA (2022): 1–18. http://dx.doi.org/10.1145/3530879.

Texto completo da fonte

Resumo:

Emotional expressions are inherently multimodal -- integrating facial behavior, speech, and gaze -- but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, an explicit representation of gaze was not included. We propose a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework. We show that our method outperforms the previous state of the art f

Estilos ABNT, Harvard, Vancouver, APA, etc.

34

Hu, Wenbo, Yifan Xu, Yi Li, Weiyue Li, Zeyuan Chen, and Zhuowen Tu. "BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (2024): 2256–64. http://dx.doi.org/10.1609/aaai.v38i3.27999.

Texto completo da fonte

Resumo:

Vision Language Models (VLMs), which extend Large Language Models (LLM) by incorporating visual understanding capability, have demonstrated significant advancements in addressing open-ended visual question-answering (VQA) tasks. However, these models cannot accurately interpret images infused with text, a common occurrence in real-world scenarios. Standard procedures for extracting information from images often involve learning a fixed set of query embeddings. These embeddings are designed to encapsulate image contexts and are later used as soft prompt inputs in LLMs. Yet, this process is limi

Estilos ABNT, Harvard, Vancouver, APA, etc.

35

Chen, Qihua, Xuejin Chen, Chenxuan Wang, Yixiong Liu, Zhiwei Xiong, and Feng Wu. "Learning Multimodal Volumetric Features for Large-Scale Neuron Tracing." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 2 (2024): 1174–82. http://dx.doi.org/10.1609/aaai.v38i2.27879.

Texto completo da fonte

Resumo:

The current neuron reconstruction pipeline for electron microscopy (EM) data usually includes automatic image segmentation followed by extensive human expert proofreading. In this work, we aim to reduce human workload by predicting connectivity between over-segmented neuron pieces, taking both microscopy image and 3D morphology features into account, similar to human proofreading workflow. To this end, we first construct a dataset, named FlyTracing, that contains millions of pairwise connections of segments expanding the whole fly brain, which is three orders of magnitude larger than existing

Estilos ABNT, Harvard, Vancouver, APA, etc.

36

Shen, Aili, Bahar Salehi, Jianzhong Qi, and Timothy Baldwin. "A General Approach to Multimodal Document Quality Assessment." Journal of Artificial Intelligence Research 68 (July 22, 2020): 607–32. http://dx.doi.org/10.1613/jair.1.11647.

Texto completo da fonte

Resumo:

   The perceived quality of a document is affected by various factors, including grammat- icality, readability, stylistics, and expertise depth, making the task of document quality assessment a complex one. In this paper, we explore this task in the context of assessing the quality of Wikipedia articles and academic papers. Observing that the visual rendering of a document can capture implicit quality indicators that are not present in the document text — such as images, font choices, and visual layout — we propose a joint model that combines the text content with a visual re

Estilos ABNT, Harvard, Vancouver, APA, etc.

37

Sata, Ikumi, Motoki Amagasaki, and Masato Kiyama. "Multimodal Retrieval Method for Images and Diagnostic Reports Using Cross-Attention." AI 6, no. 2 (2025): 38. https://doi.org/10.3390/ai6020038.

Texto completo da fonte

Resumo:

Background: Conventional medical image retrieval methods treat images and text as independent embeddings, limiting their ability to fully utilize the complementary information from both modalities. This separation often results in suboptimal retrieval performance, as the intricate relationships between images and text remain underexplored. Methods: To address this limitation, we propose a novel retrieval method that integrates medical image and text embeddings using a cross-attention mechanism. Our approach creates a unified representation by directly modeling the interactions between the two

Estilos ABNT, Harvard, Vancouver, APA, etc.

38

Kiran Chitturi. "Demystifying Multimodal AI: A Technical Deep Dive." International Journal of Scientific Research in Computer Science, Engineering and Information Technology 10, no. 6 (2024): 2011–17. https://doi.org/10.32628/cseit2410612394.

Texto completo da fonte

Resumo:

This article explores the transformative impact of multimodal AI systems in bridging diverse data types and processing capabilities. It examines how these systems have revolutionized various domains through their ability to handle multiple modalities simultaneously, from visual-linguistic understanding to complex search operations. The article delves into the technical foundations of multimodal embeddings, analyzes leading models like CLIP and MUM, and investigates their real-world applications across different sectors. Through a detailed examination of current implementations, challenges, and

Estilos ABNT, Harvard, Vancouver, APA, etc.

39

Tokar, Tomas, and Scott Sanner. "ICE-T: Interactions-aware Cross-column Contrastive Embedding for Heterogeneous Tabular Datasets." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 20 (2025): 20904–11. https://doi.org/10.1609/aaai.v39i20.35385.

Texto completo da fonte

Resumo:

Finding high-quality representations of heterogeneous tabular datasets is crucial for their effective use in downstream machine learning tasks. Contrastive representation learning (CRL) methods have been previously shown to provide a straightforward way to learn such representations across various data domains. Current tabular CRL methods learn joint embeddings of data instances (tabular rows) by minimizing a contrastive loss between the original instance and its perturbations. Unlike existing tabular CRL methods, we propose leveraging frameworks established in multimodal representation learni

Estilos ABNT, Harvard, Vancouver, APA, etc.

40

Ma, Shukui, Pengyuan Ma, Shuaichao Feng, Fei Ma, and Guangping Zhuo. "Multimodal Data-Based Text Generation Depression Classification Model." International Journal of Computer Science and Information Technology 5, no. 1 (2025): 175–93. https://doi.org/10.62051/ijcsit.v5n1.16.

Texto completo da fonte

Resumo:

Depression classification often relies on multimodal features, but existing models struggle to capture the similarity between multimodal features. Moreover, the social stigma surrounding depression leads to limited availability of datasets, which constrains model accuracy. This study aims to improve multimodal depression recognition methods by proposing a Multimodal Generation-Text Depression Classification Model. The model introduces a Multimodal-Deep-Extract-Feature Net to capture both long- and short-term sequential features. A Dual Text Contrastive Learning Module is employed to generate e

Estilos ABNT, Harvard, Vancouver, APA, etc.

41

Zhang, Jianqiang, Renyao Chen, Shengwen Li, Tailong Li, and Hong Yao. "MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation." Algorithms 17, no. 12 (2024): 593. https://doi.org/10.3390/a17120593.

Texto completo da fonte

Resumo:

Geographic knowledge graph representation learning embeds entities and relationships in geographic knowledge graphs into a low-dimensional continuous vector space, which serves as a basic method that bridges geographic knowledge graphs and geographic applications. Previous geographic knowledge graph representation methods primarily learn the vectors of entities and their relationships from their spatial attributes and relationships, which ignores various semantics of entities, resulting in poor embeddings on geographic knowledge graphs. This study proposes a two-stage multimodal geographic kno

Estilos ABNT, Harvard, Vancouver, APA, etc.

42

Jyoti, Arora, Khapekar Priyal, and Pal Rakhi. "Multimodal Sentiment Analysis using LSTM and RoBerta." Advanced Innovations in Computer Programming Languages 5, no. 2 (2023): 24–35. https://doi.org/10.5281/zenodo.8130701.

Texto completo da fonte

Resumo:

<em>Social media is a valuable data source for understanding people's thoughts and feelings. Sentiment analysis and affective computing help analyze sentiment and emotions in social media posts. Our research paper proposes a model for tweet emotions analysis using LSTM, GloVe embeddings, and RoBERTa. This model captures sequential dependencies in tweets, leverages semantic representations, and enhances contextual understanding. We evaluate the model on a tweet emotions dataset, demonstrating its effectiveness in accurately classifying emotions in tweets.Through evaluation on a tweet emotio

Estilos ABNT, Harvard, Vancouver, APA, etc.

43

Tseng, Shao-Yen, Shrikanth Narayanan, and Panayiotis Georgiou. "Multimodal Embeddings From Language Models for Emotion Recognition in the Wild." IEEE Signal Processing Letters 28 (2021): 608–12. http://dx.doi.org/10.1109/lsp.2021.3065598.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

44

Jing, Xuebin, Liang He, Zhida Song, and Shaolei Wang. "Audio–Visual Fusion Based on Interactive Attention for Person Verification." Sensors 23, no. 24 (2023): 9845. http://dx.doi.org/10.3390/s23249845.

Texto completo da fonte

Resumo:

With the rapid development of multimedia technology, personnel verification systems have become increasingly important in the security field and identity verification. However, unimodal verification systems have performance bottlenecks in complex scenarios, thus triggering the need for multimodal feature fusion methods. The main problem with audio–visual multimodal feature fusion is how to effectively integrate information from different modalities to improve the accuracy and robustness of the system for individual identity. In this paper, we focus on how to improve multimodal person verificat

Estilos ABNT, Harvard, Vancouver, APA, etc.

45

Azeroual, Saadia, Zakaria Hamane, Rajaa Sebihi, and Fatima-Ezzahraa Ben-Bouazza. "Toward Improved Glioma Mortality Prediction: A Multimodal Framework Combining Radiomic and Clinical Features." International Journal of Online and Biomedical Engineering (iJOE) 21, no. 05 (2025): 31–46. https://doi.org/10.3991/ijoe.v21i05.52691.

Texto completo da fonte

Resumo:

Gliomas, especially diffuse gliomas, remain a major challenge in neuro-oncology due to their highly heterogeneous nature and poor prognosis. Accurately predicting patient mortality is essential for improving treatment strategies and outcomes, yet current models often fail to fully utilize the wealth of available multimodal data. To address this, we developed a novel multimodal predictive model that integrates diverse magnetic resonance imaging (MRI) sequences—T1, T2, FLAIR, DWI, SWI, and advanced diffusion metrics such as high angular resolution diffusion imaging (HARDI)—with detailed clinical

Estilos ABNT, Harvard, Vancouver, APA, etc.

46

Salin, Emmanuelle, Badreddine Farah, Stéphane Ayache, and Benoit Favre. "Are Vision-Language Transformers Learning Multimodal Representations? A Probing Perspective." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 11248–57. http://dx.doi.org/10.1609/aaai.v36i10.21375.

Texto completo da fonte

Resumo:

In recent years, joint text-image embeddings have significantly improved thanks to the development of transformer-based Vision-Language models. Despite these advances, we still need to better understand the representations produced by those models. In this paper, we compare pre-trained and fine-tuned representations at a vision, language and multimodal level. To that end, we use a set of probing tasks to evaluate the performance of state-of-the-art Vision-Language models and introduce new datasets specifically for multimodal probing. These datasets are carefully designed to address a range of

Estilos ABNT, Harvard, Vancouver, APA, etc.

47

Bikshapathy Peruka. "Sentemonet: A Comprehensive Framework for Multimodal Sentiment Analysis from Text and Emotions." Journal of Information Systems Engineering and Management 10, no. 34s (2025): 569–87. https://doi.org/10.52783/jisem.v10i34s.5852.

Texto completo da fonte

Resumo:

Sentiment analysis, a crucial aspect of Natural Language Processing (NLP), plays a pivotal role in understanding public opinion, customer feedback, and user sentiments in various domains. In this study, we present a comprehensive approach to sentiment analysis that incorporates both textual and emoji data, leveraging diverse datasets from sources such as social media, customer reviews, and surveys. Our methodology consists of several key steps, including data collection, pre-processing, feature extraction, feature fusion, and feature selection. For data pre-processing, we apply techniques such

Estilos ABNT, Harvard, Vancouver, APA, etc.

48

Skantze, Gabriel, and Bram Willemsen. "CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings." Journal of Artificial Intelligence Research 74 (July 9, 2022): 1201–23. http://dx.doi.org/10.1613/jair.1.13689.

Texto completo da fonte

Resumo:

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this vector, so that the adjustment is only applied when needed. Unlike traditional few-shot learning, th

Estilos ABNT, Harvard, Vancouver, APA, etc.

49

Li, Wenxiang, Longyuan Ding, Yuliang Zhang, and Ziyuan Pu. "Understanding multimodal travel patterns based on semantic embeddings of human mobility trajectories." Journal of Transport Geography 124 (April 2025): 104169. https://doi.org/10.1016/j.jtrangeo.2025.104169.

Texto completo da fonte

Estilos ABNT, Harvard, Vancouver, APA, etc.

50

Wang, Jenq-Haur, Mehdi Norouzi, and Shu Ming Tsai. "Augmenting Multimodal Content Representation with Transformers for Misinformation Detection." Big Data and Cognitive Computing 8, no. 10 (2024): 134. http://dx.doi.org/10.3390/bdcc8100134.

Texto completo da fonte

Resumo:

Information sharing on social media has become a common practice for people around the world. Since it is difficult to check user-generated content on social media, huge amounts of rumors and misinformation are being spread with authentic information. On the one hand, most of the social platforms identify rumors through manual fact-checking, which is very inefficient. On the other hand, with an emerging form of misinformation that contains inconsistent image–text pairs, it would be beneficial if we could compare the meaning of multimodal content within the same post for detecting image–text in

Estilos ABNT, Harvard, Vancouver, APA, etc.

Oferecemos descontos em todos os planos premium para autores cujas obras estão incluídas em seleções literárias temáticas. Contate-nos para obter um código promocional único!