Siga este enlace para ver otros tipos de publicaciones sobre el tema: Multimodal embedding space.

Artículos de revistas sobre el tema "Multimodal embedding space"

Crea una cita precisa en los estilos APA, MLA, Chicago, Harvard y otros

Elija tipo de fuente:

Consulte los 50 mejores artículos de revistas para su investigación sobre el tema "Multimodal embedding space".

Junto a cada fuente en la lista de referencias hay un botón "Agregar a la bibliografía". Pulsa este botón, y generaremos automáticamente la referencia bibliográfica para la obra elegida en el estilo de cita que necesites: APA, MLA, Harvard, Vancouver, Chicago, etc.

También puede descargar el texto completo de la publicación académica en formato pdf y leer en línea su resumen siempre que esté disponible en los metadatos.

Explore artículos de revistas sobre una amplia variedad de disciplinas y organice su bibliografía correctamente.

1

Tyshchuk, Kirill, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev, and Alexander Panchenko. "On Isotropy of Multimodal Embeddings." Information 14, no. 7 (2023): 392. http://dx.doi.org/10.3390/info14070392.

Texto completo
Resumen
Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based text encoders, such as the BERT model. Anisotropic word embeddings do not use the entire space, instead concentrating on a narrow cone in such a pretrained vector space, negatively affecting the performance of applications, such as textual semantic similarity. Transforming a vector space to optimize isotropy has been shown to be beneficial for improving performance in text processing tasks. This paper is the first comprehensive investigation of the distribution of multimodal embeddings using the example of OpenAI’s CLIP pretrained model. We aimed to deepen the understanding of the embedding space of multimodal embeddings, which has previously been unexplored in this respect, and study the impact on various end tasks. Our initial efforts were focused on measuring the alignment of image and text embedding distributions, with an emphasis on their isotropic properties. In addition, we evaluated several gradient-free approaches to enhance these properties, establishing their efficiency in improving the isotropy/alignment of the embeddings and, in certain cases, the zero-shot classification accuracy. Significantly, our analysis revealed that both CLIP and BERT models yielded embeddings situated within a cone immediately after initialization and preceding training. However, they were mostly isotropic in the local sense. We further extended our investigation to the structure of multilingual CLIP text embeddings, confirming that the observed characteristics were language-independent. By computing the few-shot classification accuracy and point-cloud metrics, we provide evidence of a strong correlation among multilingual embeddings. Embeddings transformation using the methods described in this article makes it easier to visualize embeddings. At the same time, multiple experiments that we conducted showed that, in regard to the transformed embeddings, the downstream tasks performance does not drop substantially (and sometimes is even improved). This means that one could obtain an easily visualizable embedding space, without substantially losing the quality of downstream tasks.
Los estilos APA, Harvard, Vancouver, ISO, etc.
2

Mai, Sijie, Haifeng Hu, and Songlong Xing. "Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (2020): 164–72. http://dx.doi.org/10.1609/aaai.v34i01.5347.

Texto completo
Resumen
Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative.
Los estilos APA, Harvard, Vancouver, ISO, etc.
3

Zhang, Linhai, Deyu Zhou, Yulan He, and Zeng Yang. "MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (2021): 14420–27. http://dx.doi.org/10.1609/aaai.v35i16.17695.

Texto completo
Resumen
Previous work has shown the effectiveness of using event representations for tasks such as script event prediction and stock market prediction. It is however still challenging to learn the subtle semantic differences between events based solely on textual descriptions of events often represented as (subject, predicate, object) triples. As an alternative, images offer a more intuitive way of understanding event semantics. We observe that event described in text and in images show different abstraction levels and therefore should be projected onto heterogeneous embedding spaces, as opposed to what have been done in previous approaches which project signals from different modalities onto a homogeneous space. In this paper, we propose a Multimodal Event Representation Learning framework (MERL) to learn event representations based on both text and image modalities simultaneously. Event textual triples are projected as Gaussian density embeddings by a dual-path Gaussian triple encoder, while event images are projected as point embeddings by a visual event component-aware image encoder. Moreover, a novel score function motivated by statistical hypothesis testing is introduced to coordinate two embedding spaces. Experiments are conducted on various multimodal event-related tasks and results show that MERL outperforms a number of unimodal and multimodal baselines, demonstrating the effectiveness of the proposed framework.
Los estilos APA, Harvard, Vancouver, ISO, etc.
4

Guo, Zhiqiang, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi, and Bin Ruan. "LGMRec: Local and Global Graph Learning for Multimodal Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (2024): 8454–62. http://dx.doi.org/10.1609/aaai.v38i8.28688.

Texto completo
Resumen
The multimodal recommendation has gradually become the infrastructure of online media platforms, enabling them to provide personalized service to users through a joint modeling of user historical behaviors (e.g., purchases, clicks) and item various modalities (e.g., visual and textual). The majority of existing studies typically focus on utilizing modal features or modal-related graph structure to learn user local interests. Nevertheless, these approaches encounter two limitations: (1) Shared updates of user ID embeddings result in the consequential coupling between collaboration and multimodal signals; (2) Lack of exploration into robust global user interests to alleviate the sparse interaction problems faced by local interest modeling. To address these issues, we propose a novel Local and Global Graph Learning-guided Multimodal Recommender (LGMRec), which jointly models local and global user interests. Specifically, we present a local graph embedding module to independently learn collaborative-related and modality-related embeddings of users and items with local topological relations. Moreover, a global hypergraph embedding module is designed to capture global user and item embeddings by modeling insightful global dependency relations. The global embeddings acquired within the hypergraph embedding space can then be combined with two decoupled local embeddings to improve the accuracy and robustness of recommendations. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our LGMRec over various state-of-the-art recommendation baselines, showcasing its effectiveness in modeling both local and global user interests.
Los estilos APA, Harvard, Vancouver, ISO, etc.
5

Moon, Jucheol, Nhat Anh Le, Nelson Hebert Minaya, and Sang-Il Choi. "Multimodal Few-Shot Learning for Gait Recognition." Applied Sciences 10, no. 21 (2020): 7619. http://dx.doi.org/10.3390/app10217619.

Texto completo
Resumen
A person’s gait is a behavioral trait that is uniquely associated with each individual and can be used to recognize the person. As information about the human gait can be captured by wearable devices, a few studies have led to the proposal of methods to process gait information for identification purposes. Despite recent advances in gait recognition, an open set gait recognition problem presents challenges to current approaches. To address the open set gait recognition problem, a system should be able to deal with unseen subjects who have not included in the training dataset. In this paper, we propose a system that learns a mapping from a multimodal time series collected using insole to a latent (embedding vector) space to address the open set gait recognition problem. The distance between two embedding vectors in the latent space corresponds to the similarity between two multimodal time series. Using the characteristics of the human gait pattern, multimodal time series are sliced into unit steps. The system maps unit steps to embedding vectors using an ensemble consisting of a convolutional neural network and a recurrent neural network. To recognize each individual, the system learns a decision function using a one-class support vector machine from a few embedding vectors of the person in the latent space, then the system determines whether an unknown unit step is recognized as belonging to a known individual. Our experiments demonstrate that the proposed framework recognizes individuals with high accuracy regardless they have been registered or not. If we could have an environment in which all people would be wearing the insole, the framework would be used for user verification widely.
Los estilos APA, Harvard, Vancouver, ISO, etc.
6

Zhang, Rongchao, Yiwei Lou, Dexuan Xu, Yongzhi Cao, Hanpin Wang, and Yu Huang. "A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 15 (2024): 16803–11. http://dx.doi.org/10.1609/aaai.v38i15.29621.

Texto completo
Resumen
The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacy-protecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realistic multimodal tabular data for scientific research. Our approach introduces a transformer-based fusion module that seamlessly integrates multimodal features, permitting for mining more informative latent representations. The attention within the fusion module directs the integrated output features to focus on critical components that facilitate the task of generating latent embeddings. Moreover, we formulate a contrastive learning strategy to implicitly constrain the embeddings from discrete features in the latent feature space by encouraging the similar discrete feature distributions closer while pushing the dissimilar further away, in order to better enhance the representation of the latent embedding. Experimental results indicate that GTCoder is effective to generate photo-realistic synthetic data, with interactive interpretation of latent embedding, and performs favorably against some baselines on most real-world and simulated datasets.
Los estilos APA, Harvard, Vancouver, ISO, etc.
7

Merkx, Danny, and Stefan L. Frank. "Learning semantic sentence representations from visually grounded language without lexical knowledge." Natural Language Engineering 25, no. 4 (2019): 451–66. http://dx.doi.org/10.1017/s1351324919000196.

Texto completo
Resumen
AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.
Los estilos APA, Harvard, Vancouver, ISO, etc.
8

Yogesh J. Gaikwad. "Stress Detection using Multimodal Representation Learning, Fusion Techniques, and Applications." Journal of Information Systems Engineering and Management 10, no. 16s (2025): 245–70. https://doi.org/10.52783/jisem.v10i16s.2593.

Texto completo
Resumen
The fields of speech recognition, image identification, and natural language processing have undergone a paradigm shift with the advent of machine learning and deep learning approaches. Although these tasks rely primarily on a single modality for input signals, the artificial intelligence field has various applications that necessitate the use of several modalities. In recent years, academics have placed a growing emphasis on the intricate topic of modelling and learning across various modalities. This has attracted the interest of the scientific community. This technical article provides a comprehensive analysis of the models and learning methods available for multimodal intelligence. Specifically, this work concentrates on the fusion of video and language processing modalities, which has become a crucial area in both computer vision and natural language research. In this article, we explore recent research on multimodal deep learning from three different perspectives: learning multimodal representations, combining multimodal inputs at different levels, and multimodal applications. Regarding the learning of multimodal representations, the article delves into the concept of embedding, which involves the combination of different types of signals into a unified vector space. This enables cross-modal signal processing, which has significant implications for various applications. Moreover, several forms of embedding created and trained for common downstream tasks are examined. Regarding multimodal fusion, the research focuses on specific designs that merge representations of unimodal inputs for a specific purpose.
Los estilos APA, Harvard, Vancouver, ISO, etc.
9

Zhang, Kaifan, Lihuo He, Xin Jiang, Wen Lu, Di Wang, and Xinbo Gao. "CognitionCapturer: Decoding Visual Stimuli from Human EEG Signal with Multimodal Information." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 13 (2025): 14486–93. https://doi.org/10.1609/aaai.v39i13.33587.

Texto completo
Resumen
Electroencephalogram (EEG) signals have attracted significant attention from researchers due to their non-invasive nature and high temporal sensitivity in decoding visual stimuli. However, most recent studies have focused solely on the relationship between EEG and image data pairs, neglecting the valuable "beyond-image-modality" information embedded in EEG signals. This results in the loss of critical multimodal information in EEG. To address the limitation, this paper proposes a unified framework that fully leverages multimodal data to represent EEG signals, named CognitionCapturer. Specifically, CognitionCapturer trains modality expert encoders for each modality to extract cross-modal information from the EEG modality. Then, it introduces a diffusion prior to map the EEG embedding space to the CLIP embedding space, followed by using a pretrained generative model, the proposed framework can reconstruct visual stimuli with high semantic and structural fidelity. Notably, the framework does not require any fine-tuning of the generative models and can be extended to incorporate more modalities. Through extensive experiments, we demonstrate that CognitionCapturer outperforms state-of-the-art methods both qualitatively and quantitatively.
Los estilos APA, Harvard, Vancouver, ISO, etc.
10

Subbotin, Sergey А., and Fedir A. Shmalko. "Partitioning the data space before applying hashingusing clustering algorithms." Herald of Advanced Information Technology 8, no. 1 (2025): 28–42. https://doi.org/10.15276/hait.8.2025.2.

Texto completo
Resumen
This research presents a locality-sensitive hashing framework that enhances approximate nearest neighbor search efficiency by integrating adaptive encoding trees and BERT-based clusterization. The proposed method optimizes data space partitioning before applying hashing, improving retrieval accuracy while reducing computational complexity. First, multimodal data, such as images and textual descriptions, are transformed into a unified semantic space using pre-trained bidirectional encoder representations from transformers embeddings. this ensures cross-modal consistency and facilitates high-dimensional similarity comparisons. Second, dimensionality reduction techniques like Uniform Manifold Approximation and Projection or t-distributed stochastic neighbor embedding are applied to mitigate the curse of dimensionality while preserving key relationships between data points. Third, an adaptive encoding tree locality-sensitive hashing encoding tree is constructed, dynamically segmenting the data space based on statistical distribution, thereby enabling efficient hierarchical clustering. Each data point is converted into a symbolic representation, allowing fast retrieval using structured hashing. Fourth, locality-sensitive hashing is applied to the encoded dataset, leveraging p-stable distributions to maintain high search precision while reducing index size. The combination of encoding trees and Locality-Sensitive Hashing enables efficient candidate selection while minimizing search overhead. Experimental evaluations on the CarDD dataset, which includes car damage images and annotations, demonstrate that the proposed method outperforms state-of-the-art approximate nearest neighbor techniques in both indexing efficiency and retrieval accuracy. The results highlight its adaptability to large-scale, high-dimensional, and multimodal datasets, making it suitable for diagnostic models and real-time retrieval tasks.
Los estilos APA, Harvard, Vancouver, ISO, etc.
11

Fan, Yunpeng, Wenyou Du, Yingwei Zhang, and Xiaogang Wang. "Fault Detection for Multimodal Process Using Quality-Relevant Kernel Neighborhood Preserving Embedding." Mathematical Problems in Engineering 2015 (2015): 1–15. http://dx.doi.org/10.1155/2015/210125.

Texto completo
Resumen
A new method named quality-relevant kernel neighborhood preserving embedding (QKNPE) has been proposed. Quality variables have been considered for the first time in kernel neighborhood preserving embedding (KNPE) method for monitoring multimodal process. In summary, the whole algorithm is a two-step process: first, to improve manifold structure and to deal with multimodal nonlinearity problem, the neighborhood preserving embedding technique is introduced; and second to monitoring the complete production process, the product quality variables are added in the objective function. Compared with the conventional monitoring method, the proposed method has the following advantages: (1) the hidden manifold which related to the character of industrial process has been embedded to a low dimensional space and the identifying information of the different mode of the monitored system has been extracted; (2) the product quality as an important factor has been considered for the first time in manifold method. In the experiment section, we applied this method to electrofused magnesia furnace (EFMF) process, which is a representative case study. The experimental results show the effectiveness of the proposed method.
Los estilos APA, Harvard, Vancouver, ISO, etc.
12

Zhang, Jianqiang, Renyao Chen, Shengwen Li, Tailong Li, and Hong Yao. "MGKGR: Multimodal Semantic Fusion for Geographic Knowledge Graph Representation." Algorithms 17, no. 12 (2024): 593. https://doi.org/10.3390/a17120593.

Texto completo
Resumen
Geographic knowledge graph representation learning embeds entities and relationships in geographic knowledge graphs into a low-dimensional continuous vector space, which serves as a basic method that bridges geographic knowledge graphs and geographic applications. Previous geographic knowledge graph representation methods primarily learn the vectors of entities and their relationships from their spatial attributes and relationships, which ignores various semantics of entities, resulting in poor embeddings on geographic knowledge graphs. This study proposes a two-stage multimodal geographic knowledge graph representation (MGKGR) model that integrates multiple kinds of semantics to improve the embedding learning of geographic knowledge graph representation. Specifically, in the first stage, a spatial feature fusion method for modality enhancement is proposed to combine the structural features of geographic knowledge graphs with two modal semantic features. In the second stage, a multi-level modality feature fusion method is proposed to integrate heterogeneous features from different modalities. By fusing the semantics of text and images, the performance of geographic knowledge graph representation is improved, providing accurate representations for downstream geographic intelligence tasks. Extensive experiments on two datasets show that the proposed MGKGR model outperforms the baselines. Moreover, the results demonstrate that integrating textual and image data into geographic knowledge graphs can effectively enhance the model’s performance.
Los estilos APA, Harvard, Vancouver, ISO, etc.
13

Zhang, Sensen, Xun Liang, Simin Niu, et al. "Integrating Large Language Models and Möbius Group Transformations for Temporal Knowledge Graph Embedding on the Riemann Sphere." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 12 (2025): 13277–85. https://doi.org/10.1609/aaai.v39i12.33449.

Texto completo
Resumen
The significance of Temporal Knowledge Graphs (TKGs) in Artificial Intelligence (AI) lies in their capacity to incorporate time-dimensional information, support complex reasoning and prediction, optimize decision-making processes, enhance the accuracy of recommendation systems, promote multimodal data integration, and strengthen knowledge management and updates. This provides a robust foundation for various AI applications. To effectively learn and apply both static and dynamic temporal patterns for reasoning, a range of embedding methods and large language models (LLMs) have been proposed in the literature. However, these methods often rely on a single underlying embedding space, whose geometric properties severely limit their ability to model intricate temporal patterns, such as hierarchical and ring structures. To address this limitation, this paper proposes embedding TKGs into projective geometric space and leverages LLMs technology to extract crucial temporal node information, thereby constructing the 5EL model. By embedding TKGs into projective geometric space and utilizing Möbius Group transformations, we effectively model various temporal patterns. Subsequently, LLMs technology is employed to process the trained TKGs. We adopt a parameter-efficient fine-tuning strategy to align LLMs with specific task requirements, thereby enhancing the model's ability to recognize structural information of key nodes in historical chains and enriching the representation of central entities. Experimental results on five advanced TKG datasets demonstrate that our proposed 5EL model significantly outperforms existing models.
Los estilos APA, Harvard, Vancouver, ISO, etc.
14

Ota, Kosuke, Keiichiro Shirai, Hidetoshi Miyao, and Minoru Maruyama. "Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings." Journal of Advanced Computational Intelligence and Intelligent Informatics 26, no. 6 (2022): 995–1003. http://dx.doi.org/10.20965/jaciii.2022.p0995.

Texto completo
Resumen
In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neural network called multimodal siamese network. The network consists of recurrent neural networks and convolutional neural networks based on the siamese architecture. We also introduce an effective procedure to generate analogy examples from an image-caption dataset for training of our network. In our experiments, we test our model on analogy-based image retrieval tasks. The results show that our method outperforms the previous work in qualitative evaluation.
Los estilos APA, Harvard, Vancouver, ISO, etc.
15

Kim, Jongseok, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. "Dual Compositional Learning in Interactive Image Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (2021): 1771–79. http://dx.doi.org/10.1609/aaai.v35i2.16271.

Texto completo
Resumen
We present an approach named Dual Composition Network (DCNet) for interactive image retrieval that searches for the best target image for a natural language query and a reference image. To accomplish this task, existing methods have focused on learning a composite representation of the reference image and the text query to be as close to the embedding of the target image as possible. We refer this approach as Composition Network. In this work, we propose to close the loop with Correction Network that models the difference between the reference and target image in the embedding space and matches it with the embedding of the text query. That is, we consider two cyclic directional mappings for triplets of (reference image, text query, target image) by using both Composition Network and Correction Network. We also propose a joint training loss that can further improve the robustness of multimodal representation learning. We evaluate the proposed model on three benchmark datasets for multimodal retrieval: Fashion-IQ, Shoes, and Fashion200K. Our experiments show that our DCNet achieves new state-of-the-art performance on all three datasets, and the addition of Correction Network consistently improves multiple existing methods that are solely based on Composition Network. Moreover, an ensemble of our model won the first place in Fashion-IQ 2020 challenge held in a CVPR 2020 workshop.
Los estilos APA, Harvard, Vancouver, ISO, etc.
16

Chen, Yatong, Chenzhi Hu, Tomoyoshi Kimura, et al. "SemiCMT: Contrastive Cross-Modal Knowledge Transfer for IoT Sensing with Semi-Paired Multi-Modal Signals." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, no. 4 (2024): 1–30. http://dx.doi.org/10.1145/3699779.

Texto completo
Resumen
This paper proposes a novel contrastive cross-modal knowledge transfer framework, SemiCMT, for multi-modal IoT sensing applications. It effectively transfers the feature extraction capability (also called knowledge) learned from a source modality (e.g., acoustic signals) with abundant unlabeled training data, to a target modality (e.g., seismic signals) that lacks enough training data, in a self-supervised manner with the help of only a small set of synchronized multi-modal pairs. The transferred model can be quickly finetuned to downstream target-modal tasks with only limited labels. The key design constitutes of three aspects: First, we factorize the latent embedding of each modality into shared and private components and perform knowledge transfer considering both the modality information commonality and gaps. Second, we enforce structural correlation constraints between the source modality and the target modality, to push the target modal embedding space symmetric to the source modal embedding space, with the anchoring of additional source-modal samples, which expands the existing modal-matching objective in current multi-modal contrastive frameworks. Finally, we conduct downstream task finetuning in the spherical space with a KNN classifier to better align with the structured modality embedding space. Extensive evaluations on five multimodal IoT datasets are performed to validate the effectiveness of SemiCMT in cross-modal knowledge transfer, including a new self-collected dataset using seismic and acoustic signals for office activity monitoring. SemiCMT consistently outperforms existing self-supervised and knowledge transfer approaches by up to 36.47% in the finetuned target-modal classification tasks. The code and the self-collected dataset will be released at https://github.com/SJTU-RTEAS/SemiCMT.
Los estilos APA, Harvard, Vancouver, ISO, etc.
17

He, Yuxuan, Kunda Wang, Qicheng Song, Huixin Li, and Bozhi Zhang. "Specific Emitter Identification Algorithm Based on Time–Frequency Sequence Multimodal Feature Fusion Network." Electronics 13, no. 18 (2024): 3703. http://dx.doi.org/10.3390/electronics13183703.

Texto completo
Resumen
Specific emitter identification is a challenge in the field of radar signal processing. Its aims to extract individual fingerprint features of the signal. However, early works are all designed using either signal or time–frequency image and heavily rely on the calculation of hand-crafted features or complex interactions in high-dimensional feature space. This paper introduces the time–frequency multimodal feature fusion network, a novel architecture based on multimodal feature interaction. Specifically, we designed a time–frequency signal feature encoding module, a wvd image feature encoding module, and a multimodal feature fusion module. Additionally, we propose a feature point filtering mechanism named FMM for signal embedding. Our algorithm demonstrates high performance in comparison with the state-of-the-art mainstream identification methods. The results indicate that our algorithm outperforms others, achieving the highest accuracy, precision, recall, and F1-score, surpassing the second-best by 9.3%, 8.2%, 9.2%, and 9%. Notably, the visual results show that the proposed method aligns with the signal generation mechanism, effectively capturing the distinctive fingerprint features of radar data. This paper establishes a foundational architecture for the subsequent multimodal research in SEI tasks.
Los estilos APA, Harvard, Vancouver, ISO, etc.
18

Abiyev, Rahib H., Mohamad Ziad Altabel, Manal Darwish, and Abdulkader Helwan. "A Multimodal Transformer Model for Recognition of Images from Complex Laparoscopic Surgical Videos." Diagnostics 14, no. 7 (2024): 681. http://dx.doi.org/10.3390/diagnostics14070681.

Texto completo
Resumen
The determination of the potential role and advantages of artificial intelligence-based models in the field of surgery remains uncertain. This research marks an initial stride towards creating a multimodal model, inspired by the Video-Audio-Text Transformer, that aims to reduce negative occurrences and enhance patient safety. The model employs text and image embedding state-of-the-art models (ViT and BERT) to assess their efficacy in extracting the hidden and distinct features from the surgery video frames. These features are then used as inputs for convolution-free Transformer architectures to extract comprehensive multidimensional representations. A joint space is then used to combine the text and image features extracted from both Transformer encoders. This joint space ensures that the relationships between the different modalities are preserved during the combination process. The entire model was trained and tested on laparoscopic cholecystectomy (LC) videos encompassing various levels of complexity. Experimentally, a mean accuracy of 91.0%, a precision of 81%, and a recall of 83% were reached by the model when tested on 30 videos out of 80 from the Cholec80 dataset.
Los estilos APA, Harvard, Vancouver, ISO, etc.
19

Tripathi, Aakash Gireesh, Asim Waqas, Yasin Yilmaz, Matthew B. Schabath, and Ghulam Rasool. "Abstract 3641: Predicting treatment outcomes using cross-modality correlations in multimodal oncology data." Cancer Research 85, no. 8_Supplement_1 (2025): 3641. https://doi.org/10.1158/1538-7445.am2025-3641.

Texto completo
Resumen
Abstract Accurate prediction of treatment outcomes in oncology requires modeling the intricate relationships across diverse data modalities. This study investigates cross-modality correlations by leveraging imaging and clinical data curated through the Multimodal Integration of Oncology Data System (MINDS) and HoneyBee frameworks to uncover actionable patterns for personalized treatment strategies. Using data from over 10, 000 cancer patients, we developed a machine learning pipeline that employs advanced embedding techniques to capture associations between radiological imaging phenotypes and histopathological features, aiming to predict survival outcomes and treatment efficacy. The HoneyBee framework played a key role in preprocessing and feature extraction, using foundation models for imaging and text data. Radiological imaging, such as computed tomography (CT) and magnetic resonance imaging (MRI), was transformed into high-dimensional embeddings with vision transformers and convolutional neural networks. Clinical data, including electronic health records and pathology reports, was structured into embeddings using biomedical language models. These embeddings ensured standardization and cross-modal alignment while capturing modality-specific features. HoneyBee also applied data augmentation and preprocessing, such as stain normalization and tokenization, to maintain data quality. MINDS provided a scalable, metadata-driven architecture for integrating datasets like patient demographics and treatment histories, supporting seamless ingestion, harmonization, and storage. Together, these frameworks created a unified, multimodal dataset optimized for predictive modeling and resilient to missing data. In training, embeddings were combined through late-stage fusion techniques to generate unified patient-level representations in latent space. These representations were used to train machine learning models, including gradient-boosted trees and neural networks, for survival analysis and treatment response prediction. Testing was conducted using stratified cross-validation to ensure generalizability across cancer subtypes and demographic groups. Results demonstrated significant predictive correlations, such as imaging-derived phenotypes linked to histopathological indicators of treatment response. The machine learning pipeline achieved a 12% improvement in concordance indices over unimodal approaches, underscoring the enhanced accuracy of cross-modality learning. Citation Format: Aakash Gireesh Tripathi, Asim Waqas, Yasin Yilmaz, Matthew B. Schabath, Ghulam Rasool. Predicting treatment outcomes using cross-modality correlations in multimodal oncology data [abstract]. In: Proceedings of the American Association for Cancer Research Annual Meeting 2025; Part 1 (Regular Abstracts); 2025 Apr 25-30; Chicago, IL. Philadelphia (PA): AACR; Cancer Res 2025;85(8_Suppl_1):Abstract nr 3641.
Los estilos APA, Harvard, Vancouver, ISO, etc.
20

Skantze, Gabriel, and Bram Willemsen. "CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings." Journal of Artificial Intelligence Research 74 (July 9, 2022): 1201–23. http://dx.doi.org/10.1613/jair.1.13689.

Texto completo
Resumen
This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this vector, so that the adjustment is only applied when needed. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also generalize to similar language use and leverage semantic compositionality. We verify the model’s performance on two different tasks of identifying the targets of referring expressions, where it has to learn new language use. The results show that the model can efficiently learn and generalize from only a few examples, with little interference with the model’s original zero-shot performance.
Los estilos APA, Harvard, Vancouver, ISO, etc.
21

Zhang, Linhao, Li Jin, Xian Sun, et al. "TOT:Topology-Aware Optimal Transport for Multimodal Hate Detection." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 4 (2023): 4884–92. http://dx.doi.org/10.1609/aaai.v37i4.25614.

Texto completo
Resumen
Multimodal hate detection, which aims to identify the harmful content online such as memes, is crucial for building a wholesome internet environment. Previous work has made enlightening exploration in detecting explicit hate remarks. However, most of their approaches neglect the analysis of implicit harm, which is particularly challenging as explicit text markers and demographic visual cues are often twisted or missing. The leveraged cross-modal attention mechanisms also suffer from the distributional modality gap and lack logical interpretability. To address these semantic gap issues, we propose TOT: a topology-aware optimal transport framework to decipher the implicit harm in memes scenario, which formulates the cross-modal aligning problem as solutions for optimal transportation plans. Specifically, we leverage an optimal transport kernel method to capture complementary information from multiple modalities. The kernel embedding provides a non-linear transformation ability to reproduce a kernel Hilbert space (RKHS), which reflects significance for eliminating the distributional modality gap. Moreover, we perceive the topology information based on aligned representations to conduct bipartite graph path reasoning. The newly achieved state-of-the-art performance on two publicly available benchmark datasets, together with further visual analysis, demonstrate the superiority of TOT in capturing implicit cross-modal alignment.
Los estilos APA, Harvard, Vancouver, ISO, etc.
22

Zhang, Yachao, Runze Hu, Ronghui Li, Yanyun Qu, Yuan Xie, and Xiu Li. "Cross-Modal Match for Language Conditioned 3D Object Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 7 (2024): 7359–67. http://dx.doi.org/10.1609/aaai.v38i7.28566.

Texto completo
Resumen
Language conditioned 3D object grounding aims to find the object within the 3D scene mentioned by natural language descriptions, which mainly depends on the matching between visual and natural language. Considerable improvement in grounding performance is achieved by improving the multimodal fusion mechanism or bridging the gap between detection and matching. However, several mismatches are ignored, i.e., mismatch in local visual representation and global sentence representation, and mismatch in visual space and corresponding label word space. In this paper, we propose crossmodal match for 3D grounding from mitigating these mismatches perspective. Specifically, to match local visual features with the global description sentence, we propose BEV (Bird’s-eye-view) based global information embedding module. It projects multiple object proposal features into the BEV and the relations of different objects are accessed by the visual transformer which can model both positions and features with long-range dependencies. To circumvent the mismatch in feature spaces of different modalities, we propose crossmodal consistency learning. It performs cross-modal consistency constraints to convert the visual feature space into the label word feature space resulting in easier matching. Besides, we introduce label distillation loss and global distillation loss to drive these matches learning in a distillation way. We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method.
Los estilos APA, Harvard, Vancouver, ISO, etc.
23

Liang, Meiyu, Junping Du, Zhengyang Liang, Yongwang Xing, Wei Huang, and Zhe Xue. "Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (2024): 13744–53. http://dx.doi.org/10.1609/aaai.v38i12.29280.

Texto completo
Resumen
Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to sufficiently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final unified cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods.
Los estilos APA, Harvard, Vancouver, ISO, etc.
24

Chen, Meng, Kai Zhang, Zhenying He, Yinan Jing, and X. Sean Wang. "RoarGraph: A Projected Bipartite Graph for Efficient Cross-Modal Approximate Nearest Neighbor Search." Proceedings of the VLDB Endowment 17, no. 11 (2024): 2735–49. http://dx.doi.org/10.14778/3681954.3681959.

Texto completo
Resumen
Approximate Nearest Neighbor Search (ANNS) is a fundamental and critical component in many applications, including recommendation systems and large language model-based applications. With the advancement of multimodal neural models, which transform data from different modalities into a shared high-dimensional space as feature vectors, cross-modal ANNS aims to use the data vector from one modality (e.g., texts) as the query to retrieve the most similar items from another (e.g., images or videos). However, there is an inherent distribution gap between embeddings from different modalities, and cross-modal queries become Out-of-Distribution (OOD) to the base data. Consequently, state-of-the-art ANNS approaches suffer poor performance for OOD workloads. In this paper, we quantitatively analyze the properties of the OOD workloads to gain an understanding of their ANNS efficiency. Unlike single-modal workloads, we reveal OOD queries spatially deviate from base data, and the k-nearest neighbors of an OOD query are distant from each other in the embedding space. The property breaks the assumptions of existing ANNS approaches and mismatches their design for efficient search. With the insights from the OOD workloads, we propose p Ro jected bipartite Graph ( RoarGraph ), an efficient ANNS graph index that is built under the guidance of query distribution. Extensive experiments show that RoarGraph significantly outperforms state-of-the-art approaches on modern cross-modal datasets, achieving up to 3.56× faster search speed at a 90% recall rate for OOD queries.
Los estilos APA, Harvard, Vancouver, ISO, etc.
25

Akalya, Devi C., Renuka D. Karthika, T. Harisudhan, V. K. Jeevanantham, J. Jhanani, and Varshini S. Kavi. "Text emotion recognition using fast text word embedding in bi-directional gated recurrent unit." i-manager's Journal on Information Technology 11, no. 4 (2022): 1. http://dx.doi.org/10.26634/jit.11.4.19119.

Texto completo
Resumen
Emotions are states of readiness in the mind that result from evaluations of one's own thinking or events. Although almost all of the important events in our lives are marked by emotions, the nature, causes, and effects of emotions are some of the least understood parts of the human experience. Emotion recognition is playing a promising role in the domains of human-computer interaction and artificial intelligence. A human's emotions can be detected using a variety of methods, including facial gestures, blood pressure, body movements, heart rate, and textual data. From an application standpoint, the ability to identify human emotions in text is becoming more and more crucial in computational linguistics. In this work, we present a classification methodology based on deep neural networks. The Bi-directional Gated Recurrent Unit (Bi-GRU) employed here demonstrates its effectiveness on the Multimodal Emotion Lines Dataset (MELD) when compared to Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM). For word encoding, a comparison of three pre-trained word embeddings namely Glove, Word2Vec, and fastText is made. The findings from the MELD corpus support the conclusion that fastText is the best word embedding for the proposed Bi-GRU model. The experiment utilized the "glove.6B.300d" vector space. It consists of two million word representations in 300 dimensions trained on Common Crawl with sub-word information (600 billion tokens). The accuracy scores of GloVe, Word2Vec, and fastText (300 dimensions each) are tabulated and studied in order to highlight the improved results with fastText on the MELD dataset tested. It is observed that the Bidirectional Gated Recurrent Unit (Bi-GRU) with fastText word embedding outperforms GloVe and Word2Vec with an accuracy of 79.7%.
Los estilos APA, Harvard, Vancouver, ISO, etc.
26

Simhal, Anish K., Rena Elkin, Ross S. Firestone, Jung Hun Oh, and Joseph O. Deasy. "Abstract A031: Unsupervised graph-based visualization of variational autoencoder latent spaces reveals hidden multiple myeloma subtypes." Clinical Cancer Research 31, no. 13_Supplement (2025): A031. https://doi.org/10.1158/1557-3265.aimachine-a031.

Texto completo
Resumen
Abstract Latent space representations learned through variational autoencoders (VAEs) offer a powerful, unsupervised means of capturing nonlinear structure in high-dimensional oncology data. The latent embedding spaces often encode information that differs from traditional bioinformatics methods such as t-SNE or UMAP. However, a persistent challenge remains: how to meaningfully visualize and interpret these latent variables. Common dimensionality reduction techniques like UMAP and t-SNE, while effective, can obscure graph-theoretic relationships that may underlie important biological patterns. We present a novel approach for intuitive latent space interpretation using NetFlow, a method that visualizes the organizational structure of samples as a graph derived from their latent embeddings. NetFlow constructs a topological representation based on the metric structure of the latent space, drawing on concepts from network analysis, optimal mass transport, topological data analysis, and lineage tracing. The result is an interpretable graph in which nodes represent individual subjects and edges reflect local and global similarity among the samples. We applied this method to multiple myeloma (MM), a hematologic malignancy marked by malignant plasma cell proliferation and inevitable relapse. To uncover hidden disease subtypes, we trained a VAE on multimodal data from 659 patients in the MMRF CoMMpass dataset (IA19), integrating transcriptomic, genomic, and clinical features. Direct clustering of latent space vectors failed to yield subgroups with significant differences in progression-free survival (PFS). In contrast, NetFlow generated a latent space graph that, when clustered using Louvain community detection, identified three distinct subtypes: one high-risk and two low-risk groups. The high-risk group exhibited a median PFS of 1.5 years shorter than the low-risk groups (p<0.001) and was enriched for known poor prognostic markers including gain 1q21 (59%), MAF translocations (17%), and t(4;14) (66%). Although the two low-risk groups had similar PFS outcomes, they differed in their molecular profiles, suggesting they may benefit from different therapeutic strategies. These preliminary results demonstrate that variational autoencoders and NetFlow graph analysis can reveal latent substructures missed by traditional clustering, thereby advancing latent space explainability and enabling improved subtype discovery in MM. Our framework offers a generalizable pipeline for interpreting deep generative models in cancer genomics. Citation Format: Anish K. Simhal, Rena Elkin, Ross S. Firestone, Jung Hun Oh, Joseph O. Deasy. Unsupervised graph-based visualization of variational autoencoder latent spaces reveals hidden multiple myeloma subtypes [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr A031.
Los estilos APA, Harvard, Vancouver, ISO, etc.
27

Li, Yihua, Hongyue Chen, Yiqing Li, and Yetong Xin. "PoeSpin: A Human-AI Dance to Poetry System for Movement-Based Verse Generation." Proceedings of the ACM on Computer Graphics and Interactive Techniques 8, no. 3 (2025): 1–13. https://doi.org/10.1145/3736781.

Texto completo
Resumen
This paper presents PoeSpin, a human–AI co-writing system that transforms pole dance movements into poetry. Inspired by pole dance and computational linguistics, this project reimagines dance as a form of embodied poetic creation situated within the traditions of spatial and concrete poetry. Drawing from these traditions, PoeSpin treats physical motion as a generative force in the poetic process. We implemented three movement-to-poetry strategies. Among them, the second approach—embedding pole dance trajectories into a reduced-dimensional semantic space—served as the foundation for a real-time interactive installation. Building on this implementation and user feedback, we also proposed an integrated strategy for future interaction design. Our primary method employs the reduction of dimensionality of word vectors, mapping dance trajectories into a semantic space constructed upon Yeats’s poetic corpus. Rather than merely describing physical movement, the system generates evocative verse that extends beyond literal translation, unleashing creative possibilities that bridge the corporeal and the literary. This work offers both a technical framework for multimodal dance-poetry generation and a critical reframing of pole dance aesthetics, aiming to liberate this art form from prevailing societal dismissive gaze.
Los estilos APA, Harvard, Vancouver, ISO, etc.
28

Hnini, Ghizlane, Jamal Riffi, Mohamed Adnane Mahraz, Ali Yahyaouy, and Hamid Tairi. "MMPC-RF: A Deep Multimodal Feature-Level Fusion Architecture for Hybrid Spam E-mail Detection." Applied Sciences 11, no. 24 (2021): 11968. http://dx.doi.org/10.3390/app112411968.

Texto completo
Resumen
Hybrid spam is an undesirable e-mail (electronic mail) that contains both image and text parts. It is more harmful and complex as compared to image-based and text-based spam e-mail. Thus, an efficient and intelligent approach is required to distinguish between spam and ham. To our knowledge, a small number of studies have been aimed at detecting hybrid spam e-mails. Most of these multimodal architectures adopted the decision-level fusion method, whereby the classification scores of each modality were concatenated and fed to another classification model to make a final decision. Unfortunately, this method not only demands many learning steps, but it also loses correlation in mixed feature space. In this paper, we propose a deep multimodal feature-level fusion architecture that concatenates two embedding vectors to have a strong representation of e-mails and increase the performance of the classification. The paragraph vector distributed bag of words (PV-DBOW) and the convolutional neural network (CNN) were used as feature extraction techniques for text and image parts, respectively, of the same e-mail. The extracted feature vectors were concatenated and fed to the random forest (RF) model to classify a hybrid e-mail as either spam or ham. The experiments were conducted on three hybrid datasets made using three publicly available corpora: Enron, Dredze, and TREC 2007. According to the obtained results, the proposed model provides a higher accuracy of 99.16% compared to recent state-of-the-art methods.
Los estilos APA, Harvard, Vancouver, ISO, etc.
29

Wang, Kaijie, Tiejun Wang, Xiaoran Guo, Kui Xu, and Jiao Wu. "Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer." Applied Sciences 14, no. 2 (2024): 807. http://dx.doi.org/10.3390/app14020807.

Texto completo
Resumen
Image–text matching is a research hotspot in the multimodal task of integrating image and text processing. In order to solve the difficult problem of associating image and text data in the multimodal knowledge graph of Thangka, we propose an image and text matching method based on the Visual Semantic Embedding (VSE) model. The method introduces an adaptive pooling layer to improve the feature extraction capability of semantic associations between Thangka images and texts. We also improved the traditional Transformer architecture by combining bidirectional residual concatenation and mask attention mechanisms to improve the stability of the matching process and the ability to extract semantic information. In addition, we designed a multi-granularity tag alignment module that maps global and local features of images and text into a common coding space, leveraging inter- and intra-modal semantic associations to improve image and text accuracy. Comparative experiments on the Thangka dataset show that our method achieves significant improvements compared to the VSE baseline method. Specifically, our method improves the recall by 9.4% and 10.5% for image-matching text and text-matching images, respectively. Furthermore, without any large-scale corpus pre-training, our method outperforms all models without pre-training and outperforms two out of four pre-trained models on the Flickr30k public dataset. Also, the execution efficiency of our model is an order of magnitude higher than that of the pre-trained models, which highlights the superior performance and efficiency of our model in the image–text matching task.
Los estilos APA, Harvard, Vancouver, ISO, etc.
30

Zhang, Yutong, Jiantao Wu, Li Sun, and Guoan Yang. "Contrastive Learning-Based Cross-Modal Fusion for Product Form Imagery Recognition: A Case Study on New Energy Vehicle Front-End Design." Sustainability 17, no. 10 (2025): 4432. https://doi.org/10.3390/su17104432.

Texto completo
Resumen
Fine-grained feature extraction and affective semantic mapping remain significant challenges in product form analysis. To address these issues, this study proposes a contrastive learning-based cross-modal fusion approach for product form imagery recognition, using the front-end design of new energy vehicles (NEVs) as a case study. The proposed method first employs the Biterm Topic Model (BTM) and Analytic Hierarchy Process (AHP) to extract thematic patterns and compute weight distributions from consumer review texts, thereby identifying key imagery style labels. These labels are then leveraged for image annotation, facilitating the construction of a multimodal dataset. Next, ResNet-50 and Transformer architectures serve as the image and text encoders, respectively, to extract and represent multimodal features. To ensure effective alignment and deep fusion of textual and visual representations in a shared embedding space, a contrastive learning mechanism is introduced, optimizing cosine similarity between positive and negative sample pairs. Finally, a fully connected multilayer network is integrated at the output of the Transformer and ResNet with Contrastive Learning (TRCL) model to enhance classification accuracy and reliability. Comparative experiments against various deep convolutional neural networks (DCNNs) demonstrate that the TRCL model effectively integrates semantic and visual information, significantly improving the accuracy and robustness of complex product form imagery recognition. These findings suggest that the proposed method holds substantial potential for large-scale product appearance evaluation and affective cognition research. Moreover, this data-driven fusion underpins sustainable product form design by streamlining evaluation and optimizing resource use.
Los estilos APA, Harvard, Vancouver, ISO, etc.
31

Biswas, Rajarshi, Michael Barz, and Daniel Sonntag. "Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking." KI - Künstliche Intelligenz 34, no. 4 (2020): 571–84. http://dx.doi.org/10.1007/s13218-020-00679-2.

Texto completo
Resumen
AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.
Los estilos APA, Harvard, Vancouver, ISO, etc.
32

Meo, Giuseppe, Pilar M. Ferraro, Marta Cillerai, et al. "MND Phenotypes Differentiation: The Role of Multimodal Characterization at the Time of Diagnosis." Life 12, no. 10 (2022): 1506. http://dx.doi.org/10.3390/life12101506.

Texto completo
Resumen
Pure/predominant upper motor neuron (pUMN) and lower motor neuron (pLMN) diseases have significantly better prognosis compared to amyotrophic lateral sclerosis (ALS), but their early differentiation is often challenging. We therefore tested whether a multimodal characterization approach embedding clinical, cognitive/behavioral, genetic, and neurophysiological data may improve the differentiation of pUMN and pLMN from ALS already by the time of diagnosis. Dunn’s and chi-squared tests were used to compare data from 41 ALS, 34 pLMN, and 19 pUMN cases with diagnoses confirmed throughout a 2-year observation period. Area under the curve (AUC) analyses were implemented to identify the finest tools for phenotypes discrimination. Relative to ALS, pLMN showed greater lower limbs weakness, lower UMN burden, and progression rate (p < 0.001–0.04). PUMN showed a greater frequency of lower limbs onset, higher UMN burden, lower ALSFRS-r and MRC progression rates (p < 0.001–0.03), and greater ulnar compound muscle action potential (CMAP) amplitude and tibial central motor conduction time (CMCT) (p = 0.05–0.03). The UMN progression rate was the finest measure to identify pLMN cases (AUC = 90%), while the MRC progression rate was the finest tool to identify pUMN (AUC = 82%). Detailed clinical and neurophysiological examinations may significantly improve MNDs differentiation, facilitating prognosis estimation and ameliorating stratification strategies for clinical trials enrollment.
Los estilos APA, Harvard, Vancouver, ISO, etc.
33

Malitesta, Daniele. "Graph Neural Networks for Recommendation Leveraging Multimodal Information." ACM SIGIR Forum 58, no. 1 (2024): 1–2. http://dx.doi.org/10.1145/3687273.3687295.

Texto completo
Resumen
Recommender systems act as filtering algorithms to provide users with items that might meet their interests according to the expressed preferences and items' characteristics. As of today, the collaborative filtering paradigm, along with deep learning techniques to learn high-quality users' and items' representations, constitute the de facto standard for personalized recommendation, showing remarkable recommendation accuracy performance. Nevertheless, recommendation remains a highly-challenging task. Among the most debated open issues in the community, this thesis considers two algorithmic and conceptual ones, namely: (i) the inexplicable nature of users' preferences, especially when they come in the form of implicit feedback; (ii) the effective exploitation of the collaborative information in the designing and training of recommendation models. In domains such as fashion, food, and media content recommendation, the shallow item's profile can be enhanced through the multimodal characteristics describing items [Malitesta et al., 2023]. Driven by these assumptions, in the first part of this thesis, we apply multimodal deep learning strategies for multimedia recommendation; the scope is to study and design recommendation algorithms based upon the principles of multimodality to possibly match each item's characteristic to the implicit preference expressed by the user [Deldjoo et al., 2022], thus addressing the (i) issue. Recent collaborative filtering approaches profile users and items through embedding vectors in the latent space. However, such models disregard structural properties naturally encoded into the user-item interaction data. Indeed, recommendation datasets are easily describable under the topology of a bipartite and undirected graph, with users and items being the graph nodes connected at multiple distance hops. In this respect, the application of graph neural networks , recent deep learning techniques specifically tailored to learn from non-euclidean data, can provide a refined representation of users and items to mine near- and long-distance relationships in the user-item graphs [Anelli et al., 2023b]. Indeed, this is one possible solution to exploit the collaborative information, which is effectively propagated within the user-item graph, addressing the (ii) issue. Conclusively, this thesis aims to match the two families of recommendation strategies by leveraging graph neural networks and multimodal information data [Anelli et al., 2022]. In doing so, other numerous micro-aspects within the two macro-areas (introduced above) are examined. Indeed, the thesis is a systematic compendium of careful analyses regarding, among others, reproducibility, novel evaluation dimensions [Anelli et al., 2023a], and tasks/scenarios complementary to recommendation. Awarded by : Politecnico di Bari, Bari, Italy on 30 January 2024. Supervised by : Tommaso Di Noia. Available at : https://hdl.handle.net/11589/264941.
Los estilos APA, Harvard, Vancouver, ISO, etc.
34

Balabin, Helena, Charles Tapley Hoyt, Colin Birkenbihl, et al. "STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs." Bioinformatics 38, no. 6 (2022): 1648–56. http://dx.doi.org/10.1093/bioinformatics/btac001.

Texto completo
Resumen
Abstract Motivation The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models. However, representations based on a single modality are inherently limited. Results To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (KGs). This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations in a shared embedding space. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against three baseline models trained on either one of the modalities (i.e. text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.084 (i.e. from 0.881 to 0.965). Finally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Availability and implementation We make the source code and the Python package of STonKGs available at GitHub (https://github.com/stonkgs/stonkgs) and PyPI (https://pypi.org/project/stonkgs/). The pre-trained STonKGs models and the task-specific classification models are respectively available at https://huggingface.co/stonkgs/stonkgs-150k and https://zenodo.org/communities/stonkgs. Supplementary information Supplementary data are available at Bioinformatics online.
Los estilos APA, Harvard, Vancouver, ISO, etc.
35

Yuan, Xinpan, Xinxin Mao, Wei Xia, Zhiqi Zhang, Shaojun Xie, and Chengyuan Zhang. "PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric." Complexity 2022 (September 16, 2022): 1–14. http://dx.doi.org/10.1155/2022/2343707.

Texto completo
Resumen
Image similarity metric, also known as metric learning (ML) in computer vision, is a significant step in various advanced image tasks. Nevertheless, existing well-performing approaches for image similarity measurement only focus on the image itself without utilizing the information of other modalities, while pictures always appear with the described text. Furthermore, those methods need human supervision, yet most images are unlabeled in the real world. Considering the above problems comprehensively, we present a novel visual similarity metric model named PTF-SimCM. It adopts a self-supervised contrastive structure like SimSiam and incorporates a multimodal fusion module to utilize textual modality correlated to the image. We apply a cross-modal model for text modality rather than a standard unimodal text encoder to improve late fusion productivity. In addition, the proposed model employs Sentence PIE-Net to solve the issue caused by polysemous sentences. For simplicity and efficiency, our model learns a specific embedding space where distances directly correspond to the similarity. Experimental results on MSCOCO, Flickr 30k, and Pascal Sentence datasets show that our model overall outperforms all the compared methods in this work, which illustrates that the model can effectively address the issues faced and enhance the performances on unsupervised visual similarity measuring relatively.
Los estilos APA, Harvard, Vancouver, ISO, etc.
36

Tang, Zhenchao, Jiehui Huang, Guanxing Chen, and Calvin Yu-Chian Chen. "Comprehensive View Embedding Learning for Single-Cell Multimodal Integration." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (2024): 15292–300. http://dx.doi.org/10.1609/aaai.v38i14.29453.

Texto completo
Resumen
Motivation: Advances in single-cell measurement techniques provide rich multimodal data, which helps us to explore the life state of cells more deeply. However, multimodal integration, or, learning joint embeddings from multimodal data remains a current challenge. The difficulty in integrating unpaired single-cell multimodal data is that different modalities have different feature spaces, which easily leads to information loss in joint embedding. And few existing methods have fully exploited and fused the information in single-cell multimodal data. Result: In this study, we propose CoVEL, a deep learning method for unsupervised integration of single-cell multimodal data. CoVEL learns single-cell representations from a comprehensive view, including regulatory relationships between modalities, fine-grained representations of cells, and relationships between different cells. The comprehensive view embedding enables CoVEL to remove the gap between modalities while protecting biological heterogeneity. Experimental results on multiple public datasets show that CoVEL is accurate and robust to single-cell multimodal integration. Data availability: https://github.com/shapsider/scintegration.
Los estilos APA, Harvard, Vancouver, ISO, etc.
37

Chen, Ziwei, Shaokun An, Xiangqi Bai, Fuzhou Gong, Liang Ma, and Lin Wan. "DensityPath: an algorithm to visualize and reconstruct cell state-transition path on density landscape for single-cell RNA sequencing data." Bioinformatics 35, no. 15 (2018): 2593–601. http://dx.doi.org/10.1093/bioinformatics/bty1009.

Texto completo
Resumen
Abstract Motivation Visualizing and reconstructing cell developmental trajectories intrinsically embedded in high-dimensional expression profiles of single-cell RNA sequencing (scRNA-seq) snapshot data are computationally intriguing, but challenging. Results We propose DensityPath, an algorithm allowing (i) visualization of the intrinsic structure of scRNA-seq data on an embedded 2-d space and (ii) reconstruction of an optimal cell state-transition path on the density landscape. DensityPath powerfully handles high dimensionality and heterogeneity of scRNA-seq data by (i) revealing the intrinsic structures of data, while adopting a non-linear dimension reduction algorithm, termed elastic embedding, which can preserve both local and global structures of the data; and (ii) extracting the topological features of high-density, level-set clusters from a single-cell multimodal density landscape of transcriptional heterogeneity, as the representative cell states. DensityPath reconstructs the optimal cell state-transition path by finding the geodesic minimum spanning tree of representative cell states on the density landscape, establishing a least action path with the minimum-transition-energy of cell fate decisions. We demonstrate that DensityPath can ably reconstruct complex trajectories of cell development, e.g. those with multiple bifurcating and trifurcating branches, while maintaining computational efficiency. Moreover, DensityPath has high accuracy for pseudotime calculation and branch assignment on real scRNA-seq, as well as simulated datasets. DensityPath is robust to parameter choices, as well as permutations of data. Availability and implementation DensityPath software is available at https://github.com/ucasdp/DensityPath. Supplementary information Supplementary data are available at Bioinformatics online.
Los estilos APA, Harvard, Vancouver, ISO, etc.
38

Yin, Ziyi, Muchao Ye, Tianrong Zhang, et al. "VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 7 (2024): 6755–63. http://dx.doi.org/10.1609/aaai.v38i7.28499.

Texto completo
Resumen
Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the “pre-training & finetuning” learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQATTACK model, which can iteratively generate both im- age and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQATTACK in the transferable attack setting, compared with state-of-the-art baselines. This work reveals a significant blind spot in the “pre-training & fine-tuning” paradigm on VQA tasks. The source code can be found in the link https://github.com/ericyinyzy/VQAttack.
Los estilos APA, Harvard, Vancouver, ISO, etc.
39

Widrich, Michael, Anooj Patel, Peter Ulz, et al. "Abstract A045: Unlocking deep learning for cell-free DNA-based early colorectal cancer detection." Clinical Cancer Research 31, no. 13_Supplement (2025): A045. https://doi.org/10.1158/1557-3265.aimachine-a045.

Texto completo
Resumen
Abstract Introduction: Colorectal cancer (CRC) is the second most common cause of cancer-related death in the US. Screening reduces cancer mortality through early detection, but only 59% of eligible individuals are up to date with recommended CRC screening. Non-invasive and more convenient tests can increase adherence to screening guidelines, and blood tests using next generation sequencing to detect cancer-associated methylation patterns in cell-free DNA (cfDNA) have recently shown great promise and are on track to or have recently achieved FDA approval. Such tests produce data for millions of cfDNA fragments (and billions of bases) per sample and require sophisticated featurization and classification algorithms. Deep learning (DL) outperforms traditional machine learning (ML) given enough training data, but application to blood-based cancer screening remains challenging due to the immense input space and low training case counts. Here, we present an interpretable fragment-level DL model that outperforms a state-of-the-art ML approach. Methods: Each sample yields millions of fragments represented as a multimodal feature based on nucleotide sequence, CpG methylation pattern at single-base resolution, and other biologically relevant characteristics. Our DL model first learns a fragment embedding; then, a specialized attention mechanism uses cancer-indicative fragments to learn a sample embedding. Finally, the sample embedding is used to predict CRC status. To assess classification accuracy, we used two independent test sets: challenging contrived positive material (plasma from an advanced-CRC subject diluted into plasma from healthy controls to a level just above the detection limit, n = 148); and a research cohort of patients with CRC (n = 211) or harder-to-detect advanced precancerous lesions (APLs, n = 388). Results: We trained DL models using 70% (DL1) or 100% (DL2) of available training data (925 cases; 3,469 controls). For each model, we set a classification threshold to yield 90% specificity in test-set controls (n = 331). DL2 was more sensitive than the ML model in contrived positives (82% vs 70%), CRCs (90% vs 88%), and APLs (30% vs 27%). Further, DL2 improved on DL1 in each positive sample type (82% vs 72%, 90% vs 89%, and 30% vs 28%, respectively), showing that DL model performance increases with volume of training data. Conclusion: A DL model that operates on millions of fragments per subject outperformed a state-of-the-art ML method when applied to an independent test cohort and exhibited improved performance as training data volume increased. For interpretability, the model can be analyzed via attention values and contribution analysis at the fragment level, providing insight into previously unrecognized cancer-associated fragment characteristics, and the sample embedding can be used to visualize sample distributions and assess model generalizability. Together, these results pave the way for effective DL in blood-based early cancer detection. Citation Format: Michael Widrich, Anooj Patel, Peter Ulz, Kaitlyn Coil, Thomas Royce, Jimmy Lin, Richard Bourgon, Anindita Dutta. Unlocking deep learning for cell-free DNA-based early colorectal cancer detection [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr A045.
Los estilos APA, Harvard, Vancouver, ISO, etc.
40

Lin, Kaiyi, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. "Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11515–22. http://dx.doi.org/10.1609/aaai.v34i07.6817.

Texto completo
Resumen
Zero-Shot Cross-Modal Retrieval (ZS-CMR) is an emerging research hotspot that aims to retrieve data of new classes across different modality data. It is challenging for not only the heterogeneous distributions across different modalities, but also the inconsistent semantics across seen and unseen classes. A handful of recently proposed methods typically borrow the idea from zero-shot learning, i.e., exploiting word embeddings of class labels (i.e., class-embeddings) as common semantic space, and using generative adversarial network (GAN) to capture the underlying multimodal data structures, as well as strengthen relations between input data and semantic space to generalize across seen and unseen classes. In this paper, we propose a novel method termed Learning Cross-Aligned Latent Embeddings (LCALE) as an alternative to these GAN based methods for ZS-CMR. Unlike using the class-embeddings as the semantic space, our method seeks for a shared low-dimensional latent space of input multimodal features and class-embeddings by modality-specific variational autoencoders. Notably, we align the distributions learned from multimodal input features and from class-embeddings to construct latent embeddings that contain the essential cross-modal correlation associated with unseen classes. Effective cross-reconstruction and cross-alignment criterions are further developed to preserve class-discriminative information in latent space, which benefits the efficiency for retrieval and enable the knowledge transfer to unseen classes. We evaluate our model using four benchmark datasets on image-text retrieval tasks and one large-scale dataset on image-sketch retrieval tasks. The experimental results show that our method establishes the new state-of-the-art performance for both tasks on all datasets.
Los estilos APA, Harvard, Vancouver, ISO, etc.
41

Neagu, Maria-Ionela. "INTRODUCTION. MULTIMODAL DIMENSIONS OF IDENTITY IN CULTURE, SOCIETY, AND THE ARTS." JOURNAL OF LINGUISTIC AND INTERCULTURAL EDUCATION 17, no. 2 (2024): 7–13. https://doi.org/10.29302/jolie.2024.17.2.1.

Texto completo
Resumen
The concept of identity has been approached from multiple perspectives, as the self always relates to everything and everybody that surrounds it, getting adjusted by every experience it passes through, in a continuous attempt to gain self-apprehension and to recover its sense of belonging. Place, time, emotions, culture are only a few of the factors that impact upon the self, reconfiguring it, as a result of the “troubled condition of the individual, displaced and oscillating between cultures” (Dobrinescu 2017: 156). It is the quest for personal identity, against the background of social relations, that urges the individual to accept or to reject some configurations and representations of his/her self. This personal-social dichotomy has received scholarly attention, engendering numerous theories that aim to integrate the eclectic nature of identity into a coherent picture. Nevertheless, identity should rather be viewed in its own making, as a process, emerging from constant and fluctuating identities (Hall 1997) and leading to permanent or volatile identity fragments (Norris 2011). As Lawler (2014) argues, despite having a stable core embedding both sameness and difference at the same time, identity is produced in the flow of social relationships. Moreover, as Simon (2004) would add, identity is not only socially constructed and negotiated, but also represented and conceptualised at a cognitive level. Thus, identity pertains to the individual’s own perception of him/herself, to the way s/he wants to be perceived by the others, and to the feedback s/he receives throughout the social interactions. Therefore, the individual will get the complete picture of his/her identity once s/he manages to bind “untold and repressed stories” and “the actual stories the subject can take up to and hold as constitutive of his personal identity” (Ricoeur 1984: 74). The contributions in this special issue surpass the boundaries imposed by the Self-Other dichotomy that pervades scholarly research, pinpointing to the multifaceted nature of identity. Its versatility is clearly reflected in the semiotic resources people use to express their identity. Regardless of whether they have it acknowledged by the others or not, they adopt different “stylistic resources” resulting into social semiotic manifestations that best reflect their identity. In Van Leeuwen’s (2022: 2) words: “…not only stable but also hybrid and conflicted or confused identities manifest themselves through different uses of shape, colour, texture, timbre and movement” that are “socially and culturally valued and regulated”. Such signifiers of identity distinguish or, on the contrary, unite different categories of people. “Différance is the systematic play of differences, of the traces of differences, of the spacing by means of which elements are related to each other”, as Derrida (1981: 27) explains it. And he goes on saying that “Differences are the effects of transformations…”. Identity production is such a transformation that results into manifest, perceptible difference. Broadly speaking, the contributions in this special issue highlight the sociolinguistic significance of the personal, the relational, and the collective sense of self, as outlined by a variety of genres, such as postmodern autobiographies, film adaptations, essays, novels and short stories, linguistic usage guidebooks, the discourse of education focused on the teaching of stylistic devices, and political cartoons. As surveyed by numerous studies (e.g. Hecht 1993, Brewer and Gardner 1996, Jenkins 2008), the three perspectives on identity pinpoint either to the psychological approaches that mainly focus on the individual and group membership level, or to the interactional approaches that delve into the interpersonal level of identity construction and negotiation, emphasising its sociopragmatic dimensions, such as identity positioning and (mis)management of face (Spencer-Oatey 2007) or the dialogic nature of identity (Feller 2014). This special issue of the Journal of Linguistic and Intercultural Education proceeds with Tidita Abdurrahmani’s (Bedër University College, Tirana, Albania) contribution entitled “Otherness and contemporaneity of identities in black female autobiography at the turn of the 21st century”. Drawing on a couple of feminist studies by Teresa de Lauretis (1986) or Simone de Beauvoir (1973), as well as on the postmodernist view of alterity as propounded by Gergen (1993) and Vegas-Gonzáles (2001), the study provides an insightful analysis of the multiplicity and specificity of the ethnic self as revealed by two autobiographical writings, namely: Zami: A New Spelling of My Name by Audre Lorde and Bone Black: Memories of Girlhood by bell hooks. Aware of their matrilineal heritage, the protagonists find the power to acknowledge the conflictual selves or even the Otherness within themselves. Thus, it is argued that a key aspect of the postmodern Self is the continuous pursuit of wholeness and the reconciliation of its fragmented nature. In his contribution, Franck Colotte (Université Clermont Auvergne, France) explores the transfer of meaning and ideology from Balzac’s novel Illusions perdues to Xavier Giannoli’s film Lost Illusions. These shattered illusions belong to the young man from the provinces, Lucien Chardon, who is mesmerised, challenged, and finally defeated by the Parisian mirage in his quest for social recognition. Scholars often debate the extent to which an adaptation should remain faithful to the source material. Some adaptations aim for a high degree of fidelity, closely mirroring the original work, while others embrace transformation, interpreting the source in new and innovative ways. Frank Colotte delves into the techniques employed by Xavier Giannoli to structure the Balzacian narrative and to engage the viewers. By focusing the film on the central portion of the novel, which details Lucien’s time in Paris, and shifting the plot into the background while highlighting the interactions among characters, Giannoli effectively immerses the audience in the ruthless world of the press, outlining economic and social struggles, along with the relentless pursuit of social success. The study conducted by Anca Dobrinescu (Petroleum-Gas University of Ploiesti, Romania) critically explores Virginia Woolf’s multimodal techniques that she employs in her essay Three Guineas, including drawing, painting, photography, along with the literary ones, to demonstrate how the interplay of text and image enhances the impact and memorability of the conveyed message, thereby fostering a lasting effect on the reader. The article not only validates Woolf’s masterful experimentation across artistic boundaries, but also underscores her acute awareness of contemporary social issues, such as gender disparities, education, and war, all of which are explored as (dis)connections between the Public and the Private. “Otherness from a Chinese Perspective and Mo Yan’s Hallucinatory Realism” is an expository piece of writing, aimed at conveying the author’s preoccupations with the unnecessarily rigid understanding of “Otherness”, especially in the context of Chinese literary theory. Marius Virgil Florea (Shanghai International Studies University, China) creates a correlation between historical, cultural and geopolitical perspectives and their impact on literature in the perceived chasm of East and West. The paper starts with a detailed account of the rise and development of realism in China, a movement focused on topics such as society, morals, economy, and history. In order to highlight the connection between Chinese and Western literature, the author chooses to examine the Chinese critical reception of Nobel Prize laureate Mo Yan, whose work has elicited polarized interpretations. Mo Yan’s style, characterised by a combination of magic realism, modernist elements and influences from both traditional Chinese literature and Western literature, led to two divergent critical opinions, one viewing his work as a continuation of the great Chinese literary tradition with minimal foreign influence, and the other characterising it as an imitation of Western literature. Nevertheless, as Marius Virgil Florea points out, Mo Yan’s work serves as the most effective means to challenge the enduring myth of the incompatibility between West and East, demonstrating that the two cultures can coexist and mutually influence one another without contradiction. Loredana Netedu’s (Petroleum-Gas University of Ploiesti, Romania) contribution represents an excellent study of decoding the meaning in contemporary Romanian comic strips by means of a diligent semiotic analysis. Even though the extant literature refers to it in various terms, such as “hybrid genre” (Kaindl 2004), “graphic art” (Inge 1990), or “visual narrative” (Eisner 2008), all studies acknowledge the multimodal nature of the genre, with its dramatic qualities underscored by the dynamic action and the character-driven message delivery. The corpus consists of the comic strips produced by HAC! magazine, representing the reconstruction of one of the traditional Romanian fairy tale written by Ion Creangă, namely Povestea lui Harap Alb (The Story of the White Moor). Thus, HAC! stands for Harap Alb continuă (The White Moor Continues) and involves the transposition of the source text into a successful piece of fanfiction and a metacomic. The research shows how traditional Romanian values can be revived and brought to the attention of both the young and the old generations by using a modern and attractive form of communication, in which the visual and the verbal narratives intertwine. The analysis and interpretation of the data is thoroughly and vividly presented, while the rewriting of canonic texts is clearly explained. Advocating a sociolinguistic perspective complemented by Fillmore’s (1975) frame semantics and Langacker’s (1990) profile/base theory, Adina Oana Nicolae (Petroleum-Gas University of Ploiesti) investigates a small corpus of nominal pairs employed in British and American English, selected from several usage guidebooks. The study highlights the semantic differences that emerge between apparently synonymous lexical items, which, although profiling a shared concept, convey distinct meanings shaped by the cognitive framing influenced by cultural, social, or legal contexts. The analysis accounts for the way in which seemingly equivalent nominal phrases belonging to British and American English give prominence to various features of the same object or action, leading to their different conceptualisation and implicitly to different interpretations, as a result of the background knowledge they activate in the human cognitive domain. Therefore, such an approach to dialectal variation also underscores the mental images speakers project via the lexical choices they make, thereby revealing a wide range of ideas and experiences that shape our communication and the way we present ourselves to others. In their joint contribution, Irena Shehu (University College Beder, Albania) and Enkeleda Jata (Agricultural University of Tirana, Albania) argue that stylistic devices such as zeugma, puns, and oxymoron, combined with artistic elements like humour and media, contribute to the multimodal construction of identity in the classroom by engaging students emotionally and intellectually. These devices help create a learning environment where students not only develop language skills but they also express their identities. The integration of these elements allows students to relate new information to their own experiences, facilitating deeper connections with the content and fostering a sense of belonging in the classroom. This aligns with the broader theme of identity construction, as the multimodal approach enriches the learners’ self-expression and engagement. In her study, Ágnes Virág (Institute of Fine Arts and Art Theory, Eszterházy Károly Catholic University, Eger, Hungary) examines the visual representation of corruption in political cartoons, with a focus on metaphorical depiction of the European Union and the figure of the Hungarian Prime Minister Viktor Orbán. While research on corruption imagery remains limited, political cartoons frequently employ source domains such as poison, disease, and natural disasters to illustrate its destructive nature. A dataset of 57 Hungarian and 15 international cartoons was analysed, with 25 Hungarian and 14 international illustrations from 2012-2023 selected for their emphasis on corruption. The analysis reveals that Hungarian cartoons often depict Brussels, EU politicians, and the European People’s Party metonymically, portraying them as corrupt or threatening entities. In contrast, international cartoons tend to use official EU symbols such as the stars on a blue background, euro signs, and the EU flag. Key metaphors highlight that the EU is frequently represented as a human figure, appearing as a doctor, lion tamer, enemy, investor, or banker, depending on the cartoon’s political stance. While Hungarian illustrations emphasise a power struggle between Orbán and the EU, international cartoons focus more on financial themes, portraying the EU as a treasury or bank whose primary role is distributing or withholding funds. Through the analysis of these visual and narrative techniques, the study highlights how political cartoons reinforce ideological perspectives on corruption and European politics. This special issue concludes with a book review by Jana Bérešová (Trnava University, Slovakia) on Maria-Ionela Neagu’s (2020) edited volume Voyage and Emotions across Genres (Berlin: Peter Lang). While the first part of the volume – Voyage across Literary Studies – delves into the insightful journeys experienced by various characters, as depicted by Jonathan Swift, Sandra Cisneros, Flaubert, or Petronius, the second part of the volume – Space and Emotions. A Discursive Approach – adopts a cognitive, psychological, and/or educational perspective in order to explore a wide range of emotions that pervade the intercultural space. On account of the aforementioned, situated at the crossroads of cultural studies, sociolinguistics, and cognitive linguistics, all studies featured in the current special issue contribute original research on the multifaceted aspects of identity, construed and negotiated in the discursive space of literary texts, films, comic art, and political cartoons. The editor expresses her sincere gratitude to all contributors for their rigorously conducted scholarly work, which paves the way for new avenues of research.
Los estilos APA, Harvard, Vancouver, ISO, etc.
42

Xu, Xing, Jialin Tian, Kaiyi Lin, Huimin Lu, Jie Shao, and Heng Tao Shen. "Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network." ACM Transactions on Multimedia Computing, Communications, and Applications 17, no. 1s (2021): 1–17. http://dx.doi.org/10.1145/3424341.

Texto completo
Resumen
Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of unseen classes that are disjoint with seen classes in the training set. The ZS-CMR task is more challenging due to the heterogeneous distributions of different modalities and the semantic inconsistency between seen and unseen classes. A few of recently proposed approaches are inspired by zero-shot learning to estimate the distribution underlying multimodal data by generative models and make the knowledge transfer from seen classes to unseen classes by leveraging class embeddings. However, directly borrowing the idea from zero-shot learning (ZSL) is not fully adaptive to the retrieval task, since the core of the retrieval task is learning the common space. To address the above issues, we propose a novel approach named Assembling AutoEncoder and Generative Adversarial Network (AAEGAN), which combines the strength of AutoEncoder (AE) and Generative Adversarial Network (GAN), to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. Besides, instead of utilizing class embeddings as common space, the AAEGAN approach maps all multimodal data into a learned latent space with the distribution alignment via three coupled AEs. We empirically show the remarkable improvement for ZS-CMR task and establish the state-of-the-art or competitive performance on four image-text retrieval datasets.
Los estilos APA, Harvard, Vancouver, ISO, etc.
43

Larin, Ilya, and Alexander Karabelsky. "Riemannian Manifolds for Biological Imaging Applications Based on Unsupervised Learning." Journal of Imaging 11, no. 4 (2025): 103. https://doi.org/10.3390/jimaging11040103.

Texto completo
Resumen
The development of neural networks has made the introduction of multimodal systems inevitable. Computer vision methods are still not widely used in biological research, despite their importance. It is time to recognize the significance of advances in feature extraction and real-time analysis of information from cells. Teacherless learning for the image clustering task is of great interest. In particular, the clustering of single cells is of great interest. This study will evaluate the feasibility of using latent representation and clustering of single cells in various applications in the fields of medicine and biotechnology. Of particular interest are embeddings, which relate to the morphological characterization of cells. Studies of C2C12 cells will reveal more about aspects of muscle differentiation by using neural networks. This work focuses on analyzing the applicability of the latent space to extract morphological features. Like many researchers in this field, we note that obtaining high-quality latent representations for phase-contrast or bright-field images opens new frontiers for creating large visual-language models. Graph structures are the main approaches to non-Euclidean manifolds. Graph-based segmentation has a long history, e.g., the normalized cuts algorithm treated segmentation as a graph partitioning problem—but only recently have such ideas merged with deep learning in an unsupervised manner. Recently, a number of works have shown the advantages of hyperbolic embeddings in vision tasks, including clustering and classification based on the Poincaré ball model. One area worth highlighting is unsupervised segmentation, which we believe is undervalued, particularly in the context of non-Euclidean spaces. In this approach, we aim to mark the beginning of our future work on integrating visual information and biological aspects of individual cells to multimodal space in comparative studies in vitro.
Los estilos APA, Harvard, Vancouver, ISO, etc.
44

Rakhi Madhukararao Joshi. "Enhancing Vehicle Tracking and Recognition Across Multiple Cameras with Multimodal Contrastive Domain Sharing GAN and Topological Embeddings." Panamerican Mathematical Journal 34, no. 1 (2024): 114–27. http://dx.doi.org/10.52783/pmj.v34.i1.910.

Texto completo
Resumen
Using Multimodal Contrastive Domain Sharing Generative Adversarial Networks (GAN) and topological embeddings, this study shows a new way to improve car tracking and classification across multiple camera feeds. Different camera angles and lighting conditions can make it hard for current car tracking systems to work correctly. This study tries to solve these problems. Common Objects in Context (COCO) and ImageNet are two datasets that are used in this method for training. Multimodal Contrastive Domain Sharing GAN is used for detection and tracking. It makes cross-modal learning easier by letting you see things from different camera angles. This framework lets the model learn shared representations, which makes it better at recognizing vehicles in a wider range of visual domains. The Topological Information Embedded Convolutional Neural Network (TIE-CNN) is used to re-identify the car after it has been found and tracked. This network embeds the paths of vehicles into a continuous latent space, keeping the important spatial connections needed for accurate tracking. Real-world multi-camera datasets used for experimental confirmation show that tracking accuracy and recognition performance are much better than with standard methods. The suggested framework works great in tough situations like blocked views and sudden changes in lighting, showing that it are reliable in complicated surveillance settings. This study adds to the progress in multi-camera car tracking and identification by combining geometric data analysis with deep learning methods. This method uses Multimodal Contrastive Domain Sharing GAN and topological embeddings to improve the timing and spatial coherence of tracking results. It also sets the stage for future improvements in monitoring and self-driving systems.
Los estilos APA, Harvard, Vancouver, ISO, etc.
45

Vijaya Kamble. "Design of an Iterative Method for Enhanced Multimodal Time Series Analysis Using Graph Attention Networks, Variational Graph Autoencoders, and Transfer Learning." Journal of Electrical Systems 20, no. 5s (2024): 2579–98. http://dx.doi.org/10.52783/jes.2699.

Texto completo
Resumen
In the ever-evolving landscape of data analysis, the need to efficiently and accurately interpret multimodal time series data has become paramount. Traditional methods often fall short in addressing the complex dependencies and dynamics inherent in such data, limiting their effectiveness in real-world applications. This work introduces a comprehensive approach that leverages Graph Attention Networks (GATs), Variational Graph Autoencoders (VGAEs), transfer learning with pretrained transformers, and Bayesian state-space models to overcome these limitations. GATs are selected for their ability to dynamically focus on relevant modalities through attention mechanisms, thereby capturing the intricate relationships between different data modalities. This method significantly enhances the model's ability to integrate multimodal information, leading to notable improvements in classification, prediction, and anomaly detection tasks. VGAEs are utilized to learn latent representations within a graph-based framework, promoting unsupervised learning while unveiling the underlying data structure. The resultant embeddings are pivotal for downstream tasks like clustering and visualization, encapsulating the interactions within multimodal time series data effectively. Furthermore, this work incorporates transfer learning with pretrained transformers to harness extensive knowledge from large datasets, adapting it to multimodal time series analysis. This strategy excels in capturing long-range dependencies, thereby augmenting generalization and performance in data-scarce scenarios. Bayesian state-space models are employed to elucidate the temporal dynamics and uncertainties of time series data, offering a robust framework for probabilistic inference and enhancing the interpretability and reliability of forecasting and anomaly detection. The efficacy of the proposed model is rigorously evaluated using diverse datasets, including the Yahoo! Stock Dataset, Forest Cover Dataset, and an empirical collection of 100k time series data samples. The results demonstrate a significant leap in performance metrics, including a 9.5% increase in precision, 8.5% boost in accuracy, 8.3% rise in recall, 10.4% reduction in delay, 9.4% enhancement in AUC, and a 5.9% improvement in specificity, alongside superior pre-emption capabilities compared to existing methods. This work not only addresses the pressing need for advanced multimodal time series analysis techniques but also sets a new benchmark for efficiency and accuracy. The integration of GATs, VGAEs, transfer learning with pretrained transformers, and Bayesian state-space models presents a formidable approach that significantly advances the field, offering profound impacts on a wide array of applications.
Los estilos APA, Harvard, Vancouver, ISO, etc.
46

Adams, Brittany, Nance S. Wilson, and Gillian E. Mertens. "dmQAR: Mapping Metacognition in Digital Spaces onto Question–Answer Relationship." Education Sciences 15, no. 6 (2025): 751. https://doi.org/10.3390/educsci15060751.

Texto completo
Resumen
This paper proposes the Digital Metacognitive Question–Answer Relationship (dmQAR) Framework, an adaptation of traditional QAR models for the complexities of digital reading environments. In response to the nonlinear, multimodal, and algorithmically curated nature of online texts, the dmQAR Framework scaffolds purposeful metacognitive questioning to support comprehension, evaluation, and critical engagement. Drawing on research in metacognition, critical literacy, and digital reading, the framework reinterprets “Right There,” “Think and Search,” “Author and Me,” and “On My Own” question categories to align with the demands of digital spaces. Practical instructional strategies, including think-alouds, student-generated questioning, digital annotation, and reflection journals, are detailed to support implementation across diverse educational contexts. The paper emphasizes that developing self-regulated questioning is essential for fostering critical literacy and resisting surface-level engagement with digital texts. Implications for instruction highlight the need for explicit metacognitive scaffolding and equitable access to digital literacy tools. Future research directions include empirical validation of the framework’s impact on digital reading comprehension and exploration of developmental differences in metacognitive questioning practices. In an era of widespread misinformation and algorithmic bias, embedding metacognitive questioning into literacy education is vital for preparing students to navigate digital landscapes critically and reflectively.
Los estilos APA, Harvard, Vancouver, ISO, etc.
47

Zhang, Rongchao, Yu Huang, Yiwei Lou, et al. "Exploit Your Latents: Coarse-Grained Protein Backmapping with Latent Diffusion Models." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 1 (2025): 1111–19. https://doi.org/10.1609/aaai.v39i1.32098.

Texto completo
Resumen
Coarse-grained (CG) molecular dynamics of proteins is a preferred approach to studying large molecules on extended time scales by condensing the entire atomic model into a limited number of pseudo-atoms and preserving the thermodynamic properties of the system. However, the significantly increased efficiency impedes the analysis of substantial physicochemical information, since high-resolution atomic details are sacrificed to accelerate simulation. In this paper, we propose LatCPB, a generative approach based on diffusion that enables high-resolution backmapping of CG proteins. Specifically, our model encodes an all-atom into discrete latent embeddings, aligned with learnable multimodal discrete priors for circumventing posterior collapse and maintaining the discrete properties of the protein sequence. During the generation, we further design a latent diffusion process within the continuous latent space due to the potential stochastics in the data. Moreover, LatCPB performs a contrastive learning strategy in latent space to separate feature representations of various molecules and conformations of the same molecule, thus enhancing the comprehension of molecular representational diversity. Experimental results demonstrate that LatCPB is able to backmap CG proteins effectively and achieve outstanding performance.
Los estilos APA, Harvard, Vancouver, ISO, etc.
48

Han, Kezhen, Shaohang Lu, Zhengce Liu, and Zipeng Wang. "Active Fault Isolation for Multimode Fault Systems Based on a Set Separation Indicator." Entropy 25, no. 6 (2023): 876. http://dx.doi.org/10.3390/e25060876.

Texto completo
Resumen
This paper considers the active fault isolation problem for a class of uncertain multimode fault systems with a high-dimensional state-space model. It has been observed that the existing approaches in the literature based on a steady-state active fault isolation method are often accompanied by a large delay in making the correct isolation decision. To reduce such fault isolation latency significantly, this paper proposes a fast online active fault isolation method based on the construction of residual transient-state reachable set and transient-state separating hyperplane. The novelty and benefit of this strategy lies in the embedding of a new component called the set separation indicator, which is designed offline to distinguish the residual transient-state reachable sets of different system configurations at any given moment. Based on the results delivered by the set separation indicator, one can determine the specific moments at which the deterministic isolation is to be implemented during online diagnostics. Meanwhile, some alternative constant inputs can also be evaluated for isolation effects to determine better auxiliary excitation signals with smaller amplitudes and more differentiated separating hyperplanes. The validity of these results is verified by both a numerical comparison and an FPGA-in-loop experiment.
Los estilos APA, Harvard, Vancouver, ISO, etc.
49

Weiner, Pascal, Caterina Neef, Yoshihisa Shibata, Yoshihiko Nakamura, and Tamim Asfour. "An Embedded, Multi-Modal Sensor System for Scalable Robotic and Prosthetic Hand Fingers." Sensors 20, no. 1 (2019): 101. http://dx.doi.org/10.3390/s20010101.

Texto completo
Resumen
Grasping and manipulation with anthropomorphic robotic and prosthetic hands presents a scientific challenge regarding mechanical design, sensor system, and control. Apart from the mechanical design of such hands, embedding sensors needed for closed-loop control of grasping tasks remains a hard problem due to limited space and required high level of integration of different components. In this paper we present a scalable design model of artificial fingers, which combines mechanical design and embedded electronics with a sophisticated multi-modal sensor system consisting of sensors for sensing normal and shear force, distance, acceleration, temperature, and joint angles. The design is fully parametric, allowing automated scaling of the fingers to arbitrary dimensions in the human hand spectrum. To this end, the electronic parts are composed of interchangeable modules that facilitate the mechanical scaling of the fingers and are fully enclosed by the mechanical parts of the finger. The resulting design model allows deriving freely scalable and multimodally sensorised fingers for robotic and prosthetic hands. Four physical demonstrators are assembled and tested to evaluate the approach.
Los estilos APA, Harvard, Vancouver, ISO, etc.
50

Du, Zhicheng, Hui-Yan Luo, Lijin Lian, Vijay Kumar Pandey, Jiansong Ji, and Peiwu Qin. "Abstract A049: Development of a tri-modal contrast learning model integrating pathology-text and CT-text for clinical oncology tasks." Clinical Cancer Research 31, no. 13_Supplement (2025): A049. https://doi.org/10.1158/1557-3265.aimachine-a049.

Texto completo
Resumen
Abstract This study aims to develop a path-image-text tri-modal representation learning framework (Tri-MCR) without paired data by integrating pathology-text and CT-text models to improve the performance of clinical cancer tasks. Aiming at the challenge of scarce multi-modal data pairing in the cancer field, we project the pre-trained pathology-text and CT-text models into a shared semantic space by using text (i.e., Electronic Health Record) as an intermediate modality and optimize the cross-modal alignment by using a semantic enhancement strategy. Tri-MCR projects pre-trained pathology-text and CT-text models into a shared semantic space, leveraging text as an intermediary modality. Cross-modal alignment is optimized via a semantic enhancement strategy involving two key components: (1) Dynamic Semantic Enhancement: Gaussian noise injection and cross-modal attention mechanisms are employed to dynamically aggregate text-guided pathology and radiology features, thereby enhancing the semantic integrity of the embeddings. (2) Dual-Alignment Strategy: Cross-modal contrastive loss enforces semantic consistency between pathology-text and CT-text representations within the shared space, while intra-modal contrastive loss mitigates representation shifts between pathology and CT modalities. Furthermore, an Interpretability Mapping technique visualizes pathology-CT-text semantic associations through a cross-modal similarity matrix, offering biological insights for clinical decision-making. In the validation of the colorectal cancer single-center cohort (N=1015) of the ChangKang (Healthy Bowel) project, Tri-MCR significantly outperforms the baseline model in tumor biomarker prediction and mortality risk prediction tasks. This method provides a new idea for efficient representation learning of tumor multimodal data, provides an interpretable multimodal analysis tool for tumor precision diagnosis and treatment, and reduces the dependence on large-scale paired data, which has important clinical application value. In the future, further model multi-task performance evaluation and reliability verification in multi-center and multi-cancer cohorts are considered. Citation Format: Zhicheng Du, Hui-Yan Luo, Lijin Lian, Vijay Kumar. Pandey, Jiansong Ji, Peiwu Qin. Development of a tri-modal contrast learning model integrating pathology-text and CT-text for clinical oncology tasks [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr A049.
Los estilos APA, Harvard, Vancouver, ISO, etc.
Ofrecemos descuentos en todos los planes premium para autores cuyas obras están incluidas en selecciones literarias temáticas. ¡Contáctenos para obtener un código promocional único!