Log in

Relevant bibliographies by topics / Multimodal embedding space / Journal articles

To see the other types of publications on this topic, follow the link: Multimodal embedding space.

Journal articles on the topic 'Multimodal embedding space'

Author: Grafiati

Published: 25 May 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Multimodal embedding space.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Tyshchuk, Kirill, Polina Karpikova, Andrew Spiridonov, Anastasiia Prutianova, Anton Razzhigaev, and Alexander Panchenko. "On Isotropy of Multimodal Embeddings." Information 14, no. 7 (2023): 392. http://dx.doi.org/10.3390/info14070392.

Full text

Abstract:

Embeddings, i.e., vector representations of objects, such as texts, images, or graphs, play a key role in deep learning methodologies nowadays. Prior research has shown the importance of analyzing the isotropy of textual embeddings for transformer-based text encoders, such as the BERT model. Anisotropic word embeddings do not use the entire space, instead concentrating on a narrow cone in such a pretrained vector space, negatively affecting the performance of applications, such as textual semantic similarity. Transforming a vector space to optimize isotropy has been shown to be beneficial for improving performance in text processing tasks. This paper is the first comprehensive investigation of the distribution of multimodal embeddings using the example of OpenAI’s CLIP pretrained model. We aimed to deepen the understanding of the embedding space of multimodal embeddings, which has previously been unexplored in this respect, and study the impact on various end tasks. Our initial efforts were focused on measuring the alignment of image and text embedding distributions, with an emphasis on their isotropic properties. In addition, we evaluated several gradient-free approaches to enhance these properties, establishing their efficiency in improving the isotropy/alignment of the embeddings and, in certain cases, the zero-shot classification accuracy. Significantly, our analysis revealed that both CLIP and BERT models yielded embeddings situated within a cone immediately after initialization and preceding training. However, they were mostly isotropic in the local sense. We further extended our investigation to the structure of multilingual CLIP text embeddings, confirming that the observed characteristics were language-independent. By computing the few-shot classification accuracy and point-cloud metrics, we provide evidence of a strong correlation among multilingual embeddings. Embeddings transformation using the methods described in this article makes it easier to visualize embeddings. At the same time, multiple experiments that we conducted showed that, in regard to the transformed embeddings, the downstream tasks performance does not drop substantially (and sometimes is even improved). This means that one could obtain an easily visualizable embedding space, without substantially losing the quality of downstream tasks.

APA, Harvard, Vancouver, ISO, and other styles

2

Mai, Sijie, Haifeng Hu, and Songlong Xing. "Modality to Modality Translation: An Adversarial Representation Learning and Graph Fusion Network for Multimodal Fusion." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (2020): 164–72. http://dx.doi.org/10.1609/aaai.v34i01.5347.

Full text

Abstract:

Learning joint embedding space for various modalities is of vital importance for multimodal fusion. Mainstream modality fusion approaches fail to achieve this goal, leaving a modality gap which heavily affects cross-modal fusion. In this paper, we propose a novel adversarial encoder-decoder-classifier framework to learn a modality-invariant embedding space. Since the distributions of various modalities vary in nature, to reduce the modality gap, we translate the distributions of source modalities into that of target modality via their respective encoders using adversarial training. Furthermore, we exert additional constraints on embedding space by introducing reconstruction loss and classification loss. Then we fuse the encoded representations using hierarchical graph neural network which explicitly explores unimodal, bimodal and trimodal interactions in multi-stage. Our method achieves state-of-the-art performance on multiple datasets. Visualization of the learned embeddings suggests that the joint embedding space learned by our method is discriminative.

APA, Harvard, Vancouver, ISO, and other styles

3

Zhang, Linhai, Deyu Zhou, Yulan He, and Zeng Yang. "MERL: Multimodal Event Representation Learning in Heterogeneous Embedding Spaces." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 16 (2021): 14420–27. http://dx.doi.org/10.1609/aaai.v35i16.17695.

Full text

Abstract:

Previous work has shown the effectiveness of using event representations for tasks such as script event prediction and stock market prediction. It is however still challenging to learn the subtle semantic differences between events based solely on textual descriptions of events often represented as (subject, predicate, object) triples. As an alternative, images offer a more intuitive way of understanding event semantics. We observe that event described in text and in images show different abstraction levels and therefore should be projected onto heterogeneous embedding spaces, as opposed to what have been done in previous approaches which project signals from different modalities onto a homogeneous space. In this paper, we propose a Multimodal Event Representation Learning framework (MERL) to learn event representations based on both text and image modalities simultaneously. Event textual triples are projected as Gaussian density embeddings by a dual-path Gaussian triple encoder, while event images are projected as point embeddings by a visual event component-aware image encoder. Moreover, a novel score function motivated by statistical hypothesis testing is introduced to coordinate two embedding spaces. Experiments are conducted on various multimodal event-related tasks and results show that MERL outperforms a number of unimodal and multimodal baselines, demonstrating the effectiveness of the proposed framework.

APA, Harvard, Vancouver, ISO, and other styles

4

Guo, Zhiqiang, Jianjun Li, Guohui Li, Chaoyang Wang, Si Shi, and Bin Ruan. "LGMRec: Local and Global Graph Learning for Multimodal Recommendation." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (2024): 8454–62. http://dx.doi.org/10.1609/aaai.v38i8.28688.

Full text

Abstract:

The multimodal recommendation has gradually become the infrastructure of online media platforms, enabling them to provide personalized service to users through a joint modeling of user historical behaviors (e.g., purchases, clicks) and item various modalities (e.g., visual and textual). The majority of existing studies typically focus on utilizing modal features or modal-related graph structure to learn user local interests. Nevertheless, these approaches encounter two limitations: (1) Shared updates of user ID embeddings result in the consequential coupling between collaboration and multimodal signals; (2) Lack of exploration into robust global user interests to alleviate the sparse interaction problems faced by local interest modeling. To address these issues, we propose a novel Local and Global Graph Learning-guided Multimodal Recommender (LGMRec), which jointly models local and global user interests. Specifically, we present a local graph embedding module to independently learn collaborative-related and modality-related embeddings of users and items with local topological relations. Moreover, a global hypergraph embedding module is designed to capture global user and item embeddings by modeling insightful global dependency relations. The global embeddings acquired within the hypergraph embedding space can then be combined with two decoupled local embeddings to improve the accuracy and robustness of recommendations. Extensive experiments conducted on three benchmark datasets demonstrate the superiority of our LGMRec over various state-of-the-art recommendation baselines, showcasing its effectiveness in modeling both local and global user interests.

APA, Harvard, Vancouver, ISO, and other styles

5

Moon, Jucheol, Nhat Anh Le, Nelson Hebert Minaya, and Sang-Il Choi. "Multimodal Few-Shot Learning for Gait Recognition." Applied Sciences 10, no. 21 (2020): 7619. http://dx.doi.org/10.3390/app10217619.

Full text

Abstract:

A person’s gait is a behavioral trait that is uniquely associated with each individual and can be used to recognize the person. As information about the human gait can be captured by wearable devices, a few studies have led to the proposal of methods to process gait information for identification purposes. Despite recent advances in gait recognition, an open set gait recognition problem presents challenges to current approaches. To address the open set gait recognition problem, a system should be able to deal with unseen subjects who have not included in the training dataset. In this paper, we propose a system that learns a mapping from a multimodal time series collected using insole to a latent (embedding vector) space to address the open set gait recognition problem. The distance between two embedding vectors in the latent space corresponds to the similarity between two multimodal time series. Using the characteristics of the human gait pattern, multimodal time series are sliced into unit steps. The system maps unit steps to embedding vectors using an ensemble consisting of a convolutional neural network and a recurrent neural network. To recognize each individual, the system learns a decision function using a one-class support vector machine from a few embedding vectors of the person in the latent space, then the system determines whether an unknown unit step is recognized as belonging to a known individual. Our experiments demonstrate that the proposed framework recognizes individuals with high accuracy regardless they have been registered or not. If we could have an environment in which all people would be wearing the insole, the framework would be used for user verification widely.

APA, Harvard, Vancouver, ISO, and other styles

6

Zhang, Rongchao, Yiwei Lou, Dexuan Xu, Yongzhi Cao, Hanpin Wang, and Yu Huang. "A Learnable Discrete-Prior Fusion Autoencoder with Contrastive Learning for Tabular Data Synthesis." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 15 (2024): 16803–11. http://dx.doi.org/10.1609/aaai.v38i15.29621.

Full text

Abstract:

The actual collection of tabular data for sharing involves confidentiality and privacy constraints, leaving the potential risks of machine learning for interventional data analysis unsafely averted. Synthetic data has emerged recently as a privacy-protecting solution to address this challenge. However, existing approaches regard discrete and continuous modal features as separate entities, thus falling short in properly capturing their inherent correlations. In this paper, we propose a novel contrastive learning guided Gaussian Transformer autoencoder, termed GTCoder, to synthesize photo-realistic multimodal tabular data for scientific research. Our approach introduces a transformer-based fusion module that seamlessly integrates multimodal features, permitting for mining more informative latent representations. The attention within the fusion module directs the integrated output features to focus on critical components that facilitate the task of generating latent embeddings. Moreover, we formulate a contrastive learning strategy to implicitly constrain the embeddings from discrete features in the latent feature space by encouraging the similar discrete feature distributions closer while pushing the dissimilar further away, in order to better enhance the representation of the latent embedding. Experimental results indicate that GTCoder is effective to generate photo-realistic synthetic data, with interactive interpretation of latent embedding, and performs favorably against some baselines on most real-world and simulated datasets.

APA, Harvard, Vancouver, ISO, and other styles

7

Merkx, Danny, and Stefan L. Frank. "Learning semantic sentence representations from visually grounded language without lexical knowledge." Natural Language Engineering 25, no. 4 (2019): 451–66. http://dx.doi.org/10.1017/s1351324919000196.

Full text

Abstract:

AbstractCurrent approaches to learning semantic representations of sentences often use prior word-level knowledge. The current study aims to leverage visual information in order to capture sentence level semantics without the need for word embeddings. We use a multimodal sentence encoder trained on a corpus of images with matching text captions to produce visually grounded sentence embeddings. Deep Neural Networks are trained to map the two modalities to a common embedding space such that for an image the corresponding caption can be retrieved and vice versa. We show that our model achieves results comparable to the current state of the art on two popular image-caption retrieval benchmark datasets: Microsoft Common Objects in Context (MSCOCO) and Flickr8k. We evaluate the semantic content of the resulting sentence embeddings using the data from the Semantic Textual Similarity (STS) benchmark task and show that the multimodal embeddings correlate well with human semantic similarity judgements. The system achieves state-of-the-art results on several of these benchmarks, which shows that a system trained solely on multimodal data, without assuming any word representations, is able to capture sentence level semantics. Importantly, this result shows that we do not need prior knowledge of lexical level semantics in order to model sentence level semantics. These findings demonstrate the importance of visual information in semantics.

APA, Harvard, Vancouver, ISO, and other styles

8

Fan, Yunpeng, Wenyou Du, Yingwei Zhang, and Xiaogang Wang. "Fault Detection for Multimodal Process Using Quality-Relevant Kernel Neighborhood Preserving Embedding." Mathematical Problems in Engineering 2015 (2015): 1–15. http://dx.doi.org/10.1155/2015/210125.

Full text

Abstract:

A new method named quality-relevant kernel neighborhood preserving embedding (QKNPE) has been proposed. Quality variables have been considered for the first time in kernel neighborhood preserving embedding (KNPE) method for monitoring multimodal process. In summary, the whole algorithm is a two-step process: first, to improve manifold structure and to deal with multimodal nonlinearity problem, the neighborhood preserving embedding technique is introduced; and second to monitoring the complete production process, the product quality variables are added in the objective function. Compared with the conventional monitoring method, the proposed method has the following advantages: (1) the hidden manifold which related to the character of industrial process has been embedded to a low dimensional space and the identifying information of the different mode of the monitored system has been extracted; (2) the product quality as an important factor has been considered for the first time in manifold method. In the experiment section, we applied this method to electrofused magnesia furnace (EFMF) process, which is a representative case study. The experimental results show the effectiveness of the proposed method.

APA, Harvard, Vancouver, ISO, and other styles

9

Ota, Kosuke, Keiichiro Shirai, Hidetoshi Miyao, and Minoru Maruyama. "Multimodal Analogy-Based Image Retrieval by Improving Semantic Embeddings." Journal of Advanced Computational Intelligence and Intelligent Informatics 26, no. 6 (2022): 995–1003. http://dx.doi.org/10.20965/jaciii.2022.p0995.

Full text

Abstract:

In this work, we study the application of multimodal analogical reasoning to image retrieval. Multimodal analogy questions are given in a form of tuples of words and images, e.g., “cat”:“dog”::[an image of a cat sitting on a bench]:?, to search for an image of a dog sitting on a bench. Retrieving desired images given these tuples can be seen as a task of finding images whose relation between the query image is close to that of query words. One way to achieve the task is building a common vector space that exhibits analogical regularities. To learn such an embedding, we propose a quadruple neural network called multimodal siamese network. The network consists of recurrent neural networks and convolutional neural networks based on the siamese architecture. We also introduce an effective procedure to generate analogy examples from an image-caption dataset for training of our network. In our experiments, we test our model on analogy-based image retrieval tasks. The results show that our method outperforms the previous work in qualitative evaluation.

APA, Harvard, Vancouver, ISO, and other styles

10

Kim, Jongseok, Youngjae Yu, Hoeseong Kim, and Gunhee Kim. "Dual Compositional Learning in Interactive Image Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 35, no. 2 (2021): 1771–79. http://dx.doi.org/10.1609/aaai.v35i2.16271.

Full text

Abstract:

We present an approach named Dual Composition Network (DCNet) for interactive image retrieval that searches for the best target image for a natural language query and a reference image. To accomplish this task, existing methods have focused on learning a composite representation of the reference image and the text query to be as close to the embedding of the target image as possible. We refer this approach as Composition Network. In this work, we propose to close the loop with Correction Network that models the difference between the reference and target image in the embedding space and matches it with the embedding of the text query. That is, we consider two cyclic directional mappings for triplets of (reference image, text query, target image) by using both Composition Network and Correction Network. We also propose a joint training loss that can further improve the robustness of multimodal representation learning. We evaluate the proposed model on three benchmark datasets for multimodal retrieval: Fashion-IQ, Shoes, and Fashion200K. Our experiments show that our DCNet achieves new state-of-the-art performance on all three datasets, and the addition of Correction Network consistently improves multiple existing methods that are solely based on Composition Network. Moreover, an ensemble of our model won the first place in Fashion-IQ 2020 challenge held in a CVPR 2020 workshop.

APA, Harvard, Vancouver, ISO, and other styles

11

Abiyev, Rahib H., Mohamad Ziad Altabel, Manal Darwish, and Abdulkader Helwan. "A Multimodal Transformer Model for Recognition of Images from Complex Laparoscopic Surgical Videos." Diagnostics 14, no. 7 (2024): 681. http://dx.doi.org/10.3390/diagnostics14070681.

Full text

Abstract:

The determination of the potential role and advantages of artificial intelligence-based models in the field of surgery remains uncertain. This research marks an initial stride towards creating a multimodal model, inspired by the Video-Audio-Text Transformer, that aims to reduce negative occurrences and enhance patient safety. The model employs text and image embedding state-of-the-art models (ViT and BERT) to assess their efficacy in extracting the hidden and distinct features from the surgery video frames. These features are then used as inputs for convolution-free Transformer architectures to extract comprehensive multidimensional representations. A joint space is then used to combine the text and image features extracted from both Transformer encoders. This joint space ensures that the relationships between the different modalities are preserved during the combination process. The entire model was trained and tested on laparoscopic cholecystectomy (LC) videos encompassing various levels of complexity. Experimentally, a mean accuracy of 91.0%, a precision of 81%, and a recall of 83% were reached by the model when tested on 30 videos out of 80 from the Cholec80 dataset.

APA, Harvard, Vancouver, ISO, and other styles

12

Skantze, Gabriel, and Bram Willemsen. "CoLLIE: Continual Learning of Language Grounding from Language-Image Embeddings." Journal of Artificial Intelligence Research 74 (July 9, 2022): 1201–23. http://dx.doi.org/10.1613/jair.1.13689.

Full text

Abstract:

This paper presents CoLLIE: a simple, yet effective model for continual learning of how language is grounded in vision. Given a pre-trained multimodal embedding model, where language and images are projected in the same semantic space (in this case CLIP by OpenAI), CoLLIE learns a transformation function that adjusts the language embeddings when needed to accommodate new language use. This is done by predicting the difference vector that needs to be applied, as well as a scaling factor for this vector, so that the adjustment is only applied when needed. Unlike traditional few-shot learning, the model does not just learn new classes and labels, but can also generalize to similar language use and leverage semantic compositionality. We verify the model’s performance on two different tasks of identifying the targets of referring expressions, where it has to learn new language use. The results show that the model can efficiently learn and generalize from only a few examples, with little interference with the model’s original zero-shot performance.

APA, Harvard, Vancouver, ISO, and other styles

13

Zhang, Linhao, Li Jin, Xian Sun, et al. "TOT：Topology-Aware Optimal Transport for Multimodal Hate Detection." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 4 (2023): 4884–92. http://dx.doi.org/10.1609/aaai.v37i4.25614.

Full text

Abstract:

Multimodal hate detection, which aims to identify the harmful content online such as memes, is crucial for building a wholesome internet environment. Previous work has made enlightening exploration in detecting explicit hate remarks. However, most of their approaches neglect the analysis of implicit harm, which is particularly challenging as explicit text markers and demographic visual cues are often twisted or missing. The leveraged cross-modal attention mechanisms also suffer from the distributional modality gap and lack logical interpretability. To address these semantic gap issues, we propose TOT: a topology-aware optimal transport framework to decipher the implicit harm in memes scenario, which formulates the cross-modal aligning problem as solutions for optimal transportation plans. Specifically, we leverage an optimal transport kernel method to capture complementary information from multiple modalities. The kernel embedding provides a non-linear transformation ability to reproduce a kernel Hilbert space (RKHS), which reflects significance for eliminating the distributional modality gap. Moreover, we perceive the topology information based on aligned representations to conduct bipartite graph path reasoning. The newly achieved state-of-the-art performance on two publicly available benchmark datasets, together with further visual analysis, demonstrate the superiority of TOT in capturing implicit cross-modal alignment.

APA, Harvard, Vancouver, ISO, and other styles

14

Liang, Meiyu, Junping Du, Zhengyang Liang, Yongwang Xing, Wei Huang, and Zhe Xue. "Self-Supervised Multi-Modal Knowledge Graph Contrastive Hashing for Cross-Modal Search." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 12 (2024): 13744–53. http://dx.doi.org/10.1609/aaai.v38i12.29280.

Full text

Abstract:

Deep cross-modal hashing technology provides an effective and efficient cross-modal unified representation learning solution for cross-modal search. However, the existing methods neglect the implicit fine-grained multimodal knowledge relations between these modalities such as when the image contains information that is not directly described in the text. To tackle this problem, we propose a novel self-supervised multi-grained multi-modal knowledge graph contrastive hashing method for cross-modal search (CMGCH). Firstly, in order to capture implicit fine-grained cross-modal semantic associations, a multi-modal knowledge graph is constructed, which represents the implicit multimodal knowledge relations between the image and text as inter-modal and intra-modal semantic associations. Secondly, a cross-modal graph contrastive attention network is proposed to reason on the multi-modal knowledge graph to sufficiently learn the implicit fine-grained inter-modal and intra-modal knowledge relations. Thirdly, a cross-modal multi-granularity contrastive embedding learning mechanism is proposed, which fuses the global coarse-grained and local fine-grained embeddings by multihead attention mechanism for inter-modal and intra-modal contrastive learning, so as to enhance the cross-modal unified representations with stronger discriminativeness and semantic consistency preserving power. With the joint training of intra-modal and inter-modal contrast, the invariant and modal-specific information of different modalities can be maintained in the final unified cross-modal unified hash space. Extensive experiments on several cross-modal benchmark datasets demonstrate that the proposed CMGCH outperforms the state-of the-art methods.

APA, Harvard, Vancouver, ISO, and other styles

15

Zhang, Yachao, Runze Hu, Ronghui Li, Yanyun Qu, Yuan Xie, and Xiu Li. "Cross-Modal Match for Language Conditioned 3D Object Grounding." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 7 (2024): 7359–67. http://dx.doi.org/10.1609/aaai.v38i7.28566.

Full text

Abstract:

Language conditioned 3D object grounding aims to find the object within the 3D scene mentioned by natural language descriptions, which mainly depends on the matching between visual and natural language. Considerable improvement in grounding performance is achieved by improving the multimodal fusion mechanism or bridging the gap between detection and matching. However, several mismatches are ignored, i.e., mismatch in local visual representation and global sentence representation, and mismatch in visual space and corresponding label word space. In this paper, we propose crossmodal match for 3D grounding from mitigating these mismatches perspective. Specifically, to match local visual features with the global description sentence, we propose BEV (Bird’s-eye-view) based global information embedding module. It projects multiple object proposal features into the BEV and the relations of different objects are accessed by the visual transformer which can model both positions and features with long-range dependencies. To circumvent the mismatch in feature spaces of different modalities, we propose crossmodal consistency learning. It performs cross-modal consistency constraints to convert the visual feature space into the label word feature space resulting in easier matching. Besides, we introduce label distillation loss and global distillation loss to drive these matches learning in a distillation way. We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method.

APA, Harvard, Vancouver, ISO, and other styles

16

Akalya, Devi C., Renuka D. Karthika, T. Harisudhan, V. K. Jeevanantham, J. Jhanani, and Varshini S. Kavi. "Text emotion recognition using fast text word embedding in bi-directional gated recurrent unit." i-manager's Journal on Information Technology 11, no. 4 (2022): 1. http://dx.doi.org/10.26634/jit.11.4.19119.

Full text

Abstract:

Emotions are states of readiness in the mind that result from evaluations of one's own thinking or events. Although almost all of the important events in our lives are marked by emotions, the nature, causes, and effects of emotions are some of the least understood parts of the human experience. Emotion recognition is playing a promising role in the domains of human-computer interaction and artificial intelligence. A human's emotions can be detected using a variety of methods, including facial gestures, blood pressure, body movements, heart rate, and textual data. From an application standpoint, the ability to identify human emotions in text is becoming more and more crucial in computational linguistics. In this work, we present a classification methodology based on deep neural networks. The Bi-directional Gated Recurrent Unit (Bi-GRU) employed here demonstrates its effectiveness on the Multimodal Emotion Lines Dataset (MELD) when compared to Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM). For word encoding, a comparison of three pre-trained word embeddings namely Glove, Word2Vec, and fastText is made. The findings from the MELD corpus support the conclusion that fastText is the best word embedding for the proposed Bi-GRU model. The experiment utilized the "glove.6B.300d" vector space. It consists of two million word representations in 300 dimensions trained on Common Crawl with sub-word information (600 billion tokens). The accuracy scores of GloVe, Word2Vec, and fastText (300 dimensions each) are tabulated and studied in order to highlight the improved results with fastText on the MELD dataset tested. It is observed that the Bidirectional Gated Recurrent Unit (Bi-GRU) with fastText word embedding outperforms GloVe and Word2Vec with an accuracy of 79.7%.

APA, Harvard, Vancouver, ISO, and other styles

17

Hnini, Ghizlane, Jamal Riffi, Mohamed Adnane Mahraz, Ali Yahyaouy, and Hamid Tairi. "MMPC-RF: A Deep Multimodal Feature-Level Fusion Architecture for Hybrid Spam E-mail Detection." Applied Sciences 11, no. 24 (2021): 11968. http://dx.doi.org/10.3390/app112411968.

Full text

Abstract:

Hybrid spam is an undesirable e-mail (electronic mail) that contains both image and text parts. It is more harmful and complex as compared to image-based and text-based spam e-mail. Thus, an efficient and intelligent approach is required to distinguish between spam and ham. To our knowledge, a small number of studies have been aimed at detecting hybrid spam e-mails. Most of these multimodal architectures adopted the decision-level fusion method, whereby the classification scores of each modality were concatenated and fed to another classification model to make a final decision. Unfortunately, this method not only demands many learning steps, but it also loses correlation in mixed feature space. In this paper, we propose a deep multimodal feature-level fusion architecture that concatenates two embedding vectors to have a strong representation of e-mails and increase the performance of the classification. The paragraph vector distributed bag of words (PV-DBOW) and the convolutional neural network (CNN) were used as feature extraction techniques for text and image parts, respectively, of the same e-mail. The extracted feature vectors were concatenated and fed to the random forest (RF) model to classify a hybrid e-mail as either spam or ham. The experiments were conducted on three hybrid datasets made using three publicly available corpora: Enron, Dredze, and TREC 2007. According to the obtained results, the proposed model provides a higher accuracy of 99.16% compared to recent state-of-the-art methods.

APA, Harvard, Vancouver, ISO, and other styles

18

Wang, Kaijie, Tiejun Wang, Xiaoran Guo, Kui Xu, and Jiao Wu. "Thangka Image—Text Matching Based on Adaptive Pooling Layer and Improved Transformer." Applied Sciences 14, no. 2 (2024): 807. http://dx.doi.org/10.3390/app14020807.

Full text

Abstract:

Image–text matching is a research hotspot in the multimodal task of integrating image and text processing. In order to solve the difficult problem of associating image and text data in the multimodal knowledge graph of Thangka, we propose an image and text matching method based on the Visual Semantic Embedding (VSE) model. The method introduces an adaptive pooling layer to improve the feature extraction capability of semantic associations between Thangka images and texts. We also improved the traditional Transformer architecture by combining bidirectional residual concatenation and mask attention mechanisms to improve the stability of the matching process and the ability to extract semantic information. In addition, we designed a multi-granularity tag alignment module that maps global and local features of images and text into a common coding space, leveraging inter- and intra-modal semantic associations to improve image and text accuracy. Comparative experiments on the Thangka dataset show that our method achieves significant improvements compared to the VSE baseline method. Specifically, our method improves the recall by 9.4% and 10.5% for image-matching text and text-matching images, respectively. Furthermore, without any large-scale corpus pre-training, our method outperforms all models without pre-training and outperforms two out of four pre-trained models on the Flickr30k public dataset. Also, the execution efficiency of our model is an order of magnitude higher than that of the pre-trained models, which highlights the superior performance and efficiency of our model in the image–text matching task.

APA, Harvard, Vancouver, ISO, and other styles

19

Meo, Giuseppe, Pilar M. Ferraro, Marta Cillerai, et al. "MND Phenotypes Differentiation: The Role of Multimodal Characterization at the Time of Diagnosis." Life 12, no. 10 (2022): 1506. http://dx.doi.org/10.3390/life12101506.

Full text

Abstract:

Pure/predominant upper motor neuron (pUMN) and lower motor neuron (pLMN) diseases have significantly better prognosis compared to amyotrophic lateral sclerosis (ALS), but their early differentiation is often challenging. We therefore tested whether a multimodal characterization approach embedding clinical, cognitive/behavioral, genetic, and neurophysiological data may improve the differentiation of pUMN and pLMN from ALS already by the time of diagnosis. Dunn’s and chi-squared tests were used to compare data from 41 ALS, 34 pLMN, and 19 pUMN cases with diagnoses confirmed throughout a 2-year observation period. Area under the curve (AUC) analyses were implemented to identify the finest tools for phenotypes discrimination. Relative to ALS, pLMN showed greater lower limbs weakness, lower UMN burden, and progression rate (p < 0.001–0.04). PUMN showed a greater frequency of lower limbs onset, higher UMN burden, lower ALSFRS-r and MRC progression rates (p < 0.001–0.03), and greater ulnar compound muscle action potential (CMAP) amplitude and tibial central motor conduction time (CMCT) (p = 0.05–0.03). The UMN progression rate was the finest measure to identify pLMN cases (AUC = 90%), while the MRC progression rate was the finest tool to identify pUMN (AUC = 82%). Detailed clinical and neurophysiological examinations may significantly improve MNDs differentiation, facilitating prognosis estimation and ameliorating stratification strategies for clinical trials enrollment.

APA, Harvard, Vancouver, ISO, and other styles

20

Biswas, Rajarshi, Michael Barz, and Daniel Sonntag. "Towards Explanatory Interactive Image Captioning Using Top-Down and Bottom-Up Features, Beam Search and Re-ranking." KI - Künstliche Intelligenz 34, no. 4 (2020): 571–84. http://dx.doi.org/10.1007/s13218-020-00679-2.

Full text

Abstract:

AbstractImage captioning is a challenging multimodal task. Significant improvements could be obtained by deep learning. Yet, captions generated by humans are still considered better, which makes it an interesting application for interactive machine learning and explainable artificial intelligence methods. In this work, we aim at improving the performance and explainability of the state-of-the-art method Show, Attend and Tell by augmenting their attention mechanism using additional bottom-up features. We compute visual attention on the joint embedding space formed by the union of high-level features and the low-level features obtained from the object specific salient regions of the input image. We embed the content of bounding boxes from a pre-trained Mask R-CNN model. This delivers state-of-the-art performance, while it provides explanatory features. Further, we discuss how interactive model improvement can be realized through re-ranking caption candidates using beam search decoders and explanatory features. We show that interactive re-ranking of beam search candidates has the potential to outperform the state-of-the-art in image captioning.

APA, Harvard, Vancouver, ISO, and other styles

21

Balabin, Helena, Charles Tapley Hoyt, Colin Birkenbihl, et al. "STonKGs: a sophisticated transformer trained on biomedical text and knowledge graphs." Bioinformatics 38, no. 6 (2022): 1648–56. http://dx.doi.org/10.1093/bioinformatics/btac001.

Full text

Abstract:

Abstract Motivation The majority of biomedical knowledge is stored in structured databases or as unstructured text in scientific publications. This vast amount of information has led to numerous machine learning-based biological applications using either text through natural language processing (NLP) or structured data through knowledge graph embedding models. However, representations based on a single modality are inherently limited. Results To generate better representations of biological knowledge, we propose STonKGs, a Sophisticated Transformer trained on biomedical text and Knowledge Graphs (KGs). This multimodal Transformer uses combined input sequences of structured information from KGs and unstructured text data from biomedical literature to learn joint representations in a shared embedding space. First, we pre-trained STonKGs on a knowledge base assembled by the Integrated Network and Dynamical Reasoning Assembler consisting of millions of text-triple pairs extracted from biomedical literature by multiple NLP systems. Then, we benchmarked STonKGs against three baseline models trained on either one of the modalities (i.e. text or KG) across eight different classification tasks, each corresponding to a different biological application. Our results demonstrate that STonKGs outperforms both baselines, especially on the more challenging tasks with respect to the number of classes, improving upon the F1-score of the best baseline by up to 0.084 (i.e. from 0.881 to 0.965). Finally, our pre-trained model as well as the model architecture can be adapted to various other transfer learning applications. Availability and implementation We make the source code and the Python package of STonKGs available at GitHub (https://github.com/stonkgs/stonkgs) and PyPI (https://pypi.org/project/stonkgs/). The pre-trained STonKGs models and the task-specific classification models are respectively available at https://huggingface.co/stonkgs/stonkgs-150k and https://zenodo.org/communities/stonkgs. Supplementary information Supplementary data are available at Bioinformatics online.

APA, Harvard, Vancouver, ISO, and other styles

22

Yuan, Xinpan, Xinxin Mao, Wei Xia, Zhiqi Zhang, Shaojun Xie, and Chengyuan Zhang. "PTF-SimCM: A Simple Contrastive Model with Polysemous Text Fusion for Visual Similarity Metric." Complexity 2022 (September 16, 2022): 1–14. http://dx.doi.org/10.1155/2022/2343707.

Full text

Abstract:

Image similarity metric, also known as metric learning (ML) in computer vision, is a significant step in various advanced image tasks. Nevertheless, existing well-performing approaches for image similarity measurement only focus on the image itself without utilizing the information of other modalities, while pictures always appear with the described text. Furthermore, those methods need human supervision, yet most images are unlabeled in the real world. Considering the above problems comprehensively, we present a novel visual similarity metric model named PTF-SimCM. It adopts a self-supervised contrastive structure like SimSiam and incorporates a multimodal fusion module to utilize textual modality correlated to the image. We apply a cross-modal model for text modality rather than a standard unimodal text encoder to improve late fusion productivity. In addition, the proposed model employs Sentence PIE-Net to solve the issue caused by polysemous sentences. For simplicity and efficiency, our model learns a specific embedding space where distances directly correspond to the similarity. Experimental results on MSCOCO, Flickr 30k, and Pascal Sentence datasets show that our model overall outperforms all the compared methods in this work, which illustrates that the model can effectively address the issues faced and enhance the performances on unsupervised visual similarity measuring relatively.

APA, Harvard, Vancouver, ISO, and other styles

23

Tang, Zhenchao, Jiehui Huang, Guanxing Chen, and Calvin Yu-Chian Chen. "Comprehensive View Embedding Learning for Single-Cell Multimodal Integration." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 14 (2024): 15292–300. http://dx.doi.org/10.1609/aaai.v38i14.29453.

Full text

Abstract:

Motivation: Advances in single-cell measurement techniques provide rich multimodal data, which helps us to explore the life state of cells more deeply. However, multimodal integration, or, learning joint embeddings from multimodal data remains a current challenge. The difficulty in integrating unpaired single-cell multimodal data is that different modalities have different feature spaces, which easily leads to information loss in joint embedding. And few existing methods have fully exploited and fused the information in single-cell multimodal data. Result: In this study, we propose CoVEL, a deep learning method for unsupervised integration of single-cell multimodal data. CoVEL learns single-cell representations from a comprehensive view, including regulatory relationships between modalities, fine-grained representations of cells, and relationships between different cells. The comprehensive view embedding enables CoVEL to remove the gap between modalities while protecting biological heterogeneity. Experimental results on multiple public datasets show that CoVEL is accurate and robust to single-cell multimodal integration. Data availability: https://github.com/shapsider/scintegration.

APA, Harvard, Vancouver, ISO, and other styles

24

Chen, Ziwei, Shaokun An, Xiangqi Bai, Fuzhou Gong, Liang Ma, and Lin Wan. "DensityPath: an algorithm to visualize and reconstruct cell state-transition path on density landscape for single-cell RNA sequencing data." Bioinformatics 35, no. 15 (2018): 2593–601. http://dx.doi.org/10.1093/bioinformatics/bty1009.

Full text

Abstract:

Abstract Motivation Visualizing and reconstructing cell developmental trajectories intrinsically embedded in high-dimensional expression profiles of single-cell RNA sequencing (scRNA-seq) snapshot data are computationally intriguing, but challenging. Results We propose DensityPath, an algorithm allowing (i) visualization of the intrinsic structure of scRNA-seq data on an embedded 2-d space and (ii) reconstruction of an optimal cell state-transition path on the density landscape. DensityPath powerfully handles high dimensionality and heterogeneity of scRNA-seq data by (i) revealing the intrinsic structures of data, while adopting a non-linear dimension reduction algorithm, termed elastic embedding, which can preserve both local and global structures of the data; and (ii) extracting the topological features of high-density, level-set clusters from a single-cell multimodal density landscape of transcriptional heterogeneity, as the representative cell states. DensityPath reconstructs the optimal cell state-transition path by finding the geodesic minimum spanning tree of representative cell states on the density landscape, establishing a least action path with the minimum-transition-energy of cell fate decisions. We demonstrate that DensityPath can ably reconstruct complex trajectories of cell development, e.g. those with multiple bifurcating and trifurcating branches, while maintaining computational efficiency. Moreover, DensityPath has high accuracy for pseudotime calculation and branch assignment on real scRNA-seq, as well as simulated datasets. DensityPath is robust to parameter choices, as well as permutations of data. Availability and implementation DensityPath software is available at https://github.com/ucasdp/DensityPath. Supplementary information Supplementary data are available at Bioinformatics online.

APA, Harvard, Vancouver, ISO, and other styles

25

Yin, Ziyi, Muchao Ye, Tianrong Zhang, et al. "VQAttack: Transferable Adversarial Attacks on Visual Question Answering via Pre-trained Models." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 7 (2024): 6755–63. http://dx.doi.org/10.1609/aaai.v38i7.28499.

Full text

Abstract:

Visual Question Answering (VQA) is a fundamental task in computer vision and natural language process fields. Although the “pre-training & finetuning” learning paradigm significantly improves the VQA performance, the adversarial robustness of such a learning paradigm has not been explored. In this paper, we delve into a new problem: using a pre-trained multimodal source model to create adversarial image-text pairs and then transferring them to attack the target VQA models. Correspondingly, we propose a novel VQATTACK model, which can iteratively generate both im- age and text perturbations with the designed modules: the large language model (LLM)-enhanced image attack and the cross-modal joint attack module. At each iteration, the LLM-enhanced image attack module first optimizes the latent representation-based loss to generate feature-level image perturbations. Then it incorporates an LLM to further enhance the image perturbations by optimizing the designed masked answer anti-recovery loss. The cross-modal joint attack module will be triggered at a specific iteration, which updates the image and text perturbations sequentially. Notably, the text perturbation updates are based on both the learned gradients in the word embedding space and word synonym-based substitution. Experimental results on two VQA datasets with five validated models demonstrate the effectiveness of the proposed VQATTACK in the transferable attack setting, compared with state-of-the-art baselines. This work reveals a significant blind spot in the “pre-training & fine-tuning” paradigm on VQA tasks. The source code can be found in the link https://github.com/ericyinyzy/VQAttack.

APA, Harvard, Vancouver, ISO, and other styles

26

Lin, Kaiyi, Xing Xu, Lianli Gao, Zheng Wang, and Heng Tao Shen. "Learning Cross-Aligned Latent Embeddings for Zero-Shot Cross-Modal Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11515–22. http://dx.doi.org/10.1609/aaai.v34i07.6817.

Full text

Abstract:

Zero-Shot Cross-Modal Retrieval (ZS-CMR) is an emerging research hotspot that aims to retrieve data of new classes across different modality data. It is challenging for not only the heterogeneous distributions across different modalities, but also the inconsistent semantics across seen and unseen classes. A handful of recently proposed methods typically borrow the idea from zero-shot learning, i.e., exploiting word embeddings of class labels (i.e., class-embeddings) as common semantic space, and using generative adversarial network (GAN) to capture the underlying multimodal data structures, as well as strengthen relations between input data and semantic space to generalize across seen and unseen classes. In this paper, we propose a novel method termed Learning Cross-Aligned Latent Embeddings (LCALE) as an alternative to these GAN based methods for ZS-CMR. Unlike using the class-embeddings as the semantic space, our method seeks for a shared low-dimensional latent space of input multimodal features and class-embeddings by modality-specific variational autoencoders. Notably, we align the distributions learned from multimodal input features and from class-embeddings to construct latent embeddings that contain the essential cross-modal correlation associated with unseen classes. Effective cross-reconstruction and cross-alignment criterions are further developed to preserve class-discriminative information in latent space, which benefits the efficiency for retrieval and enable the knowledge transfer to unseen classes. We evaluate our model using four benchmark datasets on image-text retrieval tasks and one large-scale dataset on image-sketch retrieval tasks. The experimental results show that our method establishes the new state-of-the-art performance for both tasks on all datasets.

APA, Harvard, Vancouver, ISO, and other styles

27

Xu, Xing, Jialin Tian, Kaiyi Lin, Huimin Lu, Jie Shao, and Heng Tao Shen. "Zero-shot Cross-modal Retrieval by Assembling AutoEncoder and Generative Adversarial Network." ACM Transactions on Multimedia Computing, Communications, and Applications 17, no. 1s (2021): 1–17. http://dx.doi.org/10.1145/3424341.

Full text

Abstract:

Conventional cross-modal retrieval models mainly assume the same scope of the classes for both the training set and the testing set. This assumption limits their extensibility on zero-shot cross-modal retrieval (ZS-CMR), where the testing set consists of unseen classes that are disjoint with seen classes in the training set. The ZS-CMR task is more challenging due to the heterogeneous distributions of different modalities and the semantic inconsistency between seen and unseen classes. A few of recently proposed approaches are inspired by zero-shot learning to estimate the distribution underlying multimodal data by generative models and make the knowledge transfer from seen classes to unseen classes by leveraging class embeddings. However, directly borrowing the idea from zero-shot learning (ZSL) is not fully adaptive to the retrieval task, since the core of the retrieval task is learning the common space. To address the above issues, we propose a novel approach named Assembling AutoEncoder and Generative Adversarial Network (AAEGAN), which combines the strength of AutoEncoder (AE) and Generative Adversarial Network (GAN), to jointly incorporate common latent space learning, knowledge transfer, and feature synthesis for ZS-CMR. Besides, instead of utilizing class embeddings as common space, the AAEGAN approach maps all multimodal data into a learned latent space with the distribution alignment via three coupled AEs. We empirically show the remarkable improvement for ZS-CMR task and establish the state-of-the-art or competitive performance on four image-text retrieval datasets.

APA, Harvard, Vancouver, ISO, and other styles

28

Vijaya Kamble. "Design of an Iterative Method for Enhanced Multimodal Time Series Analysis Using Graph Attention Networks, Variational Graph Autoencoders, and Transfer Learning." Journal of Electrical Systems 20, no. 5s (2024): 2579–98. http://dx.doi.org/10.52783/jes.2699.

Full text

Abstract:

In the ever-evolving landscape of data analysis, the need to efficiently and accurately interpret multimodal time series data has become paramount. Traditional methods often fall short in addressing the complex dependencies and dynamics inherent in such data, limiting their effectiveness in real-world applications. This work introduces a comprehensive approach that leverages Graph Attention Networks (GATs), Variational Graph Autoencoders (VGAEs), transfer learning with pretrained transformers, and Bayesian state-space models to overcome these limitations. GATs are selected for their ability to dynamically focus on relevant modalities through attention mechanisms, thereby capturing the intricate relationships between different data modalities. This method significantly enhances the model's ability to integrate multimodal information, leading to notable improvements in classification, prediction, and anomaly detection tasks. VGAEs are utilized to learn latent representations within a graph-based framework, promoting unsupervised learning while unveiling the underlying data structure. The resultant embeddings are pivotal for downstream tasks like clustering and visualization, encapsulating the interactions within multimodal time series data effectively. Furthermore, this work incorporates transfer learning with pretrained transformers to harness extensive knowledge from large datasets, adapting it to multimodal time series analysis. This strategy excels in capturing long-range dependencies, thereby augmenting generalization and performance in data-scarce scenarios. Bayesian state-space models are employed to elucidate the temporal dynamics and uncertainties of time series data, offering a robust framework for probabilistic inference and enhancing the interpretability and reliability of forecasting and anomaly detection. The efficacy of the proposed model is rigorously evaluated using diverse datasets, including the Yahoo! Stock Dataset, Forest Cover Dataset, and an empirical collection of 100k time series data samples. The results demonstrate a significant leap in performance metrics, including a 9.5% increase in precision, 8.5% boost in accuracy, 8.3% rise in recall, 10.4% reduction in delay, 9.4% enhancement in AUC, and a 5.9% improvement in specificity, alongside superior pre-emption capabilities compared to existing methods. This work not only addresses the pressing need for advanced multimodal time series analysis techniques but also sets a new benchmark for efficiency and accuracy. The integration of GATs, VGAEs, transfer learning with pretrained transformers, and Bayesian state-space models presents a formidable approach that significantly advances the field, offering profound impacts on a wide array of applications.

APA, Harvard, Vancouver, ISO, and other styles

29

Han, Kezhen, Shaohang Lu, Zhengce Liu, and Zipeng Wang. "Active Fault Isolation for Multimode Fault Systems Based on a Set Separation Indicator." Entropy 25, no. 6 (2023): 876. http://dx.doi.org/10.3390/e25060876.

Full text

Abstract:

This paper considers the active fault isolation problem for a class of uncertain multimode fault systems with a high-dimensional state-space model. It has been observed that the existing approaches in the literature based on a steady-state active fault isolation method are often accompanied by a large delay in making the correct isolation decision. To reduce such fault isolation latency significantly, this paper proposes a fast online active fault isolation method based on the construction of residual transient-state reachable set and transient-state separating hyperplane. The novelty and benefit of this strategy lies in the embedding of a new component called the set separation indicator, which is designed offline to distinguish the residual transient-state reachable sets of different system configurations at any given moment. Based on the results delivered by the set separation indicator, one can determine the specific moments at which the deterministic isolation is to be implemented during online diagnostics. Meanwhile, some alternative constant inputs can also be evaluated for isolation effects to determine better auxiliary excitation signals with smaller amplitudes and more differentiated separating hyperplanes. The validity of these results is verified by both a numerical comparison and an FPGA-in-loop experiment.

APA, Harvard, Vancouver, ISO, and other styles

30

Weiner, Pascal, Caterina Neef, Yoshihisa Shibata, Yoshihiko Nakamura, and Tamim Asfour. "An Embedded, Multi-Modal Sensor System for Scalable Robotic and Prosthetic Hand Fingers." Sensors 20, no. 1 (2019): 101. http://dx.doi.org/10.3390/s20010101.

Full text

Abstract:

Grasping and manipulation with anthropomorphic robotic and prosthetic hands presents a scientific challenge regarding mechanical design, sensor system, and control. Apart from the mechanical design of such hands, embedding sensors needed for closed-loop control of grasping tasks remains a hard problem due to limited space and required high level of integration of different components. In this paper we present a scalable design model of artificial fingers, which combines mechanical design and embedded electronics with a sophisticated multi-modal sensor system consisting of sensors for sensing normal and shear force, distance, acceleration, temperature, and joint angles. The design is fully parametric, allowing automated scaling of the fingers to arbitrary dimensions in the human hand spectrum. To this end, the electronic parts are composed of interchangeable modules that facilitate the mechanical scaling of the fingers and are fully enclosed by the mechanical parts of the finger. The resulting design model allows deriving freely scalable and multimodally sensorised fingers for robotic and prosthetic hands. Four physical demonstrators are assembled and tested to evaluate the approach.

APA, Harvard, Vancouver, ISO, and other styles

31

Myles, David, David Milne та Jonathan D. Shephard. "Scanned Mask Imaging Ablative DPSS UV Laser Process for 2μm L/S RDL". Additional Conferences (Device Packaging, HiTEC, HiTEN, and CICMT) 2015, DPC (2015): 000554–89. http://dx.doi.org/10.4071/2015dpc-tp21.

Full text

Abstract:

Laser embedding conductors within a dielectric offers numerous advantages in fabricating redistribution layers (RDLs) for chip packages. Ablation of features down to 2μm L/S gives more routing space per layer and addresses the technology gap between semiconductor and PCB technologies. Microvias are made in the same process step as the circuitry, facilitating near padless vias further increasing the routing space available per layer. For a given package, this reduces the layer count and conductor path length required reducing the height profile of the package and improving signal integrity. Embedding the conductor can also improve its adhesion to the substrate and improve the co-planarity of subsequent layers in the build up. It also removes the need for the wet photochemistry associated with lithographic techniques. This presentation analyses the results of a novel UV, diode pumped solid state (DPSS), ablative mask imaging laser system for cost effective, high volume, 3D structuring of organic dielectrics. Two methods are widely used to micro-structure materials by laser: mask projection and direct write. Excimer lasers are typically used in mask projection systems, where their high pulse energy and low coherence make them well suited to imaging. These systems can achieve the required ablation quality with 2–3μm line width and space, however excimer lasers have a high capital cost and require regular and costly maintenance when compared with DPSS lasers. The high beam quality and lower pulse energy of DPSS lasers makes them better suited to a direct write approach. A galvanometer scan head used in conjunction with an f-theta scan lens can be used to scan a focused beam across a substrate. Since the pattern is defined by a CAD file, these systems are very flexible and thus appropriate for low volume prototyping. However, complicated control systems are required to accurately control the ablated depth, and constraints in the circuit design are imposed by the direct write approach. Also, because each feature is marked sequentially, the process time is proportional to the pattern complexity, which makes these tools prohibitively slow for high volume manufacture of the high density RDLs required in the next generation of device packages. This presentation outlines a scanned mask imaging system, wherein a low maintenance, cost efficient, frequency tripled, nanosecond, multimode UV solid state laser is used to illuminate a binary reticle. The multimode beam has an approximately Gaussian beam profile which is homogenised to form a square, flat top profile. A galvanometer scan head is used to raster scan the binary reticle. The reticle is subsequently imaged onto the substrate by a projection lens. Ablation of various features down to 2μm L/S in a variety of low K organic dielectrics is demonstrated. Accurate registration of pads with vias down to 5μm diameter highlights the feasibility of the process for high density RDLs and micro-vias for organic interposers. The process can achieve an ablation quality comparable to that of an excimer laser system, but with the advantage of significant cost saving and ease of maintenance in an industrial environment.

APA, Harvard, Vancouver, ISO, and other styles

32

Suguitan, Michael, Nick DePalma, Guy Hoffman, and Jessica Hodgins. "Face2Gesture: Translating Facial Expressions Into Robot Movements Through Shared Latent Space Neural Networks." ACM Transactions on Human-Robot Interaction, October 4, 2023. http://dx.doi.org/10.1145/3623386.

Full text

Abstract:

In this work, we present a method for personalizing human-robot interaction by using emotive facial expressions to generate affective robot movements. Movement is an important medium for robots to communicate affective states, but the expertise and time required to craft new robot movements promotes a reliance on fixed preprogrammed behaviors. Enabling robots to respond to multimodal user input with newly generated movements could stave off staleness of interaction and convey a deeper degree of affective understanding than current retrieval-based methods. We use autoencoder neural networks to compress robot movement data and facial expression images into a shared latent embedding space. Then, we use a reconstruction loss to generate movements from these embeddings and triplet loss to align the embeddings by emotion classes rather than data modality. To subjectively evaluate our method, we conducted a user survey and found that generated happy and sad movements could be matched to their source face images. However, angry movements were most often mismatched to sad images. This multimodal data-driven generative method can expand an interactive agent’s behavior library and could be adopted for other multimodal affective applications.

APA, Harvard, Vancouver, ISO, and other styles

33

Wen, Jun, Xiang Zhang, Everett Rush, et al. "Multimodal representation learning for predicting molecule–disease relations." Bioinformatics 39, no. 2 (2023). http://dx.doi.org/10.1093/bioinformatics/btad085.

Full text

Abstract:

Abstract Motivation Predicting molecule–disease indications and side effects is important for drug development and pharmacovigilance. Comprehensively mining molecule–molecule, molecule–disease and disease–disease semantic dependencies can potentially improve prediction performance. Methods We introduce a Multi-Modal REpresentation Mapping Approach to Predicting molecular-disease relations (M2REMAP) by incorporating clinical semantics learned from electronic health records (EHR) of 12.6 million patients. Specifically, M2REMAP first learns a multimodal molecule representation that synthesizes chemical property and clinical semantic information by mapping molecule chemicals via a deep neural network onto the clinical semantic embedding space shared by drugs, diseases and other common clinical concepts. To infer molecule–disease relations, M2REMAP combines multimodal molecule representation and disease semantic embedding to jointly infer indications and side effects. Results We extensively evaluate M2REMAP on molecule indications, side effects and interactions. Results show that incorporating EHR embeddings improves performance significantly, for example, attaining an improvement over the baseline models by 23.6% in PRC-AUC on indications and 23.9% on side effects. Further, M2REMAP overcomes the limitation of existing methods and effectively predicts drugs for novel diseases and emerging pathogens. Availability and implementation The code is available at https://github.com/celehs/M2REMAP, and prediction results are provided at https://shiny.parse-health.org/drugs-diseases-dev/. Supplementary information Supplementary data are available at Bioinformatics online.

APA, Harvard, Vancouver, ISO, and other styles

34

Chang, Jun Qing, Deepu Rajan, and Nicholas Vun. "Multimodal few-shot classification without attribute embedding." EURASIP Journal on Image and Video Processing 2024, no. 1 (2024). http://dx.doi.org/10.1186/s13640-024-00620-9.

Full text

Abstract:

AbstractMultimodal few-shot learning aims to exploit complementary information inherent in multiple modalities for vision tasks in low data scenarios. Most of the current research focuses on a suitable embedding space for the various modalities. While solutions based on embedding provide state-of-the-art results, they reduce the interpretability of the model. Separate visualization approaches enable the models to become more transparent. In this paper, a multimodal few-shot learning framework that is inherently interpretable is presented. This is achieved by using the textual modality in the form of attributes without embedding them. This enables the model to directly explain which attributes caused it to classify an image into a particular class. The model consists of a variational autoencoder to learn the visual latent representation, which is combined with a semantic latent representation that is learnt from a normal autoencoder, which calculates a semantic loss between the latent representation and a binary attribute vector. A decoder reconstructs the original image from concatenated latent vectors. The proposed model outperforms other multimodal methods when all test classes are used, e.g., 50 classes in a 50-way 1-shot setting, and is comparable for lesser number of ways. Since raw text attributes are used, the datasets for evaluation are CUB, SUN and AWA2. The effectiveness of interpretability provided by the model is evaluated by analyzing how well it has learnt to identify the attributes.

APA, Harvard, Vancouver, ISO, and other styles

35

Elhoseiny, Mohamed, Jingen Liu, Hui Cheng, Harpreet Sawhney, and Ahmed Elgammal. "Zero-Shot Event Detection by Multimodal Distributional Semantic Embedding of Videos." Proceedings of the AAAI Conference on Artificial Intelligence 30, no. 1 (2016). http://dx.doi.org/10.1609/aaai.v30i1.10458.

Full text

Abstract:

We propose a new zero-shot Event-Detection method by Multi-modal Distributional Semantic embedding of videos. Our model embeds object and action concepts as well as other available modalities from videos into a distributional semantic space. To our knowledge, this is the first Zero-Shot event detection model that is built on top of distributional semantics and extends it in the following directions: (a) semantic embedding of multimodal information in videos (with focus on the visual modalities), (b) semantic embedding of concepts definitions, and (c) retrieve videos by free text event query (e.g., "changing a vehicle tire") based on their content. We first embed the video into the multi-modal semantic space and then measure the similarity between videos with the event query in free text form. We validated our method on the large TRECVID MED (Multimedia Event Detection) challenge. Using only the event title as a query, our method outperformed the state-the-art that uses big descriptions from 12.6\% to 13.5\% with MAP metric and from 0.73 to 0.83 with ROC-AUC metric. It is also an order of magnitude faster.

APA, Harvard, Vancouver, ISO, and other styles

36

Feng, Duoduo, Xiangteng He, and Yuxin Peng. "MKVSE: Multimodal Knowledge Enhanced Visual-Semantic Embedding for Image-Text Retrieval." ACM Transactions on Multimedia Computing, Communications, and Applications, January 19, 2023. http://dx.doi.org/10.1145/3580501.

Full text

Abstract:

Image-text retrieval aims to take the text (image) query to retrieve the semantically relevant images (texts), which is fundamental and critical in the search system, online shopping, and social network. Existing works have shown the effectiveness of visual-semantic embedding and unimodal knowledge exploiting (e.g. textual knowledge) in connecting the image and text. However, they neglect the implicit multimodal knowledge relations between these two modalities when the image contains information that is not directly described in the text, hindering the ability to connect the image and text with the implicit semantic relations. For instance, an image shows a person next to the ”tap” but the pairing text description may only include the word ”wash”, missing the washing tool ”tap”. The implicit semantic relation between image object ”tap” and text word ”wash” can help to connect the above image and text. To sufficiently utilize the implicit multimodal knowledge relations, we propose a M ultimodal K nowledge enhanced V isual- S emantic E mbedding (MKVSE) approach building a multimodal knowledge graph to explicitly represent the implicit multimodal knowledge relations and injecting it to visual-semantic embedding for image-text retrieval task. The contributions in this paper can be summarized as follows: (1) M ultimodal K nowledge G raph (MKG) is proposed to explicitly represent the implicit multimodal knowledge relations between the image and text as intra-modal semantic relations and inter-modal co-occurrence relations . Intra-modal semantic relations provide synonymy information that is implicit in the unimodal data such as the text corpus. And inter-modal co-occurrence relations characterize the co-occurrence correlations (such as temporal, causal, and logical) which are implicit in image-text pairs. These two relations help establishing reliable image-text connections in the higher-level semantic space. (2) M ultimodal G raph C onvolution N etworks (MGCN) is proposed to reason on the MKG in two steps to sufficiently utilize the implicit multimodal knowledge relations. In the first step, MGCN focuses on the intra-modal relations to distinguish other entities in the semantic space. In the second step, MGCN focuses on the inter-modal relations to connect multimodal entities based on co-occurrence correlations. The two-step reasoning manner can sufficiently utilize the implicit semantic relations between two modal entities to enhance the embeddings of the image and text. Extensive experiments are conducted on two widely-used datasets, namely Flickr30K and MSCOCO, to demonstrate the superiority of the proposed MKVSE approach in achieving state-of-the-art performances. The codes are available at https://github.com/PKU-ICST-MIPL/MKVSE-TOMM2023.

APA, Harvard, Vancouver, ISO, and other styles

37

Rivas, Ryan, Sudipta Paul, Vagelis Hristidis, Evangelos E. Papalexakis, and Amit K. Roy-Chowdhury. "Task-agnostic representation learning of multimodal twitter data for downstream applications." Journal of Big Data 9, no. 1 (2022). http://dx.doi.org/10.1186/s40537-022-00570-x.

Full text

Abstract:

AbstractTwitter is a frequent target for machine learning research and applications. Many problems, such as sentiment analysis, image tagging, and location prediction have been studied on Twitter data. Much of the prior work that addresses these problems within the context of Twitter focuses on a subset of the types of data available, e.g. only text, or text and image. However, a tweet can have several additional components, such as the location and the author, that can also provide useful information for machine learning tasks. In this work, we explore the problem of jointly modeling several tweet components in a common embedding space via task-agnostic representation learning, which can then be used to tackle various machine learning applications. To address this problem, we propose a deep neural network framework that combines text, image, and graph representations to learn joint embeddings for 5 tweet components: body, hashtags, images, user, and location. In our experiments, we use a large dataset of tweets to learn a joint embedding model and use it in multiple tasks to evaluate its performance vs. state-of-the-art baselines specific to each task. Our results show that our proposed generic method has similar or superior performance to specialized application-specific approaches, including accuracy of 52.43% vs. 48.88% for location prediction and recall of up to 15.93% vs. 12.12% for hashtag recommendation.

APA, Harvard, Vancouver, ISO, and other styles

38

Dong, Shanshan, Tianzi Niu, Xin Luo, Wu Liu, and Xin-Shun Xu. "Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning." ACM Transactions on Multimedia Computing, Communications, and Applications, July 22, 2022. http://dx.doi.org/10.1145/3550276.

Full text

Abstract:

Video captioning which bridges vision and language is a fundamental yet challenging task in computer vision. To generate accurate and comprehensive sentences, both visual and semantic information is quite important. However, most existing methods simply concatenate different types of features and ignore the interactions between them. In addition, there is a large semantic gap between visual feature space and semantic embedding space, making the task much challenging. To address these issues, we propose a framework named semantic embedding guided attention with Explicit visual Feature Fusion for vidEo CapTioning, EFFECT for short, in which we design an explicit visual-feature fusion (EVF) scheme to capture the pairwise interactions between multiple visual modalities and fuse multimodal visual features of videos in an explicit way. Furthermore, we propose a novel attention mechanism called semantic embedding guided attention (SEGA), which cooperates with the temporal attention to generate a joint attention map. In specific, in SEGA, the semantic word embedding information is leveraged to guide the model to pay more attention to the most correlated visual features at each decoding stage. In this way, the semantic gap between visual and semantic space is alleviated to some extent. To evaluate the proposed model, we conduct extensive experiments on two widely used datasets, i.e. MSVD and MSR-VTT. The experimental results demonstrate that our approach achieves state-of-the-art results in terms of four evaluation metrics.

APA, Harvard, Vancouver, ISO, and other styles

39

Chang, Jinho, and Jong Chul Ye. "Bidirectional generation of structure and properties through a single molecular foundation model." Nature Communications 15, no. 1 (2024). http://dx.doi.org/10.1038/s41467-024-46440-3.

Full text

Abstract:

AbstractRecent successes of foundation models in artificial intelligence have prompted the emergence of large-scale chemical pre-trained models. Despite the growing interest in large molecular pre-trained models that provide informative representations for downstream tasks, attempts for multimodal pre-training approaches on the molecule domain were limited. To address this, here we present a multimodal molecular pre-trained model that incorporates the modalities of structure and biochemical properties, drawing inspiration from recent advances in multimodal learning techniques. Our proposed model pipeline of data handling and training objectives aligns the structure/property features in a common embedding space, which enables the model to regard bidirectional information between the molecules’ structure and properties. These contributions emerge synergistic knowledge, allowing us to tackle both multimodal and unimodal downstream tasks through a single model. Through extensive experiments, we demonstrate that our model has the capabilities to solve various meaningful chemical challenges, including conditional molecule generation, property prediction, molecule classification, and reaction prediction.

APA, Harvard, Vancouver, ISO, and other styles

40

Ghodsizad, Talayeh, Hamid Behnam, Emad Fatemizadeh, Taraneh Faghihi Langroudi, and Fariba Bayat. "Temporal Registration of Cardiac Multimodal Images Using Locally Linear Embedding Algorithm." Frontiers in Biomedical Technologies, November 15, 2021. http://dx.doi.org/10.18502/fbt.v8i4.7757.

Full text

Abstract:

Purpose: Multimodal Cardiac Image (MCI) registration is one of the evolving fields in the diagnostic methods of Cardiovascular Diseases (CVDs). Since the heart has nonlinear and dynamic behavior, Temporal Registration (TR) is the fundamental step for the spatial registration and fusion of MCIs to integrate the heart's anatomical and functional information into a single and more informative display. Therefore, in this study, a TR framework is proposed to align MCIs in the same cardiac phase. Materials and Methods: A manifold learning-based method is proposed for the TR of MCIs. The Euclidean distance among consecutive samples lying on the Locally Linear Embedding (LLE) of MCIs is computed. By considering cardiac volume pattern concepts from distance plots of LLEs, six cardiac phases (end-diastole, rapid-ejection, end-systole, rapid-filling, reduced-filling, and atrial-contraction) are temporally registered. Results: The validation of the proposed method proceeds by collecting the data of Computed Tomography Coronary Angiography (CTCA) and Transthoracic Echocardiography (TTE) from ten patients in four acquisition views. The Correlation Coefficient (CC) between the frame number resulted from the proposed method and manually selected by an expert is analyzed. Results show that the average CC between two resulted frame numbers is about 0.82±0.08 for six cardiac phases. Moreover, the maximum Mean Absolute Error (MAE) value of two slice extraction methods is about 0.17 for four acquisition views. Conclusion: By extracting the intrinsic parameters of MCIs, and finding the relationship among them in a lower-dimensional space, a fast, fully automatic, and user-independent framework for TR of MCIs is presented. The proposed method is more accurate compared to Electrocardiogram (ECG) signal labeling or time-series processing methods which can be helpful in different MCI fusion methods.

APA, Harvard, Vancouver, ISO, and other styles

41

Ikegawa, Yuya, Ryohei Fukuma, Hidenori Sugano, et al. "Text and image generation from intracranial electroencephalography using an embedding space for text and images." Journal of Neural Engineering, April 22, 2024. http://dx.doi.org/10.1088/1741-2552/ad417a.

Full text

Abstract:

Abstract Objective:&#xD;Invasive brain-computer interfaces (BCIs) are promising communication devices for severely paralyzed patients. Recent advances in intracranial electroencephalography (iEEG) coupled with natural language processing have enhanced communication speed and accuracy. It should be noted that such a speech BCI uses signals from the motor cortex. However, BCIs based on motor cortical activities may experience signal deterioration in users with motor cortical degenerative diseases such as amyotrophic lateral sclerosis (ALS). An alternative approach to using iEEG of the motor cortex is necessary to support patients with such conditions.&#xD;Approach:&#xD;In this study, a multimodal embedding of text and images was used to decode visual semantic information from iEEG signals of the visual cortex to generate text and images. We used contrastive language-image pretraining (CLIP) embedding to represent images presented to 17 patients implanted with electrodes in the occipital and temporal cortexes. A CLIP image vector was inferred from the high-γ power of the iEEG signals recorded while viewing the images.&#xD;Main results:&#xD;Text was generated by CLIPCAP from the inferred CLIP vector with better-than-chance accuracy. Then, an image was created from the generated text using StableDiffusion with significant accuracy.&#xD;Significance:&#xD;The text and images generated from iEEG through the CLIP embedding vector can be used for improved communication.&#xD;

APA, Harvard, Vancouver, ISO, and other styles

42

Hu, Yue, Ghalia Rehawi, Lambert Moyon, et al. "Network Embedding Across Multiple Tissues and Data Modalities Elucidates the Context of Host Factors Important for COVID-19 Infection." Frontiers in Genetics 13 (July 8, 2022). http://dx.doi.org/10.3389/fgene.2022.909714.

Full text

Abstract:

COVID-19 is a heterogeneous disease caused by SARS-CoV-2. Aside from infections of the lungs, the disease can spread throughout the body and damage many other tissues, leading to multiorgan failure in severe cases. The highly variable symptom severity is influenced by genetic predispositions and preexisting diseases which have not been investigated in a large-scale multimodal manner. We present a holistic analysis framework, setting previously reported COVID-19 genes in context with prepandemic data, such as gene expression patterns across multiple tissues, polygenetic predispositions, and patient diseases, which are putative comorbidities of COVID-19. First, we generate a multimodal network using the prior-based network inference method KiMONo. We then embed the network to generate a meaningful lower-dimensional representation of the data. The input data are obtained via the Genotype-Tissue Expression project (GTEx), containing expression data from a range of tissues with genomic and phenotypic information of over 900 patients and 50 tissues. The generated network consists of nodes, that is, genes and polygenic risk scores (PRS) for several diseases/phenotypes, as well as for COVID-19 severity and hospitalization, and links between them if they are statistically associated in a regularized linear model by feature selection. Applying network embedding on the generated multimodal network allows us to perform efficient network analysis by identifying nodes close by in a lower-dimensional space that correspond to entities which are statistically linked. By determining the similarity between COVID-19 genes and other nodes through embedding, we identify disease associations to tissues, like the brain and gut. We also find strong associations between COVID-19 genes and various diseases such as ischemic heart disease, cerebrovascular disease, and hypertension. Moreover, we find evidence linking PTPN6 to a range of comorbidities along with the genetic predisposition of COVID-19, suggesting that this kinase is a central player in severe cases of COVID-19. In conclusion, our holistic network inference coupled with network embedding of multimodal data enables the contextualization of COVID-19-associated genes with respect to tissues, disease states, and genetic risk factors. Such contextualization can be exploited to further elucidate the biological importance of known and novel genes for severity of the disease in patients.

APA, Harvard, Vancouver, ISO, and other styles

43

Axås, Joar, and George Haller. "Model reduction for nonlinearizable dynamics via delay-embedded spectral submanifolds." Nonlinear Dynamics, July 16, 2023. http://dx.doi.org/10.1007/s11071-023-08705-2.

Full text

Abstract:

AbstractDelay embedding is a commonly employed technique in a wide range of data-driven model reduction methods for dynamical systems, including the dynamic mode decomposition, the Hankel alternative view of the Koopman decomposition (HAVOK), nearest-neighbor predictions and the reduction to spectral submanifolds (SSMs). In developing these applications, multiple authors have observed that delay embedding appears to separate the data into modes, whose orientations depend only on the spectrum of the sampled system. In this work, we make this observation precise by proving that the eigenvectors of the delay-embedded linearized system at a fixed point are determined solely by the corresponding eigenvalues, even for multi-dimensional observables. This implies that the tangent space of a delay-embedded invariant manifold can be predicted a priori using an estimate of the eigenvalues. We apply our results to three datasets to identify multimodal SSMs and analyse their nonlinear modal interactions. While SSMs are the focus of our study, these results generalize to any delay-embedded invariant manifold tangent to a set of eigenvectors at a fixed point. Therefore, we expect this theory to be applicable to a number of data-driven model reduction methods.

APA, Harvard, Vancouver, ISO, and other styles

44

Deng, Li. "Deep learning: from speech recognition to language and multimodal processing." APSIPA Transactions on Signal and Information Processing 5 (2016). http://dx.doi.org/10.1017/atsip.2015.22.

Full text

Abstract:

While artificial neural networks have been in existence for over half a century, it was not until year 2010 that they had made a significant impact on speech recognition with a deep form of such networks. This invited paper, based on my keynote talk given at Interspeech conference in Singapore in September 2014, will first reflect on the historical path to this transformative success, after providing brief reviews of earlier studies on (shallow) neural networks and on (deep) generative models relevant to the introduction of deep neural networks (DNN) to speech recognition several years ago. The role of well-timed academic-industrial collaboration is highlighted, so are the advances of big data, big compute, and the seamless integration between the application-domain knowledge of speech and general principles of deep learning. Then, an overview is given on sweeping achievements of deep learning in speech recognition since its initial success. Such achievements, summarized into six major areas in this article, have resulted in across-the-board, industry-wide deployment of deep learning in speech recognition systems. Next, more challenging applications of deep learning, natural language and multimodal processing, are selectively reviewed and analyzed. Examples include machine translation, knowledgebase completion, information retrieval, and automatic image captioning, where fresh ideas from deep learning, continuous-space embedding in particular, are shown to be revolutionizing these application areas albeit with less rapid pace than for speech and image recognition. Finally, a number of key issues in deep learning are discussed, and future directions are analyzed for perceptual tasks such as speech, image, and video, as well as for cognitive tasks involving natural language.

APA, Harvard, Vancouver, ISO, and other styles

45

Qayyum, Abdul, Imran Razzak, M. Tanveer, and Moona Mazher. "Spontaneous Facial Behavior Analysis using Deep Transformer Based Framework for Child–Computer Interaction." ACM Transactions on Multimedia Computing, Communications, and Applications, May 26, 2022. http://dx.doi.org/10.1145/3539577.

Full text

Abstract:

Abstract: A fascinating challenge in robotics-human interaction is imitating the emotion recognition capability of humans to robots with the aim to make human-robotics interaction natural, genuine and intuitive. To achieve the natural interaction in affective robots, human-machine interfaces, and autonomous vehicles, understanding our attitudes and opinions is very important, and it provides a practical and feasible path to realize the connection between machine and human. Multimodal interface that includes voice along with facial expression can manifest a large range of nuanced emotions compared to purely textual interfaces and provide a great value to improve the intelligence level of effective communication. Interfaces that fail to manifest or ignore user emotions may significantly impact the performance and risk being perceived as cold, socially inept, untrustworthy, and incompetent. To equip a child well for life, we need to help our children identify their feelings, manage them well, and express their needs in healthy, respectful, and direct ways. Early identification of emotional deficits can help to prevent low social functioning in children. In this work, we analyzed the child’s spontaneous behavior using multimodal facial expression and voice signal presenting multimodal transformer-based last feature fusion for facial behavior analysis in children to extract contextualized representations from RGB video sequence and Hematoxylin and eosin video sequence and then using these representations followed by pairwise concatenations of contextualized representations using cross-feature fusion technique to predict users emotions. To validate the performance of the proposed framework, we have performed experiments with the different pairwise concatenations of contextualized representations that showed significantly better performance than state of the art method. Besides, we perform t-distributed stochastic neighbor embedding visualization to visualize the discriminative feature in lower dimension space and probability density estimation to visualize the prediction capability of our proposed model.

APA, Harvard, Vancouver, ISO, and other styles

46

Zhang, Qing, Jing Zhang, Xiangdong Su, Feilong Bao, and Guanglai Gao. "Contour detection network for zero-shot sketch-based image retrieval." Complex & Intelligent Systems, June 2, 2023. http://dx.doi.org/10.1007/s40747-023-01096-2.

Full text

Abstract:

AbstractZero-shot sketch-based image retrieval (ZS-SBIR) is a challenging task that involves searching natural images related to a given hand-drawn sketch under the zero-shot scene. The previous approach projected image and sketch features into a low-dimensional common space for retrieval, and used semantic features to transfer the knowledge of seen to unseen classes. However, it is not effective enough to align multimodal features when projecting them into a common space, since the styles and contents of sketches and natural images are different and they are not one-to-one correspondence. To solve this problem, we propose a novel three-branch joint training network with contour detection network (called CDNNet) for the ZS-SBIR task, which uses contour maps as a bridge to align sketches and natural images to alleviate the domain gap. Specifically, we use semantic metrics to constrain the relationship between contour images and natural images and between contour images and sketches, so that natural image and sketch features can be aligned in the common space. Meanwhile, we further employ second-order attention to capture target subject information to increase the performance of retrieval descriptors. In addition, we use a teacher model and word embedding method to transfer the knowledge of the seen to the unseen classes. Extensive experiments on two large-scale datasets demonstrate that our proposed approach outperforms state-of-the-art CNN-based models: it improves by 2.6% on the Sketchy and 1.2% on TU-Berlin datasets in terms of mAP.

APA, Harvard, Vancouver, ISO, and other styles

47

Shickel, Benjamin, Brandon Silva, Tezcan Ozrazgat-Baslanti, et al. "Multi-dimensional patient acuity estimation with longitudinal EHR tokenization and flexible transformer networks." Frontiers in Digital Health 4 (November 9, 2022). http://dx.doi.org/10.3389/fdgth.2022.1029191.

Full text

Abstract:

Transformer model architectures have revolutionized the natural language processing (NLP) domain and continue to produce state-of-the-art results in text-based applications. Prior to the emergence of transformers, traditional NLP models such as recurrent and convolutional neural networks demonstrated promising utility for patient-level predictions and health forecasting from longitudinal datasets. However, to our knowledge only few studies have explored transformers for predicting clinical outcomes from electronic health record (EHR) data, and in our estimation, none have adequately derived a health-specific tokenization scheme to fully capture the heterogeneity of EHR systems. In this study, we propose a dynamic method for tokenizing both discrete and continuous patient data, and present a transformer-based classifier utilizing a joint embedding space for integrating disparate temporal patient measurements. We demonstrate the feasibility of our clinical AI framework through multi-task ICU patient acuity estimation, where we simultaneously predict six mortality and readmission outcomes. Our longitudinal EHR tokenization and transformer modeling approaches resulted in more accurate predictions compared with baseline machine learning models, which suggest opportunities for future multimodal data integrations and algorithmic support tools using clinical transformer networks.

APA, Harvard, Vancouver, ISO, and other styles

48

Gkikas, Stefanos, Nikolaos S. Tachos, Stelios Andreadis, et al. "Multimodal automatic assessment of acute pain through facial videos and heart rate signals utilizing transformer-based architectures." Frontiers in Pain Research 5 (March 27, 2024). http://dx.doi.org/10.3389/fpain.2024.1372814.

Full text

Abstract:

Accurate and objective pain evaluation is crucial in developing effective pain management protocols, aiming to alleviate distress and prevent patients from experiencing decreased functionality. A multimodal automatic assessment framework for acute pain utilizing video and heart rate signals is introduced in this study. The proposed framework comprises four pivotal modules: the Spatial Module, responsible for extracting embeddings from videos; the Heart Rate Encoder, tasked with mapping heart rate signals into a higher dimensional space; the AugmNet, designed to create learning-based augmentations in the latent space; and the Temporal Module, which utilizes the extracted video and heart rate embeddings for the final assessment. The Spatial-Module undergoes pre-training on a two-stage strategy: first, with a face recognition objective learning universal facial features, and second, with an emotion recognition objective in a multitask learning approach, enabling the extraction of high-quality embeddings for the automatic pain assessment. Experiments with the facial videos and heart rate extracted from electrocardiograms of the BioVid database, along with a direct comparison to 29 studies, demonstrate state-of-the-art performances in unimodal and multimodal settings, maintaining high efficiency. Within the multimodal context, 82.74% and 39.77% accuracy were achieved for the binary and multi-level pain classification task, respectively, utilizing 9.62 million parameters for the entire framework.

APA, Harvard, Vancouver, ISO, and other styles

49

Du, Jin-Hong, Zhanrui Cai, and Kathryn Roeder. "Robust probabilistic modeling for single-cell multimodal mosaic integration and imputation via scVAEIT." Proceedings of the National Academy of Sciences 119, no. 49 (2022). http://dx.doi.org/10.1073/pnas.2214414119.

Full text

Abstract:

Recent advances in single-cell technologies enable joint profiling of multiple omics. These profiles can reveal the complex interplay of different regulatory layers in single cells; still, new challenges arise when integrating datasets with some features shared across experiments and others exclusive to a single source; combining information across these sources is called mosaic integration. The difficulties lie in imputing missing molecular layers to build a self-consistent atlas, finding a common latent space, and transferring learning to new data sources robustly. Existing mosaic integration approaches based on matrix factorization cannot efficiently adapt to nonlinear embeddings for the latent cell space and are not designed for accurate imputation of missing molecular layers. By contrast, we propose a probabilistic variational autoencoder model, scVAEIT, to integrate and impute multimodal datasets with mosaic measurements. A key advance is the use of a missing mask for learning the conditional distribution of unobserved modalities and features, which makes scVAEIT flexible to combine different panels of measurements from multimodal datasets accurately and in an end-to-end manner. Imputing the masked features serves as a supervised learning procedure while preventing overfitting by regularization. Focusing on gene expression, protein abundance, and chromatin accessibility, we validate that scVAEIT robustly imputes the missing modalities and features of cells biologically different from the training data. scVAEIT also adjusts for batch effects while maintaining the biological variation, which provides better latent representations for the integrated datasets. We demonstrate that scVAEIT significantly improves integration and imputation across unseen cell types, different technologies, and different tissues.

APA, Harvard, Vancouver, ISO, and other styles

50

Lu, Shanghui, Yong Liang, Le Li, et al. "Inferring circRNA-drug sensitivity associations via dual hierarchical attention networks and multiple kernel fusion." BMC Genomics 24, no. 1 (2023). http://dx.doi.org/10.1186/s12864-023-09899-w.

Full text

Abstract:

AbstractIncreasing evidence has shown that the expression of circular RNAs (circRNAs) can affect the drug sensitivity of cells and significantly influence drug efficacy. Therefore, research into the relationships between circRNAs and drugs can be of great significance in increasing the comprehension of circRNAs function, as well as contributing to the discovery of new drugs and the repurposing of existing drugs. However, it is time-consuming and costly to validate the function of circRNA with traditional medical research methods. Therefore, the development of efficient and accurate computational models that can assist in discovering the potential interactions between circRNAs and drugs is urgently needed. In this study, a novel method is proposed, called DHANMKF , that aims to predict potential circRNA-drug sensitivity interactions for further biomedical screening and validation. Firstly, multimodal networks were constructed by DHANMKF using multiple sources of information on circRNAs and drugs. Secondly, comprehensive intra-type and inter-type node representations were learned using bi-typed multi-relational heterogeneous graphs, which are attention-based encoders utilizing a hierarchical process. Thirdly, the multi-kernel fusion method was used to fuse intra-type embedding and inter-type embedding. Finally, the Dual Laplacian Regularized Least Squares method (DLapRLS) was used to predict the potential circRNA-drug sensitivity associations using the combined kernel in circRNA and drug spaces. Compared with the other methods, DHANMKF obtained the highest AUC value on two datasets. Code is available at https://github.com/cuntjx/DHANMKF.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!