To see the other types of publications on this topic, follow the link: Multi-modal dataset.

Journal articles on the topic 'Multi-modal dataset'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Multi-modal dataset.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Jeong, Changhoon, Sung-Eun Jang, Sanghyuck Na, and Juntae Kim. "Korean Tourist Spot Multi-Modal Dataset for Deep Learning Applications." Data 4, no. 4 (2019): 139. http://dx.doi.org/10.3390/data4040139.

Full text
Abstract:
Recently, deep learning-based methods for solving multi-modal tasks such as image captioning, multi-modal classification, and cross-modal retrieval have attracted much attention. To apply deep learning for such tasks, large amounts of data are needed for training. However, although there are several Korean single-modal datasets, there are not enough Korean multi-modal datasets. In this paper, we introduce a KTS (Korean tourist spot) dataset for Korean multi-modal deep-learning research. The KTS dataset has four modalities (image, text, hashtags, and likes) and consists of 10 classes related to
APA, Harvard, Vancouver, ISO, and other styles
2

Wang, Fang, Shenglin Yin, Xiaoying Bai, Minghao Hu, Tianwei Yan, and Yi Liang. "M^3EL: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 12 (2025): 12712–20. https://doi.org/10.1609/aaai.v39i12.33386.

Full text
Abstract:
Multi-modal Entity Linking (MEL) is a fundamental component for various downstream tasks. However, existing MEL datasets suffer from small scale, scarcity of topic types and limited coverage of tasks, making them incapable of effectively enhancing the entity linking capabilities of multi-modal models. To address these obstacles, we propose a dataset construction pipeline and publish M^3EL, a large-scale dataset for MEL. M^3EL includes 79,625 instances, covering 9 diverse multi-modal tasks, and 5 different topics. In addition, to further improve the model's adaptability to multi-modal tasks, We
APA, Harvard, Vancouver, ISO, and other styles
3

Ma’sum, Muhammad Anwar. "Intelligent Clustering and Dynamic Incremental Learning to Generate Multi-Codebook Fuzzy Neural Network for Multi-Modal Data Classification." Symmetry 12, no. 4 (2020): 679. http://dx.doi.org/10.3390/sym12040679.

Full text
Abstract:
Classification in multi-modal data is one of the challenges in the machine learning field. The multi-modal data need special treatment as its features are distributed in several areas. This study proposes multi-codebook fuzzy neural networks by using intelligent clustering and dynamic incremental learning for multi-modal data classification. In this study, we utilized intelligent K-means clustering based on anomalous patterns and intelligent K-means clustering based on histogram information. In this study, clustering is used to generate codebook candidates before the training process, while in
APA, Harvard, Vancouver, ISO, and other styles
4

Chen, Delong, Jianfeng Liu, Wenliang Dai, and Baoyuan Wang. "Visual Instruction Tuning with Polite Flamingo." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (2024): 17745–53. http://dx.doi.org/10.1609/aaai.v38i16.29727.

Full text
Abstract:
Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately - for instance, its "politeness" - due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo
APA, Harvard, Vancouver, ISO, and other styles
5

Dai, Yin, Yumeng Song, Weibin Liu, et al. "Multi-Focus Image Fusion Based on Convolution Neural Network for Parkinson’s Disease Image Classification." Diagnostics 11, no. 12 (2021): 2379. http://dx.doi.org/10.3390/diagnostics11122379.

Full text
Abstract:
Parkinson’s disease (PD) is a common neurodegenerative disease that has a significant impact on people’s lives. Early diagnosis is imperative since proper treatment stops the disease’s progression. With the rapid development of CAD techniques, there have been numerous applications of computer-aided diagnostic (CAD) techniques in the diagnosis of PD. In recent years, image fusion has been applied in various fields and is valuable in medical diagnosis. This paper mainly adopts a multi-focus image fusion method primarily based on deep convolutional neural networks to fuse magnetic resonance image
APA, Harvard, Vancouver, ISO, and other styles
6

Ma’sum, Muhammad Anwar, Hadaiq Rolis Sanabila, Petrus Mursanto, and Wisnu Jatmiko. "Clustering versus Incremental Learning Multi-Codebook Fuzzy Neural Network for Multi-Modal Data Classification." Computation 8, no. 1 (2020): 6. http://dx.doi.org/10.3390/computation8010006.

Full text
Abstract:
One of the challenges in machine learning is a classification in multi-modal data. The problem needs a customized method as the data has a feature that spreads in several areas. This study proposed a multi-codebook fuzzy neural network classifiers using clustering and incremental learning approaches to deal with multi-modal data classification. The clustering methods used are K-Means and GMM clustering. Experiment result, on a synthetic dataset, the proposed method achieved the highest performance with 84.76% accuracy. Whereas on the benchmark dataset, the proposed method has the highest perfo
APA, Harvard, Vancouver, ISO, and other styles
7

Suryani, Dewi, Valentino Ekaputra, and Andry Chowanda. "Multi-modal Asian Conversation Mobile Video Dataset for Recognition Task." International Journal of Electrical and Computer Engineering (IJECE) 8, no. 5 (2018): 4042. http://dx.doi.org/10.11591/ijece.v8i5.pp4042-4046.

Full text
Abstract:
Images, audio, and videos have been used by researchers for a long time to develop several tasks regarding human facial recognition and emotion detection. Most of the available datasets usually focus on either static expression, a short video of changing emotion from neutral to peak emotion, or difference in sounds to detect the current emotion of a person. Moreover, the common datasets were collected and processed in the United States (US) or Europe, and only several datasets were originated from Asia. In this paper, we present our effort to create a unique dataset that can fill in the gap by
APA, Harvard, Vancouver, ISO, and other styles
8

Dewi, Suryani, Ekaputra Valentino, and Chowanda Andry. "Multi-modal Asian Conversation Mobile Video Dataset for Recognition Task." International Journal of Electrical and Computer Engineering (IJECE) 8, no. 5 (2018): 4042–46. https://doi.org/10.11591/ijece.v8i5.pp4042-4046.

Full text
Abstract:
Images, audio, and videos have been used by researchers for a long time to develop several tasks regarding human facial recognition and emotion detection. Most of the available datasets usually focus on either static expression, a short video of changing emotion from neutral to peak emotion, or difference in sounds to detect the current emotion of a person. Moreover, the common datasets were collected and processed in the United States (US) or Europe, and only several datasets were originated from Asia. In this paper, we present our effort to create a unique dataset that can fill in the gap by
APA, Harvard, Vancouver, ISO, and other styles
9

Guan, Wenhao, Yishuang Li, Tao Li, et al. "MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 16 (2024): 18117–25. http://dx.doi.org/10.1609/aaai.v38i16.29769.

Full text
Abstract:
The style transfer task in Text-to-Speech (TTS) refers to the process of transferring style information into text content to generate corresponding speech with a specific style. However, most existing style transfer approaches are either based on fixed emotional labels or reference speech clips, which cannot achieve flexible style transfer. Recently, some methods have adopted text descriptions to guide style transfer. In this paper, we propose a more flexible multi-modal and style controllable TTS framework named MM-TTS. It can utilize any modality as the prompt in unified multi-modal prompt s
APA, Harvard, Vancouver, ISO, and other styles
10

Wang, Bingbing, Yiming Du, Bin Liang, et al. "A New Formula for Sticker Retrieval: Reply with Stickers in Multi-Modal and Multi-Session Conversation." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 25327–35. https://doi.org/10.1609/aaai.v39i24.34720.

Full text
Abstract:
Stickers are widely used in online chatting, which can vividly express someone's intention, emotion, or attitude. Existing conversation research typically retrieves stickers based on a single session or the previous textual information, which can not adapt to the multi-modal and multi-session nature of the real-world conversation. To this end, we introduce MultiChat, a new dataset for sticker retrieval facing the multi-modal and multi-session conversation, comprising 1,542 sessions, featuring 50,192 utterances and 2,182 stickers. Based on the created dataset, we propose a novel Intent-Guided S
APA, Harvard, Vancouver, ISO, and other styles
11

Wang, Zi, Chenglong Li, Aihua Zheng, Ran He, and Jin Tang. "Interact, Embed, and EnlargE: Boosting Modality-Specific Representations for Multi-Modal Person Re-identification." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 3 (2022): 2633–41. http://dx.doi.org/10.1609/aaai.v36i3.20165.

Full text
Abstract:
Multi-modal person Re-ID introduces more complementary information to assist the traditional Re-ID task. Existing multi-modal methods ignore the importance of modality-specific information in the feature fusion stage. To this end, we propose a novel method to boost modality-specific representations for multi-modal person Re-ID: Interact, Embed, and EnlargE (IEEE). First, we propose a cross-modal interacting module to exchange useful information between different modalities in the feature extraction phase. Second, we propose a relation-based embedding module to enhance the richness of feature d
APA, Harvard, Vancouver, ISO, and other styles
12

Hegh, Abya Newton, Adekunle Adedotun Adeyelu, Aamo Iorliam, and Samera U. Otor. "MULTI-MODAL EMOTION RECOGNITION MODEL USING GENERATIVE ADVERSARIAL NETWORKS (GANs) FOR AUGMENTING FACIAL EXPRESSIONS AND PHYSIOLOGICAL SIGNALS." FUDMA JOURNAL OF SCIENCES 9, no. 5 (2025): 277–90. https://doi.org/10.33003/fjs-2025-0905-3412.

Full text
Abstract:
Emotion recognition is a critical area of research with applications in healthcare, human-computer interaction (HCI), security, and entertainment. This study addressed the limitations of single-modal emotion recognition systems by developing a multi-modal emotion recognition model that integrates facial expressions and physiological signals, enhanced by Generative Adversarial Networks (GANs). It aims at improving accuracy, reliability, and robustness in emotion detection, particularly underrepresented emotions. The study utilized the FER-2013 dataset for facial expressions and the DEAP dataset
APA, Harvard, Vancouver, ISO, and other styles
13

Zuo, Jialong, Ying Nie, Tianyu Guo, et al. "L-Man: A Large Multi-modal Model Unifying Human-centric Tasks." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 10 (2025): 11095–103. https://doi.org/10.1609/aaai.v39i10.33206.

Full text
Abstract:
Large language models (LLMs) have recently shown notable progress in unifying various visual tasks with an open-ended form. However, when transferred to human-centric tasks, despite their remarkable multi-modal understanding ability in general domains, they lack further human-related domain knowledge and show unsatisfactory performance. Meanwhile, current human-centric unified models are mostly restricted to a pre-defined form and lack open-ended task capability. Therefore, it is necessary to propose a large multi-modal model which utilizes LLMs to unify various human-centric tasks. We forge a
APA, Harvard, Vancouver, ISO, and other styles
14

Wang, Yueqian, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Qun Liu, and Dongyan Zhao. "Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 24 (2025): 25425–33. https://doi.org/10.1609/aaai.v39i24.34731.

Full text
Abstract:
Multi-modal multi-party conversation (MMC) is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. Compared with the traditional multi-modal conversations, MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context. To facilitate the study of this problem, we present Friends-MMC in this paper, an MMC dataset that contains 24,000+ unique utterances paired with video context. To explore the character-centered unders
APA, Harvard, Vancouver, ISO, and other styles
15

Islam, Kh Tohidul, Sudanthi Wijewickrema, and Stephen O’Leary. "A Deep Learning Framework for Segmenting Brain Tumors Using MRI and Synthetically Generated CT Images." Sensors 22, no. 2 (2022): 523. http://dx.doi.org/10.3390/s22020523.

Full text
Abstract:
Multi-modal three-dimensional (3-D) image segmentation is used in many medical applications, such as disease diagnosis, treatment planning, and image-guided surgery. Although multi-modal images provide information that no single image modality alone can provide, integrating such information to be used in segmentation is a challenging task. Numerous methods have been introduced to solve the problem of multi-modal medical image segmentation in recent years. In this paper, we propose a solution for the task of brain tumor segmentation. To this end, we first introduce a method of enhancing an exis
APA, Harvard, Vancouver, ISO, and other styles
16

Tabassum, Israt, and Vimala Nunavath. "A Hybrid Deep Learning Approach for Multi-Class Cyberbullying Classification Using Multi-Modal Social Media Data." Applied Sciences 14, no. 24 (2024): 12007. https://doi.org/10.3390/app142412007.

Full text
Abstract:
Cyberbullying involves the use of social media platforms to harm or humiliate people online. Victims may resort to self-harm due to the abuse they experience on these platforms, where users can remain anonymous and spread malicious content. This highlights an urgent need for efficient systems to identify and classify cyberbullying. Many researchers have approached this problem using various methods such as binary and multi-class classification, focusing on text, image, or multi-modal data. While deep learning has advanced cyberbullying detection and classification, the multi-class classificati
APA, Harvard, Vancouver, ISO, and other styles
17

Li, Yangning, Tingwei Lu, Hai-Tao Zheng, et al. "MESED: A Multi-Modal Entity Set Expansion Dataset with Fine-Grained Semantic Classes and Hard Negative Entities." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 8 (2024): 8697–706. http://dx.doi.org/10.1609/aaai.v38i8.28715.

Full text
Abstract:
The Entity Set Expansion (ESE) task aims to expand a handful of seed entities with new entities belonging to the same semantic class. Conventional ESE methods are based on mono-modality (i.e., literal modality), which struggle to deal with complex entities in the real world such as (1) Negative entities with fine-grained semantic differences. (2) Synonymous entities. (3) Polysemous entities. (4) Long-tailed entities. These challenges prompt us to propose novel Multi-modal Entity Set Expansion (MESE), where models integrate information from multiple modalities to represent entities. Intuitively
APA, Harvard, Vancouver, ISO, and other styles
18

Park, Jiho, Kwangryeol Park, and Dongho Kim. "DGU-HAU: A Dataset for 3D Human Action Analysis on Utterances." Electronics 12, no. 23 (2023): 4793. http://dx.doi.org/10.3390/electronics12234793.

Full text
Abstract:
Constructing diverse and complex multi-modal datasets is crucial for advancing human action analysis research, providing ground truth annotations for training deep learning networks, and enabling the development of robust models across real-world scenarios. Generating natural and contextually appropriate nonverbal gestures is essential for enhancing immersive and effective human–computer interactions in various applications. These applications include video games, embodied virtual assistants, and conversations within a metaverse. However, existing speech-related human datasets are focused on s
APA, Harvard, Vancouver, ISO, and other styles
19

Chang, Xin, and Władysław Skarbek. "Multi-Modal Residual Perceptron Network for Audio–Video Emotion Recognition." Sensors 21, no. 16 (2021): 5452. http://dx.doi.org/10.3390/s21165452.

Full text
Abstract:
Emotion recognition is an important research field for human–computer interaction. Audio–video emotion recognition is now attacked with deep neural network modeling tools. In published papers, as a rule, the authors show only cases of the superiority in multi-modality over audio-only or video-only modality. However, there are cases of superiority in uni-modality that can be found. In our research, we hypothesize that for fuzzy categories of emotional events, the within-modal and inter-modal noisy information represented indirectly in the parameters of the modeling neural network impedes better
APA, Harvard, Vancouver, ISO, and other styles
20

Das, Mithun, Rohit Raj, Punyajoy Saha, Binny Mathew, Manish Gupta, and Animesh Mukherjee. "HateMM: A Multi-Modal Dataset for Hate Video Classification." Proceedings of the International AAAI Conference on Web and Social Media 17 (June 2, 2023): 1014–23. http://dx.doi.org/10.1609/icwsm.v17i1.22209.

Full text
Abstract:
Hate speech has become one of the most significant issues in modern society, having implications in both the online and the offline world. Due to this, hate speech research has recently gained a lot of traction. However, most of the work has primarily focused on text media with relatively little work on images and even lesser on videos. Thus, early stage automated video moderation techniques are needed to handle the videos that are being uploaded to keep the platform safe and healthy. With a view to detect and remove hateful content from the video sharing platforms, our work focuses on hate vi
APA, Harvard, Vancouver, ISO, and other styles
21

Li, Chenrui, Kun Gao, Zibo Hu, et al. "CSMR: A Multi-Modal Registered Dataset for Complex Scenarios." Remote Sensing 17, no. 5 (2025): 844. https://doi.org/10.3390/rs17050844.

Full text
Abstract:
Complex scenarios pose challenges to tasks in computer vision, including image fusion, object detection, and image-to-image translation. On the one hand, complex scenarios involve fluctuating weather or lighting conditions, where even images of the same scenarios appear to be different. On the other hand, the large amount of textural detail in the given images introduces considerable interference that can conceal the useful information contained in them. An effective solution to these problems is to use the complementary details present in multi-modal images, such as visible-light and infrared
APA, Harvard, Vancouver, ISO, and other styles
22

Citak, Erol, and Mine Elif Karsligil. "Multi-Modal Low-Data-Based Learning for Video Classification." Applied Sciences 14, no. 10 (2024): 4272. http://dx.doi.org/10.3390/app14104272.

Full text
Abstract:
Video classification is a challenging task in computer vision that requires analyzing the content of a video to assign it to one or more predefined categories. However, due to the vast amount of visual data contained in videos, the classification process is often computationally expensive and requires a significant amount of annotated data. Because of these reasons, the low-data-based video classification area, which consists of few-shot and zero-shot tasks, is proposed as a potential solution to overcome traditional video classification-oriented challenges. However, existing low-data area dat
APA, Harvard, Vancouver, ISO, and other styles
23

He, Yanzhong, Yanjiao Zhang, and Lin Zhu. "Improving chinese cross-modal retrieval with multi-modal transportation data." Journal of Physics: Conference Series 2813, no. 1 (2024): 012014. http://dx.doi.org/10.1088/1742-6596/2813/1/012014.

Full text
Abstract:
Abstract As societal development progresses and individual travel needs evolve, the demand for acquiring multi-modal transportation status information has steadily increased. The present domain of transportation status encompasses a wealth of multi-modal information, including vehicle trajectory data, traffic condition visual imagery, and textual information. Acquiring multi-modal transportation status information facilitates a rapid understanding of the prevailing traffic conditions at a given location. In this study, we investigate multi-modal transportation status data encompassing trajecto
APA, Harvard, Vancouver, ISO, and other styles
24

Qin, Jinghui, Changsong Liu, Tianchi Tang, et al. "Mental-Perceiver: Audio-Textual Multi-Modal Learning for Estimating Mental Disorders." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 23 (2025): 25029–37. https://doi.org/10.1609/aaai.v39i23.34687.

Full text
Abstract:
Mental disorders, such as anxiety and depression, have become a global concern that affects people of all ages. Early detection and treatment are crucial to mitigate the negative effects these disorders can have on daily life. Although AI-based detection methods show promise, progress is hindered by the lack of publicly available large-scale datasets. To address this, we introduce the Multi-Modal Psychological assessment corpus (MMPsy), a large-scale dataset containing audio recordings and transcripts from Mandarin-speaking adolescents undergoing automated anxiety/depression assessment intervi
APA, Harvard, Vancouver, ISO, and other styles
25

Ni, Peizhou, Xu Li, Wang Xu, Xiaojing Zhou, Tao Jiang, and Weiming Hu. "Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning." Remote Sensing 16, no. 3 (2024): 453. http://dx.doi.org/10.3390/rs16030453.

Full text
Abstract:
Since camera and LiDAR sensors provide complementary information for the 3D semantic segmentation of intelligent vehicles, extensive efforts have been invested to fuse information from multi-modal data. Despite considerable advantages, fusion-based methods still have inevitable limitations: field-of-view disparity between two modal inputs, demanding precise paired data as inputs in both the training and inferring stages, and consuming more resources. These limitations pose significant obstacles to the practical application of fusion-based methods in real-world scenarios. Therefore, we propose
APA, Harvard, Vancouver, ISO, and other styles
26

Doyle, Daniel, and Ovidiu Şerban. "Interruption Audio & Transcript: Derived from Group Affect and Performance Dataset." Data 9, no. 9 (2024): 104. http://dx.doi.org/10.3390/data9090104.

Full text
Abstract:
Despite the widespread development and use of chatbots, there is a lack of audio-based interruption datasets. This study provides a dataset of 200 manually annotated interruptions from a broader set of 355 data points of overlapping utterances. The dataset is derived from the Group Affect and Performance dataset managed by the University of the Fraser Valley, Canada. It includes both audio files and transcripts, allowing for multi-modal analysis. Given the extensive literature and the varied definitions of interruptions, it was necessary to establish precise definitions. The study aims to prov
APA, Harvard, Vancouver, ISO, and other styles
27

Pan, Xuran, Kexing Xu, Shuhao Yang, Yukun Liu, Rui Zhang, and Ping He. "SDA-Net: A Spatially Optimized Dual-Stream Network with Adaptive Global Attention for Building Extraction in Multi-Modal Remote Sensing Images." Sensors 25, no. 7 (2025): 2112. https://doi.org/10.3390/s25072112.

Full text
Abstract:
Building extraction plays a pivotal role in enabling rapid and accurate construction of urban maps, thereby supporting urban planning, smart city development, and urban management. Buildings in remote sensing imagery exhibit diverse morphological attributes and spectral signatures, yet their reliable interpretation through single-modal data remains constrained by heterogeneous terrain conditions, occlusions, and spatially variable illumination effects inherent to complex geographical landscapes. The integration of multi-modal data for building extraction offers significant advantages by levera
APA, Harvard, Vancouver, ISO, and other styles
28

Jiang, Jiali. "Multimodal Emotion Recognition Based on Deep Learning." International Journal of Computer Science and Information Technology 5, no. 2 (2025): 71–80. https://doi.org/10.62051/ijcsit.v5n2.10.

Full text
Abstract:
In recent years, multitask learning-based joint analysis of multiple emotions has emerged as a significant research topic in natural language processing and artificial intelligence. This approach aims to identify multiple emotion categories expressed in discourse by integrating multimodal information and leveraging shared knowledge across related tasks. Sentiment analysis, emotion recognition, and sarcasm detection constitute three closely interconnected tasks in affective computing. This paper focuses on these three tasks - sentiment analysis, emotion recognition, and sarcasm detection - whil
APA, Harvard, Vancouver, ISO, and other styles
29

Wei, Haoran, Pranav Chopada, and Nasser Kehtarnavaz. "C-MHAD: Continuous Multimodal Human Action Dataset of Simultaneous Video and Inertial Sensing." Sensors 20, no. 10 (2020): 2905. http://dx.doi.org/10.3390/s20102905.

Full text
Abstract:
Existing public domain multi-modal datasets for human action recognition only include actions of interest that have already been segmented from action streams. These datasets cannot be used to study a more realistic action recognition scenario where actions of interest occur randomly and continuously among actions of non-interest or no actions. It is more challenging to recognize actions of interest in continuous action streams since the starts and ends of these actions are not known and need to be determined in an on-the-fly manner. Furthermore, there exists no public domain multi-modal datas
APA, Harvard, Vancouver, ISO, and other styles
30

Ruiz de Oña, Esteban, Inés Barbero-García, Diego González-Aguilera, Fabio Remondino, Pablo Rodríguez-Gonzálvez, and David Hernández-López. "PhotoMatch: An Open-Source Tool for Multi-View and Multi-Modal Feature-Based Image Matching." Applied Sciences 13, no. 9 (2023): 5467. http://dx.doi.org/10.3390/app13095467.

Full text
Abstract:
The accurate and reliable extraction and matching of distinctive features (keypoints) in multi-view and multi-modal datasets is still an open research topic in the photogrammetric and computer vision communities. However, one of the main milestones is selecting which method is a suitable choice for specific applications. This encourages us to develop an educational tool that encloses different hand-crafted and learning-based feature-extraction methods. This article presents PhotoMatch, a didactical, open-source tool for multi-view and multi-modal feature-based image matching. The software incl
APA, Harvard, Vancouver, ISO, and other styles
31

Tao, Rui, Meng Zhu, Haiyan Cao, and Honge Ren. "Fine-Grained Cross-Modal Semantic Consistency in Natural Conservation Image Data from a Multi-Task Perspective." Sensors 24, no. 10 (2024): 3130. http://dx.doi.org/10.3390/s24103130.

Full text
Abstract:
Fine-grained representation is fundamental to species classification based on deep learning, and in this context, cross-modal contrastive learning is an effective method. The diversity of species coupled with the inherent contextual ambiguity of natural language poses a primary challenge in the cross-modal representation alignment of conservation area image data. Integrating cross-modal retrieval tasks with generation tasks contributes to cross-modal representation alignment based on contextual understanding. However, during the contrastive learning process, apart from learning the differences
APA, Harvard, Vancouver, ISO, and other styles
32

Chen, Yatong, Chenzhi Hu, Tomoyoshi Kimura, et al. "SemiCMT: Contrastive Cross-Modal Knowledge Transfer for IoT Sensing with Semi-Paired Multi-Modal Signals." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, no. 4 (2024): 1–30. http://dx.doi.org/10.1145/3699779.

Full text
Abstract:
This paper proposes a novel contrastive cross-modal knowledge transfer framework, SemiCMT, for multi-modal IoT sensing applications. It effectively transfers the feature extraction capability (also called knowledge) learned from a source modality (e.g., acoustic signals) with abundant unlabeled training data, to a target modality (e.g., seismic signals) that lacks enough training data, in a self-supervised manner with the help of only a small set of synchronized multi-modal pairs. The transferred model can be quickly finetuned to downstream target-modal tasks with only limited labels. The key
APA, Harvard, Vancouver, ISO, and other styles
33

Zhu, Wenming, Jia Zhou, Zizhe Wang, et al. "Three-Dimensional Object Detection Network Based on Multi-Layer and Multi-Modal Fusion." Electronics 13, no. 17 (2024): 3512. http://dx.doi.org/10.3390/electronics13173512.

Full text
Abstract:
Cameras and LiDAR are important sensors in autonomous driving systems that can provide complementary information to each other. However, most LiDAR-only methods outperform the fusion method on the main benchmark datasets. Current studies attribute the reasons for this to misalignment of views and difficulty in matching heterogeneous features. Specially, using the single-stage fusion method, it is difficult to fully fuse the features of the image and point cloud. In this work, we propose a 3D object detection network based on the multi-layer and multi-modal fusion (3DMMF) method. 3DMMF works by
APA, Harvard, Vancouver, ISO, and other styles
34

Xu, Yangshuyi, Lin Zhang, and Xiang Shen. "Multi-modal adaptive gated mechanism for visual question answering." PLOS ONE 18, no. 6 (2023): e0287557. http://dx.doi.org/10.1371/journal.pone.0287557.

Full text
Abstract:
Visual Question Answering (VQA) is a multimodal task that uses natural language to ask and answer questions based on image content. For multimodal tasks, obtaining accurate modality feature information is crucial. The existing researches on the visual question answering model mainly start from the perspective of attention mechanism and multimodal fusion, which will tend to ignore the impact of modal interaction learning and the introduction of noise information in the process of modal fusion on the overall performance of the model. This paper proposes a novel and efficient multimodal adaptive
APA, Harvard, Vancouver, ISO, and other styles
35

Flores Fernández, Alberto, Jonas Wurst, Eduardo Sánchez Morales, Michael Botsch, Christian Facchi, and Andrés García Higuera. "Probabilistic Traffic Motion Labeling for Multi-Modal Vehicle Route Prediction." Sensors 22, no. 12 (2022): 4498. http://dx.doi.org/10.3390/s22124498.

Full text
Abstract:
The prediction of the motion of traffic participants is a crucial aspect for the research and development of Automated Driving Systems (ADSs). Recent approaches are based on multi-modal motion prediction, which requires the assignment of a probability score to each of the multiple predicted motion hypotheses. However, there is a lack of ground truth for this probability score in the existing datasets. This implies that current Machine Learning (ML) models evaluate the multiple predictions by comparing them with the single real trajectory labeled in the dataset. In this work, a novel data-based
APA, Harvard, Vancouver, ISO, and other styles
36

Liu, K., A. Wu, X. Wan, and S. Li. "MRSSC: A BENCHMARK DATASET FOR MULTIMODAL REMOTE SENSING SCENE CLASSIFICATION." International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIII-B2-2021 (June 28, 2021): 785–92. http://dx.doi.org/10.5194/isprs-archives-xliii-b2-2021-785-2021.

Full text
Abstract:
Abstract. Scene classification based on multi-source remote sensing image is important for image interpretation, and has many applications, such as change detection, visual navigation and image retrieval. Deep learning has become a research hotspot in the field of remote sensing scene classification, and dataset is an important driving force to promote its development. Most of the remote sensing scene classification datasets are optical images, and multimodal datasets are relatively rare. Existing datasets that contain both optical and SAR data, such as SARptical and WHU-SEN-City, which mainly
APA, Harvard, Vancouver, ISO, and other styles
37

Guo, Zihan, Xiang Shen, and Chongqing Chen. "TBKIN: Threshold-based explicit selection for enhanced cross-modal semantic alignments." PLOS One 20, no. 6 (2025): e0325543. https://doi.org/10.1371/journal.pone.0325543.

Full text
Abstract:
Vision-language models aim to seamlessly integrate visual and linguistic information for multi-modal tasks, demanding precise semantic alignments between image-text pairs while minimizing the influence of irrelevant data. While existing methods leverage intra-modal and cross-modal knowledge to enhance alignments, they often fall short in sufficiently reducing interference, which ultimately constrains model performance. To address this gap, we propose a novel vision-language model, the threshold-based knowledge integration network (TBKIN), designed to effectively capture intra-modal and cross-m
APA, Harvard, Vancouver, ISO, and other styles
38

Wang, Jian, Haisen Li, Guanying Huo, Chao Li, and Yuhang Wei. "Multi-Modal Multi-Stage Underwater Side-Scan Sonar Target Recognition Based on Synthetic Images." Remote Sensing 15, no. 5 (2023): 1303. http://dx.doi.org/10.3390/rs15051303.

Full text
Abstract:
Due to the small sample size of underwater acoustic data and the strong noise interference caused by seabed reverberation, recognizing underwater targets in Side-Scan Sonar (SSS) images is challenging. Using a transfer-learning-based recognition method to train the backbone network on a large optical dataset (ImageNet) and fine-tuning the head network with a small SSS image dataset can improve the classification of sonar images. However, optical and sonar images have different statistical characteristics, directly affecting transfer-learning-based target recognition. In order to improve the ac
APA, Harvard, Vancouver, ISO, and other styles
39

Zhang, Pengyu, Dong Wang, and Huchuan Lu. "Multi-modal visual tracking: Review and experimental comparison." Computational Visual Media 10, no. 2 (2024): 193–214. http://dx.doi.org/10.1007/s41095-023-0345-5.

Full text
Abstract:
AbstractVisual object tracking has been drawing increasing attention in recent years, as a fundamental task in computer vision. To extend the range of tracking applications, researchers have been introducing information from multiple modalities to handle specific scenes, with promising research prospects for emerging methods and benchmarks. To provide a thorough review of multi-modal tracking, different aspects of multi-modal tracking algorithms are summarized under a unified taxonomy, with specific focus on visible-depth (RGB-D) and visible-thermal (RGB-T) tracking. Subsequently, a detailed d
APA, Harvard, Vancouver, ISO, and other styles
40

Stephan, Benedict, Mona Köhler, Steffen Müller, Yan Zhang, Horst-Michael Gross, and Gunther Notni. "OHO: A Multi-Modal, Multi-Purpose Dataset for Human-Robot Object Hand-Over." Sensors 23, no. 18 (2023): 7807. http://dx.doi.org/10.3390/s23187807.

Full text
Abstract:
In the context of collaborative robotics, handing over hand-held objects to a robot is a safety-critical task. Therefore, a robust distinction between human hands and presented objects in image data is essential to avoid contact with robotic grippers. To be able to develop machine learning methods for solving this problem, we created the OHO (Object Hand-Over) dataset of tools and other everyday objects being held by human hands. Our dataset consists of color, depth, and thermal images with the addition of pose and shape information about the objects in a real-world scenario. Although the focu
APA, Harvard, Vancouver, ISO, and other styles
41

Barbato, Mirko Paolo, Flavio Piccoli, and Paolo Napoletano. "Ticino: A multi-modal remote sensing dataset for semantic segmentation." Expert Systems with Applications 249 (September 2024): 123600. http://dx.doi.org/10.1016/j.eswa.2024.123600.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Xiao, Yun, Dan Cao, Chenglong Li, Bo Jiang, and Jin Tang. "A benchmark dataset for high-altitude UAV multi-modal tracking." Journal of Image and Graphics 30, no. 2 (2025): 361–74. https://doi.org/10.11834/jig.240040.

Full text
APA, Harvard, Vancouver, ISO, and other styles
43

SINGH, APOORVA, Soumyodeep Dey, Anamitra Singha, and Sriparna Saha. "Sentiment and Emotion-Aware Multi-Modal Complaint Identification." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 11 (2022): 12163–71. http://dx.doi.org/10.1609/aaai.v36i11.21476.

Full text
Abstract:
The expression of displeasure on a consumer's behalf towards an organization, product, or event is denoted via the speech act known as complaint. Customers typically post reviews on retail websites and various social media platforms about the products or services they purchase, and the reviews may include complaints about the products or services. Automatic detection of consumers' complaints about items or services they buy can be critical for organizations and online merchants since they can use this insight to meet the customers' requirements, including handling and addressing the complaints
APA, Harvard, Vancouver, ISO, and other styles
44

Yang, Junhuan, Yuzhou Zhang, Yi Sheng, Youzuo Lin, and Lei Yang. "A Novel Diffusion Model for Pairwise Geoscience Data Generation with Unbalanced Training Dataset." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 20 (2025): 21965–73. https://doi.org/10.1609/aaai.v39i20.35504.

Full text
Abstract:
Recently, the advent of generative AI technologies has made transformational impacts on our daily lives, yet its application in scientific applications remains in its early stages. Data scarcity is a major, well-known barrier in data-driven scientific computing, so physics-guided generative AI holds significant promise. In scientific computing, most tasks study the conversion of multiple data modalities to describe physical phenomena, for example, spatial and waveform in seismic imaging, time and frequency in signal processing, and temporal and spectral in climate modeling; as such, multi-moda
APA, Harvard, Vancouver, ISO, and other styles
45

Yang, Fan, Xiaozhi Men, Yangsheng Liu, et al. "Estimation of Landslide and Mudslide Susceptibility with Multi-Modal Remote Sensing Data and Semantics: The Case of Yunnan Mountain Area." Land 12, no. 10 (2023): 1949. http://dx.doi.org/10.3390/land12101949.

Full text
Abstract:
Landslide and mudslide susceptibility predictions play a crucial role in environmental monitoring, ecological protection, settlement planning, etc. Currently, multi-modal remote sensing data have been used for precise landslide and mudslide disaster prediction with spatial details, spectral information, or terrain attributes. However, features regarding landslide and mudslide susceptibility are often hidden in multi-modal remote sensing images, beyond the features extracted and learnt by deep learning approaches. This paper reports our efforts to conduct landslide and mudslide susceptibility p
APA, Harvard, Vancouver, ISO, and other styles
46

Doan, Huong-Giang, and Ngoc-Trung Nguyen. "End-to-end multiple modals deep learning system for hand posture recognition." Indonesian Journal of Electrical Engineering and Computer Science 27, no. 1 (2022): 214–21. https://doi.org/10.11591/ijeecs.v27.i1.pp214-221.

Full text
Abstract:
Multi-modal or multi-view dataset that was captured from various resources (e.g. RGB and Depth) of a subject at the same time. Combination between different cues has still faced to many challenges as unique data and complementary information. In adition, the proposed method for multiple modalities recognition consists of discrete blocks, such as: extract features for separative data flows, combine of features, and classify gestures. To address the challenges, we proposed two novel end-to-end hand posture recognition frameworks, which are integrated all steps into a convolution neuronal network
APA, Harvard, Vancouver, ISO, and other styles
47

Gao, Jingsheng, Jiacheng Ruan, Suncheng Xiang, et al. "LAMM: Label Alignment for Multi-Modal Prompt Learning." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 3 (2024): 1815–23. http://dx.doi.org/10.1609/aaai.v38i3.27950.

Full text
Abstract:
With the success of pre-trained visual-language (VL) models such as CLIP in visual representation tasks, transferring pre-trained models to downstream tasks has become a crucial paradigm. Recently, the prompt tuning paradigm, which draws inspiration from natural language processing (NLP), has made significant progress in VL field. However, preceding methods mainly focus on constructing prompt templates for text and visual inputs, neglecting the gap in class label representations between the VL models and downstream tasks. To address this challenge, we introduce an innovative label alignment me
APA, Harvard, Vancouver, ISO, and other styles
48

Qu, Fang, Youqiang Sun, Man Zhou, et al. "Vegetation Land Segmentation with Multi-Modal and Multi-Temporal Remote Sensing Images: A Temporal Learning Approach and a New Dataset." Remote Sensing 16, no. 1 (2023): 3. http://dx.doi.org/10.3390/rs16010003.

Full text
Abstract:
In recent years, remote sensing analysis has gained significant attention in visual analysis applications, particularly in segmenting and recognizing remote sensing images. However, the existing research has predominantly focused on single-period RGB image analysis, thus overlooking the complexities of remote sensing image capture, especially in highly vegetated land parcels. In this paper, we provide a large-scale vegetation remote sensing (VRS) dataset and introduce the VRS-Seg task for multi-modal and multi-temporal vegetation segmentation. The VRS dataset incorporates diverse modalities an
APA, Harvard, Vancouver, ISO, and other styles
49

Ortega, Juan Diego, Paola Natalia Cañas, Marcos Nieto, Oihana Otaegui, and Luis Salgado. "Challenges of Large-Scale Multi-Camera Datasets for Driver Monitoring Systems." Sensors 22, no. 7 (2022): 2554. http://dx.doi.org/10.3390/s22072554.

Full text
Abstract:
Tremendous advances in advanced driver assistance systems (ADAS) have been possible thanks to the emergence of deep neural networks (DNN) and Big Data (BD) technologies. Huge volumes of data can be managed and consumed as training material to create DNN models which feed functions such as lane keeping systems (LKS), automated emergency braking (AEB), lane change assistance (LCA), etc. In the ADAS/AD domain, these advances are only possible thanks to the creation and publication of large and complex datasets, which can be used by the scientific community to benchmark and leverage research and d
APA, Harvard, Vancouver, ISO, and other styles
50

Li, Boao. "Multi-modal sentiment analysis based on graph neural network." Applied and Computational Engineering 6, no. 1 (2023): 792–98. http://dx.doi.org/10.54254/2755-2721/6/20230918.

Full text
Abstract:
Thanks to popularity of social media, people are witnessing the rapid proliferation of posts with various modalities. It is worth noting that these multi-modal expressions share certain characteristics, including the interdependence of objects in the posted images, which is sometimes overlooked in previous researches as they focused on single image-text posts and pay little attention on obtaining the global features. In this paper, a neural network with multiple channels for image-text sentiment detection is proposed. The first step is to encode text and images to capture implicit tendencies.
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!