Log in

Relevant bibliographies by topics / Video based modality / Journal articles

To see the other types of publications on this topic, follow the link: Video based modality.

Journal articles on the topic 'Video based modality'

Author: Grafiati

Published: 4 June 2025

Last updated: 15 July 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Video based modality.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Oh, Changhyeon, and Yuseok Ban. "Cross-Modality Interaction-Based Traffic Accident Classification." Applied Sciences 14, no. 5 (2024): 1958. http://dx.doi.org/10.3390/app14051958.

Full text

Abstract:

Traffic accidents on the road lead to serious personal and material damage. Furthermore, preventing secondary accidents caused by traffic accidents is crucial. As various technologies for detecting traffic accidents in videos using deep learning are being researched, this paper proposes a method to classify accident videos based on a video highlight detection network. To utilize video highlight detection for traffic accident classification, we generate information using the existing traffic accident videos. Moreover, we introduce the Car Crash Highlights Dataset (CCHD). This dataset contains a variety of weather conditions, such as snow, rain, and clear skies, as well as multiple types of traffic accidents. We compare and analyze the performance of various video highlight detection networks in traffic accident detection, thereby presenting an efficient video feature extraction method according to the accident and the optimal video highlight detection network. For the first time, we have applied video highlight detection networks to the task of traffic accident classification. In the task, the most superior video highlight detection network achieves a classification performance of up to 79.26% when using video, audio, and text as inputs, compared to using video and text alone. Moreover, we elaborated the analysis of our approach in the aspects of cross-modality interaction, self-attention and cross-attention, feature extraction, and negative loss.

APA, Harvard, Vancouver, ISO, and other styles

2

Wang, Xingrun, Xiushan Nie, Xingbo Liu, Binze Wang, and Yilong Yin. "Modality correlation-based video summarization." Multimedia Tools and Applications 79, no. 45-46 (2020): 33875–90. http://dx.doi.org/10.1007/s11042-020-08690-3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Jang, Jaeyoung, Yuseok Ban, and Kyungjae Lee. "Dual-Modality Cross-Interaction-Based Hybrid Full-Frame Video Stabilization." Applied Sciences 14, no. 10 (2024): 4290. http://dx.doi.org/10.3390/app14104290.

Full text

Abstract:

This study aims to generate visually useful imagery by preventing cropping while maintaining resolution and minimizing the degradation of stability and distortion to enhance the stability of a video for Augmented Reality applications. The focus is placed on conducting research that balances maintaining execution speed with performance improvements. By processing Inertial Measurement Unit (IMU) sensor data using the Versatile Quaternion-based Filter algorithm and optical flow, our research first applies motion compensation to frames of input video. To address cropping, PCA-flow-based video stabilization is then performed. Furthermore, to mitigate distortion occurring during the full-frame video creation process, neural rendering is applied, resulting in the output of stabilized frames. The anticipated effect of using an IMU sensor is the production of full-frame videos that maintain visual quality while increasing the stability of a video. Our technique contributes to correcting video shakes and has the advantage of generating visually useful imagery at low cost. Thus, we propose a novel hybrid full-frame video stabilization algorithm that produces full-frame videos after motion compensation with an IMU sensor. Evaluating our method against three metrics, the Stability score, Distortion value, and Cropping ratio, results indicated that stabilization was more effectively achieved with robustness to flow inaccuracy when effectively using an IMU sensor. In particular, among the evaluation outcomes, within the “Turn” category, our method exhibited an 18% enhancement in the Stability score and a 3% improvement in the Distortion value compared to the average results of previously proposed full-frame video stabilization-based methods, including PCA flow, neural rendering, and DIFRINT.

APA, Harvard, Vancouver, ISO, and other styles

4

Nur, Azmina Rahmad, Amir As'ari Muhammad, Fathiah Ghazali Nurul, Shahar Norazman, and Anis Jasmin Sufri Nur. "A Survey of Video Based Action Recognition in Sports." Indonesian Journal of Electrical Engineering and Computer Science 11, no. 3 (2018): 987–93. https://doi.org/10.11591/ijeecs.v11.i3.pp987-993.

Full text

Abstract:

Sport performance analysis which is crucial in sport practice is used to improve the performance of athletes during the games. Many studies and investigation have been done in detecting different movements of player for notational analysis using either sensor based or video based modality. Recently, vision based modality has become the research interest due to the vast development of video transmission online. There are tremendous experimental studies have been done using vision based modality in sport but only a few review study has been done previously. Hence, we provide a review study on the video based technique to recognize sport action toward establishing the automated notational analysis system. The paper will be organized into four parts. Firstly, we provide an overview of the current existing technologies of the video based sports intelligence systems. Secondly, we review the framework of action recognition in all fields before we further discuss the implementation of deep learning in vision based modality for sport actions. Finally, the paper summarizes the further trend and research direction in action recognition for sports using video approach. We believed that this review study would be very beneficial in providing a complete overview on video based action recognition in sports.

APA, Harvard, Vancouver, ISO, and other styles

5

Zhang, Beibei, Tongwei Ren, and Gangshan Wu. "Text-Guided Nonverbal Enhancement Based on Modality-Invariant and -Specific Representations for Video Speaking Style Recognition." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 21 (2025): 22354–62. https://doi.org/10.1609/aaai.v39i21.34391.

Full text

Abstract:

Video speaking style recognition (VSSR) aims to classify different types of conversations in videos, contributing significantly to understanding human interactions. A significant challenge in VSSR is the inherent similarity among conversation videos, which makes it difficult to distinguish between different speaking styles. Existing VSSR methods commit to providing available multimodal information to enhance the differentiation of conversation videos. Nevertheless, treating each modality equally leads to a suboptimal result for these methods due to text is inherently more aligned with conversation understanding compared to nonverbal modalities. To address this issue, we propose a text-guided nonverbal enhancement method, TNvE, which is composed of two core modules: 1) a text-guided nonverbal representation selection module employs cross-modal attention based on modality-invariant representations, picking out critical nonverbal information via textual guide; and 2) a modality-invariant and -specific representation decoupling module incorporates modality-specific representations and decouples them from modality-invariant representations, enabling a more comprehensive understanding of multimodal data. The former module encourages multimodal representations close to each other, while the latter module provides unique characteristics of each modality as a supplement. Extensive experiments are conducted on long-form video understanding datasets to demonstrate that TNvE is highly effective for VSSR, achieving a new state-of-the-art.

APA, Harvard, Vancouver, ISO, and other styles

6

Zong, Linlin, Wenmin Lin, Jiahui Zhou, et al. "Text-Guided Fine-grained Counterfactual Inference for Short Video Fake News Detection." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 1 (2025): 1237–45. https://doi.org/10.1609/aaai.v39i1.32112.

Full text

Abstract:

Detecting fake news in short videos is crucial for combating misinformation. Existing methods utilize topic modeling and co-attention mechanism, overlooking the modality heterogeneity and resulting in suboptimal performance. To address this issue, we introduce Text-Guided Fine-grained Counterfactual Inference for Short Video Fake News detection (TGFC-SVFN). TGFC-SVFN leverages modality bias removal and teacher-model-enhanced inter-modal knowledge distillation to integrate the heterogeneous modalities in short videos. Specifically, we use causality-based reasoning prompts guided text as teacher model, which then transfers knowledge to the video and audio student models. Subsequently, a multi-head attention mechanism is employed to fuse information from different modalities. In each module, we utilize fine-grained counterfactual inference based on a diffusion model to eliminate modality bias. Experimental results on publicly available fake short video news datasets demonstrate that our method outperforms state-of-the-art techniques.

APA, Harvard, Vancouver, ISO, and other styles

7

Li, Yun, Su Wang, Jiawei Mo, and Xin Wei. "An Underwater Multi-Label Classification Algorithm Based on a Bilayer Graph Convolution Learning Network with Constrained Codec." Electronics 13, no. 16 (2024): 3134. http://dx.doi.org/10.3390/electronics13163134.

Full text

Abstract:

Within the domain of multi-label classification for micro-videos, utilizing terrestrial datasets as a foundation, researchers have embarked on profound endeavors yielding extraordinary accomplishments. The research into multi-label classification based on underwater micro-video datasets is still in the preliminary stage. There are some challenges: the severe color distortion and visual blurring in underwater visual imaging due to water molecular scattering and absorption, the difficulty in acquiring underwater short video datasets, the sparsity of underwater short video modality features, and the formidable task of achieving high-precision underwater multi-label classification. To address these issues, a bilayer graph convolution learning network based on constrained codec (BGCLN) is established in this paper. Specifically, modality-common representation is constructed to complete the representation of common information and specific information based on the constrained codec network. Then, the attention-driven double-layer graph convolutional network module is designed to mine the correlation information between labels and enhance the modality representation. Finally, the combined modality representation fusion and multi-label classification module are used to obtain the category classifier prediction. In the underwater video multi-label classification dataset (UVMCD), the effectiveness and high classification accuracy of the proposed BGCLN have been proved by numerous experiments.

APA, Harvard, Vancouver, ISO, and other styles

8

Rahmad, Nur Azmina, Muhammad Amir As'ari, Nurul Fathiah Ghazali, Norazman Shahar, and Nur Anis Jasmin Sufri. "A Survey of Video Based Action Recognition in Sports." Indonesian Journal of Electrical Engineering and Computer Science 11, no. 3 (2018): 987. http://dx.doi.org/10.11591/ijeecs.v11.i3.pp987-993.

Full text

Abstract:

<p class="Abstract">Sport performance analysis which is crucial in sport practice is used to improve the performance of athletes during the games. Many studies and investigation have been done in detecting different movements of player for notational analysis using either sensor based or video based modality. Recently, vision based modality has become the research interest due to the vast development of video transmission online. There are tremendous experimental studies have been done using vision based modality in sport but only a few review study has been done previously. Hence, we provide a review study on the video based technique to recognize sport action toward establishing the automated notational analysis system. The paper will be organized into four parts. Firstly, we provide an overview of the current existing technologies of the video based sports intelligence systems. Secondly, we review the framework of action recognition in all fields before we further discuss the implementation of deep learning in vision based modality for sport actions. Finally, the paper summarizes the further trend and research direction in action recognition for sports using video approach. We believed that this review study would be very beneficial in providing a complete overview on video based action recognition in sports.</p>

APA, Harvard, Vancouver, ISO, and other styles

9

Zawali, Bako, Richard A. Ikuesan, Victor R. Kebande, Steven Furnell, and Arafat A-Dhaqm. "Realising a Push Button Modality for Video-Based Forensics." Infrastructures 6, no. 4 (2021): 54. http://dx.doi.org/10.3390/infrastructures6040054.

Full text

Abstract:

Complexity and sophistication among multimedia-based tools have made it easy for perpetrators to conduct digital crimes such as counterfeiting, modification, and alteration without being detected. It may not be easy to verify the integrity of video content that, for example, has been manipulated digitally. To address this perennial investigative challenge, this paper proposes the integration of a forensically sound push button forensic modality (PBFM) model for the investigation of the MP4 video file format as a step towards automated video forensic investigation. An open-source multimedia forensic tool was developed based on the proposed PBFM model. A comprehensive evaluation of the efficiency of the tool against file alteration showed that the tool was capable of identifying falsified files, which satisfied the underlying assertion of the PBFM model. Furthermore, the outcome can be used as a complementary process for enhancing the evidence admissibility of MP4 video for forensic investigation.

APA, Harvard, Vancouver, ISO, and other styles

10

Waykar, Sanjay B., and C. R. Bharathi. "Multimodal Features and Probability Extended Nearest Neighbor Classification for Content-Based Lecture Video Retrieval." Journal of Intelligent Systems 26, no. 3 (2017): 585–99. http://dx.doi.org/10.1515/jisys-2016-0041.

Full text

Abstract:

AbstractDue to the ever-increasing number of digital lecture libraries and lecture video portals, the challenge of retrieving lecture videos has become a very significant and demanding task in recent years. Accordingly, the literature presents different techniques for video retrieval by considering video contents as well as signal data. Here, we propose a lecture video retrieval system using multimodal features and probability extended nearest neighbor (PENN) classification. There are two modalities utilized for feature extraction. One is textual information, which is determined from the lecture video using optical character recognition. The second modality utilized to preserve video content is local vector pattern. These two modal features are extracted, and the retrieval of videos is performed using the proposed PENN classifier, which is the extension of the extended nearest neighbor classifier, by considering the different weightages for the first-level and second-level neighbors. The performance of the proposed video retrieval is evaluated using precision, recall, and F-measure, which are computed by matching the retrieved videos and the manually classified videos. From the experimentation, we proved that the average precision of the proposed PENN+VQ is 78.3%, which is higher than that of the existing methods.

APA, Harvard, Vancouver, ISO, and other styles

11

Zhang, Bo, Xiya Yang, Ge Wang, Ying Wang, and Rui Sun. "M2ER: Multimodal Emotion Recognition Based on Multi-Party Dialogue Scenarios." Applied Sciences 13, no. 20 (2023): 11340. http://dx.doi.org/10.3390/app132011340.

Full text

Abstract:

Researchers have recently focused on multimodal emotion recognition, but issues persist in recognizing emotions in multi-party dialogue scenarios. Most studies have only used text and audio modality, ignoring the video modality. To address this, we propose M2ER, a multimodal emotion recognition scheme based on multi-party dialogue scenarios. Addressing the issue of multiple faces appearing in the same frame of the video modality, M2ER introduces a method using multi-face localization for speaker recognition to eliminate the interference of non-speakers. The attention mechanism is used to fuse and classify different modalities. We conducted extensive experiments in unimodal and multimodal fusion using the multi-party dialogue dataset MELD. The results show that M2ER achieves superior emotion recognition in both text and audio modalities compared to the baseline model. The proposed method using speaker recognition in the video modality improves emotion recognition performance by 6.58% compared to the method without speaker recognition. In addition, the multimodal fusion based on the attention mechanism also outperforms the baseline fusion model.

APA, Harvard, Vancouver, ISO, and other styles

12

Mao, Jianguo, Wenbin Jiang, Hong Liu, Xiangdong Wang, and Yajuan Lyu. "Inferential Knowledge-Enhanced Integrated Reasoning for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 11 (2023): 13380–88. http://dx.doi.org/10.1609/aaai.v37i11.26570.

Full text

Abstract:

Recently, video question answering has attracted growing attention. It involves answering a question based on a fine-grained understanding of video multi-modal information. Most existing methods have successfully explored the deep understanding of visual modality. We argue that a deep understanding of linguistic modality is also essential for answer reasoning, especially for videos that contain character dialogues. To this end, we propose an Inferential Knowledge-Enhanced Integrated Reasoning method. Our method consists of two main components: 1) an Inferential Knowledge Reasoner to generate inferential knowledge for linguistic modality inputs that reveals deeper semantics, including the implicit causes, effects, mental states, etc. 2) an Integrated Reasoning Mechanism to enhance video content understanding and answer reasoning by leveraging the generated inferential knowledge. Experimental results show that our method achieves significant improvement on two mainstream datasets. The ablation study further demonstrates the effectiveness of each component of our approach.

APA, Harvard, Vancouver, ISO, and other styles

13

N, Pavithra, and H. Sharath Kumar Y. "A Computational Meta-Learning Inspired Model for Sketch-based Video Retrieval." Indian Journal of Science and Technology 16, no. 7 (2023): 476–84. https://doi.org/10.17485/IJST/v16i7.2121.

Full text

Abstract:

ABSTRACT <strong>Objectives:</strong> To design and develop an efficient computing framework for sketch-based video retrieval using fine-grained intrinsic computational approach. <strong>Methods:</strong> The primary method of sketch-based video retrieval adopts multi-stream multi-modality of joint embedding method for improved P-SBVR from improved fine-grained KTH and TSF related dataset. It considers the potential aspects of the computation of significant visual intrinsic appearance details for sketch objects. The extracted appearance and motion-based features are used to train three different CNN baselines under strong and weak supervision. The system also implements a meta-learning model for different supervised settings to attain better performance of sketch-based video retrieval along with a relational module to overcome the problem of overfitting. <strong>Findings:</strong> The study derives specific sketch sequences from its formulated dataset to compute instance-level query processing for video retrieval. Further, it also addresses the limitations arising in the context of coarse-grained video retrieval models and sketch-based still image retrieval. The aggregated dataset for rich annotation assisted in the experimental simulation. The experimental evaluation with respect to the performance metric evaluates the 3D CNN baselines under strong supervision and weak-supervision where CNN BL-Type-2 attains maximum video retrieval accuracy of 99.96% for triplet grading feature under relational schema. CNN BL-Type-1 attains maximum retrieval accuracy of 97.40% considering the triplet grading features from the improved SBVR. The evaluation metric for the instance level retrieval process also considers true matching of sketches with the videos, it clearly shows that the appropriate appearance and motion based feature selection has enhanced the video retrieval accuracy up to 96.90% with 99.28% accuracy in action identification considering motion stream, 98.17% for appearance module and 98.45% for fusion module. Another important aspect of the proposed research context is that it addresses the problem of cross-modality while executing the simultaneous matching paradigm for visual appearances of the object with its movement appearing on particular video scenes. The experimental outcome showsits comparable effectiveness relative to the existing system of CNN. <strong>Novelty:</strong> Unlike the conventional system of sketch analysis, which is more focused on static objects or scenes, the presented approach can efficiently compute the important visual intrinsic appearance details of the object of interest from the sketch and then activate the operations for video retrieval. The proposed CNN based learning model with improved P-SBVR dataset attains better computing time for retrieval with are approximately (200, 210 and 214) milliseconds for CNN BL-Type-1, CNN BL-Type-2, CNN BL-Type-3 and comparable with the existing deep learning based SBVR models. <strong>Keywords:</strong> Sketch Based Video Retrieval; Intrinsic Appearance Details; Meta Learning; Sketch Dataset; Cross Modality Problem

APA, Harvard, Vancouver, ISO, and other styles

14

Zhu, Mengxiao, Liu He, Han Zhao, Ruoxiao Su, Licheng Zhang, and Bo Hu. "Same Vaccine, Different Voices: A Cross-Modality Analysis of HPV Vaccine Discourse on Social Media." Proceedings of the International AAAI Conference on Web and Social Media 19 (June 7, 2025): 2317–33. https://doi.org/10.1609/icwsm.v19i1.35936.

Full text

Abstract:

Despite the proven efficacy of HPV vaccines, uptake remains limited in many regions, including China. This study investigates how health beliefs and emotional responses evolve across text-, audio-, and video-based platforms by analyzing data from three representative platforms in China, including 273,357 posts from Weibo (text-based), 1,228 podcasts from Ximalaya (audio-based), and 1,225 videos from Douyin (video-based) from July 2018 to March 2023. The comparisons are conducted under four dimensions as suggested by the Health Belief Model (HBM), including susceptibility, severity, benefits, and barriers. Our findings reveal distinct modality-specific patterns. For instance, a text-based platform tends to amplify barriers and negativity, an audio-based platform enables balanced and sustained discussions, and a video-based platform highlights personal anecdotes and drives rapid sentiment shifts. By highlighting these modality-specific differences and addressing potential cross-modal incongruities at the content level, we provide actionable insights for public health communicators, policymakers, and platform designers to tailor strategies, foster informed decision-making, and ultimately enhance HPV vaccine uptake in complex social media ecosystems.

APA, Harvard, Vancouver, ISO, and other styles

15

Pang, Nuo, Songlin Guo, Ming Yan, and Chien Aun Chan. "A Short Video Classification Framework Based on Cross-Modal Fusion." Sensors 23, no. 20 (2023): 8425. http://dx.doi.org/10.3390/s23208425.

Full text

Abstract:

The explosive growth of online short videos has brought great challenges to the efficient management of video content classification, retrieval, and recommendation. Video features for video management can be extracted from video image frames by various algorithms, and they have been proven to be effective in the video classification of sensor systems. However, frame-by-frame processing of video image frames not only requires huge computing power, but also classification algorithms based on a single modality of video features cannot meet the accuracy requirements in specific scenarios. In response to these concerns, we introduce a short video categorization architecture centered around cross-modal fusion in visual sensor systems which jointly utilizes video features and text features to classify short videos, avoiding processing a large number of image frames during classification. Firstly, the image space is extended to three-dimensional space–time by a self-attention mechanism, and a series of patches are extracted from a single image frame. Each patch is linearly mapped into the embedding layer of the Timesformer network and augmented with positional information to extract video features. Second, the text features of subtitles are extracted through the bidirectional encoder representation from the Transformers (BERT) pre-training model. Finally, cross-modal fusion is performed based on the extracted video and text features, resulting in improved accuracy for short video classification tasks. The outcomes of our experiments showcase a substantial superiority of our introduced classification framework compared to alternative baseline video classification methodologies. This framework can be applied in sensor systems for potential video classification.

APA, Harvard, Vancouver, ISO, and other styles

16

Oliveira, Eva, Teresa Chambel, and Nuno Magalhães Ribeiro. "Sharing Video Emotional Information in the Web." International Journal of Web Portals 5, no. 3 (2013): 19–39. http://dx.doi.org/10.4018/ijwp.2013070102.

Full text

Abstract:

Video growth over the Internet changed the way users search, browse and view video content. Watching movies over the Internet is increasing and becoming a pastime. The possibility of streaming Internet content to TV, advances in video compression techniques and video streaming have turned this recent modality of watching movies easy and doable. Web portals as a worldwide mean of multimedia data access need to have their contents properly classified in order to meet users’ needs and expectations. The authors propose a set of semantic descriptors based on both user physiological signals, captured while watching videos, and on video low-level features extraction. These XML based descriptors contribute to the creation of automatic affective meta-information that will not only enhance a web-based video recommendation system based in emotional information, but also enhance search and retrieval of videos affective content from both users’ personal classifications and content classifications in the context of a web portal.

APA, Harvard, Vancouver, ISO, and other styles

17

Xiang, Yun Zhu. "Multi-Modality Video Scene Segmentation Algorithm with Shot Force Competition." Applied Mechanics and Materials 513-517 (February 2014): 514–17. http://dx.doi.org/10.4028/www.scientific.net/amm.513-517.514.

Full text

Abstract:

In order to quickly and effectively segment the video scene, a multi-modality video scene segmentation algorithm with shot force competition is proposed in this paper. This method is take full account of temporal associated co-occurrence of multimodal media data, to calculate the similarity between video shot by merging the video low-level features, then go to the video scene segmentation based on the judgment method of shot competition. The authors experiments show that the video scene can be efficiently separated by the method proposed in the paper.

APA, Harvard, Vancouver, ISO, and other styles

18

Chen, Chun-Ying. "The Influence of Representational Formats and Learner Modality Preferences on Instructional Efficiency Using Interactive Video Tutorials." Journal of Education and Training 7, no. 2 (2020): 77. http://dx.doi.org/10.5296/jet.v7i2.17415.

Full text

Abstract:

This study investigated how to create effective interactive video tutorials for learning computer-based tasks. The role of learner modality preferences was also considered. A 4 × 4 between-subjects factorial design was employed to examine the influence of instruction representational formats (noninteractive static, interactive static, interactive visual-only video with onscreen text, interactive video with audio narration) and learner modality preferences (visual, aural, read/write, multimodal) on instructional efficiency. Instructional efficiency was a combined effect of test performance and perceived cognitive load during learning. The results suggested that implementing interactivity into the video tutorials tended to increase transfer performance, and the role of modality preferences was related to learners’ perceived cognitive load. The significant interaction effect on transfer efficiency indicated: (a) the auditory preference tended to exhibit better transfer efficiency with the narrated video, and (b) the read/write preference tended to exhibit better transfer efficiency with both the noninteractive static format and the captioned video. This study highlighted the importance of considering individual differences in modality preferences, particularly that of auditory and read/write learners.

APA, Harvard, Vancouver, ISO, and other styles

19

Gerabon, Mariel, Asirah Amil, Ricky Dag-Uman, Rhea Dulla, Sandy Aldanese, and Sendy Suico. "Modality-Based Assessment Practices and Strategies during Pandemic." Journal of Education and Academic Settings 1, no. 1 (2024): 1–18. http://dx.doi.org/10.62596/ja5vhj56.

Full text

Abstract:

Assessment, which is an essential component of the entire educational process and determines whether or not learning capabilities are obtained, can be used to evaluate the success of educational activities. Since in-person training was banned in an attempt to curb the virus's transmission, education was one of the industries most affected by COVID-19. To adhere to the Basic Education Learning Continuity Plan, educational settings underwent modifications during the epidemic. This study examines the ways in which teachers coped with the transition from traditional face-to-face education to online assessments of students' learning. A survey questionnaire was sent to 55 instructors from Zamboanga City and Zamboanga Sibugay with the goal of gathering information on assessment practices and techniques based on the modalities of learning utilized to evaluate student learning during the epidemic. The study aims to assess the efficacy of several modular assessment methods, such as activity sheets, quizzes, summative exams, specialized projects, and call-based oral recitations. The research examines the use of blended assessment practices like output compilation, competency learned journals, and processbased assessment through online interviews in addition to blended assessment strategies like online paperless tests, video-recorded demonstrations, and live video call assessments. This research is significant for educational institutions and stakeholders in evaluating the relevance and efficacy of various assessment procedures and strategies utilized by instructors during the pandemic.

APA, Harvard, Vancouver, ISO, and other styles

20

Yuan, Haiyue, Janko Ćalić, and Ahmet Kondoz. "Analysis of User Requirements in Interactive 3D Video Systems." Advances in Human-Computer Interaction 2012 (2012): 1–11. http://dx.doi.org/10.1155/2012/343197.

Full text

Abstract:

The recent development of three dimensional (3D) display technologies has resulted in a proliferation of 3D video production and broadcasting, attracting a lot of research into capture, compression and delivery of stereoscopic content. However, the predominant design practice of interactions with 3D video content has failed to address its differences and possibilities in comparison to the existing 2D video interactions. This paper presents a study of user requirements related to interaction with the stereoscopic 3D video. The study suggests that the change of view, zoom in/out, dynamic video browsing, and textual information are the most relevant interactions with stereoscopic 3D video. In addition, we identified a strong demand for object selection that resulted in a follow-up study of user preferences in 3D selection using virtual-hand and ray-casting metaphors. These results indicate that interaction modality affects users’ decision of object selection in terms of chosen location in 3D, while user attitudes do not have significant impact. Furthermore, the ray-casting-based interaction modality using Wiimote can outperform the volume-based interaction modality using mouse and keyboard for object positioning accuracy.

APA, Harvard, Vancouver, ISO, and other styles

21

Griffiths, Noola K., and Jonathon L. Reay. "The Relative Importance of Aural and Visual Information in the Evaluation of Western Canon Music Performance by Musicians and Nonmusicians." Music Perception 35, no. 3 (2018): 364–75. http://dx.doi.org/10.1525/mp.2018.35.3.364.

Full text

Abstract:

Aural and visual information have been shown to affect audience evaluations of music performance (Griffiths, 2010; Juslin, 2000); however, it is not fully understood which modality has the greatest relative impact upon judgements of performance or if the evaluator’s musical expertise mediates this effect. An opportunity sample of thirty-four musicians (8 male, 26 female Mage = 26.4 years) and 26 nonmusicians (6 male, 20 female, Mage = 44.0 years) rated four video clips for technical proficiency, musicality, and overall performance quality using 7-point Likert scales. Two video performances of Debussy’s Clare de lune (one professional, one amateur) were used to create the four video clips, comprising two clips with congruent modality information, and two clips with incongruent modality information. The incongruent clips contained the visual modality of one quality condition with the audio modality of the other. It was possible to determine which modality was most important in participants’ evaluative judgements based on the modality of the professional quality condition in the clip that was rated most highly. The current study confirms that both aural and visual information can affect audience members’ experience of musical performance. We provide evidence that visual information has a greater impact than aural information on evaluations of performance quality, as the incongruent clip with amateur audio + professional video was rated significantly higher than that with professional audio + amateur video. Participants’ level of musical expertise was found to have no effect on their judgements of performance quality.

APA, Harvard, Vancouver, ISO, and other styles

22

Zhuo, Junbao, Shuhui Wang, Zhenghan Chen, Li Shen, Qingming Huang, and Huimin Ma. "Image-to-video Adaptation with Outlier Modeling and Robust Self-learning." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 21 (2025): 23072–80. https://doi.org/10.1609/aaai.v39i21.34471.

Full text

Abstract:

The image-to-video adaptation task seeks to effectively harness both labeled images and unlabeled videos for achieving effective video recognition. The modality gap of the image and video modalities and the domain discrepancy across the two domains are the two essential challenges in this task. Existing methods reduce the domain discrepancy via close-set domain adaptation techniques, resulting in inaccurate domain alignment as there exist outlier target frames. To tackle this issue, we extend the vanilla classifier with outlier classes, where each outlier class responsible for capturing outlier frames for a specific class via batch nuclear norm maximization loss. We further propose a new loss by treating the source images apart from class c as instances from outlier class specific for c. As for the modality gap, existing methods usually utilize the pseudo labels obtained from an image-level adapted model to learn a video-level model. Rare efforts are dedicated to handling the noise in pseudo labels. We proposed a new metric based on label propagation consistency to select samples for training a better video-level model. Experiments on 3 benchmarks validating the effectiveness of our method.

APA, Harvard, Vancouver, ISO, and other styles

23

Zhang, Zhenduo. "Cross-Category Highlight Detection via Feature Decomposition and Modality Alignment." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 3 (2023): 3525–33. http://dx.doi.org/10.1609/aaai.v37i3.25462.

Full text

Abstract:

Learning an autonomous highlight video detector with good transferability across video categories, called Cross-Category Video Highlight Detection(CC-VHD), is crucial for the practical application on video-based media platforms. To tackle this problem, we first propose a framework that treats the CC-VHD as learning category-independent highlight feature representation. Under this framework, we propose a novel module, named Multi-task Feature Decomposition Branch which jointly conducts label prediction, cyclic feature reconstruction, and adversarial feature reconstruction to decompose the video features into two independent components: highlight-related component and category-related component. Besides, we propose to align the visual and audio modalities to one aligned feature space before conducting modality fusion, which has not been considered in previous works. Finally, the extensive experimental results on three challenging public benchmarks validate the efficacy of our paradigm and the superiority over the existing state-of-the-art approaches to video highlight detection.

APA, Harvard, Vancouver, ISO, and other styles

24

Li, Mingchao, Xiaoming Shi, Haitao Leng, Wei Zhou, Hai-Tao Zheng, and Kuncai Zhang. "Learning Semantic Alignment with Global Modality Reconstruction for Video-Language Pre-training towards Retrieval." Proceedings of the AAAI Conference on Artificial Intelligence 37, no. 1 (2023): 1377–85. http://dx.doi.org/10.1609/aaai.v37i1.25222.

Full text

Abstract:

Video-language pre-training for text-based video retrieval tasks is vitally important. Previous pre-training methods suffer from the semantic misalignments. The reason is that these methods ignore sequence alignments but focusing on critical token alignment. To alleviate the problem, we propose a video-language pre-training framework, termed videolanguage pre-training For lEarning sEmantic aLignments (FEEL), to learn semantic alignments at the sequence level. Specifically, the global modality reconstruction and the cross- modal self-contrasting method is utilized to learn the alignments at the sequence level better. Extensive experimental results demonstrate the effectiveness of FEEL on text-based video retrieval and text-based video corpus moment retrieval.

APA, Harvard, Vancouver, ISO, and other styles

25

Ha, Jongwoo, Joonhyuck Ryu, and Joonghoon Ko. "Multi-Modality Tensor Fusion Based Human Fatigue Detection." Electronics 12, no. 15 (2023): 3344. http://dx.doi.org/10.3390/electronics12153344.

Full text

Abstract:

Multimodal learning is an expanding research area and aims to pursue a better understanding of given data by regarding different modals. Multimodal approaches for qualitative data are used for the quantitative proofing of ground-truth datasets and discovering unexpected phenomena. In this paper, we investigate the effect of multimodal learning schemes of quantitative data to assess its qualitative state. We try to interpret human fatigue levels through analyzing video, thermal image and voice data together. The experiment showed that the multimodal approach using three types of data was more effective than the method of using each dataset individually. As a result, we identified the possibility of predicting human fatigue states.

APA, Harvard, Vancouver, ISO, and other styles

26

Radfar, Edalat, Won Hyuk Jang, Leila Freidoony, Jihoon Park, Kichul Kwon, and Byungjo Jung. "Single-channel stereoscopic video imaging modality based on transparent rotating deflector." Optics Express 23, no. 21 (2015): 27661. http://dx.doi.org/10.1364/oe.23.027661.

Full text

APA, Harvard, Vancouver, ISO, and other styles

27

He, Ping, Huaying Qi, Shiyi Wang, and Jiayue Cang. "Cross-Modal Sentiment Analysis of Text and Video Based on Bi-GRU Cyclic Network and Correlation Enhancement." Applied Sciences 13, no. 13 (2023): 7489. http://dx.doi.org/10.3390/app13137489.

Full text

Abstract:

Cross-modal sentiment analysis is an emerging research area in natural language processing. The core task of cross-modal fusion lies in cross-modal relationship extraction and joint feature learning. The existing research methods of cross-modal sentiment analysis focus on static text, video, audio, and other modality data but ignore the fact that different modality data are often unaligned in practical applications. There is a long-term time dependence among unaligned data sequences, and it is difficult to explore the interaction between different modalities. The paper proposes a sentiment analysis model (UA-BFET) based on feature enhancement technology in unaligned data scenarios, which can perform sentiment analysis on unaligned text and video modality data in social media. Firstly, the model adds a cyclic memory enhancement network across time steps. Then, the obtained cross-modal fusion features with interaction are applied to the unimodal feature extraction process of the next time step in the Bi-directional Gated Recurrent Unit (Bi-GRU) so that the progressively enhanced unimodal features and cross-modal fusion features continuously complement each other. Secondly, the extracted unimodal text and video features taken jointly from the enhanced cross-modal fusion features are subjected to canonical correlation analysis (CCA) and input into the fully connected layer and Softmax function for sentiment analysis. Through experiments executed on unaligned public datasets MOSI and MOSEI, the UA-BFET model has achieved or even exceeded the sentiment analysis effect of text, video, and audio modality fusion and has outstanding advantages in solving cross-modal sentiment analysis in unaligned data scenarios.

APA, Harvard, Vancouver, ISO, and other styles

28

Wei, Haoran, Roozbeh Jafari, and Nasser Kehtarnavaz. "Fusion of Video and Inertial Sensing for Deep Learning–Based Human Action Recognition." Sensors 19, no. 17 (2019): 3680. http://dx.doi.org/10.3390/s19173680.

Full text

Abstract:

This paper presents the simultaneous utilization of video images and inertial signals that are captured at the same time via a video camera and a wearable inertial sensor within a fusion framework in order to achieve a more robust human action recognition compared to the situations when each sensing modality is used individually. The data captured by these sensors are turned into 3D video images and 2D inertial images that are then fed as inputs into a 3D convolutional neural network and a 2D convolutional neural network, respectively, for recognizing actions. Two types of fusion are considered—Decision-level fusion and feature-level fusion. Experiments are conducted using the publicly available dataset UTD-MHAD in which simultaneous video images and inertial signals are captured for a total of 27 actions. The results obtained indicate that both the decision-level and feature-level fusion approaches generate higher recognition accuracies compared to the approaches when each sensing modality is used individually. The highest accuracy of 95.6% is obtained for the decision-level fusion approach.

APA, Harvard, Vancouver, ISO, and other styles

29

Beemer, Lexie R., Wendy Tackett, Anna Schwartz, et al. "Use of a Novel Theory-Based Pragmatic Tool to Evaluate the Quality of Instructor-Led Exercise Videos to Promote Youth Physical Activity at Home: Preliminary Findings." International Journal of Environmental Research and Public Health 20, no. 16 (2023): 6561. http://dx.doi.org/10.3390/ijerph20166561.

Full text

Abstract:

Background: Exercise videos that work to minimize cognitive load (the amount of information that working memory can hold at one time) are hypothesized to be more engaging, leading to increased PA participation. Purpose: To use a theory-based pragmatic tool to evaluate the cognitive load of instructor-led exercise videos associated with the Interrupting Prolonged Sitting with ACTivity (InPACT) program. Methods: Exercise videos were created by physical education teachers and fitness professionals. An evaluation rubric was created to identify elements each video must contain to reduce cognitive load, which included three domains with four components each [technical (visual quality, audio quality, matching modality, signaling), content (instructional objective, met objective, call-to-action, bias), and instructional (learner engagement, content organization, segmenting, weeding)]. Each category was scored on a 3-point scale from 0 (absent) to 2 (proficient). A video scoring 20–24 points induced low cognitive load, 13–19 points induced moderate cognitive load, and less than 13 points induced high cognitive load. Three reviewers independently evaluated the videos and then agreed on scores and feedback. Results: All 132 videos were evaluated. Mean video total score was 20.1 ± 0.7 points out of 24. Eighty-five percent of videos were rated low cognitive load, 15% were rated moderate cognitive load, and 0% were rated high cognitive load. The following components scored the highest: audio quality and matching modality. The following components scored the lowest: signaling and call-to-action. Conclusions: Understanding the use of a pragmatic tool is a first step in the evaluation of InPACT at Home exercise videos. Our preliminary findings suggest that the InPACT at Home videos had low cognitive load. If future research confirms our findings, using a more rigorous study design, then developing a collection of instructor-led exercise videos that induce low cognitive load may help to enhance youth physical activity participation in the home environment.

APA, Harvard, Vancouver, ISO, and other styles

30

Citak, Erol, and Mine Elif Karsligil. "Multi-Modal Low-Data-Based Learning for Video Classification." Applied Sciences 14, no. 10 (2024): 4272. http://dx.doi.org/10.3390/app14104272.

Full text

Abstract:

Video classification is a challenging task in computer vision that requires analyzing the content of a video to assign it to one or more predefined categories. However, due to the vast amount of visual data contained in videos, the classification process is often computationally expensive and requires a significant amount of annotated data. Because of these reasons, the low-data-based video classification area, which consists of few-shot and zero-shot tasks, is proposed as a potential solution to overcome traditional video classification-oriented challenges. However, existing low-data area datasets, which are either not diverse or have no additional modality context, which is a mandatory requirement for the zero-shot task, do not fulfill the requirements for few-shot and zero-shot tasks completely. To address this gap, in this paper, we propose a large-scale, general-purpose dataset for the problem of multi-modal low-data-based video classification. The dataset contains pairs of videos and attributes that capture multiple facets of the video content. Thus, the new proposed dataset will both enable the study of low-data-based video classification tasks and provide consistency in terms of comparing the evaluations of future studies in this field. Furthermore, to evaluate and provide a baseline for future works on our new proposed dataset, we present a variational autoencoder-based model that leverages the inherent correlation among different modalities to learn more informative representations. In addition, we introduce a regularization technique to improve the baseline model’s generalization performance in low-data scenarios. Our experimental results reveal that our proposed baseline model, with the aid of this regularization technique, achieves over 12% improvement in classification accuracy compared to the pure baseline model with only a single labeled sample.

APA, Harvard, Vancouver, ISO, and other styles

31

Lee, Yong-Hyeok, Dong-Won Jang, Jae-Bin Kim, Rae-Hong Park, and Hyung-Min Park. "Audio–Visual Speech Recognition Based on Dual Cross-Modality Attentions with the Transformer Model." Applied Sciences 10, no. 20 (2020): 7263. http://dx.doi.org/10.3390/app10207263.

Full text

Abstract:

Since attention mechanism was introduced in neural machine translation, attention has been combined with the long short-term memory (LSTM) or replaced the LSTM in a transformer model to overcome the sequence-to-sequence (seq2seq) problems with the LSTM. In contrast to the neural machine translation, audio–visual speech recognition (AVSR) may provide improved performance by learning the correlation between audio and visual modalities. As a result that the audio has richer information than the video related to lips, AVSR is hard to train attentions with balanced modalities. In order to increase the role of visual modality to a level of audio modality by fully exploiting input information in learning attentions, we propose a dual cross-modality (DCM) attention scheme that utilizes both an audio context vector using video query and a video context vector using audio query. Furthermore, we introduce a connectionist-temporal-classification (CTC) loss in combination with our attention-based model to force monotonic alignments required in AVSR. Recognition experiments on LRS2-BBC and LRS3-TED datasets showed that the proposed model with the DCM attention scheme and the hybrid CTC/attention architecture achieved at least a relative improvement of 7.3% on average in the word error rate (WER) compared to competing methods based on the transformer model.

APA, Harvard, Vancouver, ISO, and other styles

32

Yang, Saelyne, Sunghyun Park, Yunseok Jang, and Moontae Lee. "YTCommentQA: Video Question Answerability in Instructional Videos." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 17 (2024): 19359–67. http://dx.doi.org/10.1609/aaai.v38i17.29906.

Full text

Abstract:

Instructional videos provide detailed how-to guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is difficult. While numerous computational models have been developed for Video Question Answering (Video QA) tasks, they are primarily trained on questions generated based on video content, aiming to produce answers from within the content. However, in real-world situations, users may pose questions that go beyond the video's informational boundaries, highlighting the necessity to determine if a video can provide the answer. Discerning whether a question can be answered by video content is challenging due to the multi-modal nature of videos, where visual and verbal information are intertwined. To bridge this gap, we present the YTCommentQA dataset, which contains naturally-generated questions from YouTube, categorized by their answerability and required modality to answer -- visual, script, or both. Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. The dataset is available at https://github.com/lgresearch/YTCommentQA.

APA, Harvard, Vancouver, ISO, and other styles

33

Guo, Jialong, Ke Liu, Jiangchao Yao, Zhihua Wang, Jiajun Bu, and Haishuai Wang. "MetaNeRV: Meta Neural Representations for Videos with Spatial-Temporal Guidance." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 3 (2025): 3257–65. https://doi.org/10.1609/aaai.v39i3.32336.

Full text

Abstract:

Neural Representations for Videos (NeRV) has emerged as a promising implicit neural representation (INR) approach for video analysis, which represents videos as neural networks with frame indexes as inputs. However, NeRV-based methods are time-consuming when adapting to a large number of diverse videos, as each video requires a separate NeRV model to be trained from scratch. In addition, NeRV-based methods spatially require generating a high-dimension signal (i.e., an entire image) from the input of a low-dimension timestamp, and a video typically consists of tens of frames temporally that have a minor change between adjacent frames. To improve the efficiency of video representation, we propose Meta Neural Representations for Videos, named MetaNeRV, a novel framework for fast NeRV representation for unseen videos. MetaNeRV leverages a meta-learning framework to learn an optimal parameter initialization, which serves as a good starting point for adapting to new videos. To address the unique spatial and temporal characteristics of video modality, we further introduce spatial-temporal guidance to improve the representation capabilities of MetaNeRV. Specifically, the spatial guidance with a multi-resolution loss aims to capture the information from different resolution stages, and the temporal guidance with an effective progressive learning strategy could gradually refine the number of fitted frames during the meta-learning process. Extensive experiments conducted on multiple datasets demonstrate the superiority of MetaNeRV for video representations and video compression.

APA, Harvard, Vancouver, ISO, and other styles

34

NAGEL, Merav. "Exercise Intervention based on the Chinese Modality." Asian Journal of Physical Education & Recreation 13, no. 2 (2007): 13–20. http://dx.doi.org/10.24112/ajper.131829.

Full text

Abstract:

LANGUAGE NOTE | Document text in English; abstract also in Chinese. Children become more and more sedentary spending long hours watching TV, playing video games, and interacting with computers instead of communicating with friends, parents and teachers. They struggle to keep themselves awake during the first hours or the school day because they go to bed late. This results in unproductive learning. However, physical activity conducted before school starts serves as a mood booster and a way to regenerate energy (recharge batteries). It helps students absorb and process class materials during the early morning hours. The main objective of this study was to implement an intervention of exercise and health education in school curriculum to improve children's lifestyle, nutrition, and sleep. 現代的兒童出現肥胖情況愈來愈嚴重，導致身體產生疾病。因此，學校在處理學童肥胖問題的角色也愈來愈重要，本文嘗試探討中國學校常用的晨早體操鍛煉模式，對西方兒童的影響。

APA, Harvard, Vancouver, ISO, and other styles

35

Zhang, Qiaoyun, Hsiang-Chuan Chang, Chia-Ling Ho, Huan-Chao Keh, and Diptendu Sinha Roy. "AI-Based Multimodal Anomaly Detection for Industrial Machine Operations." Journal of Internet Technology 26, no. 2 (2025): 255–64. https://doi.org/10.70003/160792642025032602010.

Full text

Abstract:

In the manufacturing process involving grinding wheels, challenges in fine-tuning grinding machines are typically addressed by craftsmen through subjective observations of sparks and sounds. However, most current anomaly detection methods mainly aim at a single modality, whereas existing multimodal methods cannot effectively cope with a common issue. To address this, this paper introduces an innovative mechanism, AI-Based Multimodal Anomaly Detection (AMAD), designed to optimize the efficiency and accuracy of grinding wheel production lines. The proposed AMAD includes data preprocessing and multimodal anomaly detection, accurately identifying anomalies in grinding wheel operation videos. In the data preprocessing phase, the proposed AMAD utilizes Mel Frequency Cepstral Coefficients (MFCC) and AutoEncoder for audio processing and segmentation for video processing. In the multimodal anomaly detection phase, the proposed AMAD employs Convolutional Neural Networks (CNN) for audio analysis and Convolutional Long Short-Term Memory (ConvLSTM) for video analysis. By combining both audio and video modalities, the proposed AMAD effectively predicts whether the input video represents normal or abnormal grinding wheel operations. This multimodal approach not only improves the accuracy of anomaly detection but also enhances the robustness of the system. Simulation results demonstrate that the proposed AMAD significantly improves performance in anomaly detection in terms of precision, recall, and F1-Score.

APA, Harvard, Vancouver, ISO, and other styles

36

Kim, Eun Hee, and Ju Hyun Shin. "Multi-Modal Emotion Recognition in Videos Based on Pre-Trained Models." Korean Institute of Smart Media 13, no. 10 (2024): 19–27. http://dx.doi.org/10.30693/smj.2024.13.10.19.

Full text

Abstract:

Recently, as the demand for non-face-to-face counseling has rapidly increased, the need for emotion recognition technology that combines various aspects such as text, voice, and facial expressions is being emphasized. In this paper, we address issues such as the dominance of non-Korean data and the imbalance of emotion labels in existing datasets like FER-2013, CK+, and AFEW by using Korean video data. We propose methods to enhance multimodal emotion recognition performance in videos by integrating the strengths of image modality with text modality. A pre-trained model is used to overcome the limitations caused by small training data. A GPT-4-based LLM model is applied to text, and a pre-trained model based on VGG-19 architecture is fine-tuned to facial expression images. The method of extracting representative emotions by combining the emotional results of each aspect extracted using a pre-trained model is as follows. Emotion information extracted from text was combined with facial expression changes in a video. If there was a sentiment mismatch between the text and the image, we applied a threshold that prioritized the text-based sentiment if it was deemed trustworthy. Additionally, as a result of adjusting representative emotions using emotion distribution information for each frame, performance was improved by 19% based on F1-Score compared to the existing method that used average emotion values for each frame.

APA, Harvard, Vancouver, ISO, and other styles

37

Zhu, Xiaoguang, Ye Zhu, Haoyu Wang, Honglin Wen, Yan Yan, and Peilin Liu. "Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 3 (2022): 1–24. http://dx.doi.org/10.1145/3491228.

Full text

Abstract:

Action recognition has been a heated topic in computer vision for its wide application in vision systems. Previous approaches achieve improvement by fusing the modalities of the skeleton sequence and RGB video. However, such methods pose a dilemma between the accuracy and efficiency for the high complexity of the RGB video network. To solve the problem, we propose a multi-modality feature fusion network to combine the modalities of the skeleton sequence and RGB frame instead of the RGB video, as the key information contained by the combination of the skeleton sequence and RGB frame is close to that of the skeleton sequence and RGB video. In this way, complementary information is retained while the complexity is reduced by a large margin. To better explore the correspondence of the two modalities, a two-stage fusion framework is introduced in the network. In the early fusion stage, we introduce a skeleton attention module that projects the skeleton sequence on the single RGB frame to help the RGB frame focus on the limb movement regions. In the late fusion stage, we propose a cross-attention module to fuse the skeleton feature and the RGB feature by exploiting the correlation. Experiments on two benchmarks, NTU RGB+D and SYSU, show that the proposed model achieves competitive performance compared with the state-of-the-art methods while reducing the complexity of the network.

APA, Harvard, Vancouver, ISO, and other styles

38

Jiang, Pin, and Yahong Han. "Reasoning with Heterogeneous Graph Alignment for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11109–16. http://dx.doi.org/10.1609/aaai.v34i07.6767.

Full text

Abstract:

The dominant video question answering methods are based on fine-grained representation or model-specific attention mechanism. They usually process video and question separately, then feed the representations of different modalities into following late fusion networks. Although these methods use information of one modality to boost the other, they neglect to integrate correlations of both inter- and intra-modality in an uniform module. We propose a deep heterogeneous graph alignment network over the video shots and question words. Furthermore, we explore the network architecture from four steps: representation, fusion, alignment, and reasoning. Within our network, the inter- and intra-modality information can be aligned and interacted simultaneously over the heterogeneous graph and used for cross-modal reasoning. We evaluate our method on three benchmark datasets and conduct extensive ablation study to the effectiveness of the network architecture. Experiments show the network to be superior in quality.

APA, Harvard, Vancouver, ISO, and other styles

39

Ioannis, Mademlis, Iosifidis Alexandros, Tefas Anastasios, Nikolaidis Nikos, and Pitas Ioannis. "Exploiting stereoscopic disparity for augmenting human activity recognition performance." Multimedia Tools and Applications 75 (October 4, 2016): 11641–60. https://doi.org/10.1007/s11042-015-2719-x.

Full text

Abstract:

This work investigates several ways to exploit scene depth information, implicitly available through the modality of stereoscopic disparity in 3D videos, with the purpose of augmenting performance in the problem of recognizing complex human activities in natural settings. The standard state-of-the-art activity recognition algorithmic pipeline consists in the consecutive stages of video description, video representation and video classification. Multimodal, depth-aware modifications to standard methods are being proposed and studied, both for video description and for video representation, that indirectly incorporate scene geometry information derived from stereo disparity. At the description level, this is made possible by suitably manipulating video interest points based on disparity data. At the representation level, the followed approach represents each video by multiple vectors corresponding to different disparity zones, resulting in multiple activity descriptions defined by disparity characteristics. In both cases, a scene segmentation is thus implicitly implemented, based on the distance of each imaged object from the camera during video acquisition. The investigated approaches are flexible and able to cooperate with any monocular low-level feature descriptor. They are evaluated using a publicly available activity recognition dataset of unconstrained stereoscopic 3D videos, consisting in extracts from Hollywood movies, and compared both against competing depth-aware approaches and a state-of-the-art monocular algorithm. Quantitative evaluation reveals that some of the examined approaches achieve state-of-the-art performance.

APA, Harvard, Vancouver, ISO, and other styles

40

Hua, Hang, Yunlong Tang, Chenliang Xu, and Jiebo Luo. "V2Xum-LLM: Cross-Modal Video Summarization with Temporal Prompt Instruction Tuning." Proceedings of the AAAI Conference on Artificial Intelligence 39, no. 4 (2025): 3599–607. https://doi.org/10.1609/aaai.v39i4.32374.

Full text

Abstract:

Video summarization aims to create short, accurate, and cohesive summaries of longer videos. Despite the existence of various video summarization datasets, a notable limitation is their limited amount of source videos, which hampers the effective training of advanced large vision-language models (VLMs). Additionally, most existing datasets are created for video-to-video summarization, overlooking the contemporary need for multimodal video content summarization. Recent efforts have been made to expand from unimodal to multimodal video summarization, categorizing the task into three sub-tasks based on the summary's modality: video-to-video (V2V), video-to-text (V2T), and a combination of video and text summarization (V2VT). However, the textual summaries in previous multimodal datasets are inadequate. To address these issues, we introduce Instruct-V2Xum, a cross-modal video summarization dataset featuring 30,000 diverse videos sourced from YouTube, with lengths ranging from 40 to 940 seconds and an average summarization ratio of 16.39%. Each video summary in Instruct-V2Xum is paired with a textual summary that references specific frame indexes, facilitating the generation of aligned video and textual summaries. In addition, we propose a new video summarization framework named V2Xum-LLM. V2Xum-LLM, specifically V2Xum-LLaMA in this study, is the first framework that unifies different video summarization tasks into one large language model's (LLM) text decoder and achieves task-controllable video summarization with temporal prompts and task instructions. Experiments show that V2Xum-LLaMA outperforms strong baseline models on multiple video summarization tasks. Furthermore, we propose an enhanced evaluation metric for V2V and V2VT summarization tasks.

APA, Harvard, Vancouver, ISO, and other styles

41

Zhang, Manlin, Jinpeng Wang, and Andy J. Ma. "Suppressing Static Visual Cues via Normalizing Flows for Self-Supervised Video Representation Learning." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 3 (2022): 3300–3308. http://dx.doi.org/10.1609/aaai.v36i3.20239.

Full text

Abstract:

Despite the great progress in video understanding made by deep convolutional neural networks, feature representation learned by existing methods may be biased to static visual cues. To address this issue, we propose a novel method to suppress static visual cues (SSVC) based on probabilistic analysis for self-supervised video representation learning. In our method, video frames are first encoded to obtain latent variables under standard normal distribution via normalizing flows. By modelling static factors in a video as a random variable, the conditional distribution of each latent variable becomes shifted and scaled normal. Then, the less-varying latent variables along time are selected as static cues and suppressed to generate motion-preserved videos. Finally, positive pairs are constructed by motion-preserved videos for contrastive learning to alleviate the problem of representation bias to static cues. The less-biased video representation can be better generalized to various downstream tasks. Extensive experiments on publicly available benchmarks demonstrate that the proposed method outperforms the state of the art when only single RGB modality is used for pre-training.

APA, Harvard, Vancouver, ISO, and other styles

42

Leng, Zikang, Amitrajit Bhattacharjee, Hrudhai Rajasekhar, et al. "IMUGPT 2.0: Language-Based Cross Modality Transfer for Sensor-Based Human Activity Recognition." Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 8, no. 3 (2024): 1–32. http://dx.doi.org/10.1145/3678545.

Full text

Abstract:

One of the primary challenges in the field of human activity recognition (HAR) is the lack of large labeled datasets. This hinders the development of robust and generalizable models. Recently, cross modality transfer approaches have been explored that can alleviate the problem of data scarcity. These approaches convert existing datasets from a source modality, such as video, to a target modality, such as inertial measurement units (IMUs). With the emergence of generative AI models such as large language models (LLMs) and text-driven motion synthesis models, language has become a promising source data modality as well - as shown in proof of concepts such as IMUGPT. In this work, we conduct a large-scale evaluation of language-based cross modality transfer to determine their effectiveness for HAR. Based on this study, we introduce two new extensions for IMUGPT that enhance its use for practical HAR application scenarios: a motion filter capable of filtering out irrelevant motion sequences to ensure the relevance of the generated virtual IMU data, and a set of metrics that measure the diversity of the generated data facilitating the determination of when to stop generating virtual IMU data for both effective and efficient processing. We demonstrate that our diversity metrics can reduce the effort needed for the generation of virtual IMU data by at least 50%, which opens up IMUGPT for practical use cases beyond a mere proof of concept.

APA, Harvard, Vancouver, ISO, and other styles

43

Jiang, Meng, Liming Zhang, Xiaohua Wang, Shuang Li, and Yijie Jiao. "6D Object Pose Estimation Based on Cross-Modality Feature Fusion." Sensors 23, no. 19 (2023): 8088. http://dx.doi.org/10.3390/s23198088.

Full text

Abstract:

The 6D pose estimation using RGBD images plays a pivotal role in robotics applications. At present, after obtaining the RGB and depth modality information, most methods directly concatenate them without considering information interactions. This leads to the low accuracy of 6D pose estimation in occlusion and illumination changes. To solve this problem, we propose a new method to fuse RGB and depth modality features. Our method effectively uses individual information contained within each RGBD image modality and fully integrates cross-modality interactive information. Specifically, we transform depth images into point clouds, applying the PointNet++ network to extract point cloud features; RGB image features are extracted by CNNs and attention mechanisms are added to obtain context information within the single modality; then, we propose a cross-modality feature fusion module (CFFM) to obtain the cross-modality information, and introduce a feature contribution weight training module (CWTM) to allocate the different contributions of the two modalities to the target task. Finally, the result of 6D object pose estimation is obtained by the final cross-modality fusion feature. By enabling information interactions within and between modalities, the integration of the two modalities is maximized. Furthermore, considering the contribution of each modality enhances the overall robustness of the model. Our experiments indicate that the accuracy rate of our method on the LineMOD dataset can reach 96.9%, on average, using the ADD (-S) metric, while on the YCB-Video dataset, it can reach 94.7% using the ADD-S AUC metric and 96.5% using the ADD-S score (<2 cm) metric.

APA, Harvard, Vancouver, ISO, and other styles

44

McLaren, Sean W., Dorota T. Kopycka-Kedzierawski, and Jed Nordfelt. "Accuracy of teledentistry examinations at predicting actual treatment modality in a pediatric dentistry clinic." Journal of Telemedicine and Telecare 23, no. 8 (2016): 710–15. http://dx.doi.org/10.1177/1357633x16661428.

Full text

Abstract:

Objectives The purpose of this study was to assess the accuracy of predicting dental treatment modalities for children seen initially by means of a live-video teledentistry consultation. Methods A retrospective dental record review was completed of 251 rural pediatric patients from the Finger Lakes region of New York State who had an initial teledentistry appointment with a board-certified pediatric dentist located remotely at the Eastman Institute for Oral Health in Rochester, NY. Proportions of children who were referred for specific treatment modalities and who completed treatment and proportions of children for whom the treatment recommendation was changed were calculated. Fisher’s exact test was used to assess statistical significance. Results The initial treatment modality was not changed for 221/251 (88%) children initially seen for a teledentistry consultation. Thirty (12%) children had the initial treatment modality changed, most frequently children who were initially suggested treatment with nitrous oxide. Based on the initial treatment modality, changes to a different treatment modality were statistically significant (Fisher’s exact test, p < 0.0001). Conclusions Our data suggest that the use of a live-video teledentistry consultation can be an effective way of predicting the best treatment modality for rural children with significant dental disease. A live-video teledentistry consultation can be an effective intervention to facilitate completion of complex treatment plans for children from a rural area that have extensive dental needs.

APA, Harvard, Vancouver, ISO, and other styles

45

Cheng, Yongjian, Dongmei Zhou, Siqi Wang, and Luhan Wen. "Emotion-Recognition Algorithm Based on Weight-Adaptive Thought of Audio and Video." Electronics 12, no. 11 (2023): 2548. http://dx.doi.org/10.3390/electronics12112548.

Full text

Abstract:

Emotion recognition commonly relies on single-modal recognition methods, such as voice and video signals, which demonstrate a good practicability and universality in some scenarios. Nevertheless, as emotion-recognition application scenarios continue to expand and the data volume surges, single-modal emotion recognition proves insufficient to meet people’s needs for accuracy and comprehensiveness when the amount of data reaches a certain scale. Thus, this paper proposes the application of multimodal thought to enhance emotion-recognition accuracy and conducts corresponding data preprocessing on the selected dataset. Appropriate models are constructed for both audio and video modalities: for the audio-modality emotion-recognition task, this paper adopts the “time-distributed CNNs + LSTMs” model construction scheme; for the video-modality emotion-recognition task, the “DeepID V3 + Xception architecture” model construction scheme is selected. Furthermore, each model construction scheme undergoes experimental verification and comparison with existing emotion-recognition algorithms. Finally, this paper attempts late fusion and proposes and implements a late-fusion method based on the idea of weight adaptation. The experimental results demonstrate the superiority of the multimodal fusion algorithm proposed in this paper. When compared to the single-modal emotion-recognition algorithm, the accuracy of recognition is increased by almost 4%, reaching 84.33%.

APA, Harvard, Vancouver, ISO, and other styles

46

WANG, MENG, XIAN-SHENG HUA, TAO MEI, et al. "INTERACTIVE VIDEO ANNOTATION BY MULTI-CONCEPT MULTI-MODALITY ACTIVE LEARNING." International Journal of Semantic Computing 01, no. 04 (2007): 459–77. http://dx.doi.org/10.1142/s1793351x0700024x.

Full text

Abstract:

Active learning has been demonstrated to be an effective approach to reducing human labeling effort in multimedia annotation tasks. However, most of the existing active learning methods for video annotation are studied in a relatively simple context where concepts are sequentially annotated with fixed effort and only a single modality is applied. However, we usually have to deal with multiple modalities, and sequentially annotating concepts without preference cannot suitably assign annotation effort. To address these two issues, in this paper we propose a multi-concept multi-modality active learning method for video annotation in which multiple concepts and multiple modalities can be simultaneously taken into consideration. In each round of active learning, this method selects the concept that is expected to get the highest performance gain and a batch of suitable samples to be annotated for this concept. Then, a graph-based semi-supervised learning is conducted on each modality for the selected concept. The proposed method is able to sufficiently explore the human effort by considering both the learnabilities of different concepts and the potentials of different modalities. Experimental results on TRECVID 2005 benchmark have demonstrated its effectiveness and efficiency.

APA, Harvard, Vancouver, ISO, and other styles

47

Dwivedi, Shivangi, John Hayes, Isabella Pedron, et al. "Comparing the efficacy of AR-based training with video-based training." Proceedings of the Human Factors and Ergonomics Society Annual Meeting 66, no. 1 (2022): 1862–66. http://dx.doi.org/10.1177/1071181322661289.

Full text

Abstract:

In recent years, US Emergency Medical Services (EMS) have faced a massive shortage of EMS workers. The sudden outbreak of the pandemic has further exacerbated this issue by limiting in-person training. Additionally, current training modalities for first responders are costly and time-consuming, further limiting training opportunities. To overcome these challenges, this paper compares the efficacy of augmented reality (AR), an emerging training modality, and video-based training to address many of these issues without compromising the quality of the training with reduced instructor interaction. We examined performance, subjective, and physiological data to better understand workload, user engagement, and cognitive load distribution of 51 participants during training. The statistical analysis of physiological data and subjective responses indicate that performance during AR and video-based training and retention phases depended on gender perception of workload and cognitive load (intrinsic, germane, extraneous). However, user engagement was higher in AR-based training for both genders during training.

APA, Harvard, Vancouver, ISO, and other styles

48

Chen, Yingju, and Jeongkyu Lee. "A Review of Machine-Vision-Based Analysis of Wireless Capsule Endoscopy Video." Diagnostic and Therapeutic Endoscopy 2012 (November 13, 2012): 1–9. http://dx.doi.org/10.1155/2012/418037.

Full text

Abstract:

Wireless capsule endoscopy (WCE) enables a physician to diagnose a patient's digestive system without surgical procedures. However, it takes 1-2 hours for a gastroenterologist to examine the video. To speed up the review process, a number of analysis techniques based on machine vision have been proposed by computer science researchers. In order to train a machine to understand the semantics of an image, the image contents need to be translated into numerical form first. The numerical form of the image is known as image abstraction. The process of selecting relevant image features is often determined by the modality of medical images and the nature of the diagnoses. For example, there are radiographic projection-based images (e.g., X-rays and PET scans), tomography-based images (e.g., MRT and CT scans), and photography-based images (e.g., endoscopy, dermatology, and microscopic histology). Each modality imposes unique image-dependent restrictions for automatic and medically meaningful image abstraction processes. In this paper, we review the current development of machine-vision-based analysis of WCE video, focusing on the research that identifies specific gastrointestinal (GI) pathology and methods of shot boundary detection.

APA, Harvard, Vancouver, ISO, and other styles

49

Kim, Eun-Hee, Myung-Jin Lim, and Ju-Hyun Shin. "MMER-LMF: Multi-Modal Emotion Recognition in Lightweight Modality Fusion." Electronics 14, no. 11 (2025): 2139. https://doi.org/10.3390/electronics14112139.

Full text

Abstract:

Recently, multimodal approaches that combine various modalities have been attracting attention to recognizing emotions more accurately. Although multimodal fusion delivers strong performance, it is computationally intensive and difficult to handle in real time. In addition, there is a fundamental lack of large-scale emotional datasets for learning. In particular, Korean emotional datasets have fewer resources available than English-speaking datasets, thereby limiting the generalization capability of emotion recognition models. In this study, we propose a more lightweight modality fusion method, MMER-LMF, to overcome the lack of Korean emotional datasets and improve emotional recognition performance while reducing model training complexity. To this end, we suggest three algorithms that fuse emotion scores based on the reliability of each model, including text emotion scores extracted using a pre-trained large-scale language model and video emotion scores extracted based on a 3D CNN model. Each algorithm showed similar classification performance except for slight differences in disgust emotion performance with confidence-based weight adjustment, correlation coefficient utilization, and the Dempster–Shafer Theory-based combination method. The accuracy was 80% and the recall was 79%, which is higher than 58% using text modality and 72% using video modality. This is a superior result in terms of learning complexity and performance compared to previous studies using Korean datasets.

APA, Harvard, Vancouver, ISO, and other styles

50

Lie, Wen-Nung, Dao-Quang Le, Chun-Yu Lai, and Yu-Shin Fang. "Heart Rate Estimation from Facial Image Sequences of a Dual-Modality RGB-NIR Camera." Sensors 23, no. 13 (2023): 6079. http://dx.doi.org/10.3390/s23136079.

Full text

Abstract:

This paper presents an RGB-NIR (Near Infrared) dual-modality technique to analyze the remote photoplethysmogram (rPPG) signal and hence estimate the heart rate (in beats per minute), from a facial image sequence. Our main innovative contribution is the introduction of several denoising techniques such as Modified Amplitude Selective Filtering (MASF), Wavelet Decomposition (WD), and Robust Principal Component Analysis (RPCA), which take advantage of RGB and NIR band characteristics to uncover the rPPG signals effectively through this Independent Component Analysis (ICA)-based algorithm. Two datasets, of which one is the public PURE dataset and the other is the CCUHR dataset built with a popular Intel RealSense D435 RGB-D camera, are adopted in our experiments. Facial video sequences in the two datasets are diverse in nature with normal brightness, under-illumination (i.e., dark), and facial motion. Experimental results show that the proposed method has reached competitive accuracies among the state-of-the-art methods even at a shorter video length. For example, our method achieves MAE = 4.45 bpm (beats per minute) and RMSE = 6.18 bpm for RGB-NIR videos of 10 and 20 s in the CCUHR dataset and MAE = 3.24 bpm and RMSE = 4.1 bpm for RGB videos of 60-s in the PURE dataset. Our system has the advantages of accessible and affordable hardware, simple and fast computations, and wide realistic applications.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!