Academic literature on the topic 'Video Vision Transformer'
Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles
Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Video Vision Transformer.'
Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.
You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.
Journal articles on the topic "Video Vision Transformer"
Naikwadi, Sanket Shashikant. "Video Summarization Using Vision and Language Transformer Models." International Journal of Research Publication and Reviews 6, no. 6 (January 2025): 5217–21. https://doi.org/10.55248/gengpi.6.0125.0654.
Full textMoutik, Oumaima, Hiba Sekkat, Smail Tigani, Abdellah Chehri, Rachid Saadane, Taha Ait Tchakoucht, and Anand Paul. "Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?" Sensors 23, no. 2 (January 9, 2023): 734. http://dx.doi.org/10.3390/s23020734.
Full textYuan, Hongchun, Zhenyu Cai, Hui Zhou, Yue Wang, and Xiangzhi Chen. "TransAnomaly: Video Anomaly Detection Using Video Vision Transformer." IEEE Access 9 (2021): 123977–86. http://dx.doi.org/10.1109/access.2021.3109102.
Full textSarraf, Saman, and Milton Kabia. "Optimal Topology of Vision Transformer for Real-Time Video Action Recognition in an End-To-End Cloud Solution." Machine Learning and Knowledge Extraction 5, no. 4 (September 29, 2023): 1320–39. http://dx.doi.org/10.3390/make5040067.
Full textZhao, Hong, Zhiwen Chen, Lan Guo, and Zeyu Han. "Video captioning based on vision transformer and reinforcement learning." PeerJ Computer Science 8 (March 16, 2022): e916. http://dx.doi.org/10.7717/peerj-cs.916.
Full textIm, Heeju, and Yong Suk Choi. "A Full Transformer Video Captioning Model via Vision Transformer." KIISE Transactions on Computing Practices 29, no. 8 (August 31, 2023): 378–83. http://dx.doi.org/10.5626/ktcp.2023.29.8.378.
Full textUgile, Tukaram, and Dr Nilesh Uke. "TRANSFORMER ARCHITECTURES FOR COMPUTER VISION: A COMPREHENSIVE REVIEW AND FUTURE RESEARCH DIRECTIONS." Journal of Dynamics and Control 9, no. 3 (March 15, 2025): 70–79. https://doi.org/10.71058/jodac.v9i3005.
Full textWu, Pengfei, Le Wang, Sanping Zhou, Gang Hua, and Changyin Sun. "Temporal Correlation Vision Transformer for Video Person Re-Identification." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (March 24, 2024): 6083–91. http://dx.doi.org/10.1609/aaai.v38i6.28424.
Full textJin, Yanxiu, and Rulin Ma. "Applications of transformers in computer vision." Applied and Computational Engineering 16, no. 1 (October 23, 2023): 234–41. http://dx.doi.org/10.54254/2755-2721/16/20230898.
Full textPei, Pengfei, Xianfeng Zhao, Jinchuan Li, Yun Cao, and Xuyuan Lai. "Vision Transformer-Based Video Hashing Retrieval for Tracing the Source of Fake Videos." Security and Communication Networks 2023 (June 28, 2023): 1–16. http://dx.doi.org/10.1155/2023/5349392.
Full textDissertations / Theses on the topic "Video Vision Transformer"
Zhang, Yujing. "Deep learning-assisted video list decoding in error-prone video transmission systems." Electronic Thesis or Diss., Valenciennes, Université Polytechnique Hauts-de-France, 2024. http://www.theses.fr/2024UPHF0028.
Full textIn recent years, video applications have developed rapidly. At the same time, the video quality experience has improved considerably with the advent of HD video and the emergence of 4K content. As a result, video streams tend to represent a larger amount of data. To reduce the size of these video streams, new video compression solutions such as HEVC have been developed.However, transmission errors that may occur over networks can cause unwanted visual artifacts that significantly degrade the user experience. Various approaches have been proposed in the literature to find efficient and low-complexity solutions to repair video packets containing binary errors, thus avoiding costly retransmission that is incompatible with the low latency constraints of many emerging applications (immersive video, tele-operation). Error correction based on cyclic redundancy check (CRC) is a promising approach that uses readily available information without throughput overhead. However, in practice it can only correct a limited number of errors. Depending on the generating polynomial used, the size of the packets and the maximum number of errors considered, this method can lead not to a single corrected packet but rather to a list of possibly corrected packets. In this case, list decoding becomes relevant in combination with CRC-based error correction as well as methods exploiting information on the reliability of the received bits. However, this has disadvantages in terms of selection of candidate videos. Following the generation of ranked candidates during the state-of-the-art list decoding process, the final selection often considers the first valid candidate in the final list as the reconstructed video. However, this simple selection is arbitrary and not optimal, the candidate video sequence at the top of the list is not necessarily the one which presents the best visual quality. It is therefore necessary to develop a new method to automatically select the video with the highest quality from the list of candidates.We propose to select the best candidate based on the visual quality determined by a deep learning (DL) system. Considering that distortions will be assessed on each frame, we consider image quality assessment rather than video quality assessment. More specifically, each candidate undergoes processing by a reference-free image quality assessment (IQA) method based on deep learning to obtain a score. Subsequently, the system selects the candidate with the highest IQA score. To do this, our system evaluates the quality of videos subject to transmission errors without eliminating lost packets or concealing lost regions. Distortions caused by transmission errors differ from those accounted for by traditional visual quality measures, which typically deal with global, uniform image distortions. Thus, these metrics fail to distinguish the repaired version from different corrupted video versions when local, non-uniform errors occur. Our approach revisits and optimizes the classic list decoding technique by associating it with a CNN architecture first, then with a Transformer to evaluate the visual quality and identify the best candidate. It is unprecedented and offers excellent performance. In particular, we show that when transmission errors occur within an intra frame, our CNN and Transformer-based architectures achieve 100% decision accuracy. For errors in an inter frame, the accuracy is 93% and 95%, respectively
Filali, razzouki Anas. "Deep learning-based video face-based digital markers for early detection and analysis of Parkinson disease." Electronic Thesis or Diss., Institut polytechnique de Paris, 2025. http://www.theses.fr/2025IPPAS002.
Full textThis thesis aims to develop robust digital biomarkers for early detection of Parkinson's disease (PD) by analyzing facial videos to identify changes associated with hypomimia. In this context, we introduce new contributions to the state of the art: one based on shallow machine learning and the other on deep learning.The first method employs machine learning models that use manually extracted facial features, particularly derivatives of facial action units (AUs). These models incorporate interpretability mechanisms that explain their decision-making process for stakeholders, highlighting the most distinctive facial features for PD. We examine the influence of biological sex on these digital biomarkers, compare them against neuroimaging data and clinical scores, and use them to predict PD severity.The second method leverages deep learning to automatically extract features from raw facial videos and optical flow using foundational models based on Video Vision Transformers. To address the limited training data, we propose advanced adaptive transfer learning techniques, utilizing foundational models trained on large-scale video classification datasets. Additionally, we integrate interpretability mechanisms to clarify the relationship between automatically extracted features and manually extracted facial AUs, enhancing the comprehensibility of the model's decisions.Finally, our generated facial features are derived from both cross-sectional and longitudinal data, which provides a significant advantage over existing work. We use these recordings to analyze the progression of hypomimia over time with these digital markers, and its correlation with the progression of clinical scores.Combining these two approaches allows for a classification AUC (Area Under the Curve) of over 90%, demonstrating the efficacy of machine learning and deep learning models in detecting hypomimia in early-stage PD patients through facial videos. This research could enable continuous monitoring of hypomimia outside hospital settings via telemedicine
Cedernaes, Erasmus. "Runway detection in LWIR video : Real time image processing and presentation of sensor data." Thesis, Uppsala universitet, Avdelningen för visuell information och interaktion, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-300690.
Full textSaravi, Sara. "Use of Coherent Point Drift in computer vision applications." Thesis, Loughborough University, 2013. https://dspace.lboro.ac.uk/2134/12548.
Full textLeoputra, Wilson Suryajaya. "Video foreground extraction for mobile camera platforms." Thesis, Curtin University, 2009. http://hdl.handle.net/20.500.11937/1384.
Full textAli, Abid. "Analyse vidéo à l'aide de réseaux de neurones profonds : une application pour l'autisme." Electronic Thesis or Diss., Université Côte d'Azur, 2024. http://www.theses.fr/2024COAZ4066.
Full textUnderstanding actions in videos is a crucial element of computer vision with significant implications across various fields. As our dependence on visual data grows, comprehending and interpreting human actions in videos becomes essential for advancing technologies in surveillance, healthcare, autonomous systems, and human-computer interaction. The accurate interpretation of actions in videos is fundamental for creating intelligent systems that can effectively navigate and respond to the complexities of the real world. In this context, advances in action understanding push the boundaries of computer vision and play a crucial role in shaping the landscape of cutting-edge applications that impact our daily lives. Computer vision has made significant progress with the rise of deep learning methods such as convolutional neural networks (CNNs) pushing the boundaries of computer vision and enabling the computer vision community to advance in many domains, including image segmentation, object detection, scene understanding, and more. However, video processing remains limited compared to static images. In this thesis, we focus on action understanding, dividing it into two main parts: action recognition and action detection, and their application in the medical domain for autism analysis.In this thesis, we explore the various aspects and challenges of video understanding from a general and an application-specific perspective. We then present our contributions and solutions to address these challenges. In addition, we introduce the ACTIVIS dataset, designed to diagnose autism in young children. Our work is divided into two main parts: generic modeling and applied models. Initially, we focus on adapting image models for action recognition tasks by incorporating temporal modeling using parameter-efficient fine-tuning (PEFT) techniques. We also address real-time action detection and anticipation by proposing a new joint model for action anticipation and online action detection in real-life scenarios. Furthermore, we introduce a new task called 'loose-interaction' in dyadic situations and its applications in autism analysis. Finally, we concentrate on the applied aspect of video understanding by proposing an action recognition model for repetitive behaviors in videos of autistic individuals. We conclude by proposing a weakly-supervised method to estimate the severity score of autistic children in long videos
Burger, Thomas. "Reconnaissance automatique des gestes de la langue française parlée complétée." Phd thesis, Grenoble INPG, 2007. http://tel.archives-ouvertes.fr/tel-00203360.
Full textBooks on the topic "Video Vision Transformer"
Korsgaard, Mathias Bonde. Music Video Transformed. Edited by John Richardson, Claudia Gorbman, and Carol Vernallis. Oxford University Press, 2013. http://dx.doi.org/10.1093/oxfordhb/9780199733866.013.015.
Full textBook chapters on the topic "Video Vision Transformer"
Gabeur, Valentin, Chen Sun, Karteek Alahari, and Cordelia Schmid. "Multi-modal Transformer for Video Retrieval." In Computer Vision – ECCV 2020, 214–29. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58548-8_13.
Full textKim, Hannah Halin, Shuzhi Yu, Shuai Yuan, and Carlo Tomasi. "Cross-Attention Transformer for Video Interpolation." In Computer Vision – ACCV 2022 Workshops, 325–42. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-27066-6_23.
Full textKim, Tae Hyun, Mehdi S. M. Sajjadi, Michael Hirsch, and Bernhard Schölkopf. "Spatio-Temporal Transformer Network for Video Restoration." In Computer Vision – ECCV 2018, 111–27. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-01219-9_7.
Full textXue, Tong, Qianrui Wang, Xinyi Huang, and Dengshi Li. "Self-guided Transformer for Video Super-Resolution." In Pattern Recognition and Computer Vision, 186–98. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-8549-4_16.
Full textLi, Zutong, and Lei Yang. "DCVQE: A Hierarchical Transformer for Video Quality Assessment." In Computer Vision – ACCV 2022, 398–416. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-26316-3_24.
Full textCourant, Robin, Maika Edberg, Nicolas Dufour, and Vicky Kalogeiton. "Transformers and Visual Transformers." In Machine Learning for Brain Disorders, 193–229. New York, NY: Springer US, 2012. http://dx.doi.org/10.1007/978-1-0716-3195-9_6.
Full textHuo, Shuwei, Yuan Zhou, and Haiyang Wang. "YFormer: A New Transformer Architecture for Video-Query Based Video Moment Retrieval." In Pattern Recognition and Computer Vision, 638–50. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-18913-5_49.
Full textLi, Li, Liansheng Zhuang, Shenghua Gao, and Shafei Wang. "HaViT: Hybrid-Attention Based Vision Transformer for Video Classification." In Computer Vision – ACCV 2022, 502–17. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-26316-3_30.
Full textZhang, Hui, Jiewen Yang, Xingbo Dong, Xingguo Lv, Wei Jia, Zhe Jin, and Xuejun Li. "A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer." In Pattern Recognition and Computer Vision, 29–43. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-8469-5_3.
Full textWu, Jinlin, Lingxiao He, Wu Liu, Yang Yang, Zhen Lei, Tao Mei, and Stan Z. Li. "CAViT: Contextual Alignment Vision Transformer for Video Object Re-identification." In Lecture Notes in Computer Science, 549–66. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-19781-9_32.
Full textConference papers on the topic "Video Vision Transformer"
Kobayashi, Takumi, and Masataka Seo. "Efficient Compression Method in Video Reconstruction Using Video Vision Transformer." In 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), 724–25. IEEE, 2024. https://doi.org/10.1109/gcce62371.2024.10760444.
Full textYokota, Haruto, Mert Bozkurtlar, Benjamin Yen, Katsutoshi Itoyama, Kenji Nishida, and Kazuhiro Nakadai. "A Video Vision Transformer for Sound Source Localization." In 2024 32nd European Signal Processing Conference (EUSIPCO), 106–10. IEEE, 2024. http://dx.doi.org/10.23919/eusipco63174.2024.10715427.
Full textOjaswee, R. Sreemathy, Mousami Turuk, Jayashree Jagdale, and Mohammad Anish. "Indian Sign Language Recognition Using Video Vision Transformer." In 2024 3rd International Conference for Advancement in Technology (ICONAT), 1–7. IEEE, 2024. https://doi.org/10.1109/iconat61936.2024.10774678.
Full textThuan, Pham Minh, Bui Thu Lam, and Pham Duy Trung. "Spatial Vision Transformer: A Novel Approach to Deepfake Video Detection." In 2024 1st International Conference On Cryptography And Information Security (VCRIS), 1–6. IEEE, 2024. https://doi.org/10.1109/vcris63677.2024.10813391.
Full textKumari, Supriya, Prince Kumar, Pooja Verma, Rajitha B, and Sarsij Tripathi. "Hybrid Vision Transformer and Convolutional Neural Network for Sports Video Classification." In 2024 International Conference on Intelligent Computing and Emerging Communication Technologies (ICEC), 1–5. IEEE, 2024. https://doi.org/10.1109/icec59683.2024.10837289.
Full textIsogawa, Junya, Fumihiko Sakaue, and Jun Sato. "Simultaneous Estimation of Driving Intentions for Multiple Vehicles Using Video Transformer." In 20th International Conference on Computer Vision Theory and Applications, 471–77. SCITEPRESS - Science and Technology Publications, 2025. https://doi.org/10.5220/0013232100003912.
Full textGupta, Anisha, and Vidit Kumar. "A Hybrid U-Net and Vision Transformer approach for Video Anomaly detection." In 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), 1–6. IEEE, 2024. http://dx.doi.org/10.1109/icccnt61001.2024.10725860.
Full textAnsari, Khustar, and Priyanka Srivastava. "Hybrid Attention Vision Transformer-based Deep Learning Model for Video Caption Generation." In 2025 International Conference on Electronics and Renewable Systems (ICEARS), 1238–45. IEEE, 2025. https://doi.org/10.1109/icears64219.2025.10940922.
Full textZhou, Xingyu, Leheng Zhang, Xiaorui Zhao, Keze Wang, Leida Li, and Shuhang Gu. "Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention." In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 25399–408. IEEE, 2024. http://dx.doi.org/10.1109/cvpr52733.2024.02400.
Full textChoi, Joonmyung, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, and Hyunwoo J. Kim. "vid-TLDR: Training Free Token merging for Light-Weight Video Transformer." In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18771–81. IEEE, 2024. http://dx.doi.org/10.1109/cvpr52733.2024.01776.
Full text