Relevant bibliographies by topics / Video Vision Transformer

Journal articles
Dissertations / Theses
Books
Book chapters
Conference papers

Academic literature on the topic 'Video Vision Transformer'

Author: Grafiati

Published: 12 April 2025

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Video Vision Transformer.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Video Vision Transformer"

Naikwadi, Sanket Shashikant. "Video Summarization Using Vision and Language Transformer Models." International Journal of Research Publication and Reviews 6, no. 6 (January 2025): 5217–21. https://doi.org/10.55248/gengpi.6.0125.0654.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Moutik, Oumaima, Hiba Sekkat, Smail Tigani, Abdellah Chehri, Rachid Saadane, Taha Ait Tchakoucht, and Anand Paul. "Convolutional Neural Networks or Vision Transformers: Who Will Win the Race for Action Recognitions in Visual Data?" Sensors 23, no. 2 (January 9, 2023): 734. http://dx.doi.org/10.3390/s23020734.

Full text

Abstract:

Understanding actions in videos remains a significant challenge in computer vision, which has been the subject of several pieces of research in the last decades. Convolutional neural networks (CNN) are a significant component of this topic and play a crucial role in the renown of Deep Learning. Inspired by the human vision system, CNN has been applied to visual data exploitation and has solved various challenges in various computer vision tasks and video/image analysis, including action recognition (AR). However, not long ago, along with the achievement of the transformer in natural language processing (NLP), it began to set new trends in vision tasks, which has created a discussion around whether the Vision Transformer models (ViT) will replace CNN in action recognition in video clips. This paper conducts this trending topic in detail, the study of CNN and Transformer for Action Recognition separately and a comparative study of the accuracy-complexity trade-off. Finally, based on the performance analysis’s outcome, the question of whether CNN or Vision Transformers will win the race will be discussed.

APA, Harvard, Vancouver, ISO, and other styles

Yuan, Hongchun, Zhenyu Cai, Hui Zhou, Yue Wang, and Xiangzhi Chen. "TransAnomaly: Video Anomaly Detection Using Video Vision Transformer." IEEE Access 9 (2021): 123977–86. http://dx.doi.org/10.1109/access.2021.3109102.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Sarraf, Saman, and Milton Kabia. "Optimal Topology of Vision Transformer for Real-Time Video Action Recognition in an End-To-End Cloud Solution." Machine Learning and Knowledge Extraction 5, no. 4 (September 29, 2023): 1320–39. http://dx.doi.org/10.3390/make5040067.

Full text

Abstract:

This study introduces an optimal topology of vision transformers for real-time video action recognition in a cloud-based solution. Although model performance is a key criterion for real-time video analysis use cases, inference latency plays a more crucial role in adopting such technology in real-world scenarios. Our objective is to reduce the inference latency of the solution while admissibly maintaining the vision transformer’s performance. Thus, we employed the optimal cloud components as the foundation of our machine learning pipeline and optimized the topology of vision transformers. We utilized UCF101, including more than one million action recognition video clips. The modeling pipeline consists of a preprocessing module to extract frames from video clips, training two-dimensional (2D) vision transformer models, and deep learning baselines. The pipeline also includes a postprocessing step to aggregate the frame-level predictions to generate the video-level predictions at inference. The results demonstrate that our optimal vision transformer model with an input dimension of 56 × 56 × 3 with eight attention heads produces an F1 score of 91.497% for the testing set. The optimized vision transformer reduces the inference latency by 40.70%, measured through a batch-processing approach, with a 55.63% faster training time than the baseline. Lastly, we developed an enhanced skip-frame approach to improve the inference latency by finding an optimal ratio of frames for prediction at inference, where we could further reduce the inference latency by 57.15%. This study reveals that the vision transformer model is highly optimizable for inference latency while maintaining the model performance.

APA, Harvard, Vancouver, ISO, and other styles

Zhao, Hong, Zhiwen Chen, Lan Guo, and Zeyu Han. "Video captioning based on vision transformer and reinforcement learning." PeerJ Computer Science 8 (March 16, 2022): e916. http://dx.doi.org/10.7717/peerj-cs.916.

Full text

Abstract:

Global encoding of visual features in video captioning is important for improving the description accuracy. In this paper, we propose a video captioning method that combines Vision Transformer (ViT) and reinforcement learning. Firstly, Resnet-152 and ResNeXt-101 are used to extract features from videos. Secondly, the encoding block of the ViT network is applied to encode video features. Thirdly, the encoded features are fed into a Long Short-Term Memory (LSTM) network to generate a video content description. Finally, the accuracy of video content description is further improved by fine-tuning reinforcement learning. We conducted experiments on the benchmark dataset MSR-VTT used for video captioning. The results show that compared with the current mainstream methods, the model in this paper has improved by 2.9%, 1.4%, 0.9% and 4.8% under the four evaluation indicators of LEU-4, METEOR, ROUGE-L and CIDEr-D, respectively.

APA, Harvard, Vancouver, ISO, and other styles

Im, Heeju, and Yong Suk Choi. "A Full Transformer Video Captioning Model via Vision Transformer." KIISE Transactions on Computing Practices 29, no. 8 (August 31, 2023): 378–83. http://dx.doi.org/10.5626/ktcp.2023.29.8.378.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ugile, Tukaram, and Dr Nilesh Uke. "TRANSFORMER ARCHITECTURES FOR COMPUTER VISION: A COMPREHENSIVE REVIEW AND FUTURE RESEARCH DIRECTIONS." Journal of Dynamics and Control 9, no. 3 (March 15, 2025): 70–79. https://doi.org/10.71058/jodac.v9i3005.

Full text

Abstract:

Transformers have made revolutionary impacts in Natural Language Processing (NLP) area and started making significant contributions in Computer Vision problems. This paper provides a comprehensive review of the Transformer Architectures in Computer Vision, providing a detailed view about their evolution from Vision Transformers (ViTs) to more advanced variants of transformers like Swin Transformer, Transformer-XL, and Hybrid CNN-Transformer models. We have tried to make the study of the advantages of the Transformers over the traditional Convolutional Neural Networks (CNNs), their applications for Object Detection, Image Classification, Video Analysis, and their computational challenges. Finally, we discuss the future research directions, including the self-attention mechanisms, multi-modal learning, and lightweight architectures for Edge Computing.

APA, Harvard, Vancouver, ISO, and other styles

Wu, Pengfei, Le Wang, Sanping Zhou, Gang Hua, and Changyin Sun. "Temporal Correlation Vision Transformer for Video Person Re-Identification." Proceedings of the AAAI Conference on Artificial Intelligence 38, no. 6 (March 24, 2024): 6083–91. http://dx.doi.org/10.1609/aaai.v38i6.28424.

Full text

Abstract:

Video Person Re-Identification (Re-ID) is a task of retrieving persons from multi-camera surveillance systems. Despite the progress made in leveraging spatio-temporal information in videos, occlusion in dense crowds still hinders further progress. To address this issue, we propose a Temporal Correlation Vision Transformer (TCViT) for video person Re-ID. TCViT consists of a Temporal Correlation Attention (TCA) module and a Learnable Temporal Aggregation (LTA) module. The TCA module is designed to reduce the impact of non-target persons by relative state, while the LTA module is used to aggregate frame-level features based on their completeness. Specifically, TCA is a parameter-free module that first aligns frame-level features to restore semantic coherence in videos and then enhances the features of the target person according to temporal correlation. Additionally, unlike previous methods that treat each frame equally with a pooling layer, LTA introduces a lightweight learnable module to weigh and aggregate frame-level features under the guidance of a classification score. Extensive experiments on four prevalent benchmarks demonstrate that our method achieves state-of-the-art performance in video Re-ID.

APA, Harvard, Vancouver, ISO, and other styles

Jin, Yanxiu, and Rulin Ma. "Applications of transformers in computer vision." Applied and Computational Engineering 16, no. 1 (October 23, 2023): 234–41. http://dx.doi.org/10.54254/2755-2721/16/20230898.

Full text

Abstract:

Recently, research based on transformers has become a hot topic. Owing to their ability to capture long-range dependencies, transformers have been rapidly adopted in the field of computer vision for processing image and video data. Despite their widespread adoption, the application of transformer in computer vision such as semantic segmentation, image generation and image repair are still lacking. To address this gap, this paper provides a thorough review and summary of the latest research findings on the applications of transformers in these areas, with a focus on the mechanism of transformers and using ViT (Vision Transformer) as an example. The paper further highlights recent or popular discoveries of transformers in medical scenarios, image generation, and image inpainting. Based on the research, this work also provides insights on future developments and expectations.

APA, Harvard, Vancouver, ISO, and other styles

Pei, Pengfei, Xianfeng Zhao, Jinchuan Li, Yun Cao, and Xuyuan Lai. "Vision Transformer-Based Video Hashing Retrieval for Tracing the Source of Fake Videos." Security and Communication Networks 2023 (June 28, 2023): 1–16. http://dx.doi.org/10.1155/2023/5349392.

Full text

Abstract:

With the increasing negative impact of fake videos on individuals and society, it is crucial to detect different types of forgeries. Existing forgery detection methods often output a probability value, which lacks interpretability and reliability. In this paper, we propose a source-tracing-based solution to find the original real video of a fake video, which can provide more reliable results in practical situations. However, directly applying retrieval methods to traceability tasks is infeasible since traceability tasks require finding the unique source video from a large number of real videos, while retrieval methods are typically used to find similar videos. In addition, training an effective hashing center to distinguish similar real videos is challenging. To address the above issues, we introduce a novel loss function, hash triplet loss, to capture fine-grained features with subtle differences. Extensive experiments show that our method outperforms state-of-the-art methods on multiple datasets of object removal (video inpainting), object addition (video splicing), and object swapping (face swapping), demonstrating excellent robustness and cross-dataset performance. The effectiveness of the hash triplet loss for nondifferentiable optimization problems is validated through experiments in similar video scenes.

APA, Harvard, Vancouver, ISO, and other styles

More sources

Dissertations / Theses on the topic "Video Vision Transformer"

Zhang, Yujing. "Deep learning-assisted video list decoding in error-prone video transmission systems." Electronic Thesis or Diss., Valenciennes, Université Polytechnique Hauts-de-France, 2024. http://www.theses.fr/2024UPHF0028.

Full text

Abstract:

Au cours des dernières années, les applications vidéo ont connu un développement rapide. Par ailleurs, l’expérience en matière de qualité vidéo s’est considérablement améliorée grâce à l’avènement de la vidéo HD et à l’émergence des contenus 4K. En conséquence, les flux vidéo ont tendance à représenter une plus grande quantité de données. Pour réduire la taille de ces flux vidéo, de nouvelles solutions de compression vidéo telles que HEVC ont été développées.Cependant, les erreurs de transmission susceptibles de survenir sur les réseaux peuvent provoquer des artefacts visuels indésirables qui dégradent considérablement l'expérience utilisateur. Diverses approches ont été proposées dans la littérature pour trouver des solutions efficaces et peu complexes afin de réparer les paquets vidéo contenant des erreurs binaires, en évitant ainsi une retransmission coûteuse et incompatible avec les contraintes de faible latence de nombreuses applications émergentes (vidéo immersive, télé-opération). La correction d'erreurs basée sur le contrôle de redondance cyclique (CRC) est une approche prometteuse qui utilise des informations facilement disponibles sans surcoût de débit. Cependant, elle ne peut corriger en pratique qu'un nombre limité d'erreurs. Selon le polynôme générateur utilisé, la taille des paquets et le nombre maximum d'erreurs considéré, cette méthode peut conduire non pas à un paquet corrigé unique, mais plutôt à une liste de paquets possiblement corrigés. Dans ce cas, le décodage de liste devient pertinent en combinaison avec la correction d'erreurs basée CRC ainsi qu'avec les méthodes exploitant l'information sur la fiabilité des bits reçus. Celui-ci présente toutefois des inconvénients en termes de sélection de vidéos candidates. Suite à la génération des candidats classés lors du processus de décodage de liste dans l'état de l'art, la sélection finale considéra souvent le premier candidat valide dans la liste finale comme vidéo reconstruite. Cependant, cette sélection simple est arbitraire et non optimale, la séquence vidéo candidate en tête de liste n'étant pas nécessairement celle qui présente la meilleure qualité visuelle. Il est donc nécessaire de développer une nouvelle méthode permettant de sélectionner automatiquement la vidéo ayant la plus haute qualité dans la liste des candidats.Nous proposons de sélectionner le meilleur candidat en fonction de la qualité visuelle déterminée par un système d'apprentissage profond (DL). Considérant que la distorsion sera gérée sur chaque image, nous considérons l’évaluation de la qualité de l’image plutôt que l’évaluation de la qualité vidéo. Plus précisément, chaque candidat subit un traitement par une méthode d'évaluation de la qualité d'image (image quality assessment, IQA) sans référence basée sur l'apprentissage profond pour obtenir un score. Par la suite, le système sélectionne le candidat ayant le score IQA le plus élevé. Pour cela, notre système évalue la qualité des vidéos soumises à des erreurs de transmission sans éliminer les paquets perdus ni dissimuler les régions perdues. Les distorsions causées par les erreurs de transmission diffèrent de celles prises en compte par les mesures de qualité visuelle traditionnelles, qui traitent généralement des distorsions globales et uniformes de l'image. Ainsi, ces métriques ne parviennent pas à distinguer la version corrigée des différentes versions vidéo corrompues. Notre approche revisite et optimise la technique de décodage de liste classique en lui associant une architecture CNN d’abord, puis Transformer pour évaluer la qualité visuelle et identifier le meilleur candidat. Elle est sans précédent et offre d'excellentes performances. En particulier, nous montrons que lorsque les erreurs de transmission se produisent dans une trame intra, nos architectures basées sur CNN et Transformer atteignent une précision de décision de 100%. Pour les erreurs dans une image inter, la précision est de 93% et 95%, respectivement
In recent years, video applications have developed rapidly. At the same time, the video quality experience has improved considerably with the advent of HD video and the emergence of 4K content. As a result, video streams tend to represent a larger amount of data. To reduce the size of these video streams, new video compression solutions such as HEVC have been developed.However, transmission errors that may occur over networks can cause unwanted visual artifacts that significantly degrade the user experience. Various approaches have been proposed in the literature to find efficient and low-complexity solutions to repair video packets containing binary errors, thus avoiding costly retransmission that is incompatible with the low latency constraints of many emerging applications (immersive video, tele-operation). Error correction based on cyclic redundancy check (CRC) is a promising approach that uses readily available information without throughput overhead. However, in practice it can only correct a limited number of errors. Depending on the generating polynomial used, the size of the packets and the maximum number of errors considered, this method can lead not to a single corrected packet but rather to a list of possibly corrected packets. In this case, list decoding becomes relevant in combination with CRC-based error correction as well as methods exploiting information on the reliability of the received bits. However, this has disadvantages in terms of selection of candidate videos. Following the generation of ranked candidates during the state-of-the-art list decoding process, the final selection often considers the first valid candidate in the final list as the reconstructed video. However, this simple selection is arbitrary and not optimal, the candidate video sequence at the top of the list is not necessarily the one which presents the best visual quality. It is therefore necessary to develop a new method to automatically select the video with the highest quality from the list of candidates.We propose to select the best candidate based on the visual quality determined by a deep learning (DL) system. Considering that distortions will be assessed on each frame, we consider image quality assessment rather than video quality assessment. More specifically, each candidate undergoes processing by a reference-free image quality assessment (IQA) method based on deep learning to obtain a score. Subsequently, the system selects the candidate with the highest IQA score. To do this, our system evaluates the quality of videos subject to transmission errors without eliminating lost packets or concealing lost regions. Distortions caused by transmission errors differ from those accounted for by traditional visual quality measures, which typically deal with global, uniform image distortions. Thus, these metrics fail to distinguish the repaired version from different corrupted video versions when local, non-uniform errors occur. Our approach revisits and optimizes the classic list decoding technique by associating it with a CNN architecture first, then with a Transformer to evaluate the visual quality and identify the best candidate. It is unprecedented and offers excellent performance. In particular, we show that when transmission errors occur within an intra frame, our CNN and Transformer-based architectures achieve 100% decision accuracy. For errors in an inter frame, the accuracy is 93% and 95%, respectively

APA, Harvard, Vancouver, ISO, and other styles

Filali, razzouki Anas. "Deep learning-based video face-based digital markers for early detection and analysis of Parkinson disease." Electronic Thesis or Diss., Institut polytechnique de Paris, 2025. http://www.theses.fr/2025IPPAS002.

Full text

Abstract:

Cette thèse vise à développer des biomarqueurs numériques robustes pour la détection précoce de la maladie de Parkinson (MP) en analysant des vidéos faciales afin d'identifier les changements associés à l'hypomimie. Dans ce contexte, nous introduisons de nouvelles contributions à l'état de l'art : l'une fondée sur l'apprentissage automatique superficiel et l'autre fondée sur l'apprentissage profond. La première méthode utilise des modèles d'apprentissage automatique qui exploitent des caractéristiques faciales extraites manuellement, en particulier les dérivés des unités d'action faciale (AUs). Ces modèles intègrent des mécanismes d'interprétabilité qui permettent d'expliquer leur processus de décision auprès des parties prenantes, mettant en évidence les caractéristiques faciales les plus distinctives pour la MP. Nous examinons l'influence du sexe biologique sur ces biomarqueurs numériques, les comparons aux données de neuroimagerie et aux scores cliniques, et les utilisons pour prédire la gravité de la MP. La deuxième méthode exploite l'apprentissage profond pour extraire automatiquement des caractéristiques à partir de vidéos faciales brutes et des données de flux optique en utilisant des modèles fondamentaux basés sur les Vision Transformers pour vidéos. Pour pallier le manque de données d'entraînement, nous proposons des techniques avancées d'apprentissage par transfert adaptatif, en utilisant des modèles fondamentaux entraînés sur de grands ensembles de données pour la classification de vidéos. De plus, nous intégrons des mécanismes d'interprétabilité pour établir la relation entre les caractéristiques extraites automatiquement et les AUs faciales extraites manuellement, améliorant ainsi la clarté des décisions des modèles. Enfin, nos caractéristiques faciales générées proviennent à la fois de données transversales et longitudinales, ce qui offre un avantage significatif par rapport aux travaux existants. Nous utilisons ces enregistrements pour analyser la progression de l'hypomimie au fil du temps avec ces marqueurs numériques, et sa corrélation avec la progression des scores cliniques. La combinaison des deux approches proposées permet d'obtenir une AUC (Area Under the Curve) de classification de plus de 90%, démontrant l'efficacité des modèles d'apprentissage automatique et d'apprentissage profond dans la détection de l'hypomimie chez les patients atteints de MP à un stade précoce via des vidéos faciales. Cette recherche pourrait permettre une surveillance continue de l'hypomimie en dehors des environnements hospitaliers via la télémédecine
This thesis aims to develop robust digital biomarkers for early detection of Parkinson's disease (PD) by analyzing facial videos to identify changes associated with hypomimia. In this context, we introduce new contributions to the state of the art: one based on shallow machine learning and the other on deep learning.The first method employs machine learning models that use manually extracted facial features, particularly derivatives of facial action units (AUs). These models incorporate interpretability mechanisms that explain their decision-making process for stakeholders, highlighting the most distinctive facial features for PD. We examine the influence of biological sex on these digital biomarkers, compare them against neuroimaging data and clinical scores, and use them to predict PD severity.The second method leverages deep learning to automatically extract features from raw facial videos and optical flow using foundational models based on Video Vision Transformers. To address the limited training data, we propose advanced adaptive transfer learning techniques, utilizing foundational models trained on large-scale video classification datasets. Additionally, we integrate interpretability mechanisms to clarify the relationship between automatically extracted features and manually extracted facial AUs, enhancing the comprehensibility of the model's decisions.Finally, our generated facial features are derived from both cross-sectional and longitudinal data, which provides a significant advantage over existing work. We use these recordings to analyze the progression of hypomimia over time with these digital markers, and its correlation with the progression of clinical scores.Combining these two approaches allows for a classification AUC (Area Under the Curve) of over 90%, demonstrating the efficacy of machine learning and deep learning models in detecting hypomimia in early-stage PD patients through facial videos. This research could enable continuous monitoring of hypomimia outside hospital settings via telemedicine

APA, Harvard, Vancouver, ISO, and other styles

Cedernaes, Erasmus. "Runway detection in LWIR video : Real time image processing and presentation of sensor data." Thesis, Uppsala universitet, Avdelningen för visuell information och interaktion, 2016. http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-300690.

Full text

Abstract:

Runway detection in long wavelength infrared (LWIR) video could potentially increase the number of successful landings by increasing the situational awareness of pilots and verifying a correct approach. A method for detecting runways in LWIR video was therefore proposed and evaluated for robustness, speed and FPGA acceleration. The proposed algorithm improves the detection probability by making assumptions of the runway appearance during approach, as well as by using a modified Hough line transform and a symmetric search of peaks in the accumulator that is returned by the Hough line transform. A video chain was implemented on a Xilinx ZC702 Development card with input and output via HDMI through an expansion card. The video frames were buffered to RAM, and the detection algorithm ran on the CPU, which however did not meet the real-time requirement. Strategies were proposed that would improve the processing speed by either acceleration in hardware or algorithmic changes.

APA, Harvard, Vancouver, ISO, and other styles

Saravi, Sara. "Use of Coherent Point Drift in computer vision applications." Thesis, Loughborough University, 2013. https://dspace.lboro.ac.uk/2134/12548.

Full text

Abstract:

This thesis presents the novel use of Coherent Point Drift in improving the robustness of a number of computer vision applications. CPD approach includes two methods for registering two images - rigid and non-rigid point set approaches which are based on the transformation model used. The key characteristic of a rigid transformation is that the distance between points is preserved, which means it can be used in the presence of translation, rotation, and scaling. Non-rigid transformations - or affine transforms - provide the opportunity of registering under non-uniform scaling and skew. The idea is to move one point set coherently to align with the second point set. The CPD method finds both the non-rigid transformation and the correspondence distance between two point sets at the same time without having to use a-priori declaration of the transformation model used. The first part of this thesis is focused on speaker identification in video conferencing. A real-time, audio-coupled video based approach is presented, which focuses more on the video analysis side, rather than the audio analysis that is known to be prone to errors. CPD is effectively utilised for lip movement detection and a temporal face detection approach is used to minimise false positives if face detection algorithm fails to perform. The second part of the thesis is focused on multi-exposure and multi-focus image fusion with compensation for camera shake. Scale Invariant Feature Transforms (SIFT) are first used to detect keypoints in images being fused. Subsequently this point set is reduced to remove outliers, using RANSAC (RANdom Sample Consensus) and finally the point sets are registered using CPD with non-rigid transformations. The registered images are then fused with a Contourlet based image fusion algorithm that makes use of a novel alpha blending and filtering technique to minimise artefacts. The thesis evaluates the performance of the algorithm in comparison to a number of state-of-the-art approaches, including the key commercial products available in the market at present, showing significantly improved subjective quality in the fused images. The final part of the thesis presents a novel approach to Vehicle Make & Model Recognition in CCTV video footage. CPD is used to effectively remove skew of vehicles detected as CCTV cameras are not specifically configured for the VMMR task and may capture vehicles at different approaching angles. A LESH (Local Energy Shape Histogram) feature based approach is used for vehicle make and model recognition with the novelty that temporal processing is used to improve reliability. A number of further algorithms are used to maximise the reliability of the final outcome. Experimental results are provided to prove that the proposed system demonstrates an accuracy in excess of 95% when tested on real CCTV footage with no prior camera calibration.

APA, Harvard, Vancouver, ISO, and other styles

Leoputra, Wilson Suryajaya. "Video foreground extraction for mobile camera platforms." Thesis, Curtin University, 2009. http://hdl.handle.net/20.500.11937/1384.

Full text

Abstract:

Foreground object detection is a fundamental task in computer vision with many applications in areas such as object tracking, event identification, and behavior analysis. Most conventional foreground object detection methods work only in a stable illumination environments using fixed cameras. In real-world applications, however, it is often the case that the algorithm needs to operate under the following challenging conditions: drastic lighting changes, object shape complexity, moving cameras, low frame capture rates, and low resolution images. This thesis presents four novel approaches for foreground object detection on real-world datasets using cameras deployed on moving vehicles.The first problem addresses passenger detection and tracking tasks for public transport buses investigating the problem of changing illumination conditions and low frame capture rates. Our approach integrates a stable SIFT (Scale Invariant Feature Transform) background seat modelling method with a human shape model into a weighted Bayesian framework to detect passengers. To deal with the problem of tracking multiple targets, we employ the Reversible Jump Monte Carlo Markov Chain tracking algorithm. Using the SVM classifier, the appearance transformation models capture changes in the appearance of the foreground objects across two consecutives frames under low frame rate conditions. In the second problem, we present a system for pedestrian detection involving scenes captured by a mobile bus surveillance system. It integrates scene localization, foreground-background separation, and pedestrian detection modules into a unified detection framework. The scene localization module performs a two stage clustering of the video data.In the first stage, SIFT Homography is applied to cluster frames in terms of their structural similarity, and the second stage further clusters these aligned frames according to consistency in illumination. This produces clusters of images that are differential in viewpoint and lighting. A kernel density estimation (KDE) technique for colour and gradient is then used to construct background models for each image cluster, which is further used to detect candidate foreground pixels. Finally, using a hierarchical template matching approach, pedestrians can be detected.In addition to the second problem, we present three direct pedestrian detection methods that extend the HOG (Histogram of Oriented Gradient) techniques (Dalal and Triggs, 2005) and provide a comparative evaluation of these approaches. The three approaches include: a) a new histogram feature, that is formed by the weighted sum of both the gradient magnitude and the filter responses from a set of elongated Gaussian filters (Leung and Malik, 2001) corresponding to the quantised orientation, which we refer to as the Histogram of Oriented Gradient Banks (HOGB) approach; b) the codebook based HOG feature with branch-and-bound (efficient subwindow search) algorithm (Lampert et al., 2008) and; c) the codebook based HOGB approach.In the third problem, a unified framework that combines 3D and 2D background modelling is proposed to detect scene changes using a camera mounted on a moving vehicle. The 3D scene is first reconstructed from a set of videos taken at different times. The 3D background modelling identifies inconsistent scene structures as foreground objects. For the 2D approach, foreground objects are detected using the spatio-temporal MRF algorithm. Finally, the 3D and 2D results are combined using morphological operations.The significance of these research is that it provides basic frameworks for automatic large-scale mobile surveillance applications and facilitates many higher-level applications such as object tracking and behaviour analysis.

APA, Harvard, Vancouver, ISO, and other styles

Ali, Abid. "Analyse vidéo à l'aide de réseaux de neurones profonds : une application pour l'autisme." Electronic Thesis or Diss., Université Côte d'Azur, 2024. http://www.theses.fr/2024COAZ4066.

Full text

Abstract:

La compréhension des actions dans les vidéos est un élément crucial de la vision par ordinateur, avec des implications significatives dans divers domaines. À mesure que notre dépendance aux données visuelles augmente, comprendre et interpréter les actions humaines dans les vidéos devient essentiel pour faire progresser les technologies dans la surveillance, les soins de santé, les systèmes autonomes et l'interaction homme-machine. L'interprétation précise des actions dans les vidéos est fondamentale pour créer des systèmes intelligents capables de naviguer efficacement et de répondre aux complexités du monde réel. Dans ce contexte, les avancées dans la compréhension des actions repoussent les limites de la vision par ordinateur et jouent un rôle crucial dans la transformation des applications de pointe qui impactent notre quotidien. La vision par ordinateur a réalisé des progrès significatifs avec l'essor des méthodes d'apprentissage profond, telles que les réseaux de neurones convolutifs (CNN), repoussant les frontières de la vision par ordinateur et permettant à la communauté de progresser dans de nombreux domaines, notamment la segmentation d'images, la détection d'objets, la compréhension des scènes, et bien plus encore. Cependant, le traitement des vidéos reste limité par rapport aux images statiques. Dans cette thèse, nous nous concentrons sur la compréhension des actions, en la divisant en deux parties principales : la reconnaissance d'actions et la détection d'actions, ainsi que leur application dans le domaine médical pour l'analyse de l'autisme. Dans cette thèse, nous explorons les divers aspects et défis de la compréhension des vidéos, tant d'un point de vue général que spécifique à une application. Nous présentons ensuite nos contributions et solutions pour relever ces défis. De plus, nous introduisons le jeu de données ACTIVIS, conçu pour diagnostiquer l'autisme chez les jeunes enfants. Notre travail est divisé en deux parties principales : la modélisation générique et les modèles appliqués. Dans un premier temps, nous nous concentrons sur l'adaptation des modèles d'images pour les tâches de reconnaissance d'actions en incorporant la modélisation temporelle à l'aide de techniques de fine-tuning efficaces en paramètres (PEFT). Nous abordons également la détection et l'anticipation des actions en temps réel en proposant un nouveau modèle conjoint pour l'anticipation des actions et la détection d'actions en ligne dans des scénarios de la vie réelle. En outre, nous introduisons une nouvelle tâche appelée "interaction lâche" dans des situations dyadiques et ses applications dans l'analyse de l'autisme. Enfin, nous nous concentrons sur l'aspect appliqué de la compréhension des vidéos en proposant un modèle de reconnaissance d'actions pour les comportements répétitifs dans les vidéos d'individus autistes. Nous concluons en proposant une méthode faiblement supervisée pour estimer le score de gravité des enfants autistes dans des vidéos longues
Understanding actions in videos is a crucial element of computer vision with significant implications across various fields. As our dependence on visual data grows, comprehending and interpreting human actions in videos becomes essential for advancing technologies in surveillance, healthcare, autonomous systems, and human-computer interaction. The accurate interpretation of actions in videos is fundamental for creating intelligent systems that can effectively navigate and respond to the complexities of the real world. In this context, advances in action understanding push the boundaries of computer vision and play a crucial role in shaping the landscape of cutting-edge applications that impact our daily lives. Computer vision has made significant progress with the rise of deep learning methods such as convolutional neural networks (CNNs) pushing the boundaries of computer vision and enabling the computer vision community to advance in many domains, including image segmentation, object detection, scene understanding, and more. However, video processing remains limited compared to static images. In this thesis, we focus on action understanding, dividing it into two main parts: action recognition and action detection, and their application in the medical domain for autism analysis.In this thesis, we explore the various aspects and challenges of video understanding from a general and an application-specific perspective. We then present our contributions and solutions to address these challenges. In addition, we introduce the ACTIVIS dataset, designed to diagnose autism in young children. Our work is divided into two main parts: generic modeling and applied models. Initially, we focus on adapting image models for action recognition tasks by incorporating temporal modeling using parameter-efficient fine-tuning (PEFT) techniques. We also address real-time action detection and anticipation by proposing a new joint model for action anticipation and online action detection in real-life scenarios. Furthermore, we introduce a new task called 'loose-interaction' in dyadic situations and its applications in autism analysis. Finally, we concentrate on the applied aspect of video understanding by proposing an action recognition model for repetitive behaviors in videos of autistic individuals. We conclude by proposing a weakly-supervised method to estimate the severity score of autistic children in long videos

APA, Harvard, Vancouver, ISO, and other styles

Burger, Thomas. "Reconnaissance automatique des gestes de la langue française parlée complétée." Phd thesis, Grenoble INPG, 2007. http://tel.archives-ouvertes.fr/tel-00203360.

Full text

Abstract:

Le LPC est un complément à la lecture labiale qui facilite la communication des malentendants. Sur le principe, il s'agit d'effectuer des gestes avec une main placée à côté du visage pour désambigüiser le mouvement des lèvres, qui pris isolément est insuffisant à la compréhension parfaite du message. Le projet RNTS TELMA a pour objectif de mettre en place un terminal téléphonique permettant la communication des malentendants en s'appuyant sur le LPC. Parmi les nombreuses fonctionnalités que cela implique, il est nécessaire de pouvoir reconnaître le geste manuel du LPC et de lui associer un sens. L'objet de ce travail est la segmentation vidéo, l'analyse et la reconnaissance des gestes de codeur LPC en situation de communication. Cela fait appel à des techniques de segmentation d'images, de classification, d'interprétation de geste, et de fusion de données. Afin de résoudre ce problème de reconnaissance de gestes, nous avons proposé plusieurs algorithmes originaux, parmi lesquels (1) un algorithme basé sur la persistance rétinienne permettant la catégorisation des images de geste cible et des images de geste de transition, (2) une amélioration des méthodes de multi-classification par SVM ou par classifieurs unaires via la théorie de l'évidence, assortie d'une méthode de conversion des probabilités subjectives en fonction de croyance, et (3) une méthode de décision partielle basée sur la généralisation de la Transformée Pignistique, afin d'autoriser les incertitudes dans l'interprétation de gestes ambigus.

APA, Harvard, Vancouver, ISO, and other styles

Books on the topic "Video Vision Transformer"

Korsgaard, Mathias Bonde. Music Video Transformed. Edited by John Richardson, Claudia Gorbman, and Carol Vernallis. Oxford University Press, 2013. http://dx.doi.org/10.1093/oxfordhb/9780199733866.013.015.

Full text

Abstract:

This article appears in theOxford Handbook of New Audiovisual Aestheticsedited by John Richardson, Claudia Gorbman, and Carol Vernallis. This chapter asks what music video has become today and how its audiovisual aesthetics have changed online. It suggests that music videos generally through process of remediation content more actively than any other media form, performing the dual function of “visualizing music” (by recasting a song visually) and “musicalizing vision” (by structuring images according to musical logic). The discussion identifies and provides an overview of several new music video types that have come into existence online, placing them in five categories. In particular, the chapter focuses on interactive music videos and music video apps through close analyses of both Arcade Fire’s interactive video “We Used to Wait” and Björk’s interactive “app album”Biophilia. Both of these actively challenge what we have come to expect of music videos while still performing some familiar functions, prompting us to consider whether they are even music videos.

APA, Harvard, Vancouver, ISO, and other styles

Book chapters on the topic "Video Vision Transformer"

Gabeur, Valentin, Chen Sun, Karteek Alahari, and Cordelia Schmid. "Multi-modal Transformer for Video Retrieval." In Computer Vision – ECCV 2020, 214–29. Cham: Springer International Publishing, 2020. http://dx.doi.org/10.1007/978-3-030-58548-8_13.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kim, Hannah Halin, Shuzhi Yu, Shuai Yuan, and Carlo Tomasi. "Cross-Attention Transformer for Video Interpolation." In Computer Vision – ACCV 2022 Workshops, 325–42. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-27066-6_23.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kim, Tae Hyun, Mehdi S. M. Sajjadi, Michael Hirsch, and Bernhard Schölkopf. "Spatio-Temporal Transformer Network for Video Restoration." In Computer Vision – ECCV 2018, 111–27. Cham: Springer International Publishing, 2018. http://dx.doi.org/10.1007/978-3-030-01219-9_7.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Xue, Tong, Qianrui Wang, Xinyi Huang, and Dengshi Li. "Self-guided Transformer for Video Super-Resolution." In Pattern Recognition and Computer Vision, 186–98. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-8549-4_16.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Li, Zutong, and Lei Yang. "DCVQE: A Hierarchical Transformer for Video Quality Assessment." In Computer Vision – ACCV 2022, 398–416. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-26316-3_24.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Courant, Robin, Maika Edberg, Nicolas Dufour, and Vicky Kalogeiton. "Transformers and Visual Transformers." In Machine Learning for Brain Disorders, 193–229. New York, NY: Springer US, 2012. http://dx.doi.org/10.1007/978-1-0716-3195-9_6.

Full text

Abstract:

AbstractTransformers were initially introduced for natural language processing (NLP) tasks, but fast they were adopted by most deep learning fields, including computer vision. They measure the relationships between pairs of input tokens (words in the case of text strings, parts of images for visual transformers), termed attention. The cost is exponential with the number of tokens. For image classification, the most common transformer architecture uses only the transformer encoder in order to transform the various input tokens. However, there are also numerous other applications in which the decoder part of the traditional transformer architecture is also used. Here, we first introduce the attention mechanism (Subheading 1) and then the basic transformer block including the vision transformer (Subheading 2). Next, we discuss some improvements of visual transformers to account for small datasets or less computation (Subheading 3). Finally, we introduce visual transformers applied to tasks other than image classification, such as detection, segmentation, generation, and training without labels (Subheading 4) and other domains, such as video or multimodality using text or audio data (Subheading 5).

APA, Harvard, Vancouver, ISO, and other styles

Huo, Shuwei, Yuan Zhou, and Haiyang Wang. "YFormer: A New Transformer Architecture for Video-Query Based Video Moment Retrieval." In Pattern Recognition and Computer Vision, 638–50. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-18913-5_49.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Li, Li, Liansheng Zhuang, Shenghua Gao, and Shafei Wang. "HaViT: Hybrid-Attention Based Vision Transformer for Video Classification." In Computer Vision – ACCV 2022, 502–17. Cham: Springer Nature Switzerland, 2023. http://dx.doi.org/10.1007/978-3-031-26316-3_30.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Zhang, Hui, Jiewen Yang, Xingbo Dong, Xingguo Lv, Wei Jia, Zhe Jin, and Xuejun Li. "A Video Face Recognition Leveraging Temporal Information Based on Vision Transformer." In Pattern Recognition and Computer Vision, 29–43. Singapore: Springer Nature Singapore, 2023. http://dx.doi.org/10.1007/978-981-99-8469-5_3.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Wu, Jinlin, Lingxiao He, Wu Liu, Yang Yang, Zhen Lei, Tao Mei, and Stan Z. Li. "CAViT: Contextual Alignment Vision Transformer for Video Object Re-identification." In Lecture Notes in Computer Science, 549–66. Cham: Springer Nature Switzerland, 2022. http://dx.doi.org/10.1007/978-3-031-19781-9_32.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Conference papers on the topic "Video Vision Transformer"

Kobayashi, Takumi, and Masataka Seo. "Efficient Compression Method in Video Reconstruction Using Video Vision Transformer." In 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE), 724–25. IEEE, 2024. https://doi.org/10.1109/gcce62371.2024.10760444.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Yokota, Haruto, Mert Bozkurtlar, Benjamin Yen, Katsutoshi Itoyama, Kenji Nishida, and Kazuhiro Nakadai. "A Video Vision Transformer for Sound Source Localization." In 2024 32nd European Signal Processing Conference (EUSIPCO), 106–10. IEEE, 2024. http://dx.doi.org/10.23919/eusipco63174.2024.10715427.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ojaswee, R. Sreemathy, Mousami Turuk, Jayashree Jagdale, and Mohammad Anish. "Indian Sign Language Recognition Using Video Vision Transformer." In 2024 3rd International Conference for Advancement in Technology (ICONAT), 1–7. IEEE, 2024. https://doi.org/10.1109/iconat61936.2024.10774678.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Thuan, Pham Minh, Bui Thu Lam, and Pham Duy Trung. "Spatial Vision Transformer: A Novel Approach to Deepfake Video Detection." In 2024 1st International Conference On Cryptography And Information Security (VCRIS), 1–6. IEEE, 2024. https://doi.org/10.1109/vcris63677.2024.10813391.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Kumari, Supriya, Prince Kumar, Pooja Verma, Rajitha B, and Sarsij Tripathi. "Hybrid Vision Transformer and Convolutional Neural Network for Sports Video Classification." In 2024 International Conference on Intelligent Computing and Emerging Communication Technologies (ICEC), 1–5. IEEE, 2024. https://doi.org/10.1109/icec59683.2024.10837289.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Isogawa, Junya, Fumihiko Sakaue, and Jun Sato. "Simultaneous Estimation of Driving Intentions for Multiple Vehicles Using Video Transformer." In 20th International Conference on Computer Vision Theory and Applications, 471–77. SCITEPRESS - Science and Technology Publications, 2025. https://doi.org/10.5220/0013232100003912.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Gupta, Anisha, and Vidit Kumar. "A Hybrid U-Net and Vision Transformer approach for Video Anomaly detection." In 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), 1–6. IEEE, 2024. http://dx.doi.org/10.1109/icccnt61001.2024.10725860.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Ansari, Khustar, and Priyanka Srivastava. "Hybrid Attention Vision Transformer-based Deep Learning Model for Video Caption Generation." In 2025 International Conference on Electronics and Renewable Systems (ICEARS), 1238–45. IEEE, 2025. https://doi.org/10.1109/icears64219.2025.10940922.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Zhou, Xingyu, Leheng Zhang, Xiaorui Zhao, Keze Wang, Leida Li, and Shuhang Gu. "Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention." In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 25399–408. IEEE, 2024. http://dx.doi.org/10.1109/cvpr52733.2024.02400.

Full text

APA, Harvard, Vancouver, ISO, and other styles

Choi, Joonmyung, Sanghyeok Lee, Jaewon Chu, Minhyuk Choi, and Hyunwoo J. Kim. "vid-TLDR: Training Free Token merging for Light-Weight Video Transformer." In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 18771–81. IEEE, 2024. http://dx.doi.org/10.1109/cvpr52733.2024.01776.

Full text

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

Contents

Academic literature on the topic 'Video Vision Transformer'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Journal articles on the topic "Video Vision Transformer"

Dissertations / Theses on the topic "Video Vision Transformer"

Books on the topic "Video Vision Transformer"

Book chapters on the topic "Video Vision Transformer"

Conference papers on the topic "Video Vision Transformer"