Log in

Relevant bibliographies by topics / Large video dataset / Journal articles

To see the other types of publications on this topic, follow the link: Large video dataset.

Journal articles on the topic 'Large video dataset'

Author: Grafiati

Published: 4 June 2021

Last updated: 1 February 2022

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Large video dataset.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Yu, Zhou, Dejing Xu, Jun Yu, et al. "ActivityNet-QA: A Dataset for Understanding Complex Web Videos via Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9127–34. http://dx.doi.org/10.1609/aaai.v33i01.33019127.

Full text

Abstract:

Recent developments in modeling language and vision have been successfully applied to image question answering. It is both crucial and natural to extend this research direction to the video domain for video question answering (VideoQA). Compared to the image domain where large scale and fully annotated benchmark datasets exists, VideoQA datasets are limited to small scale and are automatically generated, etc. These limitations restrict their applicability in practice. Here we introduce ActivityNet-QA, a fully annotated and large scale VideoQA dataset. The dataset consists of 58,000 QA pairs on 5,800 complex web videos derived from the popular ActivityNet dataset. We present a statistical analysis of our ActivityNet-QA dataset and conduct extensive experiments on it by comparing existing VideoQA baselines. Moreover, we explore various video representation strategies to improve VideoQA performance, especially for long videos.

APA, Harvard, Vancouver, ISO, and other styles

2

Chen, Hanqing, Chunyan Hu, Feifei Lee, et al. "A Supervised Video Hashing Method Based on a Deep 3D Convolutional Neural Network for Large-Scale Video Retrieval." Sensors 21, no. 9 (2021): 3094. http://dx.doi.org/10.3390/s21093094.

Full text

Abstract:

Recently, with the popularization of camera tools such as mobile phones and the rise of various short video platforms, a lot of videos are being uploaded to the Internet at all times, for which a video retrieval system with fast retrieval speed and high precision is very necessary. Therefore, content-based video retrieval (CBVR) has aroused the interest of many researchers. A typical CBVR system mainly contains the following two essential parts: video feature extraction and similarity comparison. Feature extraction of video is very challenging, previous video retrieval methods are mostly based on extracting features from single video frames, while resulting the loss of temporal information in the videos. Hashing methods are extensively used in multimedia information retrieval due to its retrieval efficiency, but most of them are currently only applied to image retrieval. In order to solve these problems in video retrieval, we build an end-to-end framework called deep supervised video hashing (DSVH), which employs a 3D convolutional neural network (CNN) to obtain spatial-temporal features of videos, then train a set of hash functions by supervised hashing to transfer the video features into binary space and get the compact binary codes of videos. Finally, we use triplet loss for network training. We conduct a lot of experiments on three public video datasets UCF-101, JHMDB and HMDB-51, and the results show that the proposed method has advantages over many state-of-the-art video retrieval methods. Compared with the DVH method, the mAP value of UCF-101 dataset is improved by 9.3%, and the minimum improvement on JHMDB dataset is also increased by 0.3%. At the same time, we also demonstrate the stability of the algorithm in the HMDB-51 dataset.

APA, Harvard, Vancouver, ISO, and other styles

3

Ghorbani, Saeed, Kimia Mahdaviani, Anne Thaler, et al. "MoVi: A large multi-purpose human motion and video dataset." PLOS ONE 16, no. 6 (2021): e0253157. http://dx.doi.org/10.1371/journal.pone.0253157.

Full text

Abstract:

Large high-quality datasets of human body shape and kinematics lay the foundation for modelling and simulation approaches in computer vision, computer graphics, and biomechanics. Creating datasets that combine naturalistic recordings with high-accuracy data about ground truth body shape and pose is challenging because different motion recording systems are either optimized for one or the other. We address this issue in our dataset by using different hardware systems to record partially overlapping information and synchronized data that lend themselves to transfer learning. This multimodal dataset contains 9 hours of optical motion capture data, 17 hours of video data from 4 different points of view recorded by stationary and hand-held cameras, and 6.6 hours of inertial measurement units data recorded from 60 female and 30 male actors performing a collection of 21 everyday actions and sports movements. The processed motion capture data is also available as realistic 3D human meshes. We anticipate use of this dataset for research on human pose estimation, action recognition, motion modelling, gait analysis, and body shape reconstruction.

APA, Harvard, Vancouver, ISO, and other styles

4

Monfort, Mathew, Bolei Zhou, Sarah Bargal, et al. "A Large Scale Video Dataset for Event Recognition." Journal of Vision 18, no. 10 (2018): 753. http://dx.doi.org/10.1167/18.10.753.

Full text

APA, Harvard, Vancouver, ISO, and other styles

5

Pang, Bo, Kaiwen Zha, Yifan Zhang, and Cewu Lu. "Further Understanding Videos through Adverbs: A New Video Task." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 11823–30. http://dx.doi.org/10.1609/aaai.v34i07.6855.

Full text

Abstract:

Video understanding is a research hotspot of computer vision and significant progress has been made on video action recognition recently. However, the semantics information contained in actions is not rich enough to build powerful video understanding models. This paper first introduces a new video semantics: the Behavior Adverb (BA), which is a more expressive and difficult one covering subtle and inherent characteristics of human action behavior. To exhaustively decode this semantics, we construct the Videos with Action and Adverb Dataset (VAAD), which is a large-scale dataset with a semantically complete set of BAs. The dataset will be released to the public with this paper. We benchmark several representative video understanding methods (originally for action recognition) on BA and action recognition. The results show that BA recognition task is more challenging than conventional action recognition. Accordingly, we propose the BA Understanding Network (BAUN) to solve this problem and the experiments reveal that our BAUN is more suitable for BA recognition (11% better than I3D). Furthermore, we find these two semantics (action and BA) can propel each other forward to better performance: promoting action recognition results by 3.4% averagely on three standard action recognition datasets (UCF-101, HMDB-51, Kinetics).

APA, Harvard, Vancouver, ISO, and other styles

6

Jia, Jinlu, Zhenyi Lai, Yurong Qian, and Ziqiang Yao. "Aerial Video Trackers Review." Entropy 22, no. 12 (2020): 1358. http://dx.doi.org/10.3390/e22121358.

Full text

Abstract:

Target tracking technology that is based on aerial videos is widely used in many fields; however, this technology has challenges, such as image jitter, target blur, high data dimensionality, and large changes in the target scale. In this paper, the research status of aerial video tracking and the characteristics, background complexity and tracking diversity of aerial video targets are summarized. Based on the findings, the key technologies that are related to tracking are elaborated according to the target type, number of targets and applicable scene system. The tracking algorithms are classified according to the type of target, and the target tracking algorithms that are based on deep learning are classified according to the network structure. Commonly used aerial photography datasets are described, and the accuracies of commonly used target tracking methods are evaluated in an aerial photography dataset, namely, UAV123, and a long-video dataset, namely, UAV20L. Potential problems are discussed, and possible future research directions and corresponding development trends in this field are analyzed and summarized.

APA, Harvard, Vancouver, ISO, and other styles

7

Yang, Tao, Jing Li, Jingyi Yu, Sibing Wang, and Yanning Zhang. "Diverse Scene Stitching from a Large-Scale Aerial Video Dataset." Remote Sensing 7, no. 6 (2015): 6932–49. http://dx.doi.org/10.3390/rs70606932.

Full text

APA, Harvard, Vancouver, ISO, and other styles

8

Tiotsop, Lohic Fotio, Antonio Servetti, and Enrico Masala. "Investigating Prediction Accuracy of Full Reference Objective Video Quality Measures through the ITS4S Dataset." Electronic Imaging 2020, no. 11 (2020): 93–1. http://dx.doi.org/10.2352/issn.2470-1173.2020.11.hvei-093.

Full text

Abstract:

Large subjectively annotated datasets are crucial to the development and testing of objective video quality measures (VQMs). In this work we focus on the recently released ITS4S dataset. Relying on statistical tools, we show that the content of the dataset is rather heterogeneous from the point of view of quality assessment. Such diversity naturally makes the dataset a worthy asset to validate the accuracy of video quality metrics (VQMs). In particular we study the ability of VQMs to model the reduction or the increase of the visibility of distortion due to the spatial activity in the content. The study reveals that VQMs are likely to overestimate the perceived quality of processed video sequences whose source is characterized by few spatial details. We then propose an approach aiming at modeling the impact of spatial activity on distortion visibility when objectively assessing the visual quality of a content. The effectiveness of the proposal is validated on the ITS4S dataset as well as on the Netflix public dataset.

APA, Harvard, Vancouver, ISO, and other styles

9

Hemalatha, C. Sweetlin, Vignesh Sankaran, Vaidehi V, et al. "Symmetric Uncertainty Based Search Space Reduction for Fast Face Recognition." International Journal of Intelligent Information Technologies 14, no. 4 (2018): 77–97. http://dx.doi.org/10.4018/ijiit.2018100105.

Full text

Abstract:

Face recognition from a large video database involves more search time. This article proposes a symmetric uncertainty based search space reduction (SUSSR) methodology that facilitates faster face recognition in video, making it viable for real time surveillance and authentication applications. The proposed methodology employs symmetric uncertainty based feature subset selection to obtain significant features. Further, Fuzzy C-Means clustering is applied to restrict the search to nearest possible cluster, thus speeding up the recognition process. Kullback Leibler's divergence based similarity measure is employed to recognize the query face in video by matching the query frame with that of stored features in the database. The proposed search space reduction methodology is tested upon benchmark video face datasets namely FJU, YouTube celebrities and synthetic datasets namely MIT-Dataset-I and MIT-Dataset-II. Experimental results demonstrate the effectiveness of the proposed methodology with a 10 increase in recognition accuracy and 35 reduction in recognition time.

APA, Harvard, Vancouver, ISO, and other styles

10

Wang, Haiqiang, Ioannis Katsavounidis, Jiantong Zhou, et al. "VideoSet: A large-scale compressed video quality dataset based on JND measurement." Journal of Visual Communication and Image Representation 46 (July 2017): 292–302. http://dx.doi.org/10.1016/j.jvcir.2017.04.009.

Full text

APA, Harvard, Vancouver, ISO, and other styles

11

Chiu, Chih-Yi, Tsung-Han Tsai, Yu-Cyuan Liou, Guei-Wun Han, and Hung-Shuo Chang. "Near-Duplicate Subsequence Matching Between the Continuous Stream and Large Video Dataset." IEEE Transactions on Multimedia 16, no. 7 (2014): 1952–62. http://dx.doi.org/10.1109/tmm.2014.2342668.

Full text

APA, Harvard, Vancouver, ISO, and other styles

12

Safinaz, S., and AV Ravi kumar. "Real-Time Video Scaling Based on Convolution Neural Network Architecture." Indonesian Journal of Electrical Engineering and Computer Science 7, no. 2 (2017): 381. http://dx.doi.org/10.11591/ijeecs.v7.i2.pp381-394.

Full text

Abstract:

In recent years, video super resolution techniques becomes mandatory requirements to get high resolution videos. Many super resolution techniques researched but still video super resolution or scaling is a vital challenge. In this paper, we have presented a real-time video scaling based on convolution neural network architecture to eliminate the blurriness in the images and video frames and to provide better reconstruction quality while scaling of large datasets from lower resolution frames to high resolution frames. We compare our outcomes with multiple exiting algorithms. Our extensive results of proposed technique RemCNN (Reconstruction error minimization Convolution Neural Network) shows that our model outperforms the existing technologies such as bicubic, bilinear, MCResNet and provide better reconstructed motioning images and video frames. The experimental results shows that our average PSNR result is 47.80474 considering upscale-2, 41.70209 for upscale-3 and 36.24503 for upscale-4 for Myanmar dataset which is very high in contrast to other existing techniques. This results proves our proposed model real-time video scaling based on convolution neural network architecture’s high efficiency and better performance.

APA, Harvard, Vancouver, ISO, and other styles

13

Pandey, Prashant, Prathosh AP, Manu Kohli, and Josh Pritchard. "Guided Weak Supervision for Action Recognition with Scarce Data to Assess Skills of Children with Autism." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 01 (2020): 463–70. http://dx.doi.org/10.1609/aaai.v34i01.5383.

Full text

Abstract:

Diagnostic and intervention methodologies for skill assessment of autism typically requires a clinician repetitively initiating several stimuli and recording the child's response. In this paper, we propose to automate the response measurement through video recording of the scene following the use of Deep Neural models for human action recognition from videos. However, supervised learning of neural networks demand large amounts of annotated data that is hard to come by. This issue is addressed by leveraging the ‘similarities’ between the action categories in publicly available large-scale video action (source) datasets and the dataset of interest. A technique called Guided Weak Supervision is proposed, where every class in the target data is matched to a class in the source data using the principle of posterior likelihood maximization. Subsequently, classifier on the target data is re-trained by augmenting samples from the matched source classes, along with a new loss encouraging inter-class separability. The proposed method is evaluated on two skill assessment autism datasets, SSBD (Sundar Rajagopalan, Dhall, and Goecke 2013) and a real world Autism dataset comprising 37 children of different ages and ethnicity who are diagnosed with autism. Our proposed method is found to improve the performance of the state-of-the-art multi-class human action recognition models in-spite of supervision with scarce data.

APA, Harvard, Vancouver, ISO, and other styles

14

Ismail, Aya, Marwa Elpeltagy, Mervat Zaki, and Kamal A. ElDahshan. "Deepfake video detection: YOLO-Face convolution recurrent approach." PeerJ Computer Science 7 (September 21, 2021): e730. http://dx.doi.org/10.7717/peerj-cs.730.

Full text

Abstract:

Recently, the deepfake techniques for swapping faces have been spreading, allowing easy creation of hyper-realistic fake videos. Detecting the authenticity of a video has become increasingly critical because of the potential negative impact on the world. Here, a new project is introduced; You Only Look Once Convolution Recurrent Neural Networks (YOLO-CRNNs), to detect deepfake videos. The YOLO-Face detector detects face regions from each frame in the video, whereas a fine-tuned EfficientNet-B5 is used to extract the spatial features of these faces. These features are fed as a batch of input sequences into a Bidirectional Long Short-Term Memory (Bi-LSTM), to extract the temporal features. The new scheme is then evaluated on a new large-scale dataset; CelebDF-FaceForencics++ (c23), based on a combination of two popular datasets; FaceForencies++ (c23) and Celeb-DF. It achieves an Area Under the Receiver Operating Characteristic Curve (AUROC) 89.35% score, 89.38% accuracy, 83.15% recall, 85.55% precision, and 84.33% F1-measure for pasting data approach. The experimental analysis approves the superiority of the proposed method compared to the state-of-the-art methods.

APA, Harvard, Vancouver, ISO, and other styles

15

Tran, Thi-Dung, Junghee Kim, Ngoc-Huynh Ho, et al. "Stress Analysis with Dimensions of Valence and Arousal in the Wild." Applied Sciences 11, no. 11 (2021): 5194. http://dx.doi.org/10.3390/app11115194.

Full text

Abstract:

In the field of stress recognition, the majority of research has conducted experiments on datasets collected from controlled environments with limited stressors. As these datasets cannot represent real-world scenarios, stress identification and analysis are difficult. There is a dire need for reliable, large datasets that are specifically acquired for stress emotion with varying degrees of expression for this task. In this paper, we introduced a dataset for Stress Analysis with Dimensions of Valence and Arousal of Korean Movie in Wild (SADVAW), which includes video clips with diversity in facial expressions from different Korean movies. The SADVAW dataset contains continuous dimensions of valence and arousal. We presented a detailed statistical analysis of the dataset. We also analyzed the correlation between stress and continuous dimensions. Moreover, using the SADVAW dataset, we trained a deep learning-based model for stress recognition.

APA, Harvard, Vancouver, ISO, and other styles

16

Robinson, Niall H., Rachel Prudden, and Alberto Arribas. "A New Approach to Streaming Data from the Cloud." Bulletin of the American Meteorological Society 98, no. 11 (2017): 2280–83. http://dx.doi.org/10.1175/bams-d-16-0120.1.

Full text

Abstract:

Abstract Environmental datasets are becoming so large that they are increasingly being hosted in the compute cloud, where they can be efficiently analyzed and disseminated. However, this necessitates new ways of efficiently delivering environmental information across the Internet to users. We visualised a big atmospheric dataset in a web page by repurposing techniques normally used to stream HD video. You can try the prototype at http://demo.3dvis.informaticslab.co.uk/ng-3d-vis/apps/desktop/ or watch a video demonstration at www.youtube.com/watch?v=pzvk1ZNMvFY.

APA, Harvard, Vancouver, ISO, and other styles

17

Fu, Zhikang, Jun Li, Guoqing Chen, Tianbao Yu, and Tiansheng Deng. "PornNet: A Unified Deep Architecture for Pornographic Video Recognition." Applied Sciences 11, no. 7 (2021): 3066. http://dx.doi.org/10.3390/app11073066.

Full text

Abstract:

In the era of big data, massive harmful multimedia resources publicly available on the Internet greatly threaten children and adolescents. In particular, recognizing pornographic videos is of great importance for protecting the mental and physical health of the underage. In contrast to the conventional methods which are only built on image classifier without considering audio clues in the video, we propose a unified deep architecture termed PornNet integrating dual sub-networks for pornographic video recognition. More specifically, with image frames and audio clues extracted from the pornographic videos from scratch, they are respectively delivered to two deep networks for pattern discrimination. For discriminating pornographic frames, we propose a local-context aware network that takes into account the image context in capturing the key contents, whilst leveraging an attention network which can capture temporal information for recognizing pornographic audios. Thus, we incorporate the recognition scores generated from the two sub-networks into a unified deep architecture, while making use of a pre-defined aggregation function to produce the whole video recognition result. The experiments on our newly-collected large dataset demonstrate that our proposed method exhibits a promising performance, achieving an accuracy at 93.4% on the dataset including 1 k pornographic samples along with 1 k normal videos and 1 k sexy videos.

APA, Harvard, Vancouver, ISO, and other styles

18

Schofield, Daniel, Arsha Nagrani, Andrew Zisserman, et al. "Chimpanzee face recognition from videos in the wild using deep learning." Science Advances 5, no. 9 (2019): eaaw0736. http://dx.doi.org/10.1126/sciadv.aaw0736.

Full text

Abstract:

Video recording is now ubiquitous in the study of animal behavior, but its analysis on a large scale is prohibited by the time and resources needed to manually process large volumes of data. We present a deep convolutional neural network (CNN) approach that provides a fully automated pipeline for face detection, tracking, and recognition of wild chimpanzees from long-term video records. In a 14-year dataset yielding 10 million face images from 23 individuals over 50 hours of footage, we obtained an overall accuracy of 92.5% for identity recognition and 96.2% for sex recognition. Using the identified faces, we generated co-occurrence matrices to trace changes in the social network structure of an aging population. The tools we developed enable easy processing and annotation of video datasets, including those from other species. Such automated analysis unveils the future potential of large-scale longitudinal video archives to address fundamental questions in behavior and conservation.

APA, Harvard, Vancouver, ISO, and other styles

19

Wang, Jialu, Guowei Teng, and Ping An. "Video Super-Resolution Based on Generative Adversarial Network and Edge Enhancement." Electronics 10, no. 4 (2021): 459. http://dx.doi.org/10.3390/electronics10040459.

Full text

Abstract:

With the help of deep neural networks, video super-resolution (VSR) has made a huge breakthrough. However, these deep learning-based methods are rarely used in specific situations. In addition, training sets may not be suitable because many methods only assume that under ideal circumstances, low-resolution (LR) datasets are downgraded from high-resolution (HR) datasets in a fixed manner. In this paper, we proposed a model based on Generative Adversarial Network (GAN) and edge enhancement to perform super-resolution (SR) reconstruction for LR and blur videos, such as closed-circuit television (CCTV). The adversarial loss allows discriminators to be trained to distinguish between SR frames and ground truth (GT) frames, which is helpful to produce realistic and highly detailed results. The edge enhancement function uses the Laplacian edge module to perform edge enhancement on the intermediate result, which helps further improve the final results. In addition, we add the perceptual loss to the loss function to obtain a higher visual experience. At the same time, we also tried training network on different datasets. A large number of experiments show that our method has advantages in the Vid4 dataset and other LR videos.

APA, Harvard, Vancouver, ISO, and other styles

20

Islam, Md Anwarul, Md Azher Uddin, and Young-Koo Lee. "A Distributed Automatic Video Annotation Platform." Applied Sciences 10, no. 15 (2020): 5319. http://dx.doi.org/10.3390/app10155319.

Full text

Abstract:

In the era of digital devices and the Internet, thousands of videos are taken and share through the Internet. Similarly, CCTV cameras in the digital city produce a large amount of video data that carry essential information. To handle the increased video data and generate knowledge, there is an increasing demand for distributed video annotation. Therefore, in this paper, we propose a novel distributed video annotation platform that explores the spatial information and temporal information. Afterward, we provide higher-level semantic information. The proposed framework is divided into two parts: spatial annotation and spatiotemporal annotation. Therefore, we propose a spatiotemporal descriptor, namely, volume local directional ternary pattern-three orthogonal planes (VLDTP–TOP) in a distributed manner using Spark. Moreover, we developed several state-of-the-art appearance-based and spatiotemporal-based feature descriptors on top of Spark. We also provide the distributed video annotation services for the end-users so that they can easily use the video annotation and APIs for development to produce new video annotation algorithms. Due to the lack of a spatiotemporal video annotation dataset that provides ground truth for both spatial and temporal information, we introduce a video annotation dataset, namely, STAD which provides ground truth for spatial and temporal information. An extensive experimental analysis was performed in order to validate the performance and scalability of the proposed feature descriptors, which proved the excellence of our proposed approach.

APA, Harvard, Vancouver, ISO, and other styles

21

Monfort, Mathew, SouYoung Jin, David Harwath, Rogerio Feris, James Glass, and Aude Oliva. "Spoken Moments: A Large Scale Dataset of Audio Descriptions of Dynamic Events in Video." Journal of Vision 20, no. 11 (2020): 1447. http://dx.doi.org/10.1167/jov.20.11.1447.

Full text

APA, Harvard, Vancouver, ISO, and other styles

22

Ma, Shuming, Lei Cui, Damai Dai, Furu Wei, and Xu Sun. "LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6810–17. http://dx.doi.org/10.1609/aaai.v33i01.33016810.

Full text

Abstract:

We introduce the task of automatic live commenting. Live commenting, which is also called “video barrage”, is an emerging feature on online video sites that allows real-time comments from viewers to fly across the screen like bullets or roll at the right side of the screen. The live comments are a mixture of opinions for the video and the chit chats with other comments. Automatic live commenting requires AI agents to comprehend the videos and interact with human viewers who also make the comments, so it is a good testbed of an AI agent’s ability to deal with both dynamic vision and language. In this work, we construct a large-scale live comment dataset with 2,361 videos and 895,929 live comments. Then, we introduce two neural models to generate live comments based on the visual and textual contexts, which achieve better performance than previous neural baselines such as the sequence-to-sequence model. Finally, we provide a retrieval-based evaluation protocol for automatic live commenting where the model is asked to sort a set of candidate comments based on the log-likelihood score, and evaluated on metrics such as mean-reciprocal-rank. Putting it all together, we demonstrate the first “LiveBot”. The datasets and the codes can be found at https://github.com/lancopku/livebot.

APA, Harvard, Vancouver, ISO, and other styles

23

Salman, Ahmad, Shoaib Ahmad Siddiqui, Faisal Shafait, et al. "Automatic fish detection in underwater videos by a deep neural network-based hybrid motion learning system." ICES Journal of Marine Science 77, no. 4 (2019): 1295–307. http://dx.doi.org/10.1093/icesjms/fsz025.

Full text

Abstract:

Abstract It is interesting to develop effective fish sampling techniques using underwater videos and image processing to automatically estimate and consequently monitor the fish biomass and assemblage in water bodies. Such approaches should be robust against substantial variations in scenes due to poor luminosity, orientation of fish, seabed structures, movement of aquatic plants in the background and image diversity in the shape and texture among fish of different species. Keeping this challenge in mind, we propose a unified approach to detect freely moving fish in unconstrained underwater environments using a Region-Based Convolutional Neural Network, a state-of-the-art machine learning technique used to solve generic object detection and localization problems. To train the neural network, we employ a novel approach to utilize motion information of fish in videos via background subtraction and optical flow, and subsequently combine the outcomes with the raw image to generate fish-dependent candidate regions. We use two benchmark datasets extracted from a large Fish4Knowledge underwater video repository, Complex Scenes dataset and the LifeCLEF 2015 fish dataset to validate the effectiveness of our hybrid approach. We achieve a detection accuracy (F-Score) of 87.44% and 80.02% respectively on these datasets, which advocate the utilization of our approach for fish detection task.

APA, Harvard, Vancouver, ISO, and other styles

24

Stojnić, Vladan, Vladimir Risojević, Mario Muštra, et al. "A Method for Detection of Small Moving Objects in UAV Videos." Remote Sensing 13, no. 4 (2021): 653. http://dx.doi.org/10.3390/rs13040653.

Full text

Abstract:

Detection of small moving objects is an important research area with applications including monitoring of flying insects, studying their foraging behavior, using insect pollinators to monitor flowering and pollination of crops, surveillance of honeybee colonies, and tracking movement of honeybees. However, due to the lack of distinctive shape and textural details on small objects, direct application of modern object detection methods based on convolutional neural networks (CNNs) shows considerably lower performance. In this paper we propose a method for the detection of small moving objects in videos recorded using unmanned aerial vehicles equipped with standard video cameras. The main steps of the proposed method are video stabilization, background estimation and subtraction, frame segmentation using a CNN, and thresholding the segmented frame. However, for training a CNN it is required that a large labeled dataset is available. Manual labelling of small moving objects in videos is very difficult and time consuming, and such labeled datasets do not exist at the moment. To circumvent this problem, we propose training a CNN using synthetic videos generated by adding small blob-like objects to video sequences with real-world backgrounds. The experimental results on detection of flying honeybees show that by using a combination of classical computer vision techniques and CNNs, as well as synthetic training sets, the proposed approach overcomes the problems associated with direct application of CNNs to the given problem and achieves an average F1-score of 0.86 in tests on real-world videos.

APA, Harvard, Vancouver, ISO, and other styles

25

Lulla, Martins, Aleksejs Rutkovskis, Andreta Slavinska, et al. "Hand-Washing Video Dataset Annotated According to the World Health Organization’s Hand-Washing Guidelines." Data 6, no. 4 (2021): 38. http://dx.doi.org/10.3390/data6040038.

Full text

Abstract:

Washing hands is one of the most important ways to prevent infectious diseases, including COVID-19. The World Health Organization (WHO) has published hand-washing guidelines. This paper presents a large real-world dataset with videos recording medical staff washing their hands as part of their normal job duties in the Pauls Stradins Clinical University Hospital. There are 3185 hand-washing episodes in total, each of which is annotated by up to seven different persons. The annotations classify the washing movements according to the WHO guidelines by marking each frame in each video with a certain movement code. The intention of this “in-the-wild” dataset is two-fold: to serve as a basis for training machine-learning classifiers for automated hand-washing movement recognition and quality control, and to allow to investigation of the real-world quality of washing performed by working medical staff. We demonstrate how the data can be used to train a machine-learning classifier that achieves classification accuracy of 0.7511 on a test dataset.

APA, Harvard, Vancouver, ISO, and other styles

26

Zhang, Rong, Wei Li, Peng Wang, et al. "AutoRemover: Automatic Object Removal for Autonomous Driving Videos." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 12853–61. http://dx.doi.org/10.1609/aaai.v34i07.6982.

Full text

Abstract:

Motivated by the need for photo-realistic simulation in autonomous driving, in this paper we present a video inpainting algorithm AutoRemover, designed specifically for generating street-view videos without any moving objects. In our setup we have two challenges: the first is the shadow, shadows are usually unlabeled but tightly coupled with the moving objects. The second is the large ego-motion in the videos. To deal with shadows, we build up an autonomous driving shadow dataset and design a deep neural network to detect shadows automatically. To deal with large ego-motion, we take advantage of the multi-source data, in particular the 3D data, in autonomous driving. More specifically, the geometric relationship between frames is incorporated into an inpainting deep neural network to produce high-quality structurally consistent video output. Experiments show that our method outperforms other state-of-the-art (SOTA) object removal algorithms, reducing the RMSE by over 19%.

APA, Harvard, Vancouver, ISO, and other styles

27

Martínez Carrillo, Fabio, Fabián Castillo, and Lola Bautista. "3D+T dense motion trajectories as kinematics primitives to recognize gestures on depth video sequences." Revista Politécnica 15, no. 29 (2019): 82–94. http://dx.doi.org/10.33571/rpolitec.v15n29a7.

Full text

Abstract:

RGB-D sensors have allowed attacking many classical problems in computer vision such as segmentation, scene representations and human interaction, among many others. Regarding motion characterization, typical RGB-D strategies are limited to namely analyze global shape changes and capture scene flow fields to describe local motions in depth sequences. Nevertheless, such strategies only recover motion information among a couple of frames, limiting the analysis of coherent large displacements along time. This work presents a novel strategy to compute 3D+t dense and long motion trajectories as fundamental kinematic primitives to represent video sequences. Each motion trajectory models kinematic words primitives that together can describe complex gestures developed along videos. Such kinematic words were processed into a bag-of-kinematic-words framework to obtain an occurrence video descriptor. The novel video descriptor based on 3D+t motion trajectories achieved an average accuracy of 80% in a dataset of 5 gestures and 100 videos.

APA, Harvard, Vancouver, ISO, and other styles

28

Kim, Dahun, Donghyeon Cho, and In So Kweon. "Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 8545–52. http://dx.doi.org/10.1609/aaai.v33i01.33018545.

Full text

Abstract:

Self-supervised tasks such as colorization, inpainting and zigsaw puzzle have been utilized for visual representation learning for still images, when the number of labeled images is limited or absent at all. Recently, this worthwhile stream of study extends to video domain where the cost of human labeling is even more expensive. However, the most of existing methods are still based on 2D CNN architectures that can not directly capture spatio-temporal information for video applications. In this paper, we introduce a new self-supervised task called as Space-Time Cubic Puzzles to train 3D CNNs using large scale video dataset. This task requires a network to arrange permuted 3D spatio-temporal crops. By completing Space-Time Cubic Puzzles, the network learns both spatial appearance and temporal relation of video frames, which is our final goal. In experiments, we demonstrate that our learned 3D representation is well transferred to action recognition tasks, and outperforms state-of-the-art 2D CNN-based competitors on UCF101 and HMDB51 datasets.

APA, Harvard, Vancouver, ISO, and other styles

29

Yang, Xu, Dongjingdian Liu, Jing Liu, Faren Yan, Pengpeng Chen, and Qiang Niu. "Follower: A Novel Self-Deployable Action Recognition Framework." Sensors 21, no. 3 (2021): 950. http://dx.doi.org/10.3390/s21030950.

Full text

Abstract:

Deep learning technology has improved the performance of vision-based action recognition algorithms, but such methods require a large number of labeled training datasets, resulting in weak universality. To address this issue, this paper proposes a novel self-deployable ubiquitous action recognition framework that enables a self-motivated user to bootstrap and deploy action recognition services, called FOLLOWER. Our main idea is to build a “fingerprint” library of actions based on a small number of user-defined sample action data. Then, we use the matching method to complete action recognition. The key step is how to construct a suitable “fingerprint”. Thus, a pose action normalized feature extraction method based on a three-dimensional pose sequence is designed. FOLLOWER is mainly composed of the guide process and follow the process. Guide process extracts pose action normalized feature and selects the inner class central feature to build a “fingerprint” library of actions. Follow process extracts the pose action normalized feature in the target video and uses the motion detection, action filtering, and adaptive weight offset template to identify the action in the video sequence. Finally, we collect an action video dataset with human pose annotation to research self-deployable action recognition and action recognition based on pose estimation. After experimenting on this dataset, the results show that FOLLOWER can effectively recognize the actions in the video sequence with recognition accuracy reaching 96.74%.

APA, Harvard, Vancouver, ISO, and other styles

30

Joolee, Joolekha, Md Uddin, Jawad Khan, Taeyeon Kim, and Young-Koo Lee. "A Novel Lightweight Approach for Video Retrieval on Mobile Augmented Reality Environment." Applied Sciences 8, no. 10 (2018): 1860. http://dx.doi.org/10.3390/app8101860.

Full text

Abstract:

Mobile Augmented Reality merges the virtual objects with real world on mobile devices, while video retrieval brings out the similar looking videos from the large-scale video dataset. Since mobile augmented reality application demands the real-time interaction and operation, we need to process and interact in real-time. Furthermore, augmented reality based virtual objects can be poorly textured. In order to resolve the above mentioned issues, in this research, we propose a novel, fast and robust approach for retrieving videos on the mobile augmented reality environment using an image and video queries. In the beginning, Top-K key-frames are extracted from the videos which significantly increases the efficiency. Secondly, we introduce a novel frame based feature extraction method, namely Pyramid Ternary Histogram of Oriented Gradient (PTHOG) to extract the shape feature from the virtual objects in an effective and efficient manner. Thirdly, we utilize the Double-Bit Quantization (DBQ) based hashing to accomplish the nearest neighbor search efficiently, which produce the candidate list of videos. Lastly, the similarity measure is performed to re-rank the videos which are obtained from the candidate list. An extensive experimental analysis is performed in order to verify our claims.

APA, Harvard, Vancouver, ISO, and other styles

31

Yang, Tao, Dongdong Li, Yi Bai, et al. "Multiple-Object-Tracking Algorithm Based on Dense Trajectory Voting in Aerial Videos." Remote Sensing 11, no. 19 (2019): 2278. http://dx.doi.org/10.3390/rs11192278.

Full text

Abstract:

In recent years, UAV technology has developed rapidly. Due to the mobility, low cost, and variable monitoring altitude of UAVs, multiple-object detection and tracking in aerial videos has become a research hotspot in the field of computer vision. However, due to camera motion, small target size, target adhesion, and unpredictable target motion, it is still difficult to detect and track targets of interest in aerial videos, especially in the case of a low frame rate where the target position changes too much. In this paper, we propose a multiple-object-tracking algorithm based on dense-trajectory voting in aerial videos. The method models the multiple-target-tracking problem as a voting problem of the dense-optical-flow trajectory to the target ID, which can be applied to aerial-surveillance scenes and is robust to low-frame-rate videos. More specifically, we first built an aerial video dataset for vehicle targets, including a training dataset and a diverse test dataset. Based on this, we trained the neural network model by using a deep-learning method to detect vehicles in aerial videos. Thereafter, we calculated the dense optical flow in adjacent frames, and generated effective dense-optical-flow trajectories in each detection bounding box at the current time. When target IDs of optical-flow trajectories are known, the voting results of the optical-flow trajectories in each detection bounding box are counted. Finally, similarity between detection objects in adjacent frames was measured based on the voting results, and tracking results were obtained by data association. In order to evaluate the performance of this algorithm, we conducted experiments on self-built test datasets. A large number of experimental results showed that the proposed algorithm could obtain good target-tracking results in various complex scenarios, and performance was still robust at a low frame rate by changing the video frame rate. In addition, we carried out qualitative and quantitative comparison experiments between the algorithm and three state-of-the-art tracking algorithms, which further proved that this algorithm could not only obtain good tracking results in aerial videos with a normal frame rate, but also had excellent performance under low-frame-rate conditions.

APA, Harvard, Vancouver, ISO, and other styles

32

Ye, Qing, Haoxin Zhong, Chang Qu, and Yongmei Zhang. "Human Interaction Recognition Based on Whole-Individual Detection." Sensors 20, no. 8 (2020): 2346. http://dx.doi.org/10.3390/s20082346.

Full text

Abstract:

Human interaction recognition technology is a hot topic in the field of computer vision, and its application prospects are very extensive. At present, there are many difficulties in human interaction recognition such as the spatial complexity of human interaction, the differences in action characteristics at different time periods, and the complexity of interactive action features. The existence of these problems restricts the improvement of recognition accuracy. To investigate the differences in the action characteristics at different time periods, we propose an improved fusion time-phase feature of the Gaussian model to obtain video keyframes and remove the influence of a large amount of redundant information. Regarding the complexity of interactive action features, we propose a multi-feature fusion network algorithm based on parallel Inception and ResNet. This multi-feature fusion network not only reduces the network parameter quantity, but also improves the network performance; it alleviates the network degradation caused by the increase in network depth and obtains higher classification accuracy. For the spatial complexity of human interaction, we combined the whole video features with the individual video features, making full use of the feature information of the interactive video. A human interaction recognition algorithm based on whole–individual detection is proposed, where the whole video contains the global features of both sides of action, and the individual video contains the individual detail features of a single person. Making full use of the feature information of the whole video and individual videos is the main contribution of this paper to the field of human interaction recognition and the experimental results in the UT dataset (UT–interaction dataset) showed that the accuracy of this method was 91.7%.

APA, Harvard, Vancouver, ISO, and other styles

33

Tran, Van-Nhan, Suk-Hwan Lee, Hoanh-Su Le, and Ki-Ryong Kwon. "High Performance DeepFake Video Detection on CNN-Based with Attention Target-Specific Regions and Manual Distillation Extraction." Applied Sciences 11, no. 16 (2021): 7678. http://dx.doi.org/10.3390/app11167678.

Full text

Abstract:

The rapid development of deep learning models that can produce and synthesize hyper-realistic videos are known as DeepFakes. Moreover, the growth of forgery data has prompted concerns about malevolent intent usage. Detecting forgery videos are a crucial subject in the field of digital media. Nowadays, most models are based on deep learning neural networks and vision transformer, SOTA model with EfficientNetB7 backbone. However, due to the usage of excessively large backbones, these models have the intrinsic drawback of being too heavy. In our research, a high performance DeepFake detection model for manipulated video is proposed, ensuring accuracy of the model while keeping an appropriate weight. We inherited content from previous research projects related to distillation methodology but our proposal approached in a different way with manual distillation extraction, target-specific regions extraction, data augmentation, frame and multi-region ensemble, along with suggesting a CNN-based model as well as flexible classification with a dynamic threshold. Our proposal can reduce the overfitting problem, a common and particularly important problem affecting the quality of many models. So as to analyze the quality of our model, we performed tests on two datasets. DeepFake Detection Dataset (DFDC) with our model obtains 0.958 of AUC and 0.9243 of F1-score, compared with the SOTA model which obtains 0.972 of AUC and 0.906 of F1-score, and the smaller dataset Celeb-DF v2 with 0.978 of AUC and 0.9628 of F1-score.

APA, Harvard, Vancouver, ISO, and other styles

34

Wu, Ming-Che, and Mei-Chen Yeh. "Early Detection of Vacant Parking Spaces Using Dashcam Videos." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 9613–18. http://dx.doi.org/10.1609/aaai.v33i01.33019613.

Full text

Abstract:

A major problem in metropolitan areas is finding parking spaces. Existing parking guidance systems often adopt fixed sensors or cameras that cannot provide information from the driver’s point of view. Motivated by the advent of dashboard cameras (dashcams), we develop neural-network-based methods for detecting vacant parking spaces in videos recorded by a dashcam. Detecting vacant parking spaces in dashcam videos enables early detection of spaces. Different from conventional object detection methods, we leverage the monotonicity of the detection confidence with respect to the distance away of the approaching target parking space and propose a new loss function, which can not only yield improved detection results but also enable early detection. To evaluate our detection method, we create a new large dataset containing 5,800 dashcam videos captured from 22 indoor and outdoor parking lots. To the best of our knowledge, this is the first and largest driver’s view video dataset that supports parking space detection and provides parking space occupancy annotations.

APA, Harvard, Vancouver, ISO, and other styles

35

Zhang, Ruohan, Calen Walshe, Zhuode Liu, et al. "Atari-HEAD: Atari Human Eye-Tracking and Demonstration Dataset." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 04 (2020): 6811–20. http://dx.doi.org/10.1609/aaai.v34i04.6161.

Full text

Abstract:

Large-scale public datasets have been shown to benefit research in multiple areas of modern artificial intelligence. For decision-making research that requires human data, high-quality datasets serve as important benchmarks to facilitate the development of new methods by providing a common reproducible standard. Many human decision-making tasks require visual attention to obtain high levels of performance. Therefore, measuring eye movements can provide a rich source of information about the strategies that humans use to solve decision-making tasks. Here, we provide a large-scale, high-quality dataset of human actions with simultaneously recorded eye movements while humans play Atari video games. The dataset consists of 117 hours of gameplay data from a diverse set of 20 games, with 8 million action demonstrations and 328 million gaze samples. We introduce a novel form of gameplay, in which the human plays in a semi-frame-by-frame manner. This leads to near-optimal game decisions and game scores that are comparable or better than known human records. We demonstrate the usefulness of the dataset through two simple applications: predicting human gaze and imitating human demonstrated actions. The quality of the data leads to promising results in both tasks. Moreover, using a learned human gaze model to inform imitation learning leads to an 115% increase in game performance. We interpret these results as highlighting the importance of incorporating human visual attention in models of decision making and demonstrating the value of the current dataset to the research community. We hope that the scale and quality of this dataset can provide more opportunities to researchers in the areas of visual attention, imitation learning, and reinforcement learning.

APA, Harvard, Vancouver, ISO, and other styles

36

Zhang, Tianyi, Abdallah El Ali, Chen Wang, Alan Hanjalic, and Pablo Cesar. "CorrNet: Fine-Grained Emotion Recognition for Video Watching Using Wearable Physiological Sensors." Sensors 21, no. 1 (2020): 52. http://dx.doi.org/10.3390/s21010052.

Full text

Abstract:

Recognizing user emotions while they watch short-form videos anytime and anywhere is essential for facilitating video content customization and personalization. However, most works either classify a single emotion per video stimuli, or are restricted to static, desktop environments. To address this, we propose a correlation-based emotion recognition algorithm (CorrNet) to recognize the valence and arousal (V-A) of each instance (fine-grained segment of signals) using only wearable, physiological signals (e.g., electrodermal activity, heart rate). CorrNet takes advantage of features both inside each instance (intra-modality features) and between different instances for the same video stimuli (correlation-based features). We first test our approach on an indoor-desktop affect dataset (CASE), and thereafter on an outdoor-mobile affect dataset (MERCA) which we collected using a smart wristband and wearable eyetracker. Results show that for subject-independent binary classification (high-low), CorrNet yields promising recognition accuracies: 76.37% and 74.03% for V-A on CASE, and 70.29% and 68.15% for V-A on MERCA. Our findings show: (1) instance segment lengths between 1–4 s result in highest recognition accuracies (2) accuracies between laboratory-grade and wearable sensors are comparable, even under low sampling rates (≤64 Hz) (3) large amounts of neutral V-A labels, an artifact of continuous affect annotation, result in varied recognition performance.

APA, Harvard, Vancouver, ISO, and other styles

37

Florea, George Albert, and Radu-Casian Mihailescu. "Multimodal Deep Learning for Group Activity Recognition in Smart Office Environments." Future Internet 12, no. 8 (2020): 133. http://dx.doi.org/10.3390/fi12080133.

Full text

Abstract:

Deep learning (DL) models have emerged in recent years as the state-of-the-art technique across numerous machine learning application domains. In particular, image processing-related tasks have seen a significant improvement in terms of performance due to increased availability of large datasets and extensive growth of computing power. In this paper we investigate the problem of group activity recognition in office environments using a multimodal deep learning approach, by fusing audio and visual data from video. Group activity recognition is a complex classification task, given that it extends beyond identifying the activities of individuals, by focusing on the combinations of activities and the interactions between them. The proposed fusion network was trained based on the audio–visual stream from the AMI Corpus dataset. The procedure consists of two steps. First, we extract a joint audio–visual feature representation for activity recognition, and second, we account for the temporal dependencies in the video in order to complete the classification task. We provide a comprehensive set of experimental results showing that our proposed multimodal deep network architecture outperforms previous approaches, which have been designed for unimodal analysis, on the aforementioned AMI dataset.

APA, Harvard, Vancouver, ISO, and other styles

38

Fan, Xing, Wei Jiang, Hao Luo, Weijie Mao, and Hongyan Yu. "Instance Hard Triplet Loss for In-video Person Re-identification." Applied Sciences 10, no. 6 (2020): 2198. http://dx.doi.org/10.3390/app10062198.

Full text

Abstract:

Traditional Person Re-identification (ReID) methods mainly focus on cross-camera scenarios, while identifying a person in the same video/camera from adjacent subsequent frames is also an important question, for example, in human tracking and pose tracking. We try to address this unexplored in-video ReID problem with a new large-scale video-based ReID dataset called PoseTrack-ReID with full images available and a new network structure called ReID-Head, which can extract multi-person features efficiently in real time and can be integrated with both one-stage and two-stage human or pose detectors. A new loss function is also required to solve this new in-video problem. Hence, a triplet-based loss function with an online hard example mining designed to distinguish persons in the same video/group is proposed, called instance hard triplet loss, which can be applied in both cross-camera ReID and in-video ReID. Compared with the widely-used batch hard triplet loss, our proposed loss achieves competitive performance and saves more than 30% of the training time. We also propose an automatic reciprocal identity association method, so we can train our model in an unsupervised way, which further extends the potential applications of in-video ReID. The PoseTrack-ReID dataset and code will be publicly released.

APA, Harvard, Vancouver, ISO, and other styles

39

Jiang, Jingchao, Cheng-Zhi Qin, Juan Yu, Changxiu Cheng, Junzhi Liu, and Jingzhou Huang. "Obtaining Urban Waterlogging Depths from Video Images Using Synthetic Image Data." Remote Sensing 12, no. 6 (2020): 1014. http://dx.doi.org/10.3390/rs12061014.

Full text

Abstract:

Reference objects in video images can be used to indicate urban waterlogging depths. The detection of reference objects is the key step to obtain waterlogging depths from video images. Object detection models with convolutional neural networks (CNNs) have been utilized to detect reference objects. These models require a large number of labeled images as the training data to ensure the applicability at a city scale. However, it is hard to collect a sufficient number of urban flooding images containing valuable reference objects, and manually labeling images is time-consuming and expensive. To solve the problem, we present a method to synthesize image data as the training data. Firstly, original images containing reference objects and original images with water surfaces are collected from open data sources, and reference objects and water surfaces are cropped from these original images. Secondly, the reference objects and water surfaces are further enriched via data augmentation techniques to ensure the diversity. Finally, the enriched reference objects and water surfaces are combined to generate a synthetic image dataset with annotations. The synthetic image dataset is further used for training an object detection model with CNN. The waterlogging depths are calculated based on the reference objects detected by the trained model. A real video dataset and an artificial image dataset are used to evaluate the effectiveness of the proposed method. The results show that the detection model trained using the synthetic image dataset can effectively detect reference objects from images, and it can achieve acceptable accuracies of waterlogging depths based on the detected reference objects. The proposed method has the potential to monitor waterlogging depths at a city scale.

APA, Harvard, Vancouver, ISO, and other styles

40

Lopez-Vazquez, Vanesa, Jose Manuel Lopez-Guede, Simone Marini, Emanuela Fanelli, Espen Johnsen, and Jacopo Aguzzi. "Video Image Enhancement and Machine Learning Pipeline for Underwater Animal Detection and Classification at Cabled Observatories." Sensors 20, no. 3 (2020): 726. http://dx.doi.org/10.3390/s20030726.

Full text

Abstract:

An understanding of marine ecosystems and their biodiversity is relevant to sustainable use of the goods and services they offer. Since marine areas host complex ecosystems, it is important to develop spatially widespread monitoring networks capable of providing large amounts of multiparametric information, encompassing both biotic and abiotic variables, and describing the ecological dynamics of the observed species. In this context, imaging devices are valuable tools that complement other biological and oceanographic monitoring devices. Nevertheless, large amounts of images or movies cannot all be manually processed, and autonomous routines for recognizing the relevant content, classification, and tagging are urgently needed. In this work, we propose a pipeline for the analysis of visual data that integrates video/image annotation tools for defining, training, and validation of datasets with video/image enhancement and machine and deep learning approaches. Such a pipeline is required to achieve good performance in the recognition and classification tasks of mobile and sessile megafauna, in order to obtain integrated information on spatial distribution and temporal dynamics. A prototype implementation of the analysis pipeline is provided in the context of deep-sea videos taken by one of the fixed cameras at the LoVe Ocean Observatory network of Lofoten Islands (Norway) at 260 m depth, in the Barents Sea, which has shown good classification results on an independent test dataset with an accuracy value of 76.18% and an area under the curve (AUC) value of 87.59%.

APA, Harvard, Vancouver, ISO, and other styles

41

Sarif, Bambang A. B., Mahsa Pourazad, Panos Nasiopoulos, and Victor C. M. Leung. "A Study on the Power Consumption of H.264/AVC-Based Video Sensor Network." International Journal of Distributed Sensor Networks 2015 (2015): 1–10. http://dx.doi.org/10.1155/2015/304787.

Full text

Abstract:

There is an increasing interest in using video sensor networks (VSNs) as an alternative to existing video monitoring/surveillance applications. Due to the limited amount of energy resources available in VSNs, power consumption efficiency is one of the most important design challenges in VSNs. Video encoding contributes to a significant portion of the overall power consumption at the VSN nodes. In this regard, the encoding parameter settings used at each node determine the coding complexity and bitrate of the video. This, in turn, determines the encoding and transmission power consumption of the node and the VSN overall. Therefore, in order to calculate the nodes’ power consumption, we need to be able to estimate the coding complexity and bitrate of the video. In this paper, we modeled the coding complexity and bitrate of the H.264/AVC encoder, based on the encoding parameter settings used. We also propose a method to reduce the model estimation error for videos whose content changes within a specified period of time. We have conducted our experiments using a large video dataset captured from real-life applications in the analysis. Using the proposed model, we show how to estimate the VSN power consumption for a given topology.

APA, Harvard, Vancouver, ISO, and other styles

42

Gao, Lianli, Pengpeng Zeng, Jingkuan Song, et al. "Structured Two-Stream Attention Network for Video Question Answering." Proceedings of the AAAI Conference on Artificial Intelligence 33 (July 17, 2019): 6391–98. http://dx.doi.org/10.1609/aaai.v33i01.33016391.

Full text

Abstract:

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.

APA, Harvard, Vancouver, ISO, and other styles

43

Shahidha Banu, S., and N. Maheswari. "Background Modelling using a Q-Tree Based Foreground Segmentation." Scalable Computing: Practice and Experience 21, no. 1 (2020): 17–31. http://dx.doi.org/10.12694/scpe.v21i1.1603.

Full text

Abstract:

Background modelling is an empirical part in the procedure of foreground mining of idle and moving objects. The foreground object detection has become a challenging phenomenon due to intermittent objects, intensity variation, image artefact and dynamic background in the video analysis and video surveillance applications. In the video surveillances application, a large amount of data is getting processed by everyday basis. Thus it needs an efficient background modelling technique which could process those larger sets of data which promotes effective foreground detection. In this paper, we presented a renewed background modelling method for foreground segmentation. The main objective of the work is to perform the foreground extraction only inthe intended region of interest using proposed Q-Tree algorithm. At most all the present techniques consider their updates to the pixels of the entire frame which may result in inefficient foreground detection with a quick update to slow moving objects. The proposed method contract these defect by extracting the foreground object by controlling the region of interest (the region only where the background subtraction is to be performed) and thereby reducing the false positive and false negative. The extensive experimental results and the evaluation parameters of the proposed approach with the state of art method were compared against the most recent background subtraction approaches. Moreover, we use challenge change detection dataset and the efficiency of our method is analyzed in different environmental conditions (indoor, outdoor) from the CDnet2014 dataset and additional real time videos. The experimental results were satisfactorily verified the strengths and weakness of proposed method against the existing state-of-the-art background modelling methods.

APA, Harvard, Vancouver, ISO, and other styles

44

Salido, Jesus, Vanesa Lomas, Jesus Ruiz-Santaquiteria, and Oscar Deniz. "Automatic Handgun Detection with Deep Learning in Video Surveillance Images." Applied Sciences 11, no. 13 (2021): 6085. http://dx.doi.org/10.3390/app11136085.

Full text

Abstract:

There is a great need to implement preventive mechanisms against shootings and terrorist acts in public spaces with a large influx of people. While surveillance cameras have become common, the need for monitoring 24/7 and real-time response requires automatic detection methods. This paper presents a study based on three convolutional neural network (CNN) models applied to the automatic detection of handguns in video surveillance images. It aims to investigate the reduction of false positives by including pose information associated with the way the handguns are held in the images belonging to the training dataset. The results highlighted the best average precision (96.36%) and recall (97.23%) obtained by RetinaNet fine-tuned with the unfrozen ResNet-50 backbone and the best precision (96.23%) and F1 score values (93.36%) obtained by YOLOv3 when it was trained on the dataset including pose information. This last architecture was the only one that showed a consistent improvement—around 2%—when pose information was expressly considered during training.

APA, Harvard, Vancouver, ISO, and other styles

45

You, Renchun, Zhiyao Guo, Lei Cui, Xiang Long, Yingze Bao, and Shilei Wen. "Cross-Modality Attention with Semantic Graph Embedding for Multi-Label Classification." Proceedings of the AAAI Conference on Artificial Intelligence 34, no. 07 (2020): 12709–16. http://dx.doi.org/10.1609/aaai.v34i07.6964.

Full text

Abstract:

Multi-label image and video classification are fundamental yet challenging tasks in computer vision. The main challenges lie in capturing spatial or temporal dependencies between labels and discovering the locations of discriminative features for each class. In order to overcome these challenges, we propose to use cross-modality attention with semantic graph embedding for multi-label classification. Based on the constructed label graph, we propose an adjacency-based similarity graph embedding method to learn semantic label embeddings, which explicitly exploit label relationships. Then our novel cross-modality attention maps are generated with the guidance of learned label embeddings. Experiments on two multi-label image classification datasets (MS-COCO and NUS-WIDE) show our method outperforms other existing state-of-the-arts. In addition, we validate our method on a large multi-label video classification dataset (YouTube-8M Segments) and the evaluation results demonstrate the generalization capability of our method.

APA, Harvard, Vancouver, ISO, and other styles

46

Jia, Xi Bin, and Lu Yi Li. "Face Detection Based on Statistical Color Model and Haar Classifier." Advanced Materials Research 532-533 (June 2012): 634–38. http://dx.doi.org/10.4028/www.scientific.net/amr.532-533.634.

Full text

Abstract:

The paper realizes the face detection algorithm based on the combination of the skin model and the Haar algorithm. Firstly, a platform for sample labeling was constructed, which combines the contour extraction algorithm with manual labeling. By labeling more than 10000 images obtained randomly from the Internet, a large training dataset is available. Then, a skin histogram, a non-skin histogram and a statistical skin model are constructed by analyzing the distribution of the skin and the non-skin color on the basis of a large training dataset. Based on this statistical color model, the skin area is detected and split from video files frame by frame. With the Haar Object Detection algorithm and the morphology algorithm such as erosion and dilation, the background noise and non-face areas are removed from the detected skin area and facial area is detected, which provides the basis for face recognition and the video-based visual speech synthesis. Compared with the Haar-based face detection method, our algorithm greatly improves the rate of correct detection and reduces the rate of the false positives.

APA, Harvard, Vancouver, ISO, and other styles

47

Xu, Ming, Xiaosheng Yu, Dongyue Chen, Chengdong Wu, and Yang Jiang. "An Efficient Anomaly Detection System for Crowded Scenes Using Variational Autoencoders." Applied Sciences 9, no. 16 (2019): 3337. http://dx.doi.org/10.3390/app9163337.

Full text

Abstract:

Anomaly detection in crowded scenes is an important and challenging part of the intelligent video surveillance system. As the deep neural networks make success in feature representation, the features extracted by a deep neural network represent the appearance and motion patterns in different scenes more specifically, comparing with the hand-crafted features typically used in the traditional anomaly detection approaches. In this paper, we propose a new baseline framework of anomaly detection for complex surveillance scenes based on a variational auto-encoder with convolution kernels to learn feature representations. Firstly, the raw frames series are provided as input to our variational auto-encoder without any preprocessing to learn the appearance and motion features of the receptive fields. Then, multiple Gaussian models are used to predict the anomaly scores of the corresponding receptive fields. Our proposed two-stage anomaly detection system is evaluated on the video surveillance dataset for a large scene, UCSD pedestrian datasets, and yields competitive performance compared with state-of-the-art methods.

APA, Harvard, Vancouver, ISO, and other styles

48

Yan, Xuebo, and Yuemin Fan. "Foreground Extraction and Motion Recognition Technology for Intelligent Video Surveillance." International Journal of Pattern Recognition and Artificial Intelligence 34, no. 10 (2020): 2055021. http://dx.doi.org/10.1142/s0218001420550216.

Full text

Abstract:

With the rapid development of computer technology and network technology, it has become possible to build a large-scale networked video surveillance system. The video surveillance system has become a new type of infrastructure necessary for modern cities. In this paper, the problem of foreground extraction and motion recognition in intelligent video surveillance is studied. The three key sub-problems, namely the extraction of motion foreground in video, the deblurring of motion foreground and the recognition of human motion, are studied and corresponding solutions are proposed. A background modeling technique based on video block is proposed. The background is modeled at the block level, which greatly reduces the spatial complexity of the algorithm. It solves the problem that the traditional Gaussian model (GMM) moving target enters the static state and is integrated into the background process. The target starts to move for a long time and there are ghosts and other problems, which reduce the processing efficiency of the lifting algorithm. The test results on the Weizmann dataset show that the proposed algorithm can achieve high human motion recognition accuracy and recognition with low computational complexity. The rate can reach 100%; the local constrained group sparse representation classification (LGSRC) model is used to classify it. The experimental results on Weizmann, KTH, UCF sports and other test datasets confirm the validity of the algorithm in this chapter. KNN, SRC voting classification accuracy.

APA, Harvard, Vancouver, ISO, and other styles

49

Gatlin, Patrick N., Merhala Thurai, V. N. Bringi, et al. "Searching for Large Raindrops: A Global Summary of Two-Dimensional Video Disdrometer Observations." Journal of Applied Meteorology and Climatology 54, no. 5 (2015): 1069–89. http://dx.doi.org/10.1175/jamc-d-14-0089.1.

Full text

Abstract:

AbstractA dataset containing 9637 h of two-dimensional video disdrometer observations consisting of more than 240 million raindrops measured at diverse climatological locations was compiled to help characterize underlying drop size distribution (DSD) assumptions that are essential to make precise retrievals of rainfall using remote sensing platforms. This study concentrates on the tail of the DSD, which largely impacts rainfall retrieval algorithms that utilize radar reflectivity. The maximum raindrop diameter was a median factor of 1.8 larger than the mass-weighted mean diameter and increased with rainfall rate. Only 0.4% of the 1-min DSD spectra were found to contain large raindrops exceeding 5 mm in diameter. Large raindrops were most abundant at the tropical locations, especially in Puerto Rico, and were largely concentrated during the spring, especially at subtropical locations. Giant raindrops exceeding 8 mm in diameter occurred at tropical, subtropical, and high-latitude continental locations. The greatest numbers of giant raindrops were found in the subtropical locations, with the largest being a 9.7-mm raindrop that occurred in northern Oklahoma during the passage of a hail-producing thunderstorm. These results suggest large raindrops are more likely to fall from clouds that contain hail, especially those raindrops exceeding 8 mm in diameter.

APA, Harvard, Vancouver, ISO, and other styles

50

Langenkämper, D., R. van Kevelaer, T. Möller, and T. W. Nattkemper. "GAN-BASED SYNTHESIS OF DEEP LEARNING TRAINING DATA FOR UAV MONITORING." ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLIII-B1-2020 (August 6, 2020): 465–69. http://dx.doi.org/10.5194/isprs-archives-xliii-b1-2020-465-2020.

Full text

Abstract:

Abstract. Wind energy is a critical part of overcoming the use of fossil or nuclear energy usage. The price pressure on the renewable industry sector demands to cut the costs for costly regular inspections carried out by industrial climbers. Drone-based video-inspection reduces costs as well as increases the safety of inspection personal. To further increase the throughput, automatic or semi-automatic solutions to analyze these videos are needed. However, modern machine learning architectures need a lot of data to work reliably. This is by design a problem, as structural damage is rather rare in industrial infrastructure. Our proposed approach uses Generative Adversarial Networks to generate synthetic unmanned aerial vehicle imagery. This allows us to create a large enough training dataset (> 103) from a dataset, which is at least an order of magnitude smaller (approx. 102). We show that we can increase the classification accuracy of up to 6 percentage points.

APA, Harvard, Vancouver, ISO, and other styles

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!