To see the other types of publications on this topic, follow the link: RGB-D video.

Journal articles on the topic 'RGB-D video'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'RGB-D video.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Md, Kamal Uddin, Bhuiyan Amran, and Hasan Mahmudul. "Fusion in Dissimilarity Space Between RGB-D and Skeleton for Person Re-Identification." International Journal of Innovative Technology and Exploring Engineering (IJITEE) 10, no. 12 (2021): 69–75. https://doi.org/10.35940/ijitee.L9566.10101221.

Full text
Abstract:
Person re-identification (Re-id) is one of the important tools of video surveillance systems, which aims to recognize an individual across the multiple disjoint sensors of a camera network. Despite the recent advances on RGB camera-based person re-identification methods under normal lighting conditions, Re-id researchers fail to take advantages of modern RGB-D sensor-based additional information (e.g. depth and skeleton information). When traditional RGB-based cameras fail to capture the video under poor illumination conditions, RGB-D sensor-based additional information can be advantageous to tackle these constraints. This work takes depth images and skeleton joint points as additional information along with RGB appearance cues and proposes a person re-identification method. We combine 4-channel RGB-D image features with skeleton information using score-level fusion strategy in dissimilarity space to increase re-identification accuracy. Moreover, our propose method overcomes the illumination problem because we use illumination invariant depth image and skeleton information. We carried out rigorous experiments on two publicly available RGBD-ID re-identification datasets and proved the use of combined features of 4-channel RGB-D images and skeleton information boost up the rank 1 recognition accuracy.
APA, Harvard, Vancouver, ISO, and other styles
2

Uddin, Md Kamal, Amran Bhuiyan, and Mahmudul Hasan. "Fusion in Dissimilarity Space Between RGB D and Skeleton for Person Re Identification." International Journal of Innovative Technology and Exploring Engineering 10, no. 12 (2021): 69–75. http://dx.doi.org/10.35940/ijitee.l9566.10101221.

Full text
Abstract:
Person re-identification (Re-id) is one of the important tools of video surveillance systems, which aims to recognize an individual across the multiple disjoint sensors of a camera network. Despite the recent advances on RGB camera-based person re-identification methods under normal lighting conditions, Re-id researchers fail to take advantages of modern RGB-D sensor-based additional information (e.g. depth and skeleton information). When traditional RGB-based cameras fail to capture the video under poor illumination conditions, RGB-D sensor-based additional information can be advantageous to tackle these constraints. This work takes depth images and skeleton joint points as additional information along with RGB appearance cues and proposes a person re-identification method. We combine 4-channel RGB-D image features with skeleton information using score-level fusion strategy in dissimilarity space to increase re-identification accuracy. Moreover, our propose method overcomes the illumination problem because we use illumination invariant depth image and skeleton information. We carried out rigorous experiments on two publicly available RGBD-ID re-identification datasets and proved the use of combined features of 4-channel RGB-D images and skeleton information boost up the rank 1 recognition accuracy.
APA, Harvard, Vancouver, ISO, and other styles
3

Yue, Ya Jie, Xiao Jing Zhang, and Chen Ming Sha. "The Design of Wireless Video Monitoring System Based on FPGA." Advanced Materials Research 981 (July 2014): 612–15. http://dx.doi.org/10.4028/www.scientific.net/amr.981.612.

Full text
Abstract:
The wireless video monitoring system contains the video acquisition device,video transmission device,video storage device and VGA display device.In this paper,we use video acquisition device to collect video siganals in real-time.The analog video signal is transmitted by using wireless technology.The video signal is converted to a digital signal by using the dedicated A/D chip. At the same time ,the YCrCb signals will be converted into RGB signals by the format converting module.Then,the digital RGB signals are converted to analog RGB signals through the D/A,and they are finally displayed on the VGA monitor in real-time.The design mainly uses the wireless transmission technology to transmit analog video signals and uses ADV7181 to decode.The controlling system of FPGA deals with the decoded digital signals which will be transmitted to the D/A and the data finally will display in real time.
APA, Harvard, Vancouver, ISO, and other styles
4

Sharma, Richa, Manoj Sharma, Ankit Shukla, and Santanu Chaudhury. "Conditional Deep 3D-Convolutional Generative Adversarial Nets for RGB-D Generation." Mathematical Problems in Engineering 2021 (November 11, 2021): 1–8. http://dx.doi.org/10.1155/2021/8358314.

Full text
Abstract:
Generation of synthetic data is a challenging task. There are only a few significant works on RGB video generation and no pertinent works on RGB-D data generation. In the present work, we focus our attention on synthesizing RGB-D data which can further be used as dataset for various applications like object tracking, gesture recognition, and action recognition. This paper has put forward a proposal for a novel architecture that uses conditional deep 3D-convolutional generative adversarial networks to synthesize RGB-D data by exploiting 3D spatio-temporal convolutional framework. The proposed architecture can be used to generate virtually unlimited data. In this work, we have presented the architecture to generate RGB-D data conditioned on class labels. In the architecture, two parallel paths were used, one to generate RGB data and the second to synthesize depth map. The output from the two parallel paths is combined to generate RGB-D data. The proposed model is used for video generation at 30 fps (frames per second). The frame referred here is an RGB-D with the spatial resolution of 512 × 512.
APA, Harvard, Vancouver, ISO, and other styles
5

Martínez Carrillo, Fabio, Fabián Castillo, and Lola Bautista. "3D+T dense motion trajectories as kinematics primitives to recognize gestures on depth video sequences." Revista Politécnica 15, no. 29 (2019): 82–94. http://dx.doi.org/10.33571/rpolitec.v15n29a7.

Full text
Abstract:
RGB-D sensors have allowed attacking many classical problems in computer vision such as segmentation, scene representations and human interaction, among many others. Regarding motion characterization, typical RGB-D strategies are limited to namely analyze global shape changes and capture scene flow fields to describe local motions in depth sequences. Nevertheless, such strategies only recover motion information among a couple of frames, limiting the analysis of coherent large displacements along time. This work presents a novel strategy to compute 3D+t dense and long motion trajectories as fundamental kinematic primitives to represent video sequences. Each motion trajectory models kinematic words primitives that together can describe complex gestures developed along videos. Such kinematic words were processed into a bag-of-kinematic-words framework to obtain an occurrence video descriptor. The novel video descriptor based on 3D+t motion trajectories achieved an average accuracy of 80% in a dataset of 5 gestures and 100 videos.
APA, Harvard, Vancouver, ISO, and other styles
6

Aubry, Sophie, Sohaib Laraba, Joëlle Tilmanne, and Thierry Dutoit. "Action recognition based on 2D skeletons extracted from RGB videos." MATEC Web of Conferences 277 (2019): 02034. http://dx.doi.org/10.1051/matecconf/201927702034.

Full text
Abstract:
In this paper a methodology to recognize actions based on RGB videos is proposed which takes advantages of the recent breakthrough made in deep learning. Following the development of Convolutional Neural Networks (CNNs), research was conducted on the transformation of skeletal motion data into 2D images. In this work, a solution is proposed requiring only the use of RGB videos instead of RGB-D videos. This work is based on multiple works studying the conversion of RGB-D data into 2D images. From a video stream (RGB images), a two-dimension skeleton of 18 joints for each detected body is extracted with a DNN-based human pose estimator called OpenPose. The skeleton data are encoded into Red, Green and Blue channels of images. Different ways of encoding motion data into images were studied. We successfully use state-of-the-art deep neural networks designed for image classification to recognize actions. Based on a study of the related works, we chose to use image classification models: SqueezeNet, AlexNet, DenseNet, ResNet, Inception, VGG and retrained them to perform action recognition. For all the test the NTU RGB+D database is used. The highest accuracy is obtained with ResNet: 83.317% cross-subject and 88.780% cross-view which outperforms most of state-of-the-art results.
APA, Harvard, Vancouver, ISO, and other styles
7

Bertholet, P., A. E. Ichim, and M. Zwicker. "Temporally Consistent Motion Segmentation From RGB-D Video." Computer Graphics Forum 37, no. 6 (2018): 118–34. http://dx.doi.org/10.1111/cgf.13316.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Cho, Junsu, Seungwon Kim, Chi-Min Oh, and Jeong-Min Park. "Auxiliary Task Graph Convolution Network: A Skeleton-Based Action Recognition for Practical Use." Applied Sciences 15, no. 1 (2024): 198. https://doi.org/10.3390/app15010198.

Full text
Abstract:
Graph convolution networks (GCNs) have been extensively researched for action recognition by estimating human skeletons from video clips. However, their image sampling methods are not practical because they require video-length information for sampling images. In this study, we propose an Auxiliary Task Graph Convolution Network (AT-GCN) with low and high-frame pathways while supporting a new sampling method. AT-GCN learns actions at a defined frame rate in the defined range with three losses: fuse, slow, and fast losses. AT-GCN handles the slow and fast losses in two auxiliary tasks, while the mainstream handles the fuse loss. AT-GCN outperforms the original State-of-the-Art model on the NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets while maintaining the same inference time. AT-GCN shows the best performance on the NTU RGB+D dataset at 90.3% from subjects, 95.2 from view benchmarks, on the NTU RGB+D 120 dataset at 86.5% from subjects, 87.6% from set benchmarks, and at 93.5% on the NW-UCLA dataset as top-1 accuracy.
APA, Harvard, Vancouver, ISO, and other styles
9

Zhu, Xiaoguang, Ye Zhu, Haoyu Wang, Honglin Wen, Yan Yan, and Peilin Liu. "Skeleton Sequence and RGB Frame Based Multi-Modality Feature Fusion Network for Action Recognition." ACM Transactions on Multimedia Computing, Communications, and Applications 18, no. 3 (2022): 1–24. http://dx.doi.org/10.1145/3491228.

Full text
Abstract:
Action recognition has been a heated topic in computer vision for its wide application in vision systems. Previous approaches achieve improvement by fusing the modalities of the skeleton sequence and RGB video. However, such methods pose a dilemma between the accuracy and efficiency for the high complexity of the RGB video network. To solve the problem, we propose a multi-modality feature fusion network to combine the modalities of the skeleton sequence and RGB frame instead of the RGB video, as the key information contained by the combination of the skeleton sequence and RGB frame is close to that of the skeleton sequence and RGB video. In this way, complementary information is retained while the complexity is reduced by a large margin. To better explore the correspondence of the two modalities, a two-stage fusion framework is introduced in the network. In the early fusion stage, we introduce a skeleton attention module that projects the skeleton sequence on the single RGB frame to help the RGB frame focus on the limb movement regions. In the late fusion stage, we propose a cross-attention module to fuse the skeleton feature and the RGB feature by exploiting the correlation. Experiments on two benchmarks, NTU RGB+D and SYSU, show that the proposed model achieves competitive performance compared with the state-of-the-art methods while reducing the complexity of the network.
APA, Harvard, Vancouver, ISO, and other styles
10

Wang, Xiaoqin, Yasar Ahmet Sekercioglu, Tom Drummond, Enrico Natalizio, Isabelle Fantoni, and Vincent Fremont. "Fast Depth Video Compression for Mobile RGB-D Sensors." IEEE Transactions on Circuits and Systems for Video Technology 26, no. 4 (2016): 673–86. http://dx.doi.org/10.1109/tcsvt.2015.2416571.

Full text
APA, Harvard, Vancouver, ISO, and other styles
11

Coppola, Claudio, Serhan Cosar, Diego R. Faria, and Nicola Bellotto. "Social Activity Recognition on Continuous RGB-D Video Sequences." International Journal of Social Robotics 12, no. 1 (2019): 201–15. http://dx.doi.org/10.1007/s12369-019-00541-y.

Full text
APA, Harvard, Vancouver, ISO, and other styles
12

Ahmed, Naveed, and Salam Khalifa. "Time-coherent 3D animation reconstruction from RGB-D video." Signal, Image and Video Processing 10, no. 4 (2015): 783–90. http://dx.doi.org/10.1007/s11760-015-0813-1.

Full text
APA, Harvard, Vancouver, ISO, and other styles
13

Dai, Xinxin, Ran Zhao, Pengpeng Hu, and Adrian Munteanu. "KD-Net: Continuous-Keystroke-Dynamics-Based Human Identification from RGB-D Image Sequences." Sensors 23, no. 20 (2023): 8370. http://dx.doi.org/10.3390/s23208370.

Full text
Abstract:
Keystroke dynamics is a soft biometric based on the assumption that humans always type in uniquely characteristic manners. Previous works mainly focused on analyzing the key press or release events. Unlike these methods, we explored a novel visual modality of keystroke dynamics for human identification using a single RGB-D sensor. In order to verify this idea, we created a dataset dubbed KD-MultiModal, which contains 243.2 K frames of RGB images and depth images, obtained by recording a video of hand typing with a single RGB-D sensor. The dataset comprises RGB-D image sequences of 20 subjects (10 males and 10 females) typing sentences, and each subject typed around 20 sentences. In the task, only the hand and keyboard region contributed to the person identification, so we also propose methods of extracting Regions of Interest (RoIs) for each type of data. Unlike the data of the key press or release, our dataset not only captures the velocity of pressing and releasing different keys and the typing style of specific keys or combinations of keys, but also contains rich information on the hand shape and posture. To verify the validity of our proposed data, we adopted deep neural networks to learn distinguishing features from different data representations, including RGB-KD-Net, D-KD-Net, and RGBD-KD-Net. Simultaneously, the sequence of point clouds also can be obtained from depth images given the intrinsic parameters of the RGB-D sensor, so we also studied the performance of human identification based on the point clouds. Extensive experimental results showed that our idea works and the performance of the proposed method based on RGB-D images is the best, which achieved 99.44% accuracy based on the unseen real-world data. To inspire more researchers and facilitate relevant studies, the proposed dataset will be publicly accessible together with the publication of this paper.
APA, Harvard, Vancouver, ISO, and other styles
14

Lie, Wen-Nung, Dao-Quang Le, Chun-Yu Lai, and Yu-Shin Fang. "Heart Rate Estimation from Facial Image Sequences of a Dual-Modality RGB-NIR Camera." Sensors 23, no. 13 (2023): 6079. http://dx.doi.org/10.3390/s23136079.

Full text
Abstract:
This paper presents an RGB-NIR (Near Infrared) dual-modality technique to analyze the remote photoplethysmogram (rPPG) signal and hence estimate the heart rate (in beats per minute), from a facial image sequence. Our main innovative contribution is the introduction of several denoising techniques such as Modified Amplitude Selective Filtering (MASF), Wavelet Decomposition (WD), and Robust Principal Component Analysis (RPCA), which take advantage of RGB and NIR band characteristics to uncover the rPPG signals effectively through this Independent Component Analysis (ICA)-based algorithm. Two datasets, of which one is the public PURE dataset and the other is the CCUHR dataset built with a popular Intel RealSense D435 RGB-D camera, are adopted in our experiments. Facial video sequences in the two datasets are diverse in nature with normal brightness, under-illumination (i.e., dark), and facial motion. Experimental results show that the proposed method has reached competitive accuracies among the state-of-the-art methods even at a shorter video length. For example, our method achieves MAE = 4.45 bpm (beats per minute) and RMSE = 6.18 bpm for RGB-NIR videos of 10 and 20 s in the CCUHR dataset and MAE = 3.24 bpm and RMSE = 4.1 bpm for RGB videos of 60-s in the PURE dataset. Our system has the advantages of accessible and affordable hardware, simple and fast computations, and wide realistic applications.
APA, Harvard, Vancouver, ISO, and other styles
15

Sial, H. A., M. H. Yousaf, and F. Hussain. "Spatio-Temporal RGBD Cuboids Feature for Human Activity Recognition." Nucleus 55, no. 3 (2018): 139–49. https://doi.org/10.71330/thenucleus.2018.303.

Full text
Abstract:
Human activity recognition is one of the promising research areas in the domain of computer vision. Color sensor cameras are frequently used in the literature for human activity recognition systems. These cameras map 4D real-world activities to 3D digital space by discarding important depth information. Due to the elimination of depth information, the achieved results exhibit degraded performance. Therefore, this research work presents a robust approach to recognize a human activity by using both the aligned RGB and the depth channels to form a combined RGBD.Furthermore, in order to handle the occlusion and background challenges in the RGB domain, Spatial-Temporal Interest Point (STIP) based scheme is employed to deal with both RGB and depth channels. Moreover, the proposed scheme only extracts the interest points from depth video (D-STIP) such that the identical interest points are used to extract the cuboid descriptors from RGB (RGB-DESC) and depth (D-DESC) channels. Finally, aconcatenatedfeature vector, comprising features from both channels is passed to exploit a bag of visual words scheme for human activity recognition. The proposed combined RGBD features based approach has been tested on the challenging MSR activity dataset to show the improved capability of combined approach over a single channel approach.
APA, Harvard, Vancouver, ISO, and other styles
16

Sun, Zhen, Junfei Wu, Lu Wang, and Qingdang Li. "SRDT: A Novel Robust RGB-D Tracker Based on Siamese Region Proposal Network and Depth Information." International Journal of Pattern Recognition and Artificial Intelligence 34, no. 09 (2019): 2054023. http://dx.doi.org/10.1142/s0218001420540233.

Full text
Abstract:
Visual tracking is still a challenging fundamental task in the field of computer vision, especially in complex scenes such as long-term occlusion, nonrigid deformation and fast movement. In this paper, we presented an RGB-D tracker based on the Siamese Region Proposal Network and Depth Information. First, Siamese Network with shared parameters was constructed to perform feature extraction on the target patch and search area. Second, Region Proposal Network was constructed to estimate the target position in the RGB channels. At the same time, the depth information in the RGB-D video was used to determine the target occlusion state and fine-tune the target position. Finally, the tracker used depth information to achieve occlusion recovery when the target was fully occluded. The experimental result shows that the method has better performance in tracking accuracy and tracking speed on the large-scale Princeton RGB-D Tracking Benchmark (PTB) dataset.
APA, Harvard, Vancouver, ISO, and other styles
17

Liu, Yun, Ruidi Ma, Hui Li, Chuanxu Wang, and Ye Tao. "RGB-D Human Action Recognition of Deep Feature Enhancement and Fusion Using Two-Stream ConvNet." Journal of Sensors 2021 (January 6, 2021): 1–10. http://dx.doi.org/10.1155/2021/8864870.

Full text
Abstract:
Action recognition is an important research direction of computer vision, whose performance based on video images is easily affected by factors such as background and light, while deep video images can better reduce interference and improve recognition accuracy. Therefore, this paper makes full use of video and deep skeleton data and proposes an RGB-D action recognition based two-stream network (SV-GCN), which can be described as a two-stream architecture that works with two different data. Proposed Nonlocal-stgcn (S-Stream) based on skeleton data, by adding nonlocal to obtain dependency relationship between a wider range of joints, to provide more rich skeleton point features for the model, proposed a video based Dilated-slowfastnet (V-Stream), which replaces traditional random sampling layer with dilated convolutional layers, which can make better use of depth the feature; finally, two stream information is fused to realize action recognition. The experimental results on NTU-RGB+D dataset show that proposed method significantly improves recognition accuracy and is superior to st-gcn and Slowfastnet in both CS and CV.
APA, Harvard, Vancouver, ISO, and other styles
18

Liu, Yun, Ruidi Ma, Hui Li, Chuanxu Wang, and Ye Tao. "RGB-D Human Action Recognition of Deep Feature Enhancement and Fusion Using Two-Stream ConvNet." Journal of Sensors 2021 (January 6, 2021): 1–10. http://dx.doi.org/10.1155/2021/8864870.

Full text
Abstract:
Action recognition is an important research direction of computer vision, whose performance based on video images is easily affected by factors such as background and light, while deep video images can better reduce interference and improve recognition accuracy. Therefore, this paper makes full use of video and deep skeleton data and proposes an RGB-D action recognition based two-stream network (SV-GCN), which can be described as a two-stream architecture that works with two different data. Proposed Nonlocal-stgcn (S-Stream) based on skeleton data, by adding nonlocal to obtain dependency relationship between a wider range of joints, to provide more rich skeleton point features for the model, proposed a video based Dilated-slowfastnet (V-Stream), which replaces traditional random sampling layer with dilated convolutional layers, which can make better use of depth the feature; finally, two stream information is fused to realize action recognition. The experimental results on NTU-RGB+D dataset show that proposed method significantly improves recognition accuracy and is superior to st-gcn and Slowfastnet in both CS and CV.
APA, Harvard, Vancouver, ISO, and other styles
19

TSURUDA, Yoshito, Shingo AKITA, Kotomi YAMANAKA, et al. "3D Gait Measurement of Mouse Using RGB-D Video from Below." Proceedings of JSME annual Conference on Robotics and Mechatronics (Robomec) 2021 (2021): 2P1—M02. http://dx.doi.org/10.1299/jsmermd.2021.2p1-m02.

Full text
APA, Harvard, Vancouver, ISO, and other styles
20

Ryogo, Miyazaki, Sasaki Kazuya, Tsumura Norimichi, and Hirai Keita. "Hand authentication from RGB-D video based on deep neural network." Electronic Imaging 34, no. 17 (2022): 235–1. http://dx.doi.org/10.2352/ei.2022.34.17.3dia-235.

Full text
APA, Harvard, Vancouver, ISO, and other styles
21

Li, Bei, You Yang, and Qiong Liu. "RGB-D video saliency detection via superpixel-level conditional random field." Journal of Image and Graphics 26, no. 4 (2021): 872–82. http://dx.doi.org/10.11834/jig.200122.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Min, Xin, Shouqian Sun, Honglie Wang, Xurui Zhang, Chao Li, and Xianfu Zhang. "Motion Capture Research: 3D Human Pose Recovery Based on RGB Video Sequences." Applied Sciences 9, no. 17 (2019): 3613. http://dx.doi.org/10.3390/app9173613.

Full text
Abstract:
Using video sequences to restore 3D human poses is of great significance in the field of motion capture. This paper proposes a novel approach to estimate 3D human action via end-to-end learning of deep convolutional neural network to calculate the parameters of the parameterized skinned multi-person linear model. The method is divided into two main stages: (1) 3D human pose estimation based on a single frame image. We use 2D/3D skeleton point constraints, human height constraints, and generative adversarial network constraints to obtain a more accurate human-body model. The model is pre-trained using open-source human pose datasets; (2) Human-body pose generation based on video streams. Combined with the correlation of video sequences, a 3D human pose recovery method based on video streams is proposed, which uses the correlation between videos to generate a smoother 3D pose. In addition, we compared the proposed 3D human pose recovery method with the commercial motion capture platform to prove the effectiveness of the proposed method. To make a contrast, we first built a motion capture platform through two Kinect (V2) devices and iPi Soft series software to obtain depth-camera video sequences and monocular-camera video sequences respectively. Then we defined several different tasks, including the speed of the movements, the position of the subject, the orientation of the subject, and the complexity of the movements. Experimental results show that our low-cost method based on RGB video data can achieve similar results to commercial motion capture platform with RGB-D video data.
APA, Harvard, Vancouver, ISO, and other styles
23

binghua, HE, CHEN zengzhao, LI gaoyang, JIANG lang, ZHANG zhao, and DENG chunlin. "An expression recognition algorithm based on convolution neural network and RGB-D Images." MATEC Web of Conferences 173 (2018): 03066. http://dx.doi.org/10.1051/matecconf/201817303066.

Full text
Abstract:
Aiming at the problem of recognition effect is not stable when 2D facial expression recognition in the complex illumination and posture changes. A facial expression recognition algorithm based on RGB-D dynamic sequence analysis is proposed. The algorithm uses LBP features which are robust to illumination, and adds depth information to study the facial expression recognition. The algorithm firstly extracts 3D texture features of preprocessed RGB-D facial expression sequence, and then uses the CNN to train the dataset. At the same time, in order to verify the performance of the algorithm, a comprehensive facial expression library including 2D image, video and 3D depth information is constructed with the help of Intel RealSense technology. The experimental results show that the proposed algorithm has some advantages over other RGB-D facial expression recognition algorithms in training time and recognition rate, and has certain reference value for future research in facial expression recognition.
APA, Harvard, Vancouver, ISO, and other styles
24

Tang, Chao, Anyang Tong, Aihua Zheng, Hua Peng, and Wei Li. "Using a Selective Ensemble Support Vector Machine to Fuse Multimodal Features for Human Action Recognition." Computational Intelligence and Neuroscience 2022 (January 10, 2022): 1–18. http://dx.doi.org/10.1155/2022/1877464.

Full text
Abstract:
The traditional human action recognition (HAR) method is based on RGB video. Recently, with the introduction of Microsoft Kinect and other consumer class depth cameras, HAR based on RGB-D (RGB-Depth) has drawn increasing attention from scholars and industry. Compared with the traditional method, the HAR based on RGB-D has high accuracy and strong robustness. In this paper, using a selective ensemble support vector machine to fuse multimodal features for human action recognition is proposed. The algorithm combines the improved HOG feature-based RGB modal data, the depth motion map-based local binary pattern features (DMM-LBP), and the hybrid joint features (HJF)-based joints modal data. Concomitantly, a frame-based selective ensemble support vector machine classification model (SESVM) is proposed, which effectively integrates the selective ensemble strategy with the selection of SVM base classifiers, thus increasing the differences between the base classifiers. The experimental results have demonstrated that the proposed method is simple, fast, and efficient on public datasets in comparison with other action recognition algorithms.
APA, Harvard, Vancouver, ISO, and other styles
25

Ren, Ziliang, Xiongjiang Xiao, and Huabei Nie. "Empowering Efficient Spatio-Temporal Learning with a 3D CNN for Pose-Based Action Recognition." Sensors 24, no. 23 (2024): 7682. https://doi.org/10.3390/s24237682.

Full text
Abstract:
Action recognition based on 3D heatmap volumes has received increasing attention recently because it is suitable for application to 3D CNNs to improve the recognition performance of deep networks. However, it is difficult for models to capture global dependencies due to their restricted receptive field. To effectively capture long-range dependencies and balance computations, a novel model, PoseTransformer3D with Global Cross Blocks (GCBs), is proposed for pose-based action recognition. The proposed model extracts spatio-temporal features from processed 3D heatmap volumes. Moreover, we design a further recognition framework, RGB-PoseTransformer3D with Global Cross Complementary Blocks (GCCBs), for multimodality feature learning from both pose and RGB data. To verify the effectiveness of this model, we conducted extensive experiments on four popular video datasets, namely FineGYM, HMDB51, NTU RGB+D 60, and NTU RGB+D 120. Experimental results show that the proposed recognition framework always achieves state-of-the-art recognition performance, substantially improving multimodality learning through action recognition.
APA, Harvard, Vancouver, ISO, and other styles
26

Bao, Linchao, Xiangkai Lin, Yajing Chen, et al. "High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies." ACM Transactions on Graphics 41, no. 1 (2022): 1–21. http://dx.doi.org/10.1145/3472954.

Full text
Abstract:
We present a fully automatic system that can produce high-fidelity, photo-realistic three-dimensional (3D) digital human heads with a consumer RGB-D selfie camera. The system only needs the user to take a short selfie RGB-D video while rotating his/her head and can produce a high-quality head reconstruction in less than 30 s. Our main contribution is a new facial geometry modeling and reflectance synthesis procedure that significantly improves the state of the art. Specifically, given the input video a two-stage frame selection procedure is first employed to select a few high-quality frames for reconstruction. Then a differentiable renderer-based 3D Morphable Model (3DMM) fitting algorithm is applied to recover facial geometries from multiview RGB-D data, which takes advantages of a powerful 3DMM basis constructed with extensive data generation and perturbation. Our 3DMM has much larger expressive capacities than conventional 3DMM, allowing us to recover more accurate facial geometry using merely linear basis. For reflectance synthesis, we present a hybrid approach that combines parametric fitting and Convolutional Neural Networks (CNNs) to synthesize high-resolution albedo/normal maps with realistic hair/pore/wrinkle details. Results show that our system can produce faithful 3D digital human faces with extremely realistic details. The main code and the newly constructed 3DMM basis is publicly available.
APA, Harvard, Vancouver, ISO, and other styles
27

Pavel, Mircea Serban, Hannes Schulz, and Sven Behnke. "Object class segmentation of RGB-D video using recurrent convolutional neural networks." Neural Networks 88 (April 2017): 105–13. http://dx.doi.org/10.1016/j.neunet.2017.01.003.

Full text
APA, Harvard, Vancouver, ISO, and other styles
28

Shao, Ling, Ziyun Cai, Li Liu, and Ke Lu. "Performance evaluation of deep feature learning for RGB-D image/video classification." Information Sciences 385-386 (April 2017): 266–83. http://dx.doi.org/10.1016/j.ins.2017.01.013.

Full text
APA, Harvard, Vancouver, ISO, and other styles
29

Stückler, Jörg, and Sven Behnke. "Efficient Dense Rigid-Body Motion Segmentation and Estimation in RGB-D Video." International Journal of Computer Vision 113, no. 3 (2015): 233–45. http://dx.doi.org/10.1007/s11263-014-0796-3.

Full text
APA, Harvard, Vancouver, ISO, and other styles
30

Stückler, Jörg, Benedikt Waldvogel, Hannes Schulz, and Sven Behnke. "Dense real-time mapping of object-class semantics from RGB-D video." Journal of Real-Time Image Processing 10, no. 4 (2013): 599–609. http://dx.doi.org/10.1007/s11554-013-0379-5.

Full text
APA, Harvard, Vancouver, ISO, and other styles
31

Ahmed, Naveed, and Imran Junejo. "Using Multiple RGB-D Cameras for 3D Video Acquisition and Spatio-Temporally Coherent 3D Animation Reconstruction." International Journal of Computer Theory and Engineering 6, no. 6 (2014): 447–50. http://dx.doi.org/10.7763/ijcte.2014.v6.907.

Full text
APA, Harvard, Vancouver, ISO, and other styles
32

Hou, J., M. Goebel, P. Hübner, and D. Iwaszczuk. "OCTREE-BASED APPROACH FOR REAL-TIME 3D INDOOR MAPPING USING RGB-D VIDEO DATA." International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences XLVIII-1/W1-2023 (May 25, 2023): 183–90. http://dx.doi.org/10.5194/isprs-archives-xlviii-1-w1-2023-183-2023.

Full text
Abstract:
Abstract. 3D indoor mapping is becoming increasingly critical for a variety of applications such as path planning and navigation for robots. In recent years, there is a growing interest in how low-cost sensors, such as monocular or depth cameras, can be used for 3D mapping. In our paper, we present an octree-based approach for real-time 3D indoor mapping using a handheld RGB depth camera. One benefit of the generated octree map is that it requires less storage and computational resources than point cloud models. Moreover, it explicitly represents free space and unmapped areas, which are essential for the robot's navigation tasks. In this work, on the basis of the ORB-SLAM3 system (Campos et al., 2021), we developed an octree mapping system, which directly calls the keyframes and estimated poses provided by ORB-SLAM3 algorithms. Furthermore, we used point cloud library (PCL) for the dense point cloud mapping and then OctoMap for the point cloud to octree map conversion. Finally, we implemented an efficient probabilistic 3D mapping in the robot operating system (ROS) environment. We used the TUM RGB-D dataset to evaluate the estimated trajectories of the camera. The evaluation shows an average translational RMSE of 5.9 cm on the TUM RGB-D dataset. Besides, we also compared the ground truth point clouds and our generated point clouds. The result shows the mean cloud-to-cloud distance in the corridor scene is about 6 cm. All the evaluation results show our proposed approach is a promising solution for advanced indoor voxel mapping and robotic navigation systems.
APA, Harvard, Vancouver, ISO, and other styles
33

Shi, Zhenlian, Yanfeng Sun, Linxin Xiong, Yongli Hu, and Baocai Yin. "A Multisource Heterogeneous Data Fusion Method for Pedestrian Tracking." Mathematical Problems in Engineering 2015 (2015): 1–10. http://dx.doi.org/10.1155/2015/150541.

Full text
Abstract:
Traditional visual pedestrian tracking methods perform poorly when faced with problems such as occlusion, illumination changes, and complex backgrounds. In principle, collecting more sensing information should resolve these issues. However, it is extremely challenging to properly fuse different sensing information to achieve accurate tracking results. In this study, we develop a pedestrian tracking method for fusing multisource heterogeneous sensing information, including video, RGB-D sequences, and inertial sensor data. In our method, a RGB-D sequence is used to position the target locally by fusing the texture and depth features. The local position is then used to eliminate the cumulative error resulting from the inertial sensor positioning. A camera calibration process is used to map the inertial sensor position onto the video image plane, where the visual tracking position and the mapped position are fused using a similarity feature to obtain accurate tracking results. Experiments using real scenarios show that the developed method outperforms the existing tracking method, which uses only a single sensing dataset, and is robust to target occlusion, illumination changes, and interference from similar textures or complex backgrounds.
APA, Harvard, Vancouver, ISO, and other styles
34

Chen, Tao, and Dongbing Gu. "CSA6D: Channel-Spatial Attention Networks for 6D Object Pose Estimation." Cognitive Computation 14, no. 2 (2021): 702–13. http://dx.doi.org/10.1007/s12559-021-09966-y.

Full text
Abstract:
Abstract6D object pose estimation plays a crucial role in robotic manipulation and grasping tasks. The aim to estimate the 6D object pose from RGB or RGB-D images is to detect objects and estimate their orientations and translations relative to the given canonical models. RGB-D cameras provide two sensory modalities: RGB and depth images, which could benefit the estimation accuracy. But the exploitation of two different modality sources remains a challenging issue. In this paper, inspired by recent works on attention networks that could focus on important regions and ignore unnecessary information, we propose a novel network: Channel-Spatial Attention Network (CSA6D) to estimate the 6D object pose from RGB-D camera. The proposed CSA6D includes a pre-trained 2D network to segment the interested objects from RGB image. Then it uses two separate networks to extract appearance and geometrical features from RGB and depth images for each segmented object. Two feature vectors for each pixel are stacked together as a fusion vector which is refined by an attention module to generate a aggregated feature vector. The attention module includes a channel attention block and a spatial attention block which can effectively leverage the concatenated embeddings into accurate 6D pose prediction on known objects. We evaluate proposed network on two benchmark datasets YCB-Video dataset and LineMod dataset and the results show it can outperform previous state-of-the-art methods under ADD and ADD-S metrics. Also, the attention map demonstrates our proposed network searches for the unique geometry information as the most likely features for pose estimation. From experiments, we conclude that the proposed network can accurately estimate the object pose by effectively leveraging multi-modality features.
APA, Harvard, Vancouver, ISO, and other styles
35

Pan, Baiyu, Liming Zhang, Hanxiong Yin, Jun Lan, and Feilong Cao. "An automatic 2D to 3D video conversion approach based on RGB-D images." Multimedia Tools and Applications 80, no. 13 (2021): 19179–201. http://dx.doi.org/10.1007/s11042-021-10662-0.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Miao, Yongwei, Jiahui Chen, Xinjie Zhang, Wenjuan Ma, and Shusen Sun. "Efficient 3D Object Detection of Indoor Scenes Based on RGB-D Video Stream." Journal of Computer-Aided Design & Computer Graphics 33, no. 7 (2021): 1015–25. http://dx.doi.org/10.3724/sp.j.1089.2021.18630.

Full text
APA, Harvard, Vancouver, ISO, and other styles
37

Wang, Yucheng, Jian Zhang, Zicheng Liu, Qiang Wu, Zhengyou Zhang, and Yunde Jia. "Depth Super-Resolution on RGB-D Video Sequences With Large Displacement 3D Motion." IEEE Transactions on Image Processing 27, no. 7 (2018): 3571–85. http://dx.doi.org/10.1109/tip.2018.2820809.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Zuo, Xinxin, Sen Wang, Jiangbin Zheng, Zhigeng Pan, and Ruigang Yang. "Detailed Surface Geometry and Albedo Recovery from RGB-D Video under Natural Illumination." IEEE Transactions on Pattern Analysis and Machine Intelligence 42, no. 10 (2020): 2720–34. http://dx.doi.org/10.1109/tpami.2019.2955459.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

Xie, Qian, Oussama Remil, Yanwen Guo, Meng Wang, Mingqiang Wei, and Jun Wang. "Object Detection and Tracking Under Occlusion for Object-Level RGB-D Video Segmentation." IEEE Transactions on Multimedia 20, no. 3 (2018): 580–92. http://dx.doi.org/10.1109/tmm.2017.2751965.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Morell-Gimenez, Vicente, Marcelo Saval-Calvo, Jorge Azorin-Lopez, et al. "A Comparative Study of Registration Methods for RGB-D Video of Static Scenes." Sensors 14, no. 5 (2014): 8547–76. http://dx.doi.org/10.3390/s140508547.

Full text
APA, Harvard, Vancouver, ISO, and other styles
41

Cheng, Shyi-Chyi, Kuei-Fang Hsiao, Chen-Kuei Yang, Po-Fu Hsiao, and Wan-Hsuan Yu. "A novel unsupervised 3D skeleton detection in RGB-D images for video surveillance." Multimedia Tools and Applications 79, no. 23-24 (2018): 15829–57. http://dx.doi.org/10.1007/s11042-018-6292-y.

Full text
APA, Harvard, Vancouver, ISO, and other styles
42

Xie, Dongwei, Xiaodan Zhang, Xiang Gao, Hu Zhao, and Dongyang Du. "MAF-Net: A multimodal data fusion approach for human action recognition." PLOS ONE 20, no. 4 (2025): e0319656. https://doi.org/10.1371/journal.pone.0319656.

Full text
Abstract:
3D skeleton-based human activity recognition has gained significant attention due to its robustness against variations in background, lighting, and viewpoints. However, challenges remain in effectively capturing spatiotemporal dynamics and integrating complementary information from multiple data modalities, such as RGB video and skeletal data. To address these challenges, we propose a multimodal fusion framework that leverages optical flow-based key frame extraction, data augmentation techniques, and an innovative fusion of skeletal and RGB streams using self-attention and skeletal attention modules. The model employs a late fusion strategy to combine skeletal and RGB features, allowing for more effective capture of spatial and temporal dependencies. Extensive experiments on benchmark datasets, including NTU RGB+D, SYSU, and UTD-MHAD, demonstrate that our method outperforms existing models. This work not only enhances action recognition accuracy but also provides a robust foundation for future multimodal integration and real-time applications in diverse fields such as surveillance and healthcare.
APA, Harvard, Vancouver, ISO, and other styles
43

Lai, Zhimao, Yan Zhang, and Xiubo Liang. "A Two-Stream Method for Human Action Recognition Using Facial Action Cues." Sensors 24, no. 21 (2024): 6817. http://dx.doi.org/10.3390/s24216817.

Full text
Abstract:
Human action recognition (HAR) is a critical area in computer vision with wide-ranging applications, including video surveillance, healthcare monitoring, and abnormal behavior detection. Current HAR methods predominantly rely on full-body data, which can limit their effectiveness in real-world scenarios where occlusion is common. In such situations, the face often remains visible, providing valuable cues for action recognition. This paper introduces Face in Action (FIA), a novel two-stream method that leverages facial action cues for robust action recognition under conditions of significant occlusion. FIA consists of an RGB stream and a landmark stream. The RGB stream processes facial image sequences using a fine-spatio-multitemporal (FSM) 3D convolution module, which employs smaller spatial receptive fields to capture detailed local facial movements and larger temporal receptive fields to model broader temporal dynamics. The landmark stream processes facial landmark sequences using a normalized temporal attention (NTA) module within an NTA-GCN block, enhancing the detection of key facial frames and improving overall recognition accuracy. We validate the effectiveness of FIA using the NTU RGB+D and NTU RGB+D 120 datasets, focusing on action categories related to medical conditions. Our experiments demonstrate that FIA significantly outperforms existing methods in scenarios with extensive occlusion, highlighting its potential for practical applications in surveillance and healthcare settings.
APA, Harvard, Vancouver, ISO, and other styles
44

Rafique, Adnan Ahmed, Ahmad Jalal, and Kibum Kim. "Automated Sustainable Multi-Object Segmentation and Recognition via Modified Sampling Consensus and Kernel Sliding Perceptron." Symmetry 12, no. 11 (2020): 1928. http://dx.doi.org/10.3390/sym12111928.

Full text
Abstract:
Object recognition in depth images is challenging and persistent task in machine vision, robotics, and automation of sustainability. Object recognition tasks are a challenging part of various multimedia technologies for video surveillance, human–computer interaction, robotic navigation, drone targeting, tourist guidance, and medical diagnostics. However, the symmetry that exists in real-world objects plays a significant role in perception and recognition of objects in both humans and machines. With advances in depth sensor technology, numerous researchers have recently proposed RGB-D object recognition techniques. In this paper, we introduce a sustainable object recognition framework that is consistent despite any change in the environment, and can recognize and analyze RGB-D objects in complex indoor scenarios. Firstly, after acquiring a depth image, the point cloud and the depth maps are extracted to obtain the planes. Then, the plane fitting model and the proposed modified maximum likelihood estimation sampling consensus (MMLESAC) are applied as a segmentation process. Then, depth kernel descriptors (DKDES) over segmented objects are computed for single and multiple object scenarios separately. These DKDES are subsequently carried forward to isometric mapping (IsoMap) for feature space reduction. Finally, the reduced feature vector is forwarded to a kernel sliding perceptron (KSP) for the recognition of objects. Three datasets are used to evaluate four different experiments by employing a cross-validation scheme to validate the proposed model. The experimental results over RGB-D object, RGB-D scene, and NYUDv1 datasets demonstrate overall accuracies of 92.2%, 88.5%, and 90.5% respectively. These results outperform existing state-of-the-art methods and verify the suitability of the method.
APA, Harvard, Vancouver, ISO, and other styles
45

Angelico, Jovin, and Ken Ratri Retno Wardani. "Convolutional Neural Network Using Kalman Filter for Human Detection and Tracking on RGB-D Video." CommIT (Communication and Information Technology) Journal 12, no. 2 (2018): 105. http://dx.doi.org/10.21512/commit.v12i2.4890.

Full text
Abstract:
The computer ability to detect human being by computer vision is still being improved both in accuracy or computation time. In low-lighting condition, the detection accuracy is usually low. This research uses additional information, besides RGB channels, namely a depth map that shows objects’ distance relative to the camera. This research integrates Cascade Classifier (CC) to localize the potential object, the Convolutional Neural Network (CNN) technique to identify the human and nonhuman image, and the Kalman filter technique to track human movement. For training and testing purposes, there are two kinds of RGB-D datasets used with different points of view and lighting conditions. Both datasets have been selected to remove images which contain a lot of noises and occlusions so that during the training process it will be more directed. Using these integrated techniques, detection and tracking accuracy reach 77.7%. The impact of using Kalman filter increases computation efficiency by 41%.
APA, Harvard, Vancouver, ISO, and other styles
46

Ge, Yanliang, Cong Zhang, Kang Wang, Ziqi Liu, and Hongbo Bi. "WGI-Net: A weighted group integration network for RGB-D salient object detection." Computational Visual Media 7, no. 1 (2021): 115–25. http://dx.doi.org/10.1007/s41095-020-0200-x.

Full text
Abstract:
AbstractSalient object detection is used as a pre-process in many computer vision tasks (such as salient object segmentation, video salient object detection, etc.). When performing salient object detection, depth information can provide clues to the location of target objects, so effective fusion of RGB and depth feature information is important. In this paper, we propose a new feature information aggregation approach, weighted group integration (WGI), to effectively integrate RGB and depth feature information. We use a dual-branch structure to slice the input RGB image and depth map separately and then merge the results separately by concatenation. As grouped features may lose global information about the target object, we also make use of the idea of residual learning, taking the features captured by the original fusion method as supplementary information to ensure both accuracy and completeness of the fused information. Experiments on five datasets show that our model performs better than typical existing approaches for four evaluation metrics.
APA, Harvard, Vancouver, ISO, and other styles
47

Khalid, Nida, Munkhjargal Gochoo, Ahmad Jalal, and Kibum Kim. "Modeling Two-Person Segmentation and Locomotion for Stereoscopic Action Identification: A Sustainable Video Surveillance System." Sustainability 13, no. 2 (2021): 970. http://dx.doi.org/10.3390/su13020970.

Full text
Abstract:
Due to the constantly increasing demand for automatic tracking and recognition systems, there is a need for more proficient, intelligent and sustainable human activity tracking. The main purpose of this study is to develop an accurate and sustainable human action tracking system that is capable of error-free identification of human movements irrespective of the environment in which those actions are performed. Therefore, in this paper we propose a stereoscopic Human Action Recognition (HAR) system based on the fusion of RGB (red, green, blue) and depth sensors. These sensors give an extra depth of information which enables the three-dimensional (3D) tracking of each and every movement performed by humans. Human actions are tracked according to four features, namely, (1) geodesic distance; (2) 3D Cartesian-plane features; (3) joints Motion Capture (MOCAP) features and (4) way-points trajectory generation. In order to represent these features in an optimized form, Particle Swarm Optimization (PSO) is applied. After optimization, a neuro-fuzzy classifier is used for classification and recognition. Extensive experimentation is performed on three challenging datasets: A Nanyang Technological University (NTU) RGB+D dataset; a UoL (University of Lincoln) 3D social activity dataset and a Collective Activity Dataset (CAD). Evaluation experiments on the proposed system proved that a fusion of vision sensors along with our unique features is an efficient approach towards developing a robust HAR system, having achieved a mean accuracy of 93.5% with the NTU RGB+D dataset, 92.2% with the UoL dataset and 89.6% with the Collective Activity dataset. The developed system can play a significant role in many computer vision-based applications, such as intelligent homes, offices and hospitals, and surveillance systems.
APA, Harvard, Vancouver, ISO, and other styles
48

Valenzuela, Andrea, Nicolás Sibuet, Gemma Hornero, and Oscar Casas. "Non-Contact Video-Based Assessment of the Respiratory Function Using a RGB-D Camera." Sensors 21, no. 16 (2021): 5605. http://dx.doi.org/10.3390/s21165605.

Full text
Abstract:
A fully automatic, non-contact method for the assessment of the respiratory function is proposed using an RGB-D camera-based technology. The proposed algorithm relies on the depth channel of the camera to estimate the movements of the body’s trunk during breathing. It solves in fixed-time complexity, O(1), as the acquisition relies on the mean depth value of the target regions only using the color channels to automatically locate them. This simplicity allows the extraction of real-time values of the respiration, as well as the synchronous assessment on multiple body parts. Two different experiments have been performed: a first one conducted on 10 users in a single region and with a fixed breathing frequency, and a second one conducted on 20 users considering a simultaneous acquisition in two regions. The breath rate has then been computed and compared with a reference measurement. The results show a non-statistically significant bias of 0.11 breaths/min and 96% limits of agreement of −2.21/2.34 breaths/min regarding the breath-by-breath assessment. The overall real-time assessment shows a RMSE of 0.21 breaths/min. We have shown that this method is suitable for applications where respiration needs to be monitored in non-ambulatory and static environments.
APA, Harvard, Vancouver, ISO, and other styles
49

Ahmed, Naveed. "Multi-View RGB-D Video Analysis and Fusion for 360 Degrees Unified Motion Reconstruction." Journal of Computer Science 13, no. 12 (2017): 795–804. http://dx.doi.org/10.3844/jcssp.2017.795.804.

Full text
APA, Harvard, Vancouver, ISO, and other styles
50

Okura, Fumio, Saya Ikuma, Yasushi Makihara, Daigo Muramatsu, Ken Nakada, and Yasushi Yagi. "RGB-D video-based individual identification of dairy cows using gait and texture analyses." Computers and Electronics in Agriculture 165 (October 2019): 104944. http://dx.doi.org/10.1016/j.compag.2019.104944.

Full text
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!

To the bibliography