Log in

Relevant bibliographies by topics / Active speaker detection

Contents

Journal articles

Academic literature on the topic 'Active speaker detection'

Author: Grafiati

Published: 7 September 2024

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the lists of relevant articles, books, theses, conference reports, and other scholarly sources on the topic 'Active speaker detection.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Journal articles on the topic "Active speaker detection"

1

Assunção, Gustavo, Nuno Gonçalves, and Paulo Menezes. "Bio-Inspired Modality Fusion for Active Speaker Detection." Applied Sciences 11, no. 8 (2021): 3397. http://dx.doi.org/10.3390/app11083397.

Full text

Abstract:

Human beings have developed fantastic abilities to integrate information from various sensory sources exploring their inherent complementarity. Perceptual capabilities are therefore heightened, enabling, for instance, the well-known "cocktail party" and McGurk effects, i.e., speech disambiguation from a panoply of sound signals. This fusion ability is also key in refining the perception of sound source location, as in distinguishing whose voice is being heard in a group conversation. Furthermore, neuroscience has successfully identified the superior colliculus region in the brain as the one responsible for this modality fusion, with a handful of biological models having been proposed to approach its underlying neurophysiological process. Deriving inspiration from one of these models, this paper presents a methodology for effectively fusing correlated auditory and visual information for active speaker detection. Such an ability can have a wide range of applications, from teleconferencing systems to social robotics. The detection approach initially routes auditory and visual information through two specialized neural network structures. The resulting embeddings are fused via a novel layer based on the superior colliculus, whose topological structure emulates spatial neuron cross-mapping of unimodal perceptual fields. The validation process employed two publicly available datasets, with achieved results confirming and greatly surpassing initial expectations.

APA, Harvard, Vancouver, ISO, and other styles

2

Pu, Jie, Yannis Panagakis, and Maja Pantic. "Active Speaker Detection and Localization in Videos Using Low-Rank and Kernelized Sparsity." IEEE Signal Processing Letters 27 (2020): 865–69. http://dx.doi.org/10.1109/lsp.2020.2996412.

Full text

APA, Harvard, Vancouver, ISO, and other styles

3

Lindstrom, Fredric, Keni Ren, Kerstin Persson Waye, and Haibo Li. "A comparison of two active‐speaker‐detection methods suitable for usage in noise dosimeter measurements." Journal of the Acoustical Society of America 123, no. 5 (2008): 3527. http://dx.doi.org/10.1121/1.2934471.

Full text

APA, Harvard, Vancouver, ISO, and other styles

4

Zhu, Ying-Xin, and Hao-Ran Jin. "Speaker Localization Based on Audio-Visual Bimodal Fusion." Journal of Advanced Computational Intelligence and Intelligent Informatics 25, no. 3 (2021): 375–82. http://dx.doi.org/10.20965/jaciii.2021.p0375.

Full text

Abstract:

The demand for fluency in human–computer interaction is on an increase globally; thus, the active localization of the speaker by the machine has become a problem worth exploring. Considering that the stability and accuracy of the single-mode localization method are low, while the multi-mode localization method can utilize the redundancy of information to improve accuracy and anti-interference, a speaker localization method based on voice and image multimodal fusion is proposed. First, the voice localization method based on time differences of arrival (TDOA) in a microphone array and the face detection method based on the AdaBoost algorithm are presented herein. Second, a multimodal fusion method based on spatiotemporal fusion of speech and image is proposed, and it uses a coordinate system converter and frame rate tracker. The proposed method was tested by positioning the speaker stand at 15 different points, and each point was tested 50 times. The experimental results demonstrate that there is a high accuracy when the speaker stands in front of the positioning system within a certain range.

APA, Harvard, Vancouver, ISO, and other styles

5

Stefanov, Kalin, Jonas Beskow, and Giampiero Salvi. "Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially Aware Language Acquisition." IEEE Transactions on Cognitive and Developmental Systems 12, no. 2 (2020): 250–59. http://dx.doi.org/10.1109/tcds.2019.2927941.

Full text

APA, Harvard, Vancouver, ISO, and other styles

6

DAI, Hai, Kean CHEN, Yang WANG, and Haoxin YU. "Fault detection method of secondary sound source in ANC system based on impedance characteristics." Xibei Gongye Daxue Xuebao/Journal of Northwestern Polytechnical University 40, no. 6 (2022): 1242–49. http://dx.doi.org/10.1051/jnwpu/20224061242.

Full text

Abstract:

As an indispensable component in an active noise control system, the working states of the secondary sound sources affect directly noise reduction and the robustness of the system. Therefore, it is very crucial to detect the working states of the secondary sound sources in the process of active control in real time. In this study, a real-time fault detection method for secondary sound sources during the process of active control is presented, and the corresponding detection algorithm is numerically given and experimentally verified. By collecting the input voltage and output current of the speaker unit, the sound quality factor is calculated to comprehensively judge the working state of the secondary sound sources. The simulation and experimental results show that the present method is simple and has low computational cost, can accurately reflect accurately the working states of the secondary sound sources in real time, and provides a basis for judging the operation of the active noise control system.

APA, Harvard, Vancouver, ISO, and other styles

7

Ahmad, Zubair, Alquhayz, and Ditta. "Multimodal Speaker Diarization Using a Pre-Trained Audio-Visual Synchronization Model." Sensors 19, no. 23 (2019): 5163. http://dx.doi.org/10.3390/s19235163.

Full text

Abstract:

Speaker diarization systems aim to find ‘who spoke when?’ in multi-speaker recordings. The dataset usually consists of meetings, TV/talk shows, telephone and multi-party interaction recordings. In this paper, we propose a novel multimodal speaker diarization technique, which finds the active speaker through audio-visual synchronization model for diarization. A pre-trained audio-visual synchronization model is used to find the synchronization between a visible person and the respective audio. For that purpose, short video segments comprised of face-only regions are acquired using a face detection technique and are then fed to the pre-trained model. This model is a two streamed network which matches audio frames with their respective visual input segments. On the basis of high confidence video segments inferred by the model, the respective audio frames are used to train Gaussian mixture model (GMM)-based clusters. This method helps in generating speaker specific clusters with high probability. We tested our approach on a popular subset of AMI meeting corpus consisting of 5.4 h of recordings for audio and 5.8 h of different set of multimodal recordings. A significant improvement is noticed with the proposed method in term of DER when compared to conventional and fully supervised audio based speaker diarization. The results of the proposed technique are very close to the complex state-of-the art multimodal diarization which shows significance of such simple yet effective technique.

APA, Harvard, Vancouver, ISO, and other styles

8

Wang, Shaolei, Zhongyuan Wang, Wanxiang Che, Sendong Zhao, and Ting Liu. "Combining Self-supervised Learning and Active Learning for Disfluency Detection." ACM Transactions on Asian and Low-Resource Language Information Processing 21, no. 3 (2022): 1–25. http://dx.doi.org/10.1145/3487290.

Full text

Abstract:

Spoken language is fundamentally different from the written language in that it contains frequent disfluencies or parts of an utterance that are corrected by the speaker. Disfluency detection (removing these disfluencies) is desirable to clean the input for use in downstream NLP tasks. Most existing approaches to disfluency detection heavily rely on human-annotated data, which is scarce and expensive to obtain in practice. To tackle the training data bottleneck, in this work, we investigate methods for combining self-supervised learning and active learning for disfluency detection. First, we construct large-scale pseudo training data by randomly adding or deleting words from unlabeled data and propose two self-supervised pre-training tasks: (i) a tagging task to detect the added noisy words and (ii) sentence classification to distinguish original sentences from grammatically incorrect sentences. We then combine these two tasks to jointly pre-train a neural network. The pre-trained neural network is then fine-tuned using human-annotated disfluency detection training data. The self-supervised learning method can capture task-special knowledge for disfluency detection and achieve better performance when fine-tuning on a small annotated dataset compared to other supervised methods. However, limited in that the pseudo training data are generated based on simple heuristics and cannot fully cover all the disfluency patterns, there is still a performance gap compared to the supervised models trained on the full training dataset. We further explore how to bridge the performance gap by integrating active learning during the fine-tuning process. Active learning strives to reduce annotation costs by choosing the most critical examples to label and can address the weakness of self-supervised learning with a small annotated dataset. We show that by combining self-supervised learning with active learning, our model is able to match state-of-the-art performance with just about 10% of the original training data on both the commonly used English Switchboard test set and a set of in-house annotated Chinese data.

APA, Harvard, Vancouver, ISO, and other styles

9

Maltezou-Papastylianou, Constantina, Riccardo Russo, Denise Wallace, Chelsea Harmsworth, and Silke Paulmann. "Different stages of emotional prosody processing in healthy ageing–evidence from behavioural responses, ERPs, tDCS, and tRNS." PLOS ONE 17, no. 7 (2022): e0270934. http://dx.doi.org/10.1371/journal.pone.0270934.

Full text

Abstract:

Past research suggests that the ability to recognise the emotional intent of a speaker decreases as a function of age. Yet, few studies have looked at the underlying cause for this effect in a systematic way. This paper builds on the view that emotional prosody perception is a multi-stage process and explores which step of the recognition processing line is impaired in healthy ageing using time-sensitive event-related brain potentials (ERPs). Results suggest that early processes linked to salience detection as reflected in the P200 component and initial build-up of emotional representation as linked to a subsequent negative ERP component are largely unaffected in healthy ageing. The two groups show, however, emotional prosody recognition differences: older participants recognise emotional intentions of speakers less well than younger participants do. These findings were followed up by two neuro-stimulation studies specifically targeting the inferior frontal cortex to test if recognition improves during active stimulation relative to sham. Overall, results suggests that neither tDCS nor high-frequency tRNS stimulation at 2mA for 30 minutes facilitates emotional prosody recognition rates in healthy older adults.

APA, Harvard, Vancouver, ISO, and other styles

10

Lahemer, Elfituri S. F., and Ahmad Rad. "An Audio-Based SLAM for Indoor Environments: A Robotic Mixed Reality Presentation." Sensors 24, no. 9 (2024): 2796. http://dx.doi.org/10.3390/s24092796.

Full text

Abstract:

In this paper, we present a novel approach referred to as the audio-based virtual landmark-based HoloSLAM. This innovative method leverages a single sound source and microphone arrays to estimate the voice-printed speaker’s direction. The system allows an autonomous robot equipped with a single microphone array to navigate within indoor environments, interact with specific sound sources, and simultaneously determine its own location while mapping the environment. The proposed method does not require multiple audio sources in the environment nor sensor fusion to extract pertinent information and make accurate sound source estimations. Furthermore, the approach incorporates Robotic Mixed Reality using Microsoft HoloLens to superimpose landmarks, effectively mitigating the audio landmark-related issues of conventional audio-based landmark SLAM, particularly in situations where audio landmarks cannot be discerned, are limited in number, or are completely missing. The paper also evaluates an active speaker detection method, demonstrating its ability to achieve high accuracy in scenarios where audio data are the sole input. Real-time experiments validate the effectiveness of this method, emphasizing its precision and comprehensive mapping capabilities. The results of these experiments showcase the accuracy and efficiency of the proposed system, surpassing the constraints associated with traditional audio-based SLAM techniques, ultimately leading to a more detailed and precise mapping of the robot’s surroundings.

APA, Harvard, Vancouver, ISO, and other styles

More sources

We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!