To see the other types of publications on this topic, follow the link: Speaker embedding.

Journal articles on the topic 'Speaker embedding'

Create a spot-on reference in APA, MLA, Chicago, Harvard, and other styles

Select a source type:

Consult the top 50 journal articles for your research on the topic 'Speaker embedding.'

Next to every source in the list of references, there is an 'Add to bibliography' button. Press on it, and we will generate automatically the bibliographic reference to the chosen work in the citation style you need: APA, MLA, Harvard, Chicago, Vancouver, etc.

You can also download the full text of the academic publication as pdf and read online its abstract whenever available in the metadata.

Browse journal articles on a wide variety of disciplines and organise your bibliography correctly.

1

Mridha, Muhammad Firoz, Abu Quwsar Ohi, Muhammad Mostafa Monowar, Md Abdul Hamid, Md Rashedul Islam, and Yutaka Watanobe. "U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data." Applied Sciences 11, no. 21 (2021): 10079. http://dx.doi.org/10.3390/app112110079.

Full text
Abstract:
Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate
APA, Harvard, Vancouver, ISO, and other styles
2

Kim, Minsoo, and Gil-Jin Jang. "Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding." Applied Sciences 14, no. 18 (2024): 8138. http://dx.doi.org/10.3390/app14188138.

Full text
Abstract:
Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if they are trained by recordings of single talkers. This paper proposes a multi-speaker ASR method that incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set was extracted as numeric vectors, and a
APA, Harvard, Vancouver, ISO, and other styles
3

Liu, Elaine M., Jih-Wei Yeh, Jen-Hao Lu, and Yi-Wen Liu. "Speaker embedding space cosine similarity comparisons of singing voice conversion models and voice morphing." Journal of the Acoustical Society of America 154, no. 4_supplement (2023): A244. http://dx.doi.org/10.1121/10.0023424.

Full text
Abstract:
We explore the use of cosine similarity between x-vector speaker embeddings as an objective metric to evaluate the effectiveness of singing voice conversion. Our system preprocesses a source singer’s audio to obtain melody features via the F0 contour, loudness curve, and phonetic posteriorgram. These are input to a denoising diffusion probabilistic acoustic model conditioned with another target voice’s speaker embedding to generate a mel spectrogram, which is passed through a HiFi-GAN vocoder to synthesize audio of the source song in the target timbre. We use cosine similarity between the conv
APA, Harvard, Vancouver, ISO, and other styles
4

Pick, Ron Korenblum, Vladyslav Kozhukhov, Dan Vilenchik, and Oren Tsur. "STEM: Unsupervised STructural EMbedding for Stance Detection." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 11174–82. http://dx.doi.org/10.1609/aaai.v36i10.21367.

Full text
Abstract:
Stance detection is an important task, supporting many downstream tasks such as discourse parsing and modeling the propagation of fake news, rumors, and science denial. In this paper, we propose a novel framework for stance detection. Our framework is unsupervised and domain-independent. Given a claim and a multi-participant discussion -- we construct the interaction network from which we derive topological embedding for each speaker. These speaker embedding enjoy the following property: speakers with the same stance tend to be represented by similar vectors, while antipodal vectors represent
APA, Harvard, Vancouver, ISO, and other styles
5

Karamyan, Davit S., and Grigor A. Kirakosyan. "Building a Speaker Diarization System: Lessons from VoxSRC 2023." Mathematical Problems of Computer Science 60 (November 30, 2023): 52–62. http://dx.doi.org/10.51408/1963-0109.

Full text
Abstract:
Speaker diarization is the process of partitioning an audio recording into segments corresponding to individual speakers. In this paper, we present a robust speaker diarization system and describe its architecture. We focus on discussing the key components necessary for building a strong diarization system, such as voice activity detection (VAD), speaker embedding, and clustering. Our system emerged as the winner in the Voxceleb Speaker Recognition Challenge (VoxSRC) 2023, a widely recognized competition for evaluating speaker diarization systems.
APA, Harvard, Vancouver, ISO, and other styles
6

Milewski, Krzysztof, Szymon Zaporowski, and Andrzej Czyżewski. "Comparison of the Ability of Neural Network Model and Humans to Detect a Cloned Voice." Electronics 12, no. 21 (2023): 4458. http://dx.doi.org/10.3390/electronics12214458.

Full text
Abstract:
The vulnerability of the speaker identity verification system to attacks using voice cloning was examined. The research project assumed creating a model for verifying the speaker’s identity based on voice biometrics and then testing its resistance to potential attacks using voice cloning. The Deep Speaker Neural Speaker Embedding System was trained, and the Real-Time Voice Cloning system was employed based on the SV2TTS, Tacotron, WaveRNN, and GE2E neural networks. The results of attacks using voice cloning were analyzed and discussed in the context of a subjective assessment of cloned voice f
APA, Harvard, Vancouver, ISO, and other styles
7

Kang, Woo Hyun, Sung Hwan Mun, Min Hyun Han, and Nam Soo Kim. "Disentangled Speaker and Nuisance Attribute Embedding for Robust Speaker Verification." IEEE Access 8 (2020): 141838–49. http://dx.doi.org/10.1109/access.2020.3012893.

Full text
APA, Harvard, Vancouver, ISO, and other styles
8

Poojary, Nigam R., and K. H. Ashish. "Text To Speech with Custom Voice." International Journal for Research in Applied Science and Engineering Technology 11, no. 4 (2023): 4523–30. http://dx.doi.org/10.22214/ijraset.2023.51217.

Full text
Abstract:
Abstract: The Text to Speech with Custom Voice system described in this work has vast applicability in numerous industries, including entertainment, education, and accessibility. The proposed text-to-speech (TTS) system is capable of generating speech audio in custom voices, even those not included in the training data. The system comprises a speaker encoder, a synthesizer, and a WaveRNN vocoder. Multiple speakers from a dataset of clean speech without transcripts are used to train the speaker encoder for a speaker verification process. The reference speech of the target speaker is used to cre
APA, Harvard, Vancouver, ISO, and other styles
9

Lee, Kong Aik, Qiongqiong Wang, and Takafumi Koshinaka. "Xi-Vector Embedding for Speaker Recognition." IEEE Signal Processing Letters 28 (2021): 1385–89. http://dx.doi.org/10.1109/lsp.2021.3091932.

Full text
APA, Harvard, Vancouver, ISO, and other styles
10

Sečujski, Milan, Darko Pekar, Siniša Suzić, Anton Smirnov, and Tijana Nosek. "Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding." JUCS - Journal of Universal Computer Science 26, no. 4 (2020): 434–53. http://dx.doi.org/10.3897/jucs.2020.023.

Full text
Abstract:
The paper presents a novel architecture and method for training neural networks to produce synthesized speech in a particular voice and speaking style, based on a small quantity of target speaker/style training data. The method is based on neural network embedding, i.e. mapping of discrete variables into continuous vectors in a low-dimensional space, which has been shown to be a very successful universal deep learning technique. In this particular case, different speaker/style combinations are mapped into different points in a low-dimensional space, which enables the network to capture the sim
APA, Harvard, Vancouver, ISO, and other styles
11

Sečujski, Milan, Darko Pekar, Siniša Suzić, Anton Smirnov, and Tijana Nosek. "Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding." JUCS - Journal of Universal Computer Science 26, no. (4) (2020): 434–53. https://doi.org/10.3897/jucs.2020.023.

Full text
Abstract:
The paper presents a novel architecture and method for training neural networks to produce synthesized speech in a particular voice and speaking style, based on a small quantity of target speaker/style training data. The method is based on neural network embedding, i.e. mapping of discrete variables into continuous vectors in a low-dimensional space, which has been shown to be a very successful universal deep learning technique. In this particular case, different speaker/style combinations are mapped into different points in a low-dimensional space, which enables the network to capture the sim
APA, Harvard, Vancouver, ISO, and other styles
12

Chadchankar, Mrs Asharani. "Advancements in Speaker-Independent Speech Separation Using Deep Attractor Networks." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 4056–61. https://doi.org/10.22214/ijraset.2025.71160.

Full text
Abstract:
Speaker-independent speech separation, the task of isolating individual voices from a mixture without prior knowledge of the speakers, has gained significant attention due to its importance in various applications. However, challenges such as the arbitrary order of speakers and the unknown number of speakers in a mixture remain significant hurdles. This research paper analyzes Deep Attractor Networks (DANet), a novel deep learning framework designed to address these issues. DANet projects mixed speech signals into a high-dimensional embedding space where reference points, known as attractors,
APA, Harvard, Vancouver, ISO, and other styles
13

Bae, Ara, and Wooil Kim. "Speaker Verification Employing Combinations of Self-Attention Mechanisms." Electronics 9, no. 12 (2020): 2201. http://dx.doi.org/10.3390/electronics9122201.

Full text
Abstract:
One of the most recent speaker recognition methods that demonstrates outstanding performance in noisy environments involves extracting the speaker embedding using attention mechanism instead of average or statistics pooling. In the attention method, the speaker recognition performance is improved by employing multiple heads rather than a single head. In this paper, we propose advanced methods to extract a new embedding by compensating for the disadvantages of the single-head and multi-head attention methods. The combination method comprising single-head and split-based multi-head attentions sh
APA, Harvard, Vancouver, ISO, and other styles
14

Wirdiani, Ayu, Steven Ndung'u Machetho, I. Ketut Gede Darma Putra, Made Sudarma, Rukmi Sari Hartati, and Henrico Aldy Ferdian. "Improvement Model for Speaker Recognition using MFCC-CNN and Online Triplet Mining." International Journal on Advanced Science, Engineering and Information Technology 14, no. 2 (2024): 420–27. http://dx.doi.org/10.18517/ijaseit.14.2.19396.

Full text
Abstract:
Various biometric security systems, such as face recognition, fingerprint, voice, hand geometry, and iris, have been developed. Apart from being a communication medium, the human voice is also a form of biometrics that can be used for identification. Voice has unique characteristics that can be used as a differentiator between one person and another. A sound speaker recognition system must be able to pick up the features that characterize a person's voice. This study aims to develop a human speaker recognition system using the Convolutional Neural Network (CNN) method. This research proposes i
APA, Harvard, Vancouver, ISO, and other styles
15

Li, Xiao, Xiao Chen, Rui Fu, Xiao Hu, Mintong Chen, and Kun Niu. "Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting." IET Biometrics 2024 (March 22, 2024): 1–10. http://dx.doi.org/10.1049/2024/6694481.

Full text
Abstract:
Text-independent speaker verification (TI-SV) is a crucial task in speaker recognition, as it involves verifying an individual’s claimed identity from speech of arbitrary content without any human intervention. The target for TI-SV is to design a discriminative network to learn deep speaker embedding for speaker idiosyncrasy. In this paper, we propose a deep speaker embedding learning approach of a hybrid deep neural network (DNN) for TI-SV in FM broadcasting. Not only acoustic features are utilized, but also phoneme features are introduced as prior knowledge to collectively learn deep speaker
APA, Harvard, Vancouver, ISO, and other styles
16

Pan, Weijun, Shenhao Chen, Yidi Wang, Sheng Chen, and Xuan Wang. "The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture." Applied Sciences 15, no. 6 (2025): 2994. https://doi.org/10.3390/app15062994.

Full text
Abstract:
This study addresses the challenges of complex noise and short speech in civil aviation air-ground communication scenarios and proposes a novel speaker identification model, Chrono-ECAPA-TDNN (CET). The aim of the study is to enhance the accuracy and robustness of speaker identification in these environments. The CET model incorporates three key components: the Chrono Block module, the speaker embedding extraction module, and the optimized loss function module. The Chrono Block module utilizes parallel branching architecture, Bi-LSTM, and multi-head attention mechanisms to effectively extract
APA, Harvard, Vancouver, ISO, and other styles
17

Brydinskyi, Vitalii, Yuriy Khoma, Dmytro Sabodashko, et al. "Comparison of Modern Deep Learning Models for Speaker Verification." Applied Sciences 14, no. 4 (2024): 1329. http://dx.doi.org/10.3390/app14041329.

Full text
Abstract:
This research presents an extensive comparative analysis of a selection of popular deep speaker embedding models, namely WavLM, TitaNet, ECAPA, and PyAnnote, applied in speaker verification tasks. The study employs a specially curated dataset, specifically designed to mirror the real-world operating conditions of voice models as accurately as possible. This dataset includes short, non-English statements gathered from interviews on a popular online video platform. The dataset features a wide range of speakers, with 33 males and 17 females, making a total of 50 unique voices. These speakers vary
APA, Harvard, Vancouver, ISO, and other styles
18

Lin, Weiwei, and Man-Wai Mak. "Mixture Representation Learning for Deep Speaker Embedding." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022): 968–78. http://dx.doi.org/10.1109/taslp.2022.3153270.

Full text
APA, Harvard, Vancouver, ISO, and other styles
19

Ghorbani, Shahram, and John H. L. Hansen. "Advanced accent/dialect identification and accentedness assessment with multi-embedding models and automatic speech recognition." Journal of the Acoustical Society of America 155, no. 6 (2024): 3848–60. http://dx.doi.org/10.1121/10.0026235.

Full text
Abstract:
The ability to accurately classify accents and assess accentedness in non-native speakers are challenging tasks due primarily to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pretrained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pretrained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information comp
APA, Harvard, Vancouver, ISO, and other styles
20

Khoma, Volodymyr, Yuriy Khoma, Vitalii Brydinskyi, and Alexander Konovalov. "Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library." Sensors 23, no. 4 (2023): 2082. http://dx.doi.org/10.3390/s23042082.

Full text
Abstract:
Diarization is an important task when work with audiodata is executed, as it provides a solution to the problem related to the need of dividing one analyzed call recording into several speech recordings, each of which belongs to one speaker. Diarization systems segment audio recordings by defining the time boundaries of utterances, and typically use unsupervised methods to group utterances belonging to individual speakers, but do not answer the question “who is speaking?” On the other hand, there are biometric systems that identify individuals on the basis of their voices, but such systems are
APA, Harvard, Vancouver, ISO, and other styles
21

Bahmaninezhad, Fahimeh, Chunlei Zhang, and John H. L. Hansen. "An investigation of domain adaptation in speaker embedding space for speaker recognition." Speech Communication 129 (May 2021): 7–16. http://dx.doi.org/10.1016/j.specom.2021.01.001.

Full text
APA, Harvard, Vancouver, ISO, and other styles
22

Zeng, Bang, and Ming Li. "Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection." Computer Speech & Language 94 (November 2025): 101807. https://doi.org/10.1016/j.csl.2025.101807.

Full text
APA, Harvard, Vancouver, ISO, and other styles
23

Xylogiannis, Paris, Nikolaos Vryzas, Lazaros Vrysis, and Charalampos Dimoulas. "Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization." Sensors 24, no. 13 (2024): 4229. http://dx.doi.org/10.3390/s24134229.

Full text
Abstract:
Speaker diarization consists of answering the question of “who spoke when” in audio recordings. In meeting scenarios, the task of labeling audio with the corresponding speaker identities can be further assisted by the exploitation of spatial features. This work proposes a framework designed to assess the effectiveness of combining speaker embeddings with Time Difference of Arrival (TDOA) values from available microphone sensor arrays in meetings. We extract speaker embeddings using two popular and robust pre-trained models, ECAPA-TDNN and X-vectors, and calculate the TDOA values via the Genera
APA, Harvard, Vancouver, ISO, and other styles
24

Shahin Shamsabadi, Ali, Brij Mohan Lal Srivastava, Aurélien Bellet, et al. "Differentially Private Speaker Anonymization." Proceedings on Privacy Enhancing Technologies 2023, no. 1 (2023): 98–114. http://dx.doi.org/10.56553/popets-2023-0007.

Full text
Abstract:
Sharing real-world speech utterances is key to the training and deployment of voice-based services. However, it also raises privacy risks as speech contains a wealth of personal data. Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact. State-of-the-art techniques operate by disentangling the speaker information (represented via a speaker embedding) from these attributes and re-synthesizing speech based on the speaker embedding of another speaker. Prior research in the privacy community has shown that anon
APA, Harvard, Vancouver, ISO, and other styles
25

Li, Wenjie, Pengyuan Zhang, and Yonghong Yan. "TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition." Electronics Letters 55, no. 14 (2019): 816–19. http://dx.doi.org/10.1049/el.2019.1228.

Full text
APA, Harvard, Vancouver, ISO, and other styles
26

Xie, Fei, Dalong Zhang, and Chengming Liu. "Global–Local Self-Attention Based Transformer for Speaker Verification." Applied Sciences 12, no. 19 (2022): 10154. http://dx.doi.org/10.3390/app121910154.

Full text
Abstract:
Transformer models are now widely used for speech processing tasks due to their powerful sequence modeling capabilities. Previous work determined an efficient way to model speaker embeddings using the Transformer model by combining transformers with convolutional networks. However, traditional global self-attention mechanisms lack the ability to capture local information. To alleviate these problems, we proposed a novel global–local self-attention mechanism. Instead of using local or global multi-head attention alone, this method performs local and global attention in parallel in two parallel
APA, Harvard, Vancouver, ISO, and other styles
27

Shim, Hye-jin, Jee-weon Jung, and Ha-Jin Yu. "Which to select?: Analysis of speaker representation with graph attention networks." Journal of the Acoustical Society of America 156, no. 4 (2024): 2701–8. http://dx.doi.org/10.1121/10.0032393.

Full text
Abstract:
Although the recent state-of-the-art systems show almost perfect performance, analysis of speaker embeddings has been lacking thus far. An in-depth analysis of speaker representation will be performed by looking into which features are selected. To this end, various intermediate representations of the trained model are observed using graph attentive feature aggregation, which includes a graph attention layer and graph pooling layer followed by a readout operation. To do so, the TIMIT dataset, which has comparably restricted conditions (e.g., the region and phoneme) is used after pre-training t
APA, Harvard, Vancouver, ISO, and other styles
28

Guo, Xin, Chengfang Luo, Aiwen Deng, and Feiqi Deng. "DeltaVLAD: An efficient optimization algorithm to discriminate speaker embedding for text-independent speaker verification." AIMS Mathematics 7, no. 4 (2022): 6381–95. http://dx.doi.org/10.3934/math.2022355.

Full text
Abstract:
<abstract> <p>Text-independent speaker verification aims to determine whether two given utterances in open-set task originate from the same speaker or not. In this paper, some ways are explored to enhance the discrimination of embeddings in speaker verification. Firstly, difference is used in the coding layer to process speaker features to form the DeltaVLAD layer. The frame-level speaker representation is extracted by the deep neural network with differential operations to calculate the dynamic changes between frames, which is more conducive to capturing insignificant changes in t
APA, Harvard, Vancouver, ISO, and other styles
29

Prabhala, Jagat Chaitanya, Venkatnareshbabu K, and Ragoju Ravi. "OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION." Applied Mathematics and Sciences An International Journal (MathSJ) 10, no. 1/2 (2023): 1–10. http://dx.doi.org/10.5121/mathsj.2023.10201.

Full text
Abstract:
Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This
APA, Harvard, Vancouver, ISO, and other styles
30

Smith, Sierra Rose, Patricia Crist, Rebekah Givens, Taylor Stringer, and Adriana Macdonald. "Interviews Regarding Practice Scholar Engagement: Practitioners’ Descriptions of Their Research Motivations, Characteristics, Resources, & Outcomes." American Journal of Occupational Therapy 78, Supplement_2 (2024): 7811500214p1. http://dx.doi.org/10.5014/ajot.2024.78s2-po214.

Full text
Abstract:
Abstract Date Presented 03/22/24 Embedding practice scholarship in daily work is challenging for practitioners despite being emphasized in the American Occupational Therapy Assocation’s Vision 2025 and mission statements. This presentation defines and provides strategies used by active practice scholars. Primary Author and Speaker: Sierra Rose Smith Additional Authors and Speakers: Adriana Macdonald Contributing Authors: Patricia Crist, Rebekah Givens, Taylor Stringer
APA, Harvard, Vancouver, ISO, and other styles
31

Mingote, Victoria, Antonio Miguel, Alfonso Ortega, and Eduardo Lleida. "Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification." Applied Sciences 9, no. 16 (2019): 3295. http://dx.doi.org/10.3390/app9163295.

Full text
Abstract:
In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this reduction mechanism by a phonetic phrase alignment model to keep the temporal structure of each phrase since the phonetic information is relevant in the verification task. Moreover, we can apply a convolutional neural network as front-end, and, thanks to the alignment process being differentiable, we can train th
APA, Harvard, Vancouver, ISO, and other styles
32

Lyu, Ke-Ming, Ren-yuan Lyu, and Hsien-Tsung Chang. "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation." PeerJ Computer Science 10 (March 29, 2024): e1973. http://dx.doi.org/10.7717/peerj-cs.1973.

Full text
Abstract:
This research presents the development of a cutting-edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI’s Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, inte
APA, Harvard, Vancouver, ISO, and other styles
33

LIANG, Chunyan, Lin YANG, Qingwei ZHAO, and Yonghong YAN. "Factor Analysis of Neighborhood-Preserving Embedding for Speaker Verification." IEICE Transactions on Information and Systems E95.D, no. 10 (2012): 2572–76. http://dx.doi.org/10.1587/transinf.e95.d.2572.

Full text
APA, Harvard, Vancouver, ISO, and other styles
34

Byun, Jaeuk, and Jong Won Shin. "Monaural Speech Separation Using Speaker Embedding From Preliminary Separation." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 2753–63. http://dx.doi.org/10.1109/taslp.2021.3101617.

Full text
APA, Harvard, Vancouver, ISO, and other styles
35

Lin, Weiwei, Man-Wai Mak, Na Li, Dan Su, and Dong Yu. "A Framework for Adapting DNN Speaker Embedding Across Languages." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2810–22. http://dx.doi.org/10.1109/taslp.2020.3030499.

Full text
APA, Harvard, Vancouver, ISO, and other styles
36

Misbullah, Alim, Muhammad Saifullah Sani, Husaini, Laina Farsiah, Zahnur, and Kikye Martiwi Sukiakhy. "Sistem Identifikasi Pembicara Berbahasa Indonesia Menggunakan X-Vector Embedding." Jurnal Teknologi Informasi dan Ilmu Komputer 11, no. 2 (2024): 369–76. http://dx.doi.org/10.25126/jtiik.20241127866.

Full text
Abstract:
Penyemat pembicara adalah vektor yang terbukti efektif dalam merepresentasikan karakteristik pembicara sehingga menghasilkan akurasi yang tinggi dalam ranah pengenalan pembicara. Penelitian ini berfokus pada penerapan x-vectors sebagai penyemat pembicara pada sistem identifikasi pembicara berbahasa Indonesia yang menggunakan model speaker identification. Model dibangun dengan menggunakan dataset VoxCeleb sebagai data latih dan dataset INF19 sebagai data uji yang dikumpulkan dari suara mahasiswa Jurusan Informatika Universitas Syiah Kuala angkatan 2019. Untuk membangun model, fitur-fitur diekst
APA, Harvard, Vancouver, ISO, and other styles
37

Li, Yanxiong, Qisheng Huang, Xiaofen Xing, and Xiangmin Xu. "Low-complexity speaker embedding module with feature segmentation, transformation and reconstruction for few-shot speaker identification." Expert Systems with Applications 280 (June 2025): 127542. https://doi.org/10.1016/j.eswa.2025.127542.

Full text
APA, Harvard, Vancouver, ISO, and other styles
38

Zhou, Yi, Xiaohai Tian, and Haizhou Li. "Language Agnostic Speaker Embedding for Cross-Lingual Personalized Speech Generation." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 3427–39. http://dx.doi.org/10.1109/taslp.2021.3125142.

Full text
APA, Harvard, Vancouver, ISO, and other styles
39

杨, 益灵. "Multi-Speaker Indonesian Speech Synthesis Based on Global Style Embedding." Computer Science and Application 13, no. 01 (2023): 126–35. http://dx.doi.org/10.12677/csa.2023.131013.

Full text
APA, Harvard, Vancouver, ISO, and other styles
40

Kim, Ju-Ho, Hye-Jin Shim, Jee-Weon Jung, and Ha-Jin Yu. "A Supervised Learning Method for Improving the Generalization of Speaker Verification Systems by Learning Metrics from a Mean Teacher." Applied Sciences 12, no. 1 (2021): 76. http://dx.doi.org/10.3390/app12010076.

Full text
Abstract:
The majority of recent speaker verification tasks are studied under open-set evaluation scenarios considering real-world conditions. The characteristics of these tasks imply that the generalization towards unseen speakers is a critical capability. Thus, this study aims to improve the generalization of the system for the performance enhancement of speaker verification. To achieve this goal, we propose a novel supervised-learning-method-based speaker verification system using the mean teacher framework. The mean teacher network refers to the temporal averaging of deep neural network parameters,
APA, Harvard, Vancouver, ISO, and other styles
41

Seo, Soonshin, and Ji-Hwan Kim. "Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System." Electronics 9, no. 10 (2020): 1706. http://dx.doi.org/10.3390/electronics9101706.

Full text
Abstract:
One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the numbe
APA, Harvard, Vancouver, ISO, and other styles
42

Byun, Sung-Woo, and Seok-Pil Lee. "Design of a Multi-Condition Emotional Speech Synthesizer." Applied Sciences 11, no. 3 (2021): 1144. http://dx.doi.org/10.3390/app11031144.

Full text
Abstract:
Recently, researchers have developed text-to-speech models based on deep learning, which have produced results superior to those of previous approaches. However, because those systems only mimic the generic speaking style of reference audio, it is difficult to assign user-defined emotional types to synthesized speech. This paper proposes an emotional speech synthesizer constructed by embedding not only speaking styles but also emotional styles. We extend speaker embedding to multi-condition embedding by adding emotional embedding in Tacotron, so that the synthesizer can generate emotional spee
APA, Harvard, Vancouver, ISO, and other styles
43

Wang, Jiani, Shiran Dudy, Xinlu Hu, Zhiyong Wang, Rosy Southwell, and Jacob Whitehill. "Optimizing Speaker Diarization for the Classroom: Applications in Timing Student Speech and Distinguishing Teachers from Children." Journal of Educational Data Mining 17, no. 1 (2025): 98–125. https://doi.org/10.5281/zenodo.14871875.

Full text
Abstract:
An important dimension of classroom group dynamics & collaboration is how much each person contributes to the discussion. With the goal of distinguishing teachers' speech from children's speech and measuring how much each student speaks, we have investigated how automatic speaker diarization can be built to handle real-world classroom group discussions. We examined key design considerations such as the level of granularity of speaker assignment, speech enhancement techniques, voice activity detection, and embedding assignment methods to find an effective configuration. The best speaker dia
APA, Harvard, Vancouver, ISO, and other styles
44

Wang, Shuai, Zili Huang, Yanmin Qian, and Kai Yu. "Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification." IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, no. 11 (2019): 1686–96. http://dx.doi.org/10.1109/taslp.2019.2928128.

Full text
APA, Harvard, Vancouver, ISO, and other styles
45

Wang, Shuai, Yexin Yang, Zhanghao Wu, Yanmin Qian, and Kai Yu. "Data Augmentation Using Deep Generative Models for Embedding Based Speaker Recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2598–609. http://dx.doi.org/10.1109/taslp.2020.3016498.

Full text
APA, Harvard, Vancouver, ISO, and other styles
46

YOU, MINGYU, GUO-ZHENG LI, JACK Y. YANG, and MARY QU YANG. "AN ENHANCED LIPSCHITZ EMBEDDING CLASSIFIER FOR MULTI-EMOTION SPEECH ANALYSIS." International Journal of Pattern Recognition and Artificial Intelligence 23, no. 08 (2009): 1685–700. http://dx.doi.org/10.1142/s0218001409007764.

Full text
Abstract:
This paper proposes an Enhanced Lipschitz Embedding based Classifier (ELEC) for the classification of multi-emotions from speech signals. ELEC adopts geodesic distance to preserve the intrinsic geometry at all scales of speech corpus, instead of Euclidean distance. Based on the minimal geodesic distance to vectors of different emotions, ELEC maps the high dimensional feature vectors into a lower space. Through analyzing the class labels of the neighbor training vectors in the compressed low space, ELEC classifies the test data into six archetypal emotional states, i.e. neutral, anger, fear, ha
APA, Harvard, Vancouver, ISO, and other styles
47

CLARIDGE, CLAUDIA, EWA JONSSON, and MERJA KYTÖ. "Entirely innocent: a historical sociopragmatic analysis of maximizers in the Old Bailey Corpus." English Language and Linguistics 24, no. 4 (2019): 855–74. http://dx.doi.org/10.1017/s1360674319000388.

Full text
Abstract:
Based on an investigation of the Old Bailey Corpus, this article explores the development and usage patterns of maximizers in Late Modern English (LModE). The maximizers to be considered for inclusion in the study are based on the lists provided in Quirk et al. (1985) and Huddleston & Pullum (2002). The aims of the study were to (i) document the frequency development of maximizers, (ii) investigate the sociolinguistic embedding of maximizers usage (gender, class) and (iii) analyze the sociopragmatics of maximizers based on the speakers’ roles, such as judge or witness, in the courtroom.Of
APA, Harvard, Vancouver, ISO, and other styles
48

Viñals, Ignacio, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida. "An Analysis of the Short Utterance Problem for Speaker Characterization." Applied Sciences 9, no. 18 (2019): 3697. http://dx.doi.org/10.3390/app9183697.

Full text
Abstract:
Speaker characterization has always been conditioned by the length of the evaluated utterances. Despite performing well with large amounts of audio, significant degradations in performance are obtained when short utterances are considered. In this work we present an analysis of the short utterance problem providing an alternative point of view. From our perspective the performance in the evaluation of short utterances is highly influenced by the phonetic similarity between enrollment and test utterances. Both enrollment and test should contain similar phonemes to properly discriminate, being d
APA, Harvard, Vancouver, ISO, and other styles
49

Kang, Woo Hyun, and Nam Soo Kim. "Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings." Applied Sciences 9, no. 8 (2019): 1597. http://dx.doi.org/10.3390/app9081597.

Full text
Abstract:
Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker verification, is not considered to be an optimal method for this task since it is known to suffer from severe performance degradation when dealing with short-duration speech utterances. More recent approaches that implement deep-learning techniques for e
APA, Harvard, Vancouver, ISO, and other styles
50

Qiu, Zeyu, Jun Tang, Yaxin Zhang, Jiaxin Li, and Xishan Bai. "A Voice Cloning Method Based on the Improved HiFi-GAN Model." Computational Intelligence and Neuroscience 2022 (October 11, 2022): 1–12. http://dx.doi.org/10.1155/2022/6707304.

Full text
Abstract:
With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model paramet
APA, Harvard, Vancouver, ISO, and other styles
We offer discounts on all premium plans for authors whose works are included in thematic literature selections. Contact us to get a unique promo code!