Um die anderen Arten von Veröffentlichungen zu diesem Thema anzuzeigen, folgen Sie diesem Link: Speaker embedding.

Zeitschriftenartikel zum Thema „Speaker embedding“

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an

Wählen Sie eine Art der Quelle aus:

Machen Sie sich mit Top-50 Zeitschriftenartikel für die Forschung zum Thema "Speaker embedding" bekannt.

Neben jedem Werk im Literaturverzeichnis ist die Option "Zur Bibliographie hinzufügen" verfügbar. Nutzen Sie sie, wird Ihre bibliographische Angabe des gewählten Werkes nach der nötigen Zitierweise (APA, MLA, Harvard, Chicago, Vancouver usw.) automatisch gestaltet.

Sie können auch den vollen Text der wissenschaftlichen Publikation im PDF-Format herunterladen und eine Online-Annotation der Arbeit lesen, wenn die relevanten Parameter in den Metadaten verfügbar sind.

Sehen Sie die Zeitschriftenartikel für verschiedene Spezialgebieten durch und erstellen Sie Ihre Bibliographie auf korrekte Weise.

1

Mridha, Muhammad Firoz, Abu Quwsar Ohi, Muhammad Mostafa Monowar, Md Abdul Hamid, Md Rashedul Islam, and Yutaka Watanobe. "U-Vectors: Generating Clusterable Speaker Embedding from Unlabeled Data." Applied Sciences 11, no. 21 (2021): 10079. http://dx.doi.org/10.3390/app112110079.

Der volle Inhalt der Quelle
Annotation:
Speaker recognition deals with recognizing speakers by their speech. Most speaker recognition systems are built upon two stages, the first stage extracts low dimensional correlation embeddings from speech, and the second performs the classification task. The robustness of a speaker recognition system mainly depends on the extraction process of speech embeddings, which are primarily pre-trained on a large-scale dataset. As the embedding systems are pre-trained, the performance of speaker recognition models greatly depends on domain adaptation policy, which may reduce if trained using inadequate
APA, Harvard, Vancouver, ISO und andere Zitierweisen
2

Kim, Minsoo, and Gil-Jin Jang. "Speaker-Attributed Training for Multi-Speaker Speech Recognition Using Multi-Stage Encoders and Attention-Weighted Speaker Embedding." Applied Sciences 14, no. 18 (2024): 8138. http://dx.doi.org/10.3390/app14188138.

Der volle Inhalt der Quelle
Annotation:
Automatic speech recognition (ASR) aims at understanding naturally spoken human speech to be used as text inputs to machines. In multi-speaker environments, where multiple speakers are talking simultaneously with a large amount of overlap, a significant performance degradation may occur with conventional ASR systems if they are trained by recordings of single talkers. This paper proposes a multi-speaker ASR method that incorporates speaker embedding information as an additional input. The embedding information for each of the speakers in the training set was extracted as numeric vectors, and a
APA, Harvard, Vancouver, ISO und andere Zitierweisen
3

Liu, Elaine M., Jih-Wei Yeh, Jen-Hao Lu, and Yi-Wen Liu. "Speaker embedding space cosine similarity comparisons of singing voice conversion models and voice morphing." Journal of the Acoustical Society of America 154, no. 4_supplement (2023): A244. http://dx.doi.org/10.1121/10.0023424.

Der volle Inhalt der Quelle
Annotation:
We explore the use of cosine similarity between x-vector speaker embeddings as an objective metric to evaluate the effectiveness of singing voice conversion. Our system preprocesses a source singer’s audio to obtain melody features via the F0 contour, loudness curve, and phonetic posteriorgram. These are input to a denoising diffusion probabilistic acoustic model conditioned with another target voice’s speaker embedding to generate a mel spectrogram, which is passed through a HiFi-GAN vocoder to synthesize audio of the source song in the target timbre. We use cosine similarity between the conv
APA, Harvard, Vancouver, ISO und andere Zitierweisen
4

Pick, Ron Korenblum, Vladyslav Kozhukhov, Dan Vilenchik, and Oren Tsur. "STEM: Unsupervised STructural EMbedding for Stance Detection." Proceedings of the AAAI Conference on Artificial Intelligence 36, no. 10 (2022): 11174–82. http://dx.doi.org/10.1609/aaai.v36i10.21367.

Der volle Inhalt der Quelle
Annotation:
Stance detection is an important task, supporting many downstream tasks such as discourse parsing and modeling the propagation of fake news, rumors, and science denial. In this paper, we propose a novel framework for stance detection. Our framework is unsupervised and domain-independent. Given a claim and a multi-participant discussion -- we construct the interaction network from which we derive topological embedding for each speaker. These speaker embedding enjoy the following property: speakers with the same stance tend to be represented by similar vectors, while antipodal vectors represent
APA, Harvard, Vancouver, ISO und andere Zitierweisen
5

Karamyan, Davit S., and Grigor A. Kirakosyan. "Building a Speaker Diarization System: Lessons from VoxSRC 2023." Mathematical Problems of Computer Science 60 (November 30, 2023): 52–62. http://dx.doi.org/10.51408/1963-0109.

Der volle Inhalt der Quelle
Annotation:
Speaker diarization is the process of partitioning an audio recording into segments corresponding to individual speakers. In this paper, we present a robust speaker diarization system and describe its architecture. We focus on discussing the key components necessary for building a strong diarization system, such as voice activity detection (VAD), speaker embedding, and clustering. Our system emerged as the winner in the Voxceleb Speaker Recognition Challenge (VoxSRC) 2023, a widely recognized competition for evaluating speaker diarization systems.
APA, Harvard, Vancouver, ISO und andere Zitierweisen
6

Milewski, Krzysztof, Szymon Zaporowski, and Andrzej Czyżewski. "Comparison of the Ability of Neural Network Model and Humans to Detect a Cloned Voice." Electronics 12, no. 21 (2023): 4458. http://dx.doi.org/10.3390/electronics12214458.

Der volle Inhalt der Quelle
Annotation:
The vulnerability of the speaker identity verification system to attacks using voice cloning was examined. The research project assumed creating a model for verifying the speaker’s identity based on voice biometrics and then testing its resistance to potential attacks using voice cloning. The Deep Speaker Neural Speaker Embedding System was trained, and the Real-Time Voice Cloning system was employed based on the SV2TTS, Tacotron, WaveRNN, and GE2E neural networks. The results of attacks using voice cloning were analyzed and discussed in the context of a subjective assessment of cloned voice f
APA, Harvard, Vancouver, ISO und andere Zitierweisen
7

Kang, Woo Hyun, Sung Hwan Mun, Min Hyun Han, and Nam Soo Kim. "Disentangled Speaker and Nuisance Attribute Embedding for Robust Speaker Verification." IEEE Access 8 (2020): 141838–49. http://dx.doi.org/10.1109/access.2020.3012893.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
8

Poojary, Nigam R., and K. H. Ashish. "Text To Speech with Custom Voice." International Journal for Research in Applied Science and Engineering Technology 11, no. 4 (2023): 4523–30. http://dx.doi.org/10.22214/ijraset.2023.51217.

Der volle Inhalt der Quelle
Annotation:
Abstract: The Text to Speech with Custom Voice system described in this work has vast applicability in numerous industries, including entertainment, education, and accessibility. The proposed text-to-speech (TTS) system is capable of generating speech audio in custom voices, even those not included in the training data. The system comprises a speaker encoder, a synthesizer, and a WaveRNN vocoder. Multiple speakers from a dataset of clean speech without transcripts are used to train the speaker encoder for a speaker verification process. The reference speech of the target speaker is used to cre
APA, Harvard, Vancouver, ISO und andere Zitierweisen
9

Lee, Kong Aik, Qiongqiong Wang, and Takafumi Koshinaka. "Xi-Vector Embedding for Speaker Recognition." IEEE Signal Processing Letters 28 (2021): 1385–89. http://dx.doi.org/10.1109/lsp.2021.3091932.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
10

Sečujski, Milan, Darko Pekar, Siniša Suzić, Anton Smirnov, and Tijana Nosek. "Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding." JUCS - Journal of Universal Computer Science 26, no. 4 (2020): 434–53. http://dx.doi.org/10.3897/jucs.2020.023.

Der volle Inhalt der Quelle
Annotation:
The paper presents a novel architecture and method for training neural networks to produce synthesized speech in a particular voice and speaking style, based on a small quantity of target speaker/style training data. The method is based on neural network embedding, i.e. mapping of discrete variables into continuous vectors in a low-dimensional space, which has been shown to be a very successful universal deep learning technique. In this particular case, different speaker/style combinations are mapped into different points in a low-dimensional space, which enables the network to capture the sim
APA, Harvard, Vancouver, ISO und andere Zitierweisen
11

Sečujski, Milan, Darko Pekar, Siniša Suzić, Anton Smirnov, and Tijana Nosek. "Speaker/Style-Dependent Neural Network Speech Synthesis Based on Speaker/Style Embedding." JUCS - Journal of Universal Computer Science 26, no. (4) (2020): 434–53. https://doi.org/10.3897/jucs.2020.023.

Der volle Inhalt der Quelle
Annotation:
The paper presents a novel architecture and method for training neural networks to produce synthesized speech in a particular voice and speaking style, based on a small quantity of target speaker/style training data. The method is based on neural network embedding, i.e. mapping of discrete variables into continuous vectors in a low-dimensional space, which has been shown to be a very successful universal deep learning technique. In this particular case, different speaker/style combinations are mapped into different points in a low-dimensional space, which enables the network to capture the sim
APA, Harvard, Vancouver, ISO und andere Zitierweisen
12

Chadchankar, Mrs Asharani. "Advancements in Speaker-Independent Speech Separation Using Deep Attractor Networks." International Journal for Research in Applied Science and Engineering Technology 13, no. 5 (2025): 4056–61. https://doi.org/10.22214/ijraset.2025.71160.

Der volle Inhalt der Quelle
Annotation:
Speaker-independent speech separation, the task of isolating individual voices from a mixture without prior knowledge of the speakers, has gained significant attention due to its importance in various applications. However, challenges such as the arbitrary order of speakers and the unknown number of speakers in a mixture remain significant hurdles. This research paper analyzes Deep Attractor Networks (DANet), a novel deep learning framework designed to address these issues. DANet projects mixed speech signals into a high-dimensional embedding space where reference points, known as attractors,
APA, Harvard, Vancouver, ISO und andere Zitierweisen
13

Bae, Ara, and Wooil Kim. "Speaker Verification Employing Combinations of Self-Attention Mechanisms." Electronics 9, no. 12 (2020): 2201. http://dx.doi.org/10.3390/electronics9122201.

Der volle Inhalt der Quelle
Annotation:
One of the most recent speaker recognition methods that demonstrates outstanding performance in noisy environments involves extracting the speaker embedding using attention mechanism instead of average or statistics pooling. In the attention method, the speaker recognition performance is improved by employing multiple heads rather than a single head. In this paper, we propose advanced methods to extract a new embedding by compensating for the disadvantages of the single-head and multi-head attention methods. The combination method comprising single-head and split-based multi-head attentions sh
APA, Harvard, Vancouver, ISO und andere Zitierweisen
14

Wirdiani, Ayu, Steven Ndung'u Machetho, I. Ketut Gede Darma Putra, Made Sudarma, Rukmi Sari Hartati, and Henrico Aldy Ferdian. "Improvement Model for Speaker Recognition using MFCC-CNN and Online Triplet Mining." International Journal on Advanced Science, Engineering and Information Technology 14, no. 2 (2024): 420–27. http://dx.doi.org/10.18517/ijaseit.14.2.19396.

Der volle Inhalt der Quelle
Annotation:
Various biometric security systems, such as face recognition, fingerprint, voice, hand geometry, and iris, have been developed. Apart from being a communication medium, the human voice is also a form of biometrics that can be used for identification. Voice has unique characteristics that can be used as a differentiator between one person and another. A sound speaker recognition system must be able to pick up the features that characterize a person's voice. This study aims to develop a human speaker recognition system using the Convolutional Neural Network (CNN) method. This research proposes i
APA, Harvard, Vancouver, ISO und andere Zitierweisen
15

Li, Xiao, Xiao Chen, Rui Fu, Xiao Hu, Mintong Chen, and Kun Niu. "Learning Deep Embedding with Acoustic and Phoneme Features for Speaker Recognition in FM Broadcasting." IET Biometrics 2024 (March 22, 2024): 1–10. http://dx.doi.org/10.1049/2024/6694481.

Der volle Inhalt der Quelle
Annotation:
Text-independent speaker verification (TI-SV) is a crucial task in speaker recognition, as it involves verifying an individual’s claimed identity from speech of arbitrary content without any human intervention. The target for TI-SV is to design a discriminative network to learn deep speaker embedding for speaker idiosyncrasy. In this paper, we propose a deep speaker embedding learning approach of a hybrid deep neural network (DNN) for TI-SV in FM broadcasting. Not only acoustic features are utilized, but also phoneme features are introduced as prior knowledge to collectively learn deep speaker
APA, Harvard, Vancouver, ISO und andere Zitierweisen
16

Pan, Weijun, Shenhao Chen, Yidi Wang, Sheng Chen, and Xuan Wang. "The Speaker Identification Model for Air-Ground Communication Based on a Parallel Branch Architecture." Applied Sciences 15, no. 6 (2025): 2994. https://doi.org/10.3390/app15062994.

Der volle Inhalt der Quelle
Annotation:
This study addresses the challenges of complex noise and short speech in civil aviation air-ground communication scenarios and proposes a novel speaker identification model, Chrono-ECAPA-TDNN (CET). The aim of the study is to enhance the accuracy and robustness of speaker identification in these environments. The CET model incorporates three key components: the Chrono Block module, the speaker embedding extraction module, and the optimized loss function module. The Chrono Block module utilizes parallel branching architecture, Bi-LSTM, and multi-head attention mechanisms to effectively extract
APA, Harvard, Vancouver, ISO und andere Zitierweisen
17

Brydinskyi, Vitalii, Yuriy Khoma, Dmytro Sabodashko, et al. "Comparison of Modern Deep Learning Models for Speaker Verification." Applied Sciences 14, no. 4 (2024): 1329. http://dx.doi.org/10.3390/app14041329.

Der volle Inhalt der Quelle
Annotation:
This research presents an extensive comparative analysis of a selection of popular deep speaker embedding models, namely WavLM, TitaNet, ECAPA, and PyAnnote, applied in speaker verification tasks. The study employs a specially curated dataset, specifically designed to mirror the real-world operating conditions of voice models as accurately as possible. This dataset includes short, non-English statements gathered from interviews on a popular online video platform. The dataset features a wide range of speakers, with 33 males and 17 females, making a total of 50 unique voices. These speakers vary
APA, Harvard, Vancouver, ISO und andere Zitierweisen
18

Lin, Weiwei, and Man-Wai Mak. "Mixture Representation Learning for Deep Speaker Embedding." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022): 968–78. http://dx.doi.org/10.1109/taslp.2022.3153270.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
19

Ghorbani, Shahram, and John H. L. Hansen. "Advanced accent/dialect identification and accentedness assessment with multi-embedding models and automatic speech recognition." Journal of the Acoustical Society of America 155, no. 6 (2024): 3848–60. http://dx.doi.org/10.1121/10.0026235.

Der volle Inhalt der Quelle
Annotation:
The ability to accurately classify accents and assess accentedness in non-native speakers are challenging tasks due primarily to the complexity and diversity of accent and dialect variations. In this study, embeddings from advanced pretrained language identification (LID) and speaker identification (SID) models are leveraged to improve the accuracy of accent classification and non-native accentedness assessment. Findings demonstrate that employing pretrained LID and SID models effectively encodes accent/dialect information in speech. Furthermore, the LID and SID encoded accent information comp
APA, Harvard, Vancouver, ISO und andere Zitierweisen
20

Khoma, Volodymyr, Yuriy Khoma, Vitalii Brydinskyi, and Alexander Konovalov. "Development of Supervised Speaker Diarization System Based on the PyAnnote Audio Processing Library." Sensors 23, no. 4 (2023): 2082. http://dx.doi.org/10.3390/s23042082.

Der volle Inhalt der Quelle
Annotation:
Diarization is an important task when work with audiodata is executed, as it provides a solution to the problem related to the need of dividing one analyzed call recording into several speech recordings, each of which belongs to one speaker. Diarization systems segment audio recordings by defining the time boundaries of utterances, and typically use unsupervised methods to group utterances belonging to individual speakers, but do not answer the question “who is speaking?” On the other hand, there are biometric systems that identify individuals on the basis of their voices, but such systems are
APA, Harvard, Vancouver, ISO und andere Zitierweisen
21

Bahmaninezhad, Fahimeh, Chunlei Zhang, and John H. L. Hansen. "An investigation of domain adaptation in speaker embedding space for speaker recognition." Speech Communication 129 (May 2021): 7–16. http://dx.doi.org/10.1016/j.specom.2021.01.001.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
22

Zeng, Bang, and Ming Li. "Universal Speaker Embedding Free Target Speaker Extraction and Personal Voice Activity Detection." Computer Speech & Language 94 (November 2025): 101807. https://doi.org/10.1016/j.csl.2025.101807.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
23

Xylogiannis, Paris, Nikolaos Vryzas, Lazaros Vrysis, and Charalampos Dimoulas. "Multisensory Fusion for Unsupervised Spatiotemporal Speaker Diarization." Sensors 24, no. 13 (2024): 4229. http://dx.doi.org/10.3390/s24134229.

Der volle Inhalt der Quelle
Annotation:
Speaker diarization consists of answering the question of “who spoke when” in audio recordings. In meeting scenarios, the task of labeling audio with the corresponding speaker identities can be further assisted by the exploitation of spatial features. This work proposes a framework designed to assess the effectiveness of combining speaker embeddings with Time Difference of Arrival (TDOA) values from available microphone sensor arrays in meetings. We extract speaker embeddings using two popular and robust pre-trained models, ECAPA-TDNN and X-vectors, and calculate the TDOA values via the Genera
APA, Harvard, Vancouver, ISO und andere Zitierweisen
24

Shahin Shamsabadi, Ali, Brij Mohan Lal Srivastava, Aurélien Bellet, et al. "Differentially Private Speaker Anonymization." Proceedings on Privacy Enhancing Technologies 2023, no. 1 (2023): 98–114. http://dx.doi.org/10.56553/popets-2023-0007.

Der volle Inhalt der Quelle
Annotation:
Sharing real-world speech utterances is key to the training and deployment of voice-based services. However, it also raises privacy risks as speech contains a wealth of personal data. Speaker anonymization aims to remove speaker information from a speech utterance while leaving its linguistic and prosodic attributes intact. State-of-the-art techniques operate by disentangling the speaker information (represented via a speaker embedding) from these attributes and re-synthesizing speech based on the speaker embedding of another speaker. Prior research in the privacy community has shown that anon
APA, Harvard, Vancouver, ISO und andere Zitierweisen
25

Li, Wenjie, Pengyuan Zhang, and Yonghong Yan. "TEnet: target speaker extraction network with accumulated speaker embedding for automatic speech recognition." Electronics Letters 55, no. 14 (2019): 816–19. http://dx.doi.org/10.1049/el.2019.1228.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
26

Xie, Fei, Dalong Zhang, and Chengming Liu. "Global–Local Self-Attention Based Transformer for Speaker Verification." Applied Sciences 12, no. 19 (2022): 10154. http://dx.doi.org/10.3390/app121910154.

Der volle Inhalt der Quelle
Annotation:
Transformer models are now widely used for speech processing tasks due to their powerful sequence modeling capabilities. Previous work determined an efficient way to model speaker embeddings using the Transformer model by combining transformers with convolutional networks. However, traditional global self-attention mechanisms lack the ability to capture local information. To alleviate these problems, we proposed a novel global–local self-attention mechanism. Instead of using local or global multi-head attention alone, this method performs local and global attention in parallel in two parallel
APA, Harvard, Vancouver, ISO und andere Zitierweisen
27

Shim, Hye-jin, Jee-weon Jung, and Ha-Jin Yu. "Which to select?: Analysis of speaker representation with graph attention networks." Journal of the Acoustical Society of America 156, no. 4 (2024): 2701–8. http://dx.doi.org/10.1121/10.0032393.

Der volle Inhalt der Quelle
Annotation:
Although the recent state-of-the-art systems show almost perfect performance, analysis of speaker embeddings has been lacking thus far. An in-depth analysis of speaker representation will be performed by looking into which features are selected. To this end, various intermediate representations of the trained model are observed using graph attentive feature aggregation, which includes a graph attention layer and graph pooling layer followed by a readout operation. To do so, the TIMIT dataset, which has comparably restricted conditions (e.g., the region and phoneme) is used after pre-training t
APA, Harvard, Vancouver, ISO und andere Zitierweisen
28

Guo, Xin, Chengfang Luo, Aiwen Deng, and Feiqi Deng. "DeltaVLAD: An efficient optimization algorithm to discriminate speaker embedding for text-independent speaker verification." AIMS Mathematics 7, no. 4 (2022): 6381–95. http://dx.doi.org/10.3934/math.2022355.

Der volle Inhalt der Quelle
Annotation:
<abstract> <p>Text-independent speaker verification aims to determine whether two given utterances in open-set task originate from the same speaker or not. In this paper, some ways are explored to enhance the discrimination of embeddings in speaker verification. Firstly, difference is used in the coding layer to process speaker features to form the DeltaVLAD layer. The frame-level speaker representation is extracted by the deep neural network with differential operations to calculate the dynamic changes between frames, which is more conducive to capturing insignificant changes in t
APA, Harvard, Vancouver, ISO und andere Zitierweisen
29

Prabhala, Jagat Chaitanya, Venkatnareshbabu K, and Ragoju Ravi. "OPTIMIZING SIMILARITY THRESHOLD FOR ABSTRACT SIMILARITY METRIC IN SPEECH DIARIZATION SYSTEMS: A MATHEMATICAL FORMULATION." Applied Mathematics and Sciences An International Journal (MathSJ) 10, no. 1/2 (2023): 1–10. http://dx.doi.org/10.5121/mathsj.2023.10201.

Der volle Inhalt der Quelle
Annotation:
Speaker diarization is a critical task in speech processing that aims to identify "who spoke when?" in an audio or video recording that contains unknown amounts of speech from unknown speakers and unknown number of speakers. Diarization has numerous applications in speech recognition, speaker identification, and automatic captioning. Supervised and unsupervised algorithms are used to address speaker diarization problems, but providing exhaustive labeling for the training dataset can become costly in supervised learning, while accuracy can be compromised when using unsupervised approaches. This
APA, Harvard, Vancouver, ISO und andere Zitierweisen
30

Smith, Sierra Rose, Patricia Crist, Rebekah Givens, Taylor Stringer, and Adriana Macdonald. "Interviews Regarding Practice Scholar Engagement: Practitioners’ Descriptions of Their Research Motivations, Characteristics, Resources, & Outcomes." American Journal of Occupational Therapy 78, Supplement_2 (2024): 7811500214p1. http://dx.doi.org/10.5014/ajot.2024.78s2-po214.

Der volle Inhalt der Quelle
Annotation:
Abstract Date Presented 03/22/24 Embedding practice scholarship in daily work is challenging for practitioners despite being emphasized in the American Occupational Therapy Assocation’s Vision 2025 and mission statements. This presentation defines and provides strategies used by active practice scholars. Primary Author and Speaker: Sierra Rose Smith Additional Authors and Speakers: Adriana Macdonald Contributing Authors: Patricia Crist, Rebekah Givens, Taylor Stringer
APA, Harvard, Vancouver, ISO und andere Zitierweisen
31

Mingote, Victoria, Antonio Miguel, Alfonso Ortega, and Eduardo Lleida. "Supervector Extraction for Encoding Speaker and Phrase Information with Neural Networks for Text-Dependent Speaker Verification." Applied Sciences 9, no. 16 (2019): 3295. http://dx.doi.org/10.3390/app9163295.

Der volle Inhalt der Quelle
Annotation:
In this paper, we propose a new differentiable neural network with an alignment mechanism for text-dependent speaker verification. Unlike previous works, we do not extract the embedding of an utterance from the global average pooling of the temporal dimension. Our system replaces this reduction mechanism by a phonetic phrase alignment model to keep the temporal structure of each phrase since the phonetic information is relevant in the verification task. Moreover, we can apply a convolutional neural network as front-end, and, thanks to the alignment process being differentiable, we can train th
APA, Harvard, Vancouver, ISO und andere Zitierweisen
32

Lyu, Ke-Ming, Ren-yuan Lyu, and Hsien-Tsung Chang. "Real-time multilingual speech recognition and speaker diarization system based on Whisper segmentation." PeerJ Computer Science 10 (March 29, 2024): e1973. http://dx.doi.org/10.7717/peerj-cs.1973.

Der volle Inhalt der Quelle
Annotation:
This research presents the development of a cutting-edge real-time multilingual speech recognition and speaker diarization system that leverages OpenAI’s Whisper model. The system specifically addresses the challenges of automatic speech recognition (ASR) and speaker diarization (SD) in dynamic, multispeaker environments, with a focus on accurately processing Mandarin speech with Taiwanese accents and managing frequent speaker switches. Traditional speech recognition systems often fall short in such complex multilingual and multispeaker contexts, particularly in SD. This study, therefore, inte
APA, Harvard, Vancouver, ISO und andere Zitierweisen
33

LIANG, Chunyan, Lin YANG, Qingwei ZHAO, and Yonghong YAN. "Factor Analysis of Neighborhood-Preserving Embedding for Speaker Verification." IEICE Transactions on Information and Systems E95.D, no. 10 (2012): 2572–76. http://dx.doi.org/10.1587/transinf.e95.d.2572.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
34

Byun, Jaeuk, and Jong Won Shin. "Monaural Speech Separation Using Speaker Embedding From Preliminary Separation." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 2753–63. http://dx.doi.org/10.1109/taslp.2021.3101617.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
35

Lin, Weiwei, Man-Wai Mak, Na Li, Dan Su, and Dong Yu. "A Framework for Adapting DNN Speaker Embedding Across Languages." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2810–22. http://dx.doi.org/10.1109/taslp.2020.3030499.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
36

Misbullah, Alim, Muhammad Saifullah Sani, Husaini, Laina Farsiah, Zahnur, and Kikye Martiwi Sukiakhy. "Sistem Identifikasi Pembicara Berbahasa Indonesia Menggunakan X-Vector Embedding." Jurnal Teknologi Informasi dan Ilmu Komputer 11, no. 2 (2024): 369–76. http://dx.doi.org/10.25126/jtiik.20241127866.

Der volle Inhalt der Quelle
Annotation:
Penyemat pembicara adalah vektor yang terbukti efektif dalam merepresentasikan karakteristik pembicara sehingga menghasilkan akurasi yang tinggi dalam ranah pengenalan pembicara. Penelitian ini berfokus pada penerapan x-vectors sebagai penyemat pembicara pada sistem identifikasi pembicara berbahasa Indonesia yang menggunakan model speaker identification. Model dibangun dengan menggunakan dataset VoxCeleb sebagai data latih dan dataset INF19 sebagai data uji yang dikumpulkan dari suara mahasiswa Jurusan Informatika Universitas Syiah Kuala angkatan 2019. Untuk membangun model, fitur-fitur diekst
APA, Harvard, Vancouver, ISO und andere Zitierweisen
37

Li, Yanxiong, Qisheng Huang, Xiaofen Xing, and Xiangmin Xu. "Low-complexity speaker embedding module with feature segmentation, transformation and reconstruction for few-shot speaker identification." Expert Systems with Applications 280 (June 2025): 127542. https://doi.org/10.1016/j.eswa.2025.127542.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
38

Zhou, Yi, Xiaohai Tian, and Haizhou Li. "Language Agnostic Speaker Embedding for Cross-Lingual Personalized Speech Generation." IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021): 3427–39. http://dx.doi.org/10.1109/taslp.2021.3125142.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
39

杨, 益灵. "Multi-Speaker Indonesian Speech Synthesis Based on Global Style Embedding." Computer Science and Application 13, no. 01 (2023): 126–35. http://dx.doi.org/10.12677/csa.2023.131013.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
40

Kim, Ju-Ho, Hye-Jin Shim, Jee-Weon Jung, and Ha-Jin Yu. "A Supervised Learning Method for Improving the Generalization of Speaker Verification Systems by Learning Metrics from a Mean Teacher." Applied Sciences 12, no. 1 (2021): 76. http://dx.doi.org/10.3390/app12010076.

Der volle Inhalt der Quelle
Annotation:
The majority of recent speaker verification tasks are studied under open-set evaluation scenarios considering real-world conditions. The characteristics of these tasks imply that the generalization towards unseen speakers is a critical capability. Thus, this study aims to improve the generalization of the system for the performance enhancement of speaker verification. To achieve this goal, we propose a novel supervised-learning-method-based speaker verification system using the mean teacher framework. The mean teacher network refers to the temporal averaging of deep neural network parameters,
APA, Harvard, Vancouver, ISO und andere Zitierweisen
41

Seo, Soonshin, and Ji-Hwan Kim. "Self-Attentive Multi-Layer Aggregation with Feature Recalibration and Deep Length Normalization for Text-Independent Speaker Verification System." Electronics 9, no. 10 (2020): 1706. http://dx.doi.org/10.3390/electronics9101706.

Der volle Inhalt der Quelle
Annotation:
One of the most important parts of a text-independent speaker verification system is speaker embedding generation. Previous studies demonstrated that shortcut connections-based multi-layer aggregation improves the representational power of a speaker embedding system. However, model parameters are relatively large in number, and unspecified variations increase in the multi-layer aggregation. Therefore, in this study, we propose a self-attentive multi-layer aggregation with feature recalibration and deep length normalization for a text-independent speaker verification system. To reduce the numbe
APA, Harvard, Vancouver, ISO und andere Zitierweisen
42

Byun, Sung-Woo, and Seok-Pil Lee. "Design of a Multi-Condition Emotional Speech Synthesizer." Applied Sciences 11, no. 3 (2021): 1144. http://dx.doi.org/10.3390/app11031144.

Der volle Inhalt der Quelle
Annotation:
Recently, researchers have developed text-to-speech models based on deep learning, which have produced results superior to those of previous approaches. However, because those systems only mimic the generic speaking style of reference audio, it is difficult to assign user-defined emotional types to synthesized speech. This paper proposes an emotional speech synthesizer constructed by embedding not only speaking styles but also emotional styles. We extend speaker embedding to multi-condition embedding by adding emotional embedding in Tacotron, so that the synthesizer can generate emotional spee
APA, Harvard, Vancouver, ISO und andere Zitierweisen
43

Wang, Jiani, Shiran Dudy, Xinlu Hu, Zhiyong Wang, Rosy Southwell, and Jacob Whitehill. "Optimizing Speaker Diarization for the Classroom: Applications in Timing Student Speech and Distinguishing Teachers from Children." Journal of Educational Data Mining 17, no. 1 (2025): 98–125. https://doi.org/10.5281/zenodo.14871875.

Der volle Inhalt der Quelle
Annotation:
An important dimension of classroom group dynamics & collaboration is how much each person contributes to the discussion. With the goal of distinguishing teachers' speech from children's speech and measuring how much each student speaks, we have investigated how automatic speaker diarization can be built to handle real-world classroom group discussions. We examined key design considerations such as the level of granularity of speaker assignment, speech enhancement techniques, voice activity detection, and embedding assignment methods to find an effective configuration. The best speaker dia
APA, Harvard, Vancouver, ISO und andere Zitierweisen
44

Wang, Shuai, Zili Huang, Yanmin Qian, and Kai Yu. "Discriminative Neural Embedding Learning for Short-Duration Text-Independent Speaker Verification." IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, no. 11 (2019): 1686–96. http://dx.doi.org/10.1109/taslp.2019.2928128.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
45

Wang, Shuai, Yexin Yang, Zhanghao Wu, Yanmin Qian, and Kai Yu. "Data Augmentation Using Deep Generative Models for Embedding Based Speaker Recognition." IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020): 2598–609. http://dx.doi.org/10.1109/taslp.2020.3016498.

Der volle Inhalt der Quelle
APA, Harvard, Vancouver, ISO und andere Zitierweisen
46

YOU, MINGYU, GUO-ZHENG LI, JACK Y. YANG, and MARY QU YANG. "AN ENHANCED LIPSCHITZ EMBEDDING CLASSIFIER FOR MULTI-EMOTION SPEECH ANALYSIS." International Journal of Pattern Recognition and Artificial Intelligence 23, no. 08 (2009): 1685–700. http://dx.doi.org/10.1142/s0218001409007764.

Der volle Inhalt der Quelle
Annotation:
This paper proposes an Enhanced Lipschitz Embedding based Classifier (ELEC) for the classification of multi-emotions from speech signals. ELEC adopts geodesic distance to preserve the intrinsic geometry at all scales of speech corpus, instead of Euclidean distance. Based on the minimal geodesic distance to vectors of different emotions, ELEC maps the high dimensional feature vectors into a lower space. Through analyzing the class labels of the neighbor training vectors in the compressed low space, ELEC classifies the test data into six archetypal emotional states, i.e. neutral, anger, fear, ha
APA, Harvard, Vancouver, ISO und andere Zitierweisen
47

CLARIDGE, CLAUDIA, EWA JONSSON, and MERJA KYTÖ. "Entirely innocent: a historical sociopragmatic analysis of maximizers in the Old Bailey Corpus." English Language and Linguistics 24, no. 4 (2019): 855–74. http://dx.doi.org/10.1017/s1360674319000388.

Der volle Inhalt der Quelle
Annotation:
Based on an investigation of the Old Bailey Corpus, this article explores the development and usage patterns of maximizers in Late Modern English (LModE). The maximizers to be considered for inclusion in the study are based on the lists provided in Quirk et al. (1985) and Huddleston & Pullum (2002). The aims of the study were to (i) document the frequency development of maximizers, (ii) investigate the sociolinguistic embedding of maximizers usage (gender, class) and (iii) analyze the sociopragmatics of maximizers based on the speakers’ roles, such as judge or witness, in the courtroom.Of
APA, Harvard, Vancouver, ISO und andere Zitierweisen
48

Viñals, Ignacio, Alfonso Ortega, Antonio Miguel, and Eduardo Lleida. "An Analysis of the Short Utterance Problem for Speaker Characterization." Applied Sciences 9, no. 18 (2019): 3697. http://dx.doi.org/10.3390/app9183697.

Der volle Inhalt der Quelle
Annotation:
Speaker characterization has always been conditioned by the length of the evaluated utterances. Despite performing well with large amounts of audio, significant degradations in performance are obtained when short utterances are considered. In this work we present an analysis of the short utterance problem providing an alternative point of view. From our perspective the performance in the evaluation of short utterances is highly influenced by the phonetic similarity between enrollment and test utterances. Both enrollment and test should contain similar phonemes to properly discriminate, being d
APA, Harvard, Vancouver, ISO und andere Zitierweisen
49

Kang, Woo Hyun, and Nam Soo Kim. "Unsupervised Learning of Total Variability Embedding for Speaker Verification with Random Digit Strings." Applied Sciences 9, no. 8 (2019): 1597. http://dx.doi.org/10.3390/app9081597.

Der volle Inhalt der Quelle
Annotation:
Recently, the increasing demand for voice-based authentication systems has encouraged researchers to investigate methods for verifying users with short randomized pass-phrases with constrained vocabulary. The conventional i-vector framework, which has been proven to be a state-of-the-art utterance-level feature extraction technique for speaker verification, is not considered to be an optimal method for this task since it is known to suffer from severe performance degradation when dealing with short-duration speech utterances. More recent approaches that implement deep-learning techniques for e
APA, Harvard, Vancouver, ISO und andere Zitierweisen
50

Qiu, Zeyu, Jun Tang, Yaxin Zhang, Jiaxin Li, and Xishan Bai. "A Voice Cloning Method Based on the Improved HiFi-GAN Model." Computational Intelligence and Neuroscience 2022 (October 11, 2022): 1–12. http://dx.doi.org/10.1155/2022/6707304.

Der volle Inhalt der Quelle
Annotation:
With the aim of adapting a source Text to Speech (TTS) model to synthesize a personal voice by using a few speech samples from the target speaker, voice cloning provides a specific TTS service. Although the Tacotron 2-based multi-speaker TTS system can implement voice cloning by introducing a d-vector into the speaker encoder, the speaker characteristics described by the d-vector cannot allow for the voice information of the entire utterance. This affects the similarity of voice cloning. As a vocoder, WaveNet sacrifices speech generation speed. To balance the relationship between model paramet
APA, Harvard, Vancouver, ISO und andere Zitierweisen
Wir bieten Rabatte auf alle Premium-Pläne für Autoren, deren Werke in thematische Literatursammlungen aufgenommen wurden. Kontaktieren Sie uns, um einen einzigartigen Promo-Code zu erhalten!