Dissertationen: „Speech Communication. Engineering, Electronics and Electrical“

1

Othman, Noor Shamsiah. „Wireless speech and audio communications“. Thesis, University of Southampton, 2008. https://eprints.soton.ac.uk/64488/.

Der volle Inhalt der Quelle

Annotation:

The limited applicability of Shannon’s separation theorem in practical speech/audio systems motivates the employment of joint source and channel coding techniques. Thus, considerable efforts have been invested in designing various types of joint source and channel coding schemes. This thesis discusses two different types of Joint Source and Channel Coding (JSCC) schemes, namely Unequal Error Protection (UEP) aided turbo transceivers as well as Iterative Source and Channel Decoding (ISCD) exploiting the residual redundancy inherent in the source encoded parameters. More specifically, in Chapter 2, two different UEP JSCC philosophies were designed for wireless audio and speech transmissions, namely a turbo-detected UEP scheme using twin-class convolutional codes and another turbo detector using more sophisticated Irregular Convolutional Codes (IRCC). In our investigations, the MPEG-4 Advanced Audio Coding (AAC), the MPEG-4 Transform-Domain Weighted Interleaved Vector Quantization (TwinVQ) and the Adaptive MultiRate WideBand (AMR-WB) audio/speech codecs were incorporated in the sophisticated UEP turbo transceiver, which consisted of a threestage serially concatenated scheme constituted by Space-Time Trellis Coding (STTC), Trellis Coded Modulation (TCM) and two different-rate Non-Systematic Convolutional codes (NSCs) used for UEP. Explicitly, both the twin-class UEP turbo transceiver assisted MPEG-4 TwinVQ and the AMR-WB audio/speech schemes outperformed their corresponding single-class audio/speech benchmarkers by approximately 0.5 dB, in terms of the required Eb/N0, when communicating over uncorrelated Rayleigh fading channels. By contrast, when employing the MPEG-4 AAC audio codec and protecting the class-1 audio bits using a 2/3-rate NSC code, a more substantial Eb/N0 gain of about 2 dB was achieved. As a further design alternative, we also proposed a turbo transceiver employing IRCCs for the sake of providing UEP for the AMR-WB speech codec. The resultant UEP schemes exhibited a better performance when compared to the corresponding Equal Error Protection (EEP) benchmark schemes, since the former protected the audio/speech bits according to their sensitivity. The proposed UEP aided system using IRCCs exhibits an Eb/N0 gain of about 0.4 dB over the EEP system employing regular convolutional codes, when communicating over AWGN channels, at the point of tolerating a SegSNR degradation of 1 dB. In Chapter 3, a novel system that invokes jointly optimised ISCD for enhancing the error resilience of the AMR-WB speech codec was proposed and investigated. The resultant AMR-WB coded speech signal is protected by a Recursive Systematic onvolutional (RSC) code and transmitted using a non-coherently detected Multiple-Input Multiple-Output (MIMO) Differential Space-Time Spreading (DSTS) scheme. To further enhance the attainable system performance and to maximise the coding advantage of the proposed transmission scheme, the system is also combined with multi-dimensional Sphere Packing (SP) modulation. The AMR-WB speech decoder was further developed for the sake of accepting the a priori information passed to it from the channel decoder as extrinsic information, where the residual redundancy inherent in the AMR-WB encoded parameters was exploited. Moreover, the convergence behaviour of the proposed scheme was evaluated with the aid of both Three-Dimensional (3D) and Two-Dimenstional (2D) EXtrinsic Information Transfer (EXIT) charts. The proposed scheme benefitted from the exploitation of the residual redundancy inherent in the AMR-WB encoded parameters, where an approximately 0.5 dB Eb/N0 gain was achieved in comparison to its corresponding hard speech decoding based counterpart. At the point of tolerating a SegSNR degradation of 1 dB, the advocated scheme exhibited an Eb/N0 gain of about 1.0 dB in comparison to the benchmark scheme carrying out joint channel decoding and DSTS aided SP-demodulation in conjunction with separate AMR-WB decoding, when communicating over narrowband temporally correlated Rayleigh fading channels. In Chapter 4, two jointly optimized ISCD schemes invoking the soft-output AMRWB speech codec using DSTS assisted SP modulation were proposed. More specifically, the soft-bit assisted iterative AMR-WB decoder’s convergence characteristics were further enhanced by using Over-Complete source-Mapping (OCM), as well as a recursive precoder. EXIT charts were used to analyse the convergence behaviour of the proposed turbo transceivers using the soft-bit assisted AMR-WB decoder. Explicitly, the OCM aided AMR-WB MIMO transceiver exhibits an Eb/N0 gain of about 3.0 dB in comparison to the benchmark scheme also using ISCD as well as DSTS aided SP-demodulation, but dispensing with the OCM scheme, when communicating over narrowband temporally correlated Rayleigh fading channels. Finally, the precoded soft-bit AMR-WB MIMO transceiver exhibits an Eb/N0 gain of about 1.5 dB in comparison to the benchmark scheme dispensing with the precoder, when communicating over narrowband temporally correlated Rayleigh fading channels.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

2

Shen, Donglin. „Emulation study of speech communications over ATM networks“. Thesis, University of Ottawa (Canada), 1996. http://hdl.handle.net/10393/9544.

Der volle Inhalt der Quelle

Annotation:

Speech communications over ATM networks is one of the important issues in the broadband ISDN. Although CCITT Study Group XIII has already proposed the Draft Recommendation I.121 on speech communications over broadband ISDN, there are many open issued to be further studied before the effective deployment of speech communications over broadband ISDN can take place. In this thesis, several issues on speech transmission over ATM network have been studied, such as packetization delay, network queuing delay, digital speech encoding algorithms and PVR algorithm. A packetized speech emulation device is proposed to provide the capability of subjective speech transmission quality evaluation over ATM network or other kind of packetized network. The boundary of speech transmission quality degradation that human hearing can tolerate against information loss rate, delay fluctuation and encoding mechanism are found through emulation. The echo effect in ATM network, which is a primary issue in speech communications is also discussed in the thesis. In particular, the emphasis is given to the specification, design and implementation of ATM speech emulator which consists of two subsystems: ATM Network Simulator (ATMNS) and Speech Transmission Emulator (STE). ATMNS has been specified according to the results of ATM network performance analysis, and STE is based on ATM specifications recommended by CCITT. A prototype of the emulator has been implemented on a personal computer and DSP5600 development system with a special designed audio interface to interconnect phone sets to the DSP5600 AD input. The software had been written in "C" and DSP assembly language. Subjective evaluation are conducted in terms of the following factors: Cell discarding rate, different network queuing delay and fluctuation to different PVR algorithms. These factors are basic issues which may affect ATM speech transmission quality in an ATM network. Finally, test results under different network conditions are given.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

3

Rouchy, Christophe. „Systematic Design of Space-Time Convolutional Codes“. Thesis, University of California, Santa Cruz, 2014. http://pqdtopen.proquest.com/#viewpdf?dispub=1554232.

Der volle Inhalt der Quelle

Annotation:

Space-time convolutional code (STCC) is a technique that combines transmit diversity and coding to improve reliability in wireless fading channels. In this proposal, we demonstrate a systematic design of multi-level quadrature amplitude modulation (M-QAM) STCCs utilizing quadrature phase shift keying (QPSK) STCC as component codes for any number of transmit antennas. Morever, a low complexity decoding algorithm is introduced, where the decoding complexity increases linearly by the number of transmit antennas. The approach is based on utilizing a group interference cancellation technique also known as combined array processing (CAP) technique.

Finally, our research topic will explore: with the current approach, a scalable STTC with better performance as compared to space- time block code (STBC) combined with multiple trellis coded modulation (MTCM) also known as STBC-MTCM; the design of low complexity decoder for STTC; the combination of our approach with multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM).

APA, Harvard, Vancouver, ISO und andere Zitierweisen

4

Ho, Wen Tsern 1977. „Clock and data recovery circuitry for high speed communication systems“. Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=82494.

Der volle Inhalt der Quelle

Annotation:

The maturing of the telecommunications industry has seen the development and implementation of devices that work at high frequencies of the electromagnetic spectrum. With the rapid deployment of optical networks, there is an increasing demand for low-cost and efficient communications circuitry. In order to interface with such high frequency signals at lower cost, there has been a recent push for very high frequency circuits using low-cost fabrication technologies like digital CMOS.
This thesis investigates the usage of legacy architectures and the implementation of different topologies using digital CMOS technology. Various Clock and Data Recovery Phase-Locked Loops have been implemented using a 0.18mum CMOS technology, and the process from modeling to actual implementation will be presented. The design of the components of the loop, layout issues, and the performance of the various designs will be discussed. New fully-differential CMOS designs that are optimized for high-speed operation, yet providing stable lock with minimal jitter, with a targeted operation range from 1 GHz to 7 GHz, will be described in detail, as well as their operation and optimization.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

5

Tian, Xizhen. „Investigation of HBT preamplification for high speed optical communication systems“. Thesis, University of Ottawa (Canada), 2002. http://hdl.handle.net/10393/6273.

Der volle Inhalt der Quelle

Annotation:

A noise analysis for a Common-Collector-Cascode traveling wave HBT preamplifier is developed, resulting in an expression for the preamplifier's equivalent input noise current density. A photoreceiver, consisting of a P-I-N and GaAs HBT MMIC distributed amplifier, was implemented using Nortel's GaAs HBT (f T = 70GHz) process. The noise performance of the P-I-N preamplifier was predicted based on the noise analysis equations. The P-I-N preamplifier, having a measured bandwidth of 22GHz, displayed a measured average equivalent input noise current density of 24 pA/Hz . Good agreement was obtained between the predicted and measured noise performance. The analysis gives useful insight into the dominant noise contributions of the preamplifier. An 8-stage HBT distributed amplifier was successfully developed. By considering the various issues involved in its design, a design procedure for monolithic distributed amplifiers is presented. The implementation of the HBT preamplifier is described and its measured results are given. From the excellent agreement between the predicted and measured performance, the design method is considered validated. The successful operation of the distributed amplifier, which provides 15dB gain and 35GHz 3dB bandwidth, fulfills the objective of experimental verification. The implemented photoreceiver is the first to have a P-I-N mounted on the MMIC chip.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

6

Fan, Yongquan. „Accelerating jitter and BER qualifications of high speed serial communication interfaces“. Thesis, McGill University, 2010. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=86531.

Der volle Inhalt der Quelle

Annotation:

High-Speed Serial Interface (HSSI) devices have witnessed an increased use in communications. As a measure of how often bit errors happen, Bit Error Rate (BER) performance is of paramount importance in any communication interface. The bit errors in HSSIs are in large part due to jitter. This thesis investigates the topic of accelerating the jitter and BER testing and characterization [1].
The thesis first proposes a new algorithm, suitable for extrapolating the receiver jitter tolerance performance from higher BER regions down to the 10-12 level or lower [2]. This algorithm enables us to perform the jitter tolerance characterization and production test more than 1000 times faster [3]. Then an under-sampling based transmitter test scheme is presented. The scheme can accurately extract the transmitter jitter and finish the whole transmitter test within 100ms [4] while the test usually takes seconds. All the receiver and transmitter testing schemes have been successfully used on Automatic Test Equipment (ATE) to qualify millions of HSSIs with speed up to 6 Gigabits per second (Gbps).
The thesis also presents an external loopback-based testing scheme, where a novel jitter injection technique is proposed using the state-of-the-art phase delay lines. The scheme can be applied to test HSSIs with data rate up to 12.5 Gbps. It is also suitable for multi-lane HSSI testing with a lower cost than pure ATE solutions. By using high-speed relays, we combine the proposed ATE based approaches and the loopback approach along with an FPGA-based BER tester to provide a more versatile scheme for HSSI post-silicon validation, testing and debugging [5]. In addition, we further explore the unparallel advantages of our digital Gaussian noise generator in low BER evaluation [6].
Les interfaces sérielles à haute vitesse (interfaces HSSI) ont connu une utilisation accrue dans les télécommunications. Le taux d'erreur sur les bits (BER), mesure de la fréquence des erreurs, est d'une importance cruciale dans les interfaces modernes de télécommunication. Cette thèse traite de l'accélération de la caractérisation du vacillement et des tests BER.
Cette thèse propose tout d'abord un nouvel algorithme, approprié pour l'extrapolation de la performance de la tolérance au vacillement d'un récepteur pour un taux d'erreur sur les bits (BER) à un niveau de 10-12 ou moins. Cet algorithme permet de caractériser la tolérance au vacillement dans les tests de production plus de 1000 fois plus rapidement. Ensuite, une conception de transmetteur à sous-échantillonnage est présenté. Cette conception permet d'extraire précisément le vacillement du transmetteur et de compléter les tests de ce dernier en moins de 100 ms alors que ces tests durent normalement plusieurs secondes. Toutes les méthodes de test de récepteurs et de transmetteurs ont été utilisées avec succès sur un équipement d'éssai automatique (ATE) pour qualifier des millions d'interfaces HSSI à des vitesses allant jusqu'à 6 gigabits par seconde (6 Gbps).
Cette thèse présente aussi une conception de test en bouclage où une nouvelle méthode d'injection de vacillement est proposée en utilisant des lignes de délai de phase. Cette méthode peut être appliquée pour tester des interfaces HSSI avec un taux de transfer allant jusqu'à 12.5 Gbps. Elle permet aussi de tester des interface HSSI multi-lignes à un coût moindre qu'une solution utilisant un ATE. En utilisant des relais à haute vitesse, les approches sur ATE et par test en bouclage peuvent être combinées en incorporant un testeur de BER sur circuit intégré prédiffusé programmable (FPGA), ce qui permet une méthode de tests HSSI polyvalente pour la validation post-fabrication, les tests et le débogage. Finalement, nous explorons les avantages de notre générateur de bruit Gaussien dans l'évaluation de BER à bas niveau.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

7

Elsherif, Mohamed Asaad. „Mapping multiplexing technique (MMT) : a novel intensity modulated transmission format for high-speed optical communication systems“. Thesis, University of Nottingham, 2016. http://eprints.nottingham.ac.uk/33413/.

Der volle Inhalt der Quelle

Annotation:

There is a huge rapid growth in the deployment of data centers, mainly driven from the increasing demand of internet services as video streaming, e-commerce, Internet Of Things (IOT), social media, and cloud computing. This led data centers to experience an expeditious increase in the amount of network traffic that they have to sustain due to requirement of scaling with the processing speed of Complementary metal–oxide–semiconductor (CMOS) technology. On the other side, as more and more data centers and processing cores are on demand, as the power consumption is becoming a challenging issue. Unless novel power efficient methodologies are innovated, the information technology industry will be more liable to a future power crunch. As such, low complex novel transmission formats featuring both power efficiency and low cost are considered the major characteristics enabling large-scale, high performance data transmission environment for short-haul optical interconnects and metropolitan range data networks. In this thesis, a novel high-speed Intensity-Modulated Direct-Detection (IM/DD) transmission format named “Mapping Multiplexing Technique (MMT)” for high-speed optical fiber networks, is proposed and presented. Conceptually, MMT design challenges the high power consumption issue that exists in high-speed short and medium range networks. The proposed novel scheme provides low complex means for increasing the power efficiency of optical transceivers at an impactful tradeoff between power efficiency, spectral efficiency, and cost. The novel scheme has been registered as a patent (Malaysia PI2012700631) that can be employed for applications related but not limited to, short-haul optical interconnects in data centers and Metropolitan Area networks (MAN). A comprehensive mathematical model for N-channel MMT modulation format has been developed. In addition, a signal space model for the N-channel MMT has been presented to serve as a platform for comparison with other transmission formats under optical channel constraints. Especially, comparison with M-PAM, as meanwhile are of practical interest to expand the capacity for optical interconnects deployment which has been recently standardized for Ethernet IEEE 802.3bs 100Gb/s and in today ongoing investigation activities by IEEE 802.3 400Gb/s Ethernet Task Force. Performance metrics have been considered by the derivation of the average electrical and optical power for N-channel MMT symbols in comparison with Pulse Amplitude Modulation (M-PAM) format with respect to the information capacity. Asymptotic power efficiency evaluation in multi-dimensional signal space has been considered. For information capacity of 2, 3 and 4 bits/symbol, 2-channel, 3-channel and 4-channel MMT modulation formats can reduce the power penalty by 1.76 dB, 2.2 dB and 4 dB compared with 4-PAM, 8-PAM and 16-PAM, respectively. This enhancement is equivalent to 53%, 60% and 71% energy per bit reduction to the transmission of 2, 3 and 4 bits per symbol employing 2-, 3- and 4-channel MMT compared with 4-, 8- and 16-PAM format, respectively. One of the major dependable parameters that affect the immunity of a modulation format to fiber non-linearities, is the system baud rate. The propagation of pulses in fiber with bitrates in the order > 10G, is not only limited by the linear fiber impairments, however, it has strong proportionality with fiber intra-channel non-linearities (Self Phase Modulation (SPM), Intra-channel Cross-Phase Modulation (IXPM) and Intra-channel Four-Wave Mixing (IFWM)). Hence, in addition to the potential application of MMT in short-haul networks, the thesis validates the practicality of implementing N-channel MMT system accompanied by dispersion compensation methodologies to extend the reach of error free transmission (BER ≤ 10-12) for Metro-networks. N-Channel MMT has been validated by real environment simulation results to outperform the performance of M-PAM in tolerating fiber non-linearities. By the employment of pre-post compensation to tolerate both residual chromatic dispersion and non-linearity, performance above the error free transmission limit at 40Gb/s bit rate have been attained for 2-, 3- and 4-channel MMT over spans lengths of up to 1200Km, 320 Km and 320 Km, respectively. While, at an aggregated bit rate of 100 Gb/s, error free transmission can be achieved for 2-, 3- and 4-channel MMT over spans lengths of up to 480 Km, 80 Km and 160 Km, respectively. At the same spectral efficiency, 4-channel MMT has realized a single channel maximum error free transmission over span lengths up to 320 Km and 160 Km at 40Gb/s and 100Gb/s, respectively, in contrast with 4-PAM attaining 240 Km and 80 Km at 40Gb/s and 100Gb/s, respectively.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

8

Leong, Michael. „Representing voiced speech using prototype waveform interpolation for low-rate speech coding“. Thesis, McGill University, 1992. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=56796.

Der volle Inhalt der Quelle

Annotation:

In recent years, research in narrow-band digital speech coding has achieved good quality speech coders at low rates of 4.8 to 8.0 kb/s. This thesis examines the method proposed by W. B. Kleijn called prototype waveform interpolation (PWI) for coding the voiced sections of speech efficiently to achieve a coder below 4.8 kb/s while maintaining, even improving, the perceptual quality of current coders.
In examining the PWI method, it was found that although the method generally works very well there are occasional sections of the reconstructed voiced speech where audible distortion can be heard, even when the prototypes are not quantized. The research undertaken in this thesis focuses on the fundamental principles behind modelling voiced speech using PWI instead of focusing on bit allocation for encoding the prototypes. Problems in the PWI method are found that may be have been overlooked as encoding error if full encoding were implemented.
Kleijn uses PWI to represent voiced sections of the excitation signal which is the residual obtained after the removal of short-term redundancies by a linear predictive filter. The problem with this method is that when the PWI reconstructed excitation is passed through the inverse filter to synthesize the speech undesired effects occur due to the time-varying nature of the filter. The reconstructed speech may have undesired envelope variations which result in audible warble.
This thesis proposes an energy fixup to smoothen the synthesized speech envelope when the interpolation procedure fails to provide the smooth linear result that is desired. Further investigation, however, leads to the final proposal in this thesis that PWI should he performed on the clean speech signal instead of the excitation to achieve consistently reliable results for all voiced frames.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

9

Abboud, Karim. „Wideband CELP speech coding“. Thesis, McGill University, 1992. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=56805.

Der volle Inhalt der Quelle

Annotation:

The purpose of this thesis is to study the coding of wideband speech and to improve on previous Code-Excited Linear Prediction (CELP) coders in terms of speech quality and bit rate. To accomplish this task, improved coding techniques are introduced and the operating bit rate is reduced while maintaining and even enhancing the speech quality.
the first approach considers the quantization of Liner Predictive Coding (LPC) parameters and uses a three way split vector quantization. Both scalar and vector quantization are initially studied; results show that, with adequate codebook training, the second method generates better results while using a fewer number of bits. Nevertheless, the use of vector quantizers remain highly complex in terms of memory and number of computations. A new quantization scheme, split vector quantization (split VQ), is investigated to overcome this complexity problem. Using a new weighted distance measure as a selection criterion for split VQ, the average spectral distortion is significantly reduced to match the results obtained with scalar quantizers.
The second approach introduces a new pitch predictor with an increased temporal resolution for periodicity. This new technique has the advantage of maintaining the same quality obtained with conventional multiple coefficient predictors at a reduced bit rate. Furthermore, the conventional CELP noise weighting filter is modified to allow more freedom and better accuracy in the modeling of both tilt and formant structures. Throughout this process, different noise weighting schemes are evaluated and the results show that the new filter greatly contributes in solving the problem of high frequency distortion.
The final wideband CELP coder is operational at 11.7 kbits/s and generates a high perceptual quality of the reconstructed speech using the fractional pitch predictor and the new perceptual noise weighting filter.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

10

Nour-Eldin, Amr. „Quantifying and exploiting speech memory for the improvement of narrowband speech bandwidth extension“. Thesis, McGill University, 2014. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=121195.

Der volle Inhalt der Quelle

Annotation:

Since its standardization in the 1960s, the bandwidth of traditional telephony speech has been limited to the 0.3–3.4 kHz narrowband range. Wideband speech reconstruction through artificial bandwidth extension (BWE) attempts to regenerate the highband frequency content above 3.4 kHz in the receiving end, thereby providing backward compatibility with existing networks. BWE schemes have primarily relied on memoryless mapping to capture the correlation between narrowband and highband spectra. In this thesis, we investigate exploiting speech memory—in reference to the long-term information in segments longer than the conventional 10–30ms frames—for the purpose of improving the cross-band correlation central to BWE. With speech durations of up to 600ms modelled through delta features, we first quantify the correlation between long-term parameterizations of the narrow and high frequency bands using information-theoretic measures in combination with statistical modelling based on Gaussian mixture models (GMMs) and vector quantization. In addition to showing thatthe inclusion of memory can indeed increase certainty about highband spectral content in joint-band GMMs by over 100%, our information-theoretic investigation also demonstrates that the gains achievable by such acoustic-only memory inclusion saturate at, roughly, the syllabic duration of 200ms. To translate the highband certainty gains achievable by memory inclusion into tangible BWE performance improvements, we subsequently propose two distinct and novel approaches for memory-inclusive GMM-based BWE where highband spectra are reconstructed given narrowband input by minimum mean-square error estimation. In the first approach, we incorporate delta features into the feature vector representations whose underlying cross-band correlations are to be modelled by joint-band GMMs. Due to their non-invertibility, however, the inclusion of delta features into the parameterization frontendin lieu of some of the conventional static features imposes a time-frequency information tradeoff. Accordingly, we propose an empirical optimization process to determine the optimal allocation of available dimensionalities among static and delta features such that the certainty about static highband content is maximized. Integrating frontend-based memory inclusion optimized as such into our memoryless BWE baseline system results in performance improvements that, while modest, involve no increases in extension-stage computational cost nor in training data requirements, thereby providing an easy and convenient means for exploiting speech dynamics to improve BWE performance. In our second approach, we focus on modelling the high-dimensional distributions underlying sequences of joint-band feature vectors. To that end, we extend the GMM framework by presenting a novel training approach where sequences of past frames are progressively used to estimate the parameters of high-dimensional temporally-extended GMMs in a tree-like time-frequency-localized fashion. The proposed approach thus breaks down the infeasible task of modelling high-dimensional distributions into a series of localized modelling operations with considerably lower complexity and fewer degrees of freedom. The proposed temporal-based GMM extension approach is presented in a manner that emphasizes its wide applicability to the general contexts of source-target conversion and high-dimensional modelling. By integrating temporally-extended GMMs into our memoryless BWE baseline system, we show that our model-based memory-inclusive BWE technique can outperform not only our first frontend-based approach, but also other comparable and oft-cited model-based techniques in the literature. Although this superior BWE performance is achieved at a significant increase in extension-stage computational costs, we nevertheless show these costs to be within the typical capabilities of modern communication devices such as tablets and smart phones.
Depuis sa normalisation dans les années 1960, la bande passante traditionnelle de la téléphonie de la parole a été limitée à la bande étroite de 0,3 à 3,4 kHz. La reconstruction de la parole à large bande à travers l'extension artificielle de la bande passante (EBP) essaye de régénérer la bande passante à haute fréquence au-dessus de 3,4 kHz au niveau du récepteur, ce qui permet la rétrocompatibilité avec les réseaux existants. Les travaux précédentes sur l'EBP ont principalement utilisé une cartographie sans mémoire pour modéliser la corrélation entre les spectres à bande étroite et ceux à haute fréquence. Dans cette thèse, nous étudions l'exploitation de la mémoire vocale en référence à l'information à long terme dans des segments plus longs que les cadres conventionnels de 10–30 ms; ceci est dans le but d'améliorer la corrélation inter-bande capitale pour l'EBP. Focalisant sur des durées de parole modélisées jusqu'à 600 ms par des coefficients delta, nous quantifions d'abord la corrélation entre les paramétrisations à long terme des bandes à bases et hautes fréquences en utilisant la théorie de l'information et la modélisation statistique basée sur des modèles de mélanges Gaussiens (GMMs) ainsi que la quantification vectorielle. En plus de montrer que l'inclusion de la mémoire peut en effet augmenter la certitude sur le contenu spectral de la haute bande dans des GMMs de bandes jointes de plus de 100%, notre étude démontre également que les gains réalisables par une telle inclusion sature, à peu près, à la durée syllabique de 200 ms. Afin de transformer ces gains théoriques de certitude sur la bande haute à des améliorations tangibles en performance de l'EBP, nous proposons ensuite deux nouvelles approches pour l'EBP avec mémoire qui sont basées sur des GMMs et où les spectres à haute bande sont reconstruits, sachant ceux de la bande étroite, par l'estimation de l'erreur quadratique moyenne. Dans la première approche, nous incorporons des coefficients delta dans les représentations vectorielles modélisées par des GMMs de bandes jointes. En raison de la non-inversibilité des coefficients delta, cependant, nous proposons un processus d'optimisation empirique pour déterminer l'allocation optimale des dimensionnalités disponibles parmi les paramètres statiques et coefficients delta de sorte que la certitude sur le contenu statique de la haute bande est maximisée. L'intégration de la mémoire optimisé de cette manière dans la paramétrisation de notre système de base d'EBP entraîne des améliorations de performances qui, bien que modestes, offrent un moyen facile et pratique pour exploiter les caractéristiques dynamiques de la parole afin d'améliorer les performances d'EBP. Dans notre deuxième approche, nous nous concentrons sur la modélisation des distributions de dimensionnalités élevées qui sous-tendent des séquences de vecteurs de paramètres de bandes conjointes. À cette fin, nous étendons le cadre de GMMs en présentant une nouvelle approche d'apprentissage où les séquences des cadres passés sont progressivement utilisées afin d'estimer les paramètres des GMMs de dimensionnalités élevées qui sont temporellement étendus d'une manière arborescente et localisée en temps-fréquence. En intégrant des GMMs temporellement étendus dans notre système de base d'EBP sans mémoire, nous montrons que cette technique d'EBP avec mémoire modelisée peut surpasser non seulement notre première approche basée sur les coefficients delta, mais aussi d'autres techniques souvent citées dans la littérature. Bien que cette performance supérieure est réalisée au coût d'une augmentation significative des calculs associés à l'étape d'extension, nous démontrons néanmoins que ces coûts sont conformes aux capacités typiques des appareils de communication modernes tels que les tablettes et les téléphones intelligents.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

11

Hsu, Wei-shou 1981. „Robust bandwidth extension of narrowband speech“. Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=82497.

Der volle Inhalt der Quelle

Annotation:

Telephone speech often sounds muffled and thin due to its narrowband characteristics. With the increased availability of terminals capable of receiving wideband signals, extending the bandwidth of narrowband telephone speech at the receiver has drawn much research interest. Currently, there exist many methods that can provide good reconstructions of the wideband spectra from narrowband speech; however, they often lack robustness to different channel conditions, and their performances degrade when they operate in unknown environments.
This thesis presents a bandwidth extension algorithm that mitigates the effects of adverse conditions. The proposed system is designed to work with noisy input speech and unknown channel frequency response. To maximize the naturalness of the reconstructed speech, the algorithm estimates the channel and applies equalization to recover the attenuated bands. Artifacts are reduced by employing an adaptive and a fixed postfilter.
Subjective test results suggest that the proposed scheme is not affected by channel conditions and is able to produce speech with enhanced quality in adverse environments.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

12

Soong, Michael. „Predictive split vector quantization for speech coding“. Thesis, McGill University, 1994. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=68054.

Der volle Inhalt der Quelle

Annotation:

The purpose of this thesis is to examine techniques for efficiently coding speech Linear Predictive Coding (LPC) coefficients. Vector Quantization (VQ) is an efficient approach to encode speech at low bit rates. However its exponentially growing complexity poses a formidable barrier. Thus a structured vector quantizer is normally used instead.
Summation Product Codes (SPCs) are a family of structured vector quantizers that circumvent the complexity obstacle. The performance of SPC vector quantizers can be traded off against their storage and encoding complexity. Besides the complexity factors, the design algorithm can also affect the performance of the quantizer. The conventional generalized Lloyd's algorithm (GLA) generates sub-optimal codebooks. For particular SPC such as multistage VQ, the GLA is applied to design the stage codebooks stage-by-stage. Joint design algorithms on the other hand update all the stage codebooks simultaneously.
In this thesis, a general formulation and an algorithm solution to the joint codebook design problem is provided for the SPCs. The key to this algorithm is that every PC has a reference product codebook which minimizes the overall distortion. This joint design algorithm is tested with a novel SPC, namely "Predictive Split VQ (PSVQ)".
VQ of speech Line Spectral Frequencies (LSF's) using PSVQ is also presented. A result in this work is that PSVQ, designed using the joint codebook design algorithm requires only 20 bits/frame(20 ms) for transparent coding of a 10$ sp{ rm th}$ order LSF's parameters.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

13

Choy, Eddie L. T. „Waveform interpolation speech coder at 4 kbs“. Thesis, McGill University, 1998. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=20901.

Der volle Inhalt der Quelle

Annotation:

Speech coding at bit rates near 4 kbps is expected to be widely deployed in applications such as visual telephony, mobile and personal communications. This research focuses on developing a speech coder based on the waveform interpolation (WI) scheme, with an attempt to deliver near toll-quality speech at rates around 4 kbps. A WI coder has been simulated in floating-point using the C programming language. The high performance of the WI model has been confirmed by subjective listening tests in which the unquantized coder outperforms the 32 kbps G.726 standard (ADPCM) 98% of the time under clean input speech conditions; the reconstructed speech is perceived to be essentially indistinguishable from the original. When fully quantized, the speech quality of the WI coder at 4.25 kbps has been judged to be equivalent to or better than that of G.729 (the ITU-T toll-quality 8 kbps standard) for 45% of the test sentences. Further refinements of the quantization techniques are warranted to bring the coder closer to the toll-quality benchmark. Yet, the existing implementation has produced good quality coded speech with a high degree of intelligibility and naturalness when compared to the conventional coding schemes operating in the neighbourhood of 4 kbps.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

14

De, Aloknath. „Auditory distortion measures for speech coder evaluation“. Thesis, McGill University, 1993. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=41270.

Der volle Inhalt der Quelle

Annotation:

One of the important research problems in the area of speech coding is to determine the sound quality of coded speech signals. This quality can best be evaluated by a subjective assessment which is often difficult to administer and time consuming. An objective measure which is consistent with subjective assessment could play a vital role in the evaluation as well as in the design of a low bit-rate speech coder. In this dissertation, we introduce two distortion measures for speech coder evaluation. Since the perceptual abilities of a human being determine the precision with which speech data must be processed, we consider the details of cochlear (inner ear) and other auditory processing. Using Lyon's auditory model, the time-domain signal is mapped onto a perceptual-domain (PD). Any speech utterance is communicated to the brain through a series of all-or-none electrical spikes (firings) and the PD representation provides information pertaining to the probability-of-firings in the neural channels. Our first measure, namely the cochlear discrimination information (CDI), evaluates the cross-entropy of the neural firings for the coded speech with respect to those for the original one. With this measure, we also compute the rate-distortion function determining the lowest bit-rate required for a specified amount of distortion. In the second measure, namely the cochlear hidden Markovian (CHM) measure, we attempt to capture the high-level processing in the brain with simple hidden Markov models (HMMs). We characterize the firing events by HMMs where the order of occurrence of PD observations and correlations among adjacent observations are modeled suitably. For computing the coder distortion, the PD observations of the coded speech are matched against the HMMs derived from the PD observations of the original speech. Experimental results show that these measures conform to subjective evaluation results in majority of the cases. Finally, the introduced measures are also app

APA, Harvard, Vancouver, ISO und andere Zitierweisen

15

Grass, John. „Quantization of predictor coefficients in speech coding“. Thesis, McGill University, 1990. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=60067.

Der volle Inhalt der Quelle

Annotation:

This thesis examines techniques of efficiently coding Linear Predictive Coding (LPC) coefficients with 20 to 30 bits per 20 ms speech frame.
Scalar quantization is the first approach evaluated. Results show that Line Spectral Frequencies require significantly fewer bits than reflection coefficients for comparable performance. The second approach investigated is the use of vector-scalar quantization. In the first stage, vector quantization is performed. The second stage consists of a bank of scalar quantizers which code the vector errors between the original LPC coefficients and the components of the vector of the quantized coefficients.
The approach is to couple the vector and scalar quantization stages. Every codebook vector is compared to the original LPC coefficient vector to produce error vectors. The second innovation into vector-scalar quantization is the incorporation of a small adaptive codebook to the large fixed codebook. Frame-to-frame correlation of the LPC coefficients is exploited at no extra cost in bits.
The performance of the vector-scalar quantization using the two new techniques is better than that of the scalar coding techniques currently used in conventional LPC coders.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

16

Maroun, Nabih. „Toll-quality speech coding at 8 kbs“. Thesis, McGill University, 1993. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=56802.

Der volle Inhalt der Quelle

Annotation:

There has been an ongoing effort to achieve very high quality speech coding at medium transmission bit rates. Consequently, the TIA has chosen the Vector SUM Linear Predictive (VSELP) implementation of an 8 kb/s coder to be the standard for North-American cellular digital telephony. However, it was only recently that, in view of the increased research focus on developing toll-quality speech coding at such bit rates, the CCITT has imposed a set of specifications for standardizing low-delay coders operating at 8 kb/s. The Low-Delay Code Excited Linear Predictive (LD-CELP) suggested by Chen is presently the only potential candidate for CCITT standardization, achieving a one-way coding delay of 10 ms. However, just like the VSELP coding algorithm, the 8 kb/s LD-CELP version does not quite yield toll-quality reconstructed speech. The purpose of the work in this thesis is to establish the minimum requirements for a coding structure capable of generating toll-quality coded speech at 8 kb/s. The purpose of this thesis is to show that, by slightly relaxing the coding delay constraint, perceptual enhancement techniques yield toll quality coding after redesigning and fine-tuning the optimization and quantization procedures of a CELP coder.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

17

Batri, Nadim. „Robust spectral parameter coding in speech processing“. Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1998. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape11/PQDD_0005/MQ43996.pdf.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

18

He, Wei. „Adaptive-rate digital speech transmission“. Thesis, University of Warwick, 1993. http://wrap.warwick.ac.uk/104723/.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

19

El-Khoury, Roland. „Evaluating a speech interface system for an ICU“. Thesis, McGill University, 1994. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=69793.

Der volle Inhalt der Quelle

Annotation:

This Thesis presents the development and evaluation of a speech I/O interface system for a bedside data entry application in an intensive care unit. The speech I/O interface consists of a speech recognition input and a speech generation output. The main objective is to permit users to perform "hands-free" and "eyes-free" data entry. This thesis begins with a literature survey of patient data management systems, speech applications, the difficulties and the key factors of speech evaluations. This is followed by an overview of the current system. The design and the implementation of the speech interface system are described. The evaluation scheme and test results are presented and discussed followed by an outline of future work and recommendations for the system.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

20

Loo, James H. Y. (James Hung Yan). „Intraframe and interframe coding of speech spectral parameters“. Thesis, McGill University, 1996. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=24065.

Der volle Inhalt der Quelle

Annotation:

Most low bit rate speech coders employ linear predictive coding (LPC) which models the short-term spectral information within each speech frame as an all-pole filter. In this thesis, we examine various methods that can efficiently encode spectral parameters for every 20 ms frame interval. Line spectral frequencies (LSF) are found to be the most effective parametric representation for spectral coding. Product code vector quantization (VQ) techniques such as split VQ (SVQ) and multi-stage VQ (MSVQ) are employed in intraframe spectral coding, where each frame vector is encoded independently from other frames. Depending on the product code structure, "transparent coding" quality is achieved for SVQ at 26-28 bits/frame and for MSVQ at 25-27 bits/frame.
Because speech is quasi-stationary, interframe coding methods such as predictive SVQ (PSVQ) can exploit the correlation between adjacent LSF vectors. Nonlinear PSVQ (NPSVQ) is introduced in which a nonparametric and nonlinear predictor replaces the linear predictor used in PSVQ. Regardless of predictor type, PSVQ garners a performance gain of 5-7 bits/frame over SVQ. By interleaving intraframe SVQ with PSVQ, error propagation is limited to at most one adjacent frame. At an overall bit rate of about 21 bits/frame, NPSVQ can provide similar coding quality as intraframe SVQ at 24 bits/frame (an average gain of 3 bits/frame). The particular form of nonlinear prediction we use incurs virtually no additional encoding computational complexity. Voicing classification is used in classified NPSVQ (CNPSVQ) to obtain an additional average gain of 1 bit/frame for unvoiced frames. Furthermore, switched-adaptive predictive SVQ (SA-PSVQ) provides an improvement of 1 bit/frame over PSVQ, or 6-8 bits/frame over SVQ, but error propagation increases to 3-7 frames. We have verified our comparative performance results using subjective listening tests.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

21

Duplessis-Beaulieu, François. „Fast convolutive blind speech separation via subband adaptation“. Thesis, McGill University, 2002. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=29535.

Der volle Inhalt der Quelle

Annotation:

Blind source separation (BSS) attempts to recover a set of statistically independent sources from a set of mixtures knowing only the structure of the mixing network, and the hypothesized probability distribution function of the sources. The case where the sources are immobile persons speaking in a reverberant room is of particular interest, because it represents a first step toward unlocking the so-called "cocktail party problem". Due to the reverberations, BSS in the time domain is usually expensive in terms of computations, but the number of computations can be significantly decreased if separation is carried out in subbands.
An implementation of a subband-based BSS system using DFT filter banks is described, and an adaptive algorithm tailored for subband separation is developed. Aliasing present in the filter bank (due to the non-ideal frequency response of the filters) is reduced by using an oversampled scheme. Experiments, conducted with two-input two-output BSS systems, using both subband and fullband adaptation, indicate that separation and distortion rates are similar for both systems. However, the proposed 32-subband system is approximately 10 times computationally faster than the fullband system.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

22

Agarwal, Tarun. „Pre-processing of noisy speech for voice coders“. Thesis, McGill University, 2002. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=33953.

Der volle Inhalt der Quelle

Annotation:

Accurate Linear Prediction Coefficient (LPC) estimation is a central requirement in low bit-rate voice coding. Under harsh acoustic conditions, LPC estimation can become unreliable. This results in poor quality of encoded speech and introduces annoying artifacts.
The purpose of this thesis is to develop and test a two-branch speech enhancement pre-processing system. This system consists of two denoising blocks. One block will enhance the degraded speech for accurate LPC estimation. The second block will increase the perceptual quality of the speech to be coded. The goals of this research are two-fold---to design the second block, and to compare the performance of other denoising schemes in each of the two branches. Test results show that the two-branch system can provide better perceptual quality of coded speech over conventional one-branch (i.e., one denoising block) speech enhancement techniques under many noisy environments.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

23

Klein, Mark 1977. „Signal subspace speech enhancement with perceptual post-filtering“. Thesis, McGill University, 2002. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=33975.

Der volle Inhalt der Quelle

Annotation:

Speech enhancement blocks form a critical part of voice communications systems. Unfortunately, most enhancement schemes have difficulty eliminating noise from speech without introducing distortion or artefacts. Many of the disturbances originate from poor parameter estimation and interframe fluctuations.
This thesis introduces the Enhanced Signal Subspace (ESS) system to mitigate the above problems. Based on a signal subspace framework, ESS has been designed to attenuate disturbances while minimizing audible distortion.
Artefacts are reduced by employing an auditory post-filter to smooth the enhanced speech spectra. This filter performs averaging in a manner that exploits the properties of the human auditory system. As such, distortion of the underlying speech signal is reduced.
Testing shows that listeners prefer the proposed algorithm to traditional signal subspace speech enhancement.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

24

Roy, Guylain. „Low-rate analysis-by-synthesis wideband speech coding“. Thesis, McGill University, 1990. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=59643.

Der volle Inhalt der Quelle

Annotation:

This thesis studies low-rate wideband analysis-by-synthesis speech coders. The wideband speech signals have a bandwidth of up to 8 kHz and are sampled at 16 kHz, while the target operating bit rate is 16 kbits/sec. Applications for such a coder range from high-quality voice-mail services to teleconferencing. In order to achieve a low operating rate, the coding places more emphasis on the lower frequencies (0 to 4 kHz), while the higher frequencies (4 to 8 kHz) are coded less precisely but with little perceived degradation.
The study consists of three stages. First, aspects of wideband spectral envelope modeling using Line Spectral Frequencies (LSF's) are studied. Then, the underlying coder structure is derived from a basic Residual Excited Linear Predictive coder (RELP). This structure is enhanced by the addition of a pitch prediction stage, and by the development of full-band and split-band pitch parameter optimization procedures. These procedures are then applied to an Code Excited Linear Prediction (CELP) model. Finally, the performance of full-band and split-band CELP structures are compared.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

25

Bees, Duncan Charles. „Enhancement of acoustically reverberant speech using cepstral methods“. Thesis, McGill University, 1990. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=59819.

Der volle Inhalt der Quelle

Annotation:

Acoustical reverberation has been shown to degrade the intelligibility and naturalness of speech. In this thesis, we discuss the application of cepstral methods to the enhancement of acoustically reverberant speech.
We first study previously described cepstral techniques for removal of simple echoes from signals. Our results show that these techniques are not directly applicable to the enhancement of speech of indefinite extent. We next recast these techniques specifically for speech. We propose new segmentation and windowing strategies, in combination with cepstral averaging, to accurately identify the acoustical impulse response. We then consider inverse filtering based on an estimated acoustical impulse response, and find that finite impulse response filters designed according to the least mean squared error criterion provide satisfactory performance. Finally, we synthesize and test an algorithm for enhancement of reverberant speech. Although significant difficulties remain, we feel that our methods offer a substantial contribution to the solution of the reverberant speech enhancement problem.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

26

Chahine, Gebrael. „Pitch modelling for speech coding at 4.8 kbitss“. Thesis, McGill University, 1993. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=69724.

Der volle Inhalt der Quelle

Annotation:

The purpose of this thesis is to examine techniques of efficiently modelling the Long-Term Predictor (LTP) or the pitch filter in low rate speech coders. The emphasis in this thesis is on a class of coders which are referred to as Linear Prediction (LP) based analysis-by-synthesis coders, and more specifically on the Code-Excited Linear Prediction (CELP) coder which is currently the most commonly used in low rate transmission. The experiments are performed on a CELP based coder developed by the U.S. Department of Defense (DoD) and Bell Labs, with an output bit rate of 4.8 kbits/s.
A multi-tap LTP outperforms a single-tap LTP, but at the expense of a greater number of bits. A single-tap LTP can be improved by increasing the time resolution of the LTP. This results in a fractional delay LTP, which produces a significant increase in prediction gain and perceived periodicity at the cost of more bits, but less than for the multi-tap case.
The first new approach in this work is to use a pseudo-three-tap pitch filter with one or two degrees of freedom of the predictor coefficients, which gives a better quality reconstructed speech and also a more desirable frequency response than a one-tap pitch prediction filter. The pseudo-three-tap pitch filter with one degree of freedom is of particular interest as no extra bits are needed to code the pitch coefficients.
The second new approach is to perform time scaling/shifting on the original speech minimizing further the minimum mean square error and allowing a smoother and more accurate reconstruction of the pitch structure. The time scaling technique allows a saving of 1 bit in coding the pitch parameters while maintaining very closely the quality of the reconstructed speech. In addition, no extra bits are needed for the time scaling operation as no extra side information has to be transmitted to the receiver.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

27

Gagnon, Luc. „A speech enhancement algorithm based upon resonator filterbanks“. Thesis, University of Ottawa (Canada), 1991. http://hdl.handle.net/10393/7767.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

28

Starks, David Ross. „Speech recognition in adverse environments: Improvements to IMELDA“. Thesis, University of Ottawa (Canada), 1995. http://hdl.handle.net/10393/9483.

Der volle Inhalt der Quelle

Annotation:

This thesis deals with speech recognition in adverse environments. The primary problem is the mismatch between training and test conditions. Cases of mismatch include the recording channel, acoustic noise and the speaker. Noise shaping and subband filtering are two noise suppression techniques that work by utilizing properties of speech. Dynamic speech analysis and some feature extraction methods are inherently robust to the influence of noise. Linear discriminant analysis (LDA) can be used to combine disparate sets of speech parameters and obtain the optimal set of features. IMELDA (45) is one such method. In this thesis, we analyse the effectiveness of IMELDA under various training and test scenarios. Theoretical results are first derived and substantiated by simulations. It will be shown that LDA provides a form of noise shaping and the root-deconvolution technique is inappropriate for IMELDA. A new algorithm for predicting recognition performance is proposed and verified. Optimal cross-condition recognition is obtained by utilizing samples of noisy test speech in the within-class covariance, in the so-called QNT IMELDA transform. In the event that the noise is stationary, and can be modelled, we derive an equivalent transform by artificially modifying quiet speech samples. This suffices for the simplest instances. For the extreme helicopter case, we show the best approach to be a combination of band-pass filtering and dynamic analysis of the Mel-scale subbands. Unknown channel noise and additive noise are reduced through the respective subband processing algorithms. Finally, practical issues of applying LDA and integrating subband filtering in a speech recognition system are addressed.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

29

Sylvestre, Benoit. „Time-scale modification of speech : a time-frequency approach“. Thesis, McGill University, 1991. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=60496.

Der volle Inhalt der Quelle

Annotation:

Time-scale modification (TSM) is a process whereby signals are compressed or expanded in time in a manner which preserves their original frequency characteristics. This work explores TSM algorithms for sampled speech. A known approach (2) which is based on the short-time Fourier transform (STFT) is first reviewed, then modified to provide high-quality TSM of speech signals at a lower computational cost. The proposed algorithm resembles the sinusoidal speech model (SSM) based approach (9), yet incorporates new phase compensatory measures to prevent excessive structural deterioration of the time-scaled signal. In addition, a novel incremental scheme for modifying polar parameters results in substantial computational savings.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

30

Foodeei, Majid. „Low-delay speech coding at 16 kbs and below“. Thesis, McGill University, 1991. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=60717.

Der volle Inhalt der Quelle

Annotation:

Development of network quality speech coders at 16 kb/s and below is an active research area. This thesis focuses on the study of low-delay Code Excited Linear Predictive (CELP) and tree coders. A 16 kb/s stochastic tree coder based on the (M.L.) search algorithm suggested by lyengar and Kabal and a low-delay CELP coder proposed by AT&T (CCITT 16 kb/s standardization candidate) are examined. The first goal is to compare and study the performance of the two coders. Second objective is to analyze the particular characteristics which make the two coders different from one another. The final goal is the improvement of the performance of the coders, particularly with a view of bringing down the bit rate below 16 kb/s.
When compared under similar conditions, the two coders showed comparable performance at 16 kb/s. Issues in backward adaptive linear prediction analysis for both near and far sample redundancy removal such as analysis methods, windowing, ill-conditioning, quantization noise effects and computational complexities are studied.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

31

Nguỹên, Bao 1962. „The hidden filter model : applications for automatic speech processing“. Thesis, McGill University, 1991. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=60588.

Der volle Inhalt der Quelle

Annotation:

This thesis examines hidden Markov filter models and their applications in speech segmentation. A method of segmenting the speech waveform is proposed. This method uses the Baum-Welch reestimation algorithm applied to the hidden filter models. Since speech signals are handled at the sample level, the amount of computations needed is very large. We will show how this issue can be dealt with effectively by using a staircase approach in the trellis calculations.
The hidden Markov filters are used to segment speech signals. Test results show very consistent locations of phone boundaries. The hidden filter model fits vocalic segments very well (with normalized prediction errors of less than 0.01), but performs less well on consonants (with normalized prediction errors of up to 0.3).
The speech segmentation by hidden filters is applied to a large vocabulary speaker dependent isolated-word recognizer at the preprocessing stage. The performances of the recognizer with and without preprocessor are compared. The results show small improvements in the recognition accuracy.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

32

Vakil, Sam. „Gaussian mixture model based coding of speech and audio“. Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=81575.

Der volle Inhalt der Quelle

Annotation:

The transmission of speech and audio over communication channels has always required speech and audio coders with reasonable search and computational complexity and good performance relative to the corresponding distortion measure.
This work introduces a coding scheme which works in a perceptual auditory domain. The input high dimensional frames of audio and speech are transformed to power spectral domain, using either DFT or MDCT. The log spectral vectors are then transformed to the excitation domain. In the quantizer section the vectors are DCT transformed and decorrelated. This operation gives the possibility of using diagonal covariances in modelling the data. Finally, a GMM based VQ is performed on the vectors.
In the decoder part the inverse operations are done. However, in order to prevent negative power spectrum elements due to inverse perceptual transformation in the decoder, instead of direct inversion, a Nonnegative Least Squares Algorithm has been used to switch back to frequency domain. For the sake of comparison, a reference subband based "Excitation Distortion coder" is implemented and comparing the resulting coded files showed a better performance for the proposed GMM based coder.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

33

Pereira, Wesley. „Modifying LPC parameter dynamics to improve speech coder efficiency“. Thesis, McGill University, 2001. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=32970.

Der volle Inhalt der Quelle

Annotation:

Reducing the transmission bandwidth and achieving higher speech quality are primary concerns in developing new speech coding algorithms. The goal of this thesis is to improve the perceptual speech quality of algorithms that employ linear predictive coding (LPC). Most LPC-based speech coders extract parameters representing an all-pole filter. This LPC analysis is performed on each block or frame of speech. To smooth out the evolution of the LPC tracks, each block is divided into subframes for which the LPC parameters are interpolated. This improves the perceptual quality without additional transmission bit rate. A method of modifying the interpolation endpoints to improve the spectral match over all the subframes is introduced. The spectral distortion and weighted Euclidean LSF (Line Spectral Frequencies) distance are used as objective measures of the performance of this warping method. The algorithm has been integrated in a floating point C-version of the Adaptive Multi Rate (AMR) speech coder and these results are presented.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

34

Konaté, Cheick Mohamed. „Enhancing speech coder quality: improved noise estimation for postfilters“. Thesis, McGill University, 2011. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=104578.

Der volle Inhalt der Quelle

Annotation:

ITU-T G.711.1 is a multirate wideband extension for the well-known ITU-T G.711 pulse code modulation of voice frequencies. The extended system is fully interoperable with the legacy narrowband one. In the case where the legacy G.711 is used to code a speech signal and G.711.1 is used to decode it, quantization noise may be audible. For this situation, the standard proposes an optional postfilter. The application of postfiltering requires an estimation of the quantization noise. The more accurate the estimate of the quantization noise is, the better the performance of the postfilter can be.In this thesis, we propose an improved noise estimator for the postfilter proposed for the G.711.1 codec and assess its performance. The proposed estimator provides a more accurate estimate of the noise with the same computational complexity.
ITU-T G.711.1 est une extension multi-débit pour signaux à large-bande de la très répandue norme de compression audio de UIT-T G.711. Cette extension est interoperationelle avec sa version initiale à bande étroite. Lorsque l'ancienne version G.711 est employée pour coder un signal vocal et que G.711.1 est utiliser pour le décoder, le bruit de quantificationpeut être entendu. Pour ce cas, la norme propose un post-filtre optionel. Le post-filtre nécessite l'estimation du bruit de quantification. La précision de l'estimation du bruit de quantification va jouer sur la performance du post-filtre.Dans cette thèse, nous proposons un meilleur estimateur du bruit de quantification pour le post-filtre proposé pour le codec G.711.1 et nous évaluons ses performances. L'estimateur que nous proposons donne une estimation plus précise du bruit de quantification avec la même complexité.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

35

Zabawskyj, Bohdan Konstantyn. „On the use of vector quantization on speech enhancement“. Thesis, McGill University, 1993. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=68060.

Der volle Inhalt der Quelle

Annotation:

This thesis will examine a Vector Quantization-based system for speech enhancement. Key areas in this study will include the optimum size for the vector quantizer library and the distance measures used to index the vector quantizer library. In addition, the robustness of the overall enhancement process as a function on the vector quantizer training sequence (e.g., the number of speakers and the number of dissimilar phrases) will be explored. As speech enhancement is a diverse field, several other contemporary speech enhancement techniques will initially be examined in order to place the results of this study in a comparative light.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

36

Papacostantinou, Costantinos. „Improved pitch modelling for low bit-rate speech coders“. Thesis, National Library of Canada = Bibliothèque nationale du Canada, 1997. http://www.collectionscanada.ca/obj/s4/f2/dsk2/ftp01/MQ37279.pdf.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

37

Jabloun, Firas. „Perceptual and Multi-Microphone Signal Subspace Techniques for Speech Enhancement“. Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=95577.

Der volle Inhalt der Quelle

Annotation:

The performance of speech communication systems, such as hands-free telephony, is known to seriously degrade under adverse acoustic environments. The presence of noise can lead to the loss of intelligibility as well as to the listener's fatigue. These problems can generally make the existing systems unsatisfactory to the customers especially that the offered services usually put no restrictions on where they can actually be used. For this reason, speech enhancement is vital for the overall success of these systems on the market.In this thesis we present new speech enhancement techniques based on the signal subspace approach. In this approach the input speech vectors are projected onto the signal subspace where it is processed to suppress any remaining noise then reconstructed again in the time domain. The projection is obtained via the eigenvalue decomposition of the speech signal covariance matrix.The main problem with the signal subspace based methods is the expensive eigenvalue decomposition. In this thesis we present a simple solution to this problem in which the signal subspace filter is updated at a reduced rate resulting in a significant reduction in the computational load. This technique exploits the stationarity of the input speech signal within a frame of 20-30 msec to use the same eigenvalue decomposition for several input vectors. The original implementation scheme was to update the signal subspace filter for every such input vector. The proposed technique was experimentally found to offer significant computational savings at almost no performance side-effects.The second contribution of this thesis is the incorporation of the human hearing properties in the signal subspace approach using a sophisticated masking model. It is known that there is a tradeoff between the amount of noise reduction achieved and the resulting signal distortion. Therefore, it would be beneficial to avoid suppressing any noise components as long as they are not perceived by the
Il est connu que la performance des systèmes de communication par la voix se détériore lorsqu'ils sont utilisés dans des environnements acoustiques peu favorables. En effet, la présence du bruit cause la perte de l'intelligibilité et engendre la fatigue chez les auditeurs. Ces problèmes peuvent rendre les systèmes existant sur le marché inintressants pour les clients surtout que les services offerts par les compagnies de télécommunication ne comportent aucune restriction sur les endroits où ils seront utilisés. Dans ce contexte, les algorithmes qui visent à améliorer la qualité du signal parole sont très importants du fait qu'ils permettent à ces systèmes de satisfaire les attentes du marché. Dans cette thèse, nous présentons des nouvelles techniques, visant à rehausser la qualité de la voix, qui sont basées sur l'approche de sous-espace du signal (SES). Selon cette approche, les vecteurs du signal sont projetés sur le sous-espace du signal où ils sont traités afin d'éliminer le bruit restant. Après ce traitement, les vecteurs seront reconstruits dans le domaine du temps. La projection est obtenue grâce à la décomposition en valeurs propres de la matrice de covariance du signal parole. Le problème avec l'approche SES est que le coût, en terme de temps de calcul, relié à la décomposition en valeurs propres est élevé. Dans cette thèse, nous proposons une technique simple pour résoudre ce problème. Cette technique réduit considérablement le temps de calcul car le filtre en sous-espace est mis à jour moins fréquemment. Initialement, l'implémentation de l'approche SES consistait à recalculer un nouveau filtre pour chaque vecteur. L'originalité de notre technique réside dans l'exploitation de la stationnarité du signal parole dans un intervalle de 20-30 msec afin d'utiliser la même décomposition en valeurs propres pour plusieurs vecteurs. Les expériences menées montrent que notre nouvelle technique réduit consid

APA, Harvard, Vancouver, ISO und andere Zitierweisen

38

El-Maleh, Khaled Helmi. „Classification-based techniques for digital coding of speech-plus-noise“. Thesis, McGill University, 2004. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=84239.

Der volle Inhalt der Quelle

Annotation:

With the increasing demand for wireless voice services and limited bandwidth resources, it is critical to develop and implement coding techniques which use spectrum efficiently. One approach to increasing system capacity is to lower the bit rate of telephone speech. A typical telephone conversation contains approximately 40% speech and 60% silence or background acoustic noise. A reduction of the average coding rate can be achieved by using a Voice Activity Detection (VAD) unit to distinguish speech from silence or background noise. The VAD decision can be used to select different coding modes for speech and noise or to discontinue transmission during speech pauses.
The quality of a telephone conversation using a VAD-based coding system depends on three major modules: the speech coder, the noise coder, and the VAD. Existing schemes for reduced-rate coding of background noise produce a signal that sounds different from the noise at the transmitting side. The frequent changes of the noise character between that produced during talk spurts (noise coded along with the speech) and that produced during speech pauses (noise coded at a reduced rate) are noticeable and can be annoying to the user.
The objective of this thesis is to develop techniques that enhance the output quality of variable-rate and discontinuous-transmission speech coding systems operating in noisy acoustic environments during the pauses between speech bursts. We propose novel excitation models for natural-quality reduced-rate coding of background acoustic noise in voice communication systems. A better representation of the excitation signal in a noise-synthesis model is achieved by classifying the type of acoustic environment noise. Class-dependent residual substitution is used at the receive side to synthesize a background noise that sounds similar to the background noise at the transmit side. The improvement in the quality of synthesized noise during speech gaps helps in preserving noise continuity between talk spurts and speech pauses, and enhances the overall perceived quality of a conversation.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

39

Khan, Mohammad M. A. „Coding of excitation signals in a waveform interpolation speech coder“. Thesis, McGill University, 2001. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=32961.

Der volle Inhalt der Quelle

Annotation:

The goal of this thesis is to improve the quality of the Waveform Interpolation (WI) coded speech at 4.25 kbps. The quality improvement is focused on the efficient coding scheme of voiced speech segments, while keeping the basic coding format intact. In the WI paradigm voiced speech is modelled as a concatenation of the Slowly Evolving pitch-cycle Waveforms (SEW). Vector quantization is the optimal approach to encode the SEW magnitude at low bit rates, but its complexity imposes a formidable barrier.
Product code vector quantizers (PC-VQ) are a family of structured VQs that circumvent the complexity obstacle. The performance of product code VQs can be traded off against their storage and encoding complexity. This thesis introduces split/shape-gain VQ---a hybrid product code VQ, as an approach to quantize the SEW magnitude. The amplitude spectrum of the SEW is split into three non-overlapping subbands. The gains of the three subbands form the gain vector which are quantized using the conventional Generalized Lloyd Algorithm (GLA). Each shape vector obtained by normalizing each subband by its corresponding coded gain is quantized using a dimension conversion VQ along with a perceptually based bit allocation strategy and a perceptually weighted distortion measure. At the receiver, the discontinuity of the gain contour at the boundary of subbands introduces buzziness in the reconstructed speech. This problem is tackled by smoothing the gain versus frequency contour using a piecewise monotonic cubic interpolant. Simulation results indicate that the new method improves speech quality significantly.
The necessity of SEW phase information in the WI coder is also investigated in this thesis. Informal subjective test results demonstrate that transmission of SEW magnitude encoded by split/shape-gain VQ and inclusion of a fixed phase spectrum drawn from a voiced segment of a high-pitched male speaker obviates the need to send phase information.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

40

Thiemann, Joachim. „Acoustic noise suppression for speech signals using auditory masking effects“. Thesis, McGill University, 2001. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=31073.

Der volle Inhalt der Quelle

Annotation:

The process of suppressing acoustic noise in audio signals, and speech signals in particular, can be improved by exploiting the masking properties of the human hearing system. These masking properties, where strong sounds make weaker sounds inaudible, are calculated using auditory models. This thesis examines both traditional noise suppression algorithms and ones that incorporate an auditory model to achieve better performance. The different auditory models used by these algorithms are examined. A novel approach, based on a method to remove a specific type of noise from audio signals, is presented using a standardized auditory model. The proposed method is evaluated with respect to other noise suppression methods in the problem of speech enhancement. It is shown that this method performs well in suppressing noise in telephone-bandwidth speech, even at low Signal-to-Noise Ratios.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

41

Khan, Abdul Hannan. „Tree encoding in the ITU-T G.711.1 speech coder“. Thesis, McGill University, 2011. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=97215.

Der volle Inhalt der Quelle

Annotation:

This thesis examines further enhancement to ITU-T G.711.1 speech coder. The original G.711 coder is effectively a low band μ-law quantizer. The G.711.1 extension adds noise feed-back and lower band enhancement layer apart from the higher-band. To further improve the core lower-band coding performance the use of both vector quantization and delayed decision multi-path tree encoder in the above coder at the low band portion is studied. The delayed decision multi-path tree encoding is implemented by the (M,L) – algorithm. The new quantizer takes into account past history, and hence, the error propagation due to noise feed-back, and codes multiple samples under μ-law. The final bitstream is compatible with the G.711.1 decoder and, hence, with the original G.711 decoder. An evaluation method, ITU-T P.862 perceptual evaluation of speech quality (PESQ), is used to evaluate the performance. Both the vector quantizer and tree encoder have better performance than the original core layer encoder in terms of perceptual quality, though they are limited by the increased computational complexity. Future studies are suggested.
Cette thèse étudie en détail les améliorations apportées au codeur de la parole ITU-T G.711.1. Le codeur original G.711 est en fait un quantificateur μ-law. Le prolongement large-bande G.711.1 utilise le façonnage du bruit ainsi qu'une couche d'amélioration de la bande-basse en plus de la bande-haute. Afin d'améliorer le codage de la bande-basse principale, nous étudions l'utilisation de quantification vectorielle et la décision à retardement. Le codeur arboriforme avec décision à retardée est réalisé par l'algorithme(M,L). Le nouveau quantificateur considère l'information passée et par conséquent, il considère également la propagation de l'erreur engendrée par le façonnage du bruit. Il code plusieurs échantillons par μ-law. Le flot binaire final est compatible avec le décodeur du prolongement large-bande G.711.1 et donc naturellement avec le décodeur du G.711 original. Une méthode d'évaluation, ITU-T P.862 (PESQ) est utilisée pour évaluer la performance. Les résultats montrent que la quantification vectorielle et le codeur arboriforme sont perceptuellement plus performants que le codeur original de la bande principale. Nous notons tout de même qu'ils sont numériquement plus complexes à réaliser. Des études supplémentaires sont suggérées.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

42

Montminy, Christian. „A study of speech compression algorithms for Voice over IP“. Thesis, National Library of Canada = Bibliothèque nationale du Canada, 2000. http://www.collectionscanada.ca/obj/s4/f2/dsk1/tape4/PQDD_0017/MQ57147.pdf.

Der volle Inhalt der Quelle

APA, Harvard, Vancouver, ISO und andere Zitierweisen

43

Szymanski, Lech. „Comb filter decomposition feature extraction for robust automatic speech recognition“. Thesis, University of Ottawa (Canada), 2005. http://hdl.handle.net/10393/27051.

Der volle Inhalt der Quelle

Annotation:

This thesis discusses the issues of Automatic Speech Recognition in presence of additive white noise. Comb Filter Decomposition (CFD), a new method for approximating the magnitude of the speech spectrum in terms of its harmonics is proposed. Three feature extraction methods from CFD coefficients are introduced. The performance of the method and resulting features are evaluated using simulated recognition systems with Hidden Markov Model classifiers and conditions of additive white noise under varying Signal to Noise ratios. The results are compared with the performance of the existing robust feature extraction methods. The results show that the proposed method has a good potential for Automatic Speech Recognition under noisy conditions.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

44

Li, Lian. „The design and implementation of a real-time multimedia synchronization control system over high-speed communications networks“. Thesis, University of Ottawa (Canada), 1994. http://hdl.handle.net/10393/6738.

Der volle Inhalt der Quelle

Annotation:

Sychronization is considered as a key issue in distributed multimedia systems. In a real-time multimedia presentation, data objects of different media types or coding formats are delivered from distributed media-storing servers to the remote client simultaneously over high-speed networks. The multiple streams need to be synchronized so that the multimedia document can be presented in the way specified by its creator. The synchronization research involves issues such as temporal relationship modeling, extending network protocols and supporting the implementation of applications where the synchronization control mechanisms integrate with other system functionality, such as the ATM network transmissions. The video coding/decoding and the distributed database management. In this thesis, we investigate a software synchronization control system for a target presentational application, i.e., a Multimedia News-on-demand service. Relying on the Quality of Services (QoS) supported by the ATM-based virtual connections, the system prevents major multi-stream mismatches through a delivery scheduling operation. Moreover, the synchronization errors brought by the inevitable network delay variations are recovered through a Stream Synchronization Protocol (SSP) in order to preserve the presentation quality. We apply the Time Flow Graph (TFG) to model the temporal relationships among the media components so that the scheduling and recovering operations can be efficient. Synchronization QoS parameters are employed in the SSP control. In addition, the differences between the characterization of coded and uncoded data streams are taken into account. We present a priority-based synchronization control for coded data, e.g., the MPEG-2 video stream. For the implementation of such a control system, we elaborate a set of data structure specifications and algorithms. As well, we develop the software modules to implement the synchronization control prototype.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

45

Cardinal, Patrick. „Finite-state transducers and speech recognition“. Thesis, McGill University, 2003. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=78335.

Der volle Inhalt der Quelle

Annotation:

Finite-state automata and finite-state transducers have been extensively studied over the years. Recently, the theory of transducers has been generalized by Mohri for the weighted case. This generalization has allowed the use of finite-state transducers in a large variety of applications such as speech recognition. In this work, most of the algorithms for performing operations on weighted finite-state transducers are described in detail and analyzed. Then, an example of their use is given via a description of a speech recognition system based on them.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

46

Safavi, Saeid. „Speaker characterization using adult and children's speech“. Thesis, University of Birmingham, 2015. http://etheses.bham.ac.uk//id/eprint/6029/.

Der volle Inhalt der Quelle

Annotation:

Speech signals contain important information about a speaker, such as age, gender, language, accent, and emotional/psychological state. Automatic recognition of these types of characteristics has a wide range of commercial, medical and forensic applications such as interactive voice response systems, service customization, natural human-machine interaction, recognizing the type of pathology of speakers, and directing the forensic investigation process. Many such applications depend on reliable systems using short speech segments without regard to the spoken text (text-independent). All these applications are also applicable using children’s speech. This research aims to develop accurate methods and tools to identify different characteristics of the speakers. Our experiments cover speaker recognition, gender recognition, age-group classification, and accent identification. However, similar approaches and techniques can be applied to identify other characteristics such as emotional/psychological state. The main focus of this research is on detecting these characteristics from children’s speech, which is previously reported as a more challenging subject compared to adult. Furthermore, the impact of different frequency bands on the performances of several recognition systems is studied, and the performance obtained using children’s speech is compared with the corresponding results from experiments using adults’ speech. Speaker characterization is performed by fitting a probability density function to acoustic features extracted from the speech signals. Since the distribution of acoustic features is complex, Gaussian mixture models (GMM) are applied. Due to lack of data, parametric model adaptation methods have been applied to adapt the universal background model (UBM) to the char acteristics of utterances. An effective approach involves adapting the UBM to speech signals using the Maximum-A-Posteriori (MAP) scheme. Then, the Gaussian means of the adapted GMM are concatenated to form a Gaussian mean super-vector for a given utterance. Finally, a classification or regression algorithm is used to identify the speaker characteristics. While effective, Gaussian mean super-vectors are of a high dimensionality resulting in high computational cost and difficulty in obtaining a robust model in the context of limited data. In the field of speaker recognition, recent advances using the i-vector framework have increased the classification accuracy. This framework, which provides a compact representation of an utterance in the form of a low dimensional feature vector, applies a simple factor analysis on GMM means.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

47

Plourde, Eric. „Bayesian short-time spectral amplitude estimators for single-channel speech enhancement“. Thesis, McGill University, 2009. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=66864.

Der volle Inhalt der Quelle

Annotation:

Single-channel speech enhancement algorithms are used to remove background noise in speech. They are present in many common devices such as cell phones and hearing aids. In the Bayesian short-time spectral amplitude (STSA) approach for speech enhancement, an estimate of the clean speech STSA is derived by minimizing the statistical expectation of a chosen cost function. Examples of such estimators are the minimum mean square error (MMSE) STSA, the β-order MMSE STSA (β-SA), which includes a power law parameter, and the weighted Euclidian (WE), which includes a weighting parameter. This thesis analyzes single-channel Bayesian STSA estimators for speech enhancement with the aim of, firstly, gaining a better understanding of their properties and, secondly, proposing new cost functions and statistical models to improve their performance. In addition to a novel analysis of the β-SA estimator for parameter β ≤ 0, three new families of estimators are developed in this thesis: the Weighted β-SA (Wβ-SA), the Generalized Weighted family of STSA estimators (GWSA) and a family of multi-dimensional Bayesian STSA estimators. The Wβ-SA combines the power law of the β-SA and the weighting factor of the WE. Its parameters are chosen based on the characteristics of the human auditory system which is found to have the advantage of improving the noise reduction at high frequencies while limiting the speech distortions at low frequencies. An analytical generalization of a cost function structure found in many existing Bayesian STSA estimators is proposed through the GWSA family of estimators. This allows a unification of Bayesian STSA estimators and, moreover, provides a better understanding of this general class of estimators. Finally, we propose a multi-dimensional family of estimators that accounts for the correlated frequency components in a digitized speech signal. In fact, the spectral components of the clean
Les algorithmes de rehaussement de la parole à voie unique sont utilisés afin de réduire le bruit de fond d'un signal de parole bruité. Ils sont présents dans plusieurs appareils tels que les téléphones sans fil et les prothèses auditives. Dans l'approche bayésienne d'estimation de l'amplitude spectrale locale (Short-Time Spectral Amplitude - STSA) pour le rehaussement de la parole, un estimé de la STSA non bruitée est déterminé en minimisant l'espérance statistique d'une fonction de coût. Ce type d'estimateurs incluent le MMSE STSA, le β-SA, qui intègre un exposant comme paramètre de la fonction de coût, et le WE, qui possède un paramètre de pondération.Cette thèse étudie les estimateurs bayésiens du STSA avec pour objectifs d'approfondir la compréhension de leurs propriétés et de proposer de nouvelles fonctions de coût ainsi que de nouveaux modèles statistiques afin d'améliorer leurs performances. En plus d'une étude approfondie de l'estimateur β-SA pour les valeurs de β ≤ 0, trois nouvelles familles d'estimateur sont dévelopées dans cette thèse: le β-SA pondéré (Weighted β-SA - Wβ-SA), une famille d'estimateur du STSA généralisé et pondéré (Generalized Weighted STSA - GWSA) ainsi qu'une famille d'estimateur du STSA multi-dimensionnel.Le Wβ-SA combine l'exposant présent dans le β-SA et le paramètre de pondération du WE. Ses paramètres sont choisis en considérant certaines caractéristiques du système auditif humain ce qui a pour avantage d'améliorer la réduction du bruit de fond à hautes fréquences tout en limitant les distorsions de la parole à basses fréquences. Une généralisation de la structure commune des fonctions de coût de plusieurs estimateurs bayésiens du STSA est proposée à l'aide de la famille d'estimateur GWSA. Cette dernière permet une unification des estimateurs bayésiens du STSA et apporte une meilleure compréhensio

APA, Harvard, Vancouver, ISO und andere Zitierweisen

48

Reddy, Aarthi. „Speech based machine aided human translation for a document translation task“. Thesis, McGill University, 2012. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=107725.

Der volle Inhalt der Quelle

Annotation:

Translating documents into multiple languages represents an extremely large expensefor businesses, governments, and international agencies. In Canada, for example, it isa requirement that all ocial documents exist in both ocial languages, French andEnglish. This has produced a large translation industry employing a large number ofskilled professional translators.It is well known that the standards posed on the quality of translations for businessand government documents are far too high to apply existing automatic machinetranslation technology to the document translation task. A large number of tools forincreasing the eciency of human translators at various stages of their work ow havebecome commercially available to translation bureaus. These human translators maydirectly enter translated text, dictate their translations so they may be automaticallytranscribed, or post-edit rst draft translations produced by an automatic machinetranslation system. The work in this thesis is concerned with a machine aided humantranslation(MAHT) scenario where a human translator dictates translations ofa source language document. Automatic techniques are developed for improving thequality of the transcriptions obtained from these dictated translations by simultaneouslyincorporating knowledge from the source language text and the target languagespeech.The main contributions of this thesis are as follows. First, we describe novelalgorithms that provide ecient and accurate transcriptions of dictations providedby the human translator. We show that by integrating information extracted fromthe source language document with statistical models used in the automatic speechrecognition system, a more accurate transcription of the dictations can be obtained.Second, we use key information from the source language document like named entitytagged words and use acoustic, language and phonetic information to ensure that thatinformation exists in the translated document as well. Third, we describe a systemthat is specic to document translation. The document translation task domainaddressed here can be distinguished from tasks addressed in most previous MAHTresearch which has been focused on translating isolated sentences or phrases. Fourth,we created a new corpus, specically for use in this thesis. This corpus was collected atMcGill from professional translators dictating their translations and has been essentialfor characterizing the issues associated with the dictation-based MAHT task domain.
La traduction de documents dans plusieurs langues represente des coûts eleves pour les entreprises, les gouvernements et les rmes internationales. Au Canada par exemple, il est obligatoire que tous les documents ociels soient rediges en Anglais et en Francais. Cette politique a force l'industrie de traduction a embaucher un grand nombre de traducteurs professionnels. Il est de notoriete que les normes imposees pour la traduction de documents administratifs rendent la tâche des machines de traduction trop ardue. Un grand nombre d'outils sont commercialement disponibles pour ameliorer l'ecacite des traducteurs humains a dierents nivaux de leur travail. Les employes des bureaux de traduction peuvent saisir directement le texte traduit, dicter leur traduction an qu'elle puisse être transcrite de facon authentique, ou bien corriger les premieres versions fournies par les machines de traduction automatique. Le travail de cette these porte sur la traduction humaine assistee par ordinateur (MAHT), ou un traducteur humain dicte une premiere traduction d'un document. Des algorithmes sont implementes pour ameliorer la qualite de traduction de la version dictee en integrant simultanement des informations sur la langue source et sur la langue ciblee. Cette these contribue aux aspects suivants. Premierement, elle presente de nouveaux algorithmes qui ameliorent les traductions dictees. En integrant les informations extraites du document de la langue source avec des modeles statistiques utilises dans la reconnaissance vocale, de meilleures traductions sont obtenues. Deuxiemement, les informations cles telles que les mots identies comme etant des entites nommees, sont recueillies par le document de la langue source grâce aux informations acoustiques, linguistiques, et phonetiques. De cette facon, on s'assure que ces mêmes informations se retrouvent dans le chier traduit. Troisiemement, le systeme specique a la traduction de document est presente et il se demarque du travail fait avec MAHT et CAT, ou l'objectif est uniquement la traduction de phrases ou expressions. Finalement, nous avons cree un nouveau corpus dedie aux applications de cette these. Cet ensemble de documents a ete collecte et estampe a l'Universite McGill et a permis de mener les experiences a bien. Il met en evidence des obstacles qui n'ont pas ete encore rencontres durant les precedentes recherches dans ce domaine, comme l'utilisation de mots de remplissage, les repetitions, et autres erreurs commises par les traducteurs.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

49

Moreno, Carlos 1965. „Variable frame size for vector quantization and application to speech coding“. Thesis, McGill University, 2005. http://digitool.Library.McGill.CA:80/R/?func=dbin-jump-full&object_id=99001.

Der volle Inhalt der Quelle

Annotation:

Vector Quantization (VQ) is a lossy data compression technique that is often applied in the field of speech communications. In VQ, a group of values or vector is replaced by the closest vector from a list of possible choices, the codebook. Compression is achieved by providing the index corresponding to the closest vector in the codebook, which in general can be represented with less data than the original vector.
In the case of VQ applied to speech signals, the input signal is divided into frames of a given length. Depending on the particular technique being used, the system either extracts a vector representation of the whole frame (usually some form of spectral representation), or applies some processing to the signal and uses the processed frame itself as the vector to be quantized. The two techniques are often combined, and the system uses VQ for the spectral representation of the frame and also for the processed frame.
A typical assumption in this scheme is the fact that the frame size is fixed. This simplifies the scheme and thus reduces the computing-power requirements for a practical implementation.
In this study, we present a modification to this technique that allows for variable size frames, providing an additional degree of freedom for the optimization of the Data Compression process.
The quantization error is minimized by choosing the closest point in the codebook for the given frame. We now minimize this by choosing the frame size that yields the lowest quantization error---notice that the quantization error is a function of the given frame and the codebook; by considering different frame sizes, we get different actual frames that yield different quantization errors, allowing us to choose the optimal size, effectively providing a second level of optimization.
This idea has two caveats; we require additional data to represent the frame, since we have to indicate the size that was used. Also, the complexity of the system increases, since we have to try different frame sizes, requiring more computing-power for a practical implementation of the scheme.
The results of this study show that this technique effectively improves the quality of the compressed signal at a given compression ratio, even if the improvement is not dramatic. Whether or not the increase in complexity is worth the quality improvement for a given application depends entirely on the design constraints for that particular application.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

50

Agbago, Akakpo. „Investigating speed issues in acoustic-phonetic models for continuous speech recognition“. Thesis, University of Ottawa (Canada), 2004. http://hdl.handle.net/10393/26559.

Der volle Inhalt der Quelle

Annotation:

Automatic Speech Recognition applications face two challenges: accuracy and speed. For good accuracy, Dynamic Programming and Hidden Markov Model algorithms are widely used despite their heavy computational load. To solve the speed problem, this thesis uses a Three-Stage-Architecture (TSA) in which Stage.1 is to enhance and extract features from the input speech signal, Stage.2 does a phonetic-acoustic level recognition to output strings of phonemes to Stage.3 that completes the recognition into valid words using HMM on strings rather than utterances processing. We designed two algorithms for Stage.2: Fast Two-Level Dynamic Programming (FTLDP) that is 20 times faster than a standard Two-Level DP and ParrallelRecognizer that performs 320 times faster than the standard Two-Level DP. Both algorithms are combined with a heuristic feature called Cepstrum Gain Envelop Profile (CGEP) based Silence Detection to shorten the input speech and clustering to reduce the search space in the reference phonetic models.

APA, Harvard, Vancouver, ISO und andere Zitierweisen

Dissertationen zum Thema „Speech Communication. Engineering, Electronics and Electrical“

Geben Sie eine Quelle nach APA, MLA, Chicago, Harvard und anderen Zitierweisen an