Flexibility and speed are key features for a deep learning framework to allow fast transition from a research idea to prototyping and production code. We outline how to implement a unified framework for sequence processing that covers various kinds of models and applications. We will discuss our toolkit RETURNN as an example for such an implementation, that is easy to apply and understand for...
Deep learning based human language technology (HLT), such as automatic speech recognition, intent and slot recognition, or dialog management, has become the mainstream of research in recent years and significantly outperforms conventional methods. However, deep learning models are notorious for being data and computation hungry. These downsides limit the application of such models from...
While smartphone assistants and smart speakers are prevalent and there are high expectations for social communicative robots, spoken language interaction with these kinds of robots is not effectively deployed. This tutorial aims to give an overview of the issues and challenges related to the integration of natural multimodal dialogue processing for social robots. We first outline dialogue...
Flexibility and speed are key features for a deep learning framework to allow fast transition from a research idea to prototyping and production code. We outline how to implement a unified framework for sequence processing that covers various kinds of models and applications. We will discuss our toolkit RETURNN as an example for such an implementation, that is easy to apply and understand for...
Deep learning based human language technology (HLT), such as automatic speech recognition, intent and slot recognition, or dialog management, has become the mainstream of research in recent years and significantly outperforms conventional methods. However, deep learning models are notorious for being data and computation hungry. These downsides limit the application of such models from...
While smartphone assistants and smart speakers are prevalent and there are high expectations for social communicative robots, spoken language interaction with these kinds of robots is not effectively deployed. This tutorial aims to give an overview of the issues and challenges related to the integration of natural multimodal dialogue processing for social robots. We first outline dialogue...
A conversational information retrieval (CIR) system is an information retrieval (IR) system with a conversational interface which allows users to interact with the system to seek information via multi-turn conversations of natural language (in spoken or written form). This tutorial surveys recent advances in CIR, focusing on neural approaches that have been developed in the last few years. We...
Speaker diarization is an essential component for speech applications in multi-speaker settings. Spoken utterances need to be attributed to speaker-specific classes with or without prior knowledge of the speakers' identity or profile. Initially, speaker diarization technologies were developed as standalone processes without requiring much context of other components in a given speech...
This tutorial will provide an in-depth survey of the state of the art in spoken language processing in language learning and assessment from a practitioner’s perspective. The first part of the tutorial will discuss in detail the acoustic, speech, and language processing challenges in recognizing and dealing with native and non-native speech from both adults and children from different language...
A conversational information retrieval (CIR) system is an information retrieval (IR) system with a conversational interface which allows users to interact with the system to seek information via multi-turn conversations of natural language (in spoken or written form). This tutorial surveys recent advances in CIR, focusing on neural approaches that have been developed in the last few years. We...
Speaker diarization is an essential component for speech applications in multi-speaker settings. Spoken utterances need to be attributed to speaker-specific classes with or without prior knowledge of the speakers' identity or profile. Initially, speaker diarization technologies were developed as standalone processes without requiring much context of other components in a given speech...
This tutorial will provide an in-depth survey of the state of the art in spoken language processing in language learning and assessment from a practitioner’s perspective. The first part of the tutorial will discuss in detail the acoustic, speech, and language processing challenges in recognizing and dealing with native and non-native speech from both adults and children from different language...
In this paper, a dynamic three dimensional (3D) head model is introduced which is built based on knowledge of (the human) anatomy and the theory of distinctive features. The model is used to help Chinese learners understand the exact location and method of the phoneme articulation intuitively. You can access the phonetic learning system, choose the target sound you want to learn and then watch...
In this paper, an APP with Mispronunciation Detection and Feedback for Mandarin L2 Learners is shown. The APP could detect the mispronunciation in the words and highlight it with red at the phone level. Also, the score will be shown to evaluate the overall pronunciation. When touching the highlight, the pronunciation of the learner’s and the standard’s is played. Then the flash animation that...
We present Catotron, a neural network-based open-source speech synthesis system in Catalan. Catotron consists of a sequence-to-sequence model trained with two small open- source datasets based on semi-spontaneous and read speech. We demonstrate how a neural TTS can be built for languages with limited resources using found-data optimization and cross- lingual transfer learning. We make the...
We present a computer-assisted language learning system that automatically evaluates the pronunciation and fluency of spoken Malay and Tamil. Our system consists of a server and a user- facing Android application, where the server is responsible for speech-to-text alignment as well as pronunciation and fluency scoring. We describe our system architecture and discuss the technical challenges...
This work is the first attempt to apply an end-to-end, deep neural network-based automatic speech recognition (ASR) pipeline to the Silent Speech Challenge dataset (SSC), which contains synchronized ultrasound images and lip images cap- tured when a single speaker read the TIMIT corpus without uttering audible sounds. In silent speech research using SSC dataset, established methods in ASR have...
ICE-Talk is an open source web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction.
Speech provides an intuitive interface to communicate with ma- chines. Today, developers willing to implement such an inter- face must either rely on third-party proprietary software or be- come experts in speech recognition. Conversely, researchers in speech recognition wishing to demonstrate their results need to be familiar with technologies that are not relevant to their re- search (e.g.,...
Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED.
In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in...
Attention-based models with convolutional encoders enable faster training and inference than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the
computational cost and run-time memory footprint. A convolutional encoder with a short receptive...
End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of...
End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system – acoustic model, language model, pronunciation model – into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of...
In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for language modeling. In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines where each stream has...
The long short-term memory (LSTM) network is one of the most widely used recurrent neural networks (RNNs) for automatic speech recognition (ASR), but exhibits millions of parameters. This makes it prohibitive for memory constrained hardware accelerators as the storage demand causes higher dependence on off-chip memory, which becomes a bottleneck for latency and power. In this paper, we propose...
Optimal fusion of streams for ASR is a nontrivial problem. Recently, so-called posterior-in-posterior-out (PIPO-)BLSTMs have been proposed that serve as state sequence enhancers and have highly attractive training properties. In this work, we adopt the PIPO-BLSTMs and employ them in the context of stream fusion for ASR. Our contributions are the following: First, we show the positive effect of...
Transformer models are powerful sequence-to-sequence architecture that is capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism of modeling positions in this model was tailored for text modeling and thus is less ideal for acoustic inputs. In this work, we adapted the relative position encoding scheme to the Speech Transformer, in which the key is...
We propose an end-to-end speaker-attributed automatic speech recognition model that unifies speaker counting, speech recognition, and speaker identification on monaural overlapped speech. Our model is built on serialized output training (SOT) with attention-based encoder-decoder, a recently proposed method for recognizing overlapped speech comprising an arbitrary number of speakers. We extend...
This paper proposes a novel generalized knowledge distillation framework, with an implicit transfer of privileged information. In our proposed framework, teacher networks are trained with two input branches on pairs of time-synchronous lossless and lossy acoustic features. While one branch of the teacher network processes a privileged view of the data using lossless features, the second branch...
Automatic Speech Recognition (ASR) technique has been greatly developed in recent years, which expedites many applications in other fields. For the ASR research, speech corpus is always an essential foundation, especially for the vertical industry, such as Air Traffic Control (ATC). There are some speech corpora for common applications, public or paid. However, for the ATC domain, it is...
This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely...
This paper introduces an open-source speech dataset for Yoruba - one of the largest low-resource West African languages spoken by at least 22 million people. Yoruba is one of the official languages of Nigeria, Benin and Togo, and is spoken in other neighboring African countries and beyond. The corpus
consists of over four hours of 48 kHz recordings from 36 male and female volunteers and the...
Automatic speech recognition (ASR) via call is essential for various applications, including AI for contact center (AICC) services. Despite the advancement of ASR, however, most publicly available call-based speech corpora such as Switchboard are old-fashioned. Also, most existing call corpora are in English and mainly focus on open-domain dialog or general scenarios such as audiobooks. Here...
This paper introduces a corpus of Chinese Learner English containing 82 hours of L2 English speech by Chinese learners from all major dialect regions, collected through mobile apps developed by LAIX Inc. The LAIX corpus was created to serve as a benchmark dataset for evaluating Automatic Speech Recognition (ASR) performance on L2 English, the first of this kind as far as we know. The paper...
This paper presents a carefully designed corpus of scored spoken conversations between English language learners and a dialog system to facilitate research and development of both human and machine scoring of dialog interactions. We collected speech, demographic and user experience data from non-native speakers of English who interacted with a virtual boss as part of a workplace pragamatics...
This paper describes the design and development of CUCHILD, a large-scale Cantonese corpus of child speech. The corpus contains spoken words collected from 1,986 child speakers aged from 3 to 6 years old. The speech materials include 130 words of 1 to 4 syllables in length. The speakers cover both typically developing (TD) children and children with speech disorder. The intended use of the...
Creating open-domain chatbots requires large amounts of conversational data and related benchmark tasks to evaluate them. Standardized evaluation tasks are crucial for creating automatic evaluation metrics for model development, as otherwise comparing the models would require resource-expensive human evaluation. While chatbot challenges have recently managed to provide a plethora of such...
We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices...
Bipolar disorder (BD) and borderline personality disorder (BPD) are both chronic psychiatric disorders. However, their overlapping symptoms and common comorbidity make it challenging for the clinicians to distinguish the two conditions on the basis of a clinical interview. In this work, we first present a new multi-modal dataset containing interviews involving individuals with BD or BPD being...
State-of-the-art language recognition systems are based on discriminative embeddings called x-vectors. Channel and gender distortions produce mismatch in such x-vector space where embeddings corresponding to the same language are not grouped in an unique cluster. To control this mismatch, we propose to train the x-vector DNN with metric learning objective functions. Combining a classification...
In this paper, we present our XMUSPEECH system for the oriental language recognition (OLR) challenge, AP19-OLR. The challenge this year contained three tasks: (1) short-utterance LID, (2) cross-channel LID, and (3) zero-resource LID. We leveraged the system pipeline from three aspects, including front-end training, back-end processing, and fusion strategy. We implemented many encoder networks...
In this paper, we study the technology of multiple acoustic feature integration for the applications of Automatic Speaker Verification (ASV) and Language Identification (LID). In contrast to score level fusion, a common method for integrating subsystems built upon various acoustic features, we explore a new integration strategy, which integrates multiple acoustic features based on the x-vector...
An end-to-end dialect identification system generates the likelihood of each dialect, given a speech utterance. The performance relies on its capabilities to discriminate the acoustic properties between the different dialects, even though the input signal contains non-dialectal information such as speaker and channel. In this work, we study how non-dialectal information are encoded inside the...
In this paper, we propose a software toolkit for easier end-to-end training of deep learning based spoken language identification models across several speech datasets.
We apply our toolkit to implement three baseline models, one speaker recognition model, and three x-vector architecture variations, which are trained on three datasets previously used in spoken language identification...
This article presents a full end-to-end pipeline for Arabic Dialect Identification (ADI) using intonation patterns and acoustic representations. Recent approaches to language and dialect identification use linguistic aware deep architectures that are able to capture phonetic differences amongst languages and dialects. Specifically, in ADI tasks, different combinations of linguistic features...
State-of-the-art spoken language identification (LID) systems, which are based on end-to-end deep neural networks, have shown remarkable success not only in discriminating between distant languages but also between closely-related languages or even different spoken varieties of the same language. However, it is still unclear to what extent neural LID models generalize to speech samples with...
The elastic spatial filter (ESF) proposed in recent years is a popular multi-channel speech enhancement front end based on deep neural network (DNN). It is suitable for real-time processing and has shown promising automatic speech recognition (ASR) results. However, the ESF only utilizes the knowledge of fixed beamforming, resulting in limited noise reduction capabilities. In this paper, we...
We propose a space-and-speaker-aware iterative mask estimation (SSA-IME) approach to improving complex angular central Gaussian distributions (cACGMM) based beamforming in an iterative manner by leveraging upon the complementary information obtained from SSA-based regression. First, a mask calculated by beamformed speech features is proposed to enhance the estimation accuracy of the ideal...
Purely neural network (NN) based speech separation and enhancement methods, although can achieve good objective scores, inevitably cause nonlinear speech distortions that are harmful for the automatic speech recognition (ASR). On the other hand, the minimum variance distortionless response (MVDR) beamformer with NN-predicted masks, although can significantly reduce speech distortions, has...
This paper proposes an online dual-microphone system for directional speech enhancement, which employs geometrically constrained independent vector analysis (IVA) based on the auxiliary function approach and vectorwise coordinate descent. Its offline version has recently been proposed and shown to outperform the conventional auxiliary function approach-based IVA (AuxIVA) thanks to the properly...
The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multi-look neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end...
Use of omni-directional microphones is commonly assumed in
the differential beamforming with uniform circular arrays. The
conventional differential beamforming with omni-directional
elements tends to suffer in low white-noise-gain (WNG) at the
low frequencies and decrease of directivity factor (DF) at high
frequencies. WNG measures the robustness of beamformer and
DF evaluates the array...
This paper investigates different trade-offs between the number of model parameters and enhanced speech qualities by employing several deep tensor-to-vector regression models for speech enhancement. We find that a hybrid architecture, namely CNN-TT, can maintain a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for...
Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in...
Mentoring-reverse mentoring, which is a novel knowledge transfer framework for unsupervised learning, is introduced in multi-channel speech source separation. This framework aims to improve two different systems, which are referred to as a senior and a junior system, by mentoring each other. The senior system, which is composed of a neural separator and a statistical blind source separation...
This paper proposes new blind signal processing techniques foroptimizing a multi-input multi-output (MIMO) convolutionalbeamformer (CBF) in a computationally efficient way to per-form dereverberation and source separation simultaneously. Foreffective optimization of a CBF, a conventional technique fac-torizes it into a multiple-target weighted prediction error (WPE)based dereverberation...
Characterizing precisely neurophysiological activity involved in natural conversations remains a major challenge. We explore in this paper the relationship between multimodal conversational behavior and brain activity during natural conversations. This is challenging due to fMRI time resolution and to the diversity of the recorded multimodal signals. We use a unique corpus including localized...
Reconstruction of speech envelope from neural signal is a general way to study neural entrainment, which helps to understand the neural mechanism underlying speech processing.
Previous neural entrainment studies were mainly based on single-trial neural activities,
and the reconstruction accuracy of speech envelope is not high enough, probably due to the interferences from diverse noises...
Alterations in speech and language are typical signs of mild cognitive impairment (MCI), considered to be the prodromal stage of Alzheimer’s disease (AD). Yet, very few studies have pointed out at what stage their speech production is disrupted. To bridge this knowledge gap, the present study focused on lexical retrieval, a specific process during speech production, and investigated how it is...
Listeners usually have the ability to selectively attend to the target speech while ignoring competing sounds. The mechanism that top-down attention modulates the cortical envelope tracking to speech was proposed to account for this ability. Additional visual input, such as lipreading was considered beneficial for speech perception, especially in noise. However, the effect of audiovisual (AV)...
Human listeners can recognize target speech streams in complex auditory scenes. The cortical activities can robustly track the amplitude fluctuations of target speech with auditory attentional modulation under a range of signal-to-masker ratios (SMRs). The root-mean-square (RMS) level of the speech signal is a crucial acoustic cue for target speech perception. However, in most studies, the...
Human speech processing, either for listening or oral reading, requires dynamic cortical activities that are not only driven by sensory stimuli externally but also influenced by semantic knowledge and speech planning goals internally. Each of these functions has been known to accompany specific rhythmic oscillations and be localized in distributed networks. The question is how the brain...
In processing behavioral data from auditory lexical decision, reaction times (RT) can be defined relative to stimulus onset or relative to stimulus offset. Using stimulus onset as the reference invokes models that assumes that relevant processing starts immediately, while stimulus offset invokes models that assume that relevant processing can only start when the acoustic input is complete. It...
Between 15% to 40% of mild traumatic brain injury (mTBI) patients experience incomplete recoveries or provide subjective reports of decreased motor abilities, despite a clinically-determined complete recovery. This demonstrates a need for objective measures capable of detecting subclinical residual mTBI, particularly in return-to-duty decisions for warfighters and return-to-play decisions for...
The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a preexisting embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embeddings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and...
Convolutional neural networks have been successfully applied to a variety of audio signal processing tasks including sound source separation, speech recognition and acoustic scene understanding. Since many pitched sounds have a harmonic structure, an operation, called harmonic convolution, has been proposed to take advantages of the structure appearing in the audio signals. However, the...
In this paper, a deep neural network (DNN)-based poetic meter classification scheme is proposed using a fusion of musical texture features (MTF) and i-vectors. The experiment is performed in two phases. Initially, the mel-frequency cepstral coefficient (MFCC) features are fused with MTF and classification is done using DNN. MTF include timbral, rhythmic, and melodic features. Later, in the...
Formant tracking is one of the most fundamental problems in speech processing. Traditionally, formants are estimated using signal processing methods. Recent studies showed that generic convolutional architectures can outperform recurrent networks on temporal tasks such as speech synthesis and machine translation. In this paper, we explored the use of Temporal Convolutional Network (TCN) for...
In this paper we present a publicly available tool for automatic analysis of speech prosody (AASP) in Dutch. Incorporating the state-of-the-art analytical frameworks, AASP enables users to analyze prosody at two levels from different theoretical perspectives. Holistically, by means of the Functional Principal Component Analysis (FPCA) it generates mathematical functions that capture changes in...
The search for professional voice-actors for audiovisual productions is a sensitive task, performed by the artistic directors (ADs). The ADs have a strong appetite for new talents/voices but cannot perform large scale auditions. Automatic tools able to suggest the most suited voices are of a great interest for audiovisual industry.
In previous works, we showed the existence of acoustic...
Formants are resonances of the time varying vocal tract system, and their characteristics are reflected in the response of the system for a sequence of impulse-like excitation sequence originated at the glottis. This paper presents a method to enhance the formants information in the display of spectrogram of the speech signal, especially for high pitched voices. It is well known that in the...
Disentanglement is a desired property in representation learning and a significant body of research has tried to show that it is a useful representational prior. Evaluating disentanglement is challenging, particularly for real world data like speech, where ground truth generative factors are typically not available. Previous work on disentangled representation learning in speech has used...
Accurate voiced/unvoiced information is crucial in estimating the pitch of a target speech signal in severe nonstationary noise environments. Nevertheless, state-of-the-art pitch estimators based on deep neural networks (DNN) lack a dedicated mechanism for robustly detecting voiced and unvoiced segments in the target speech in noisy conditions. In this work, we proposed an end-to-end deep...
This paper extends recent work on nonlinear Independent Component Analysis (ICA) by introducing a theoretical framework for nonlinear Independent Subspace Analysis (ISA) in the presence of auxiliary variables. Observed high dimensional acoustic features like log Mel spectrograms can be considered as surface level manifestations of nonlinear transformations over individual multivariate sources...
In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. In HiNet, the amplitude spectrum predictor (ASP) predicts log amplitude spectra (LAS) from input acoustic features. This paper proposes a novel knowledge-and-data-driven ASP (KDD-ASP) to improve the...
Recent advancements in deep learning led to human-level performance in single-speaker speech synthesis. However, there are still limitations in terms of speech quality when generalizing those systems into multiple-speaker models especially for unseen speakers and unseen recording qualities. For instance, conventional neural vocoders are adjusted to the training speaker and have poor...
In this paper, we propose the neural homomorphic vocoder (NHV), a source-filter model based neural vocoder framework. NHV synthesizes speech by filtering impulse trains and noise with linear time-varying (LTV) filters. A neural network controls the LTV filters by estimating complex cepstrums of time-varying impulse responses given acoustic features. The proposed framework can be trained with a...
In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding.
The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core....
We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and...
This paper proposes a lightweight neural vocoder based on LPCNet. The recently proposed LPCNet exploits linear predictive coding to represent vocal tract characteristics, and can rapidly synthesize high-quality waveforms with fewer parameters than WaveRNN. For even greater speeds, it is necessary to reduce the time-heavy two GRUs and the DualFC. Although the original work only pruned the first...
In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed...
In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of...
We present a fast and lightweight on-device text-to-speech system based on state-of-art methods of feature and speech generation i.e. Tacotron2 and LPCNet. We show that modification of the basic pipeline combined with hardware-specific optimizations and extensive usage of parallelization enables running TTS service even on low-end devices with faster than realtime waveform generation....
Neural vocoder, such as WaveGlow, has become an important component in recent high-quality text-to-speech (TTS) systems. In this paper, we propose Efficient WaveGlow (EWG), a flow-based generative model serving as an efficient neural vocoder. Similar to WaveGlow, EWG has a normalizing flow backbone where each flow step consists of an affine coupling layer and an invertible 1x1 convolution. To...
Nowadays, synthetic speech is almost indistinguishable from human speech.
The remarkable quality is mainly due to the displacing of signal processing based vocoders in favour of neural vocoders and, in particular, the WaveNet architecture.
At the same time, speech synthesis evaluation is still facing difficulties in adjusting to these improvements.
These difficulties are even more prevalent...
End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a...
Andrusenko, Ivan Podluzhny, Aleksandr Laptev and Aleksei Romanenko
Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which...
Recently, speaker diarization based on speaker embeddings has shown excellent results in many works. In this paper we propose several enhancements throughout the diarization pipeline. This work addresses two clustering frameworks: agglomerative hierarchical clustering (AHC) and spectral clustering (SC).
First, we use multiple speaker embeddings. We show that fusion of x-vectors and d-vectors...
Speaker diarization can be described as the process of extracting sequential speaker embeddings from an audio stream and clustering them according to speaker identities. Nowadays, deep neural network based approaches like x-vector have been widely adopted for speaker embedding extraction. However, in the clustering back-end, probabilistic linear discriminant analysis (PLDA) is still the...
Speaker attribution is required in many real-world applications, such as meeting transcription, where speaker identity is assigned to each utterance according to speaker voice profiles. In this paper, we propose to solve the speaker attribution problem by using graph-based semi-supervised learning methods. A graph of speech segments is built for each session, on which segments from voice...
The state-of-the-art speaker diarization systems use agglomerative hierarchical clustering (AHC) which performs the clustering of previously learned neural embeddings. While the clustering approach attempts to identify speaker clusters, the AHC algorithm does not involve any further learning. In this paper, we propose a novel algorithm for hierarchical clustering which combines the speaker...
The goal of this paper is speaker diarisation of videos collected `in the wild'.
We make three key contributions.
First, we propose an automatic audio-visual diarisation method for YouTube videos.
Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic...
End-to-end multi-speaker speech recognition has been a popular topic in recent years, as more and more researches focus on speech processing in more realistic scenarios. Inspired by the hearing mechanism of human beings, which enables us to concentrate on the interested speaker from the multi-speaker mixed speech by utilizing both audio and context knowledge, this paper explores the contextual...
Simulated data plays a crucial role in the development and evaluation of novel distant microphone ASR techniques. However, the commonly used simulated datasets adopt uninformed and potentially unrealistic speaker location distributions. We wish to generate more realistic simulations driven by recorded human behaviour. By using devices with a paired microphone array and camera, we analyse...
To improve the noise robustness of automatic speech recognition (ASR), the generative adversarial network (GAN) based enhancement methods are employed as the front-end processing, which comprise a single adversarial process of an enhancement model and a discriminator. In this single adversarial process, the discriminator is encouraged to find differences between the enhanced and clean...
Shift-invariance is a desirable property of many machine learning models. It means that delaying the input of a model in time should only result in delaying its prediction in time. A model that is shift-invariant, also eliminates undesirable side effects like frequency aliasing. When building sequence models, not only should the shift-invariance property be preserved when sampling input...
While end-to-end ASR systems have proven competitive with the conventional hybrid approach, they are prone to accuracy degradation when it comes to noisy and low-resource conditions. In this paper, we argue that, even in such difficult cases, some end-to-end approaches show performance close to the hybrid baseline. To demonstrate this, we use the CHiME-6 Challenge data as an example of...
Despite successful applications of end-to-end approaches in multi-channel speech recognition, the performance still degrades severely when the speech is corrupted by reverberation. In this paper, we integrate the dereverberation module into the end-to-end multi-channel speech recognition system and explore two different frontend architectures. First, a multi-source mask-based weighted...
Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equipping the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings contain specific internal relations between each...
The CHiME-6 dataset presents a difficult task with extreme speech overlap, severe noise and a natural speaking style. The gap of the word error rate (WER) is distinct between the audios recorded by the distant microphone arrays and the individual headset microphones. The official baseline exhibits a WER gap of approximately 10% even though the guided source separation (GSS) has achieved...
This paper proposes a neural network based speech separation method using spatially distributed microphones. Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance, which hinders the use of conventional multi-channel speech separation neural networks based on fixed size input. To overcome this, a novel network...
A novel framework for meeting transcription using asynchronous microphones is proposed in this paper. It consists of audio synchronization, speaker diarization, utterance-wise speech enhancement using guided source separation, automatic speech recognition, and duplication reduction. Doing speaker diarization before speech enhancement enables the system to deal with overlapped speech without...
Electromyographic (EMG) signals recorded during speech production encode information on articulatory muscle activity and also on the facial expression of emotion, thus representing a speech-related biosignal with strong potential for paralinguistic applications. In this work, we estimate the electrical activity of the muscles responsible for speech articulation directly from the speech signal....
Deception detection in conversational dialogue has attracted much attention in recent years. Yet existing methods for this rely heavily on human-labeled annotations that are costly and potentially inaccurate. In this work, we present an automated system that utilizes multimodal features for conversational deception detection, without the use of human annotations. We study the predictive power...
Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to makes use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism,...
While having numerous real-world applications, speech emotion recognition is still a technically challenging problem. How to effectively leverage the inherent multiple modalities in speech data (e.g., audio and text) is key to accurate classification. Existing studies normally choose to fuse multimodal features at the utterance level and largely neglect the dynamic interplay of features from...
Speech emotion recognition (SER) is a challenging task that requires to learn suitable features for achieving good performance. The development of deep learning techniques makes it possible to automatically extract features rather than construct hand-crafted features. In this paper, a multi-scale fusion framework named STSER is proposed for bimodal SER by using speech and text information. A...
Emotion recognition is a challenging and actively-studied research area that plays a critical role in emotion-aware human-computer interaction systems. In a multimodal setting, temporal alignment between different modalities has not been well investigated yet. This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based...
General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks. The embed- dings are typically extracted from models that are built on general tasks such as skip-gram models and natural language generation. In this paper, we extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for...
Integrating multimodal emotion sensing modules in realizing human-centered technologies is rapidly growing. Despite recent advancement of deep architectures in improving recognition performances, inability to handle individual differences in the expressive cues creates a major hurdle for real world applications. In this work, we propose a Speaker-aligned Graph Memory Network (SaGMN) that...
Emotion recognition remains a complex task due to speaker variations and low-resource training samples. To address these difficulties, we focus on the domain adversarial neural networks (DANN) for emotion recognition. The primary task is to predict emotion labels. The secondary task is to learn a common representation where speaker identities can not be distinguished. By using this approach,...
We present an overview of the ASR challenge for non-native children's speech organized for a special session at Interspeech 2020. The data for the challenge was obtained in the context of a spoken language proficiency assessment administered at Italian schools for students between the ages of 9 and 16 who were studying English and German as a foreign language. The corpus distributed for the...
This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Children’s Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of non-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their...
Automatic spoken language assessment (SLA) is a challenging problem due to the large variations in learner speech combined with limited resources. These issues are even more problematic when considering children learning a language, with higher levels of acoustic and lexical variability, and of code-switching compared to adult data. This paper describes the ALTA system for the INTERSPEECH 2020...
This paper describes AaltoASR’s speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children’s speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as partial words, which in the test...
In this paper we describe our children’s Automatic Speech Recognition (ASR) system for the first shared task on ASR for English non-native children’s speech. The acoustic model comprises 6 Convolutional Neural Network (CNN) layers and 12 Factored Time-Delay Neural Network (TDNN-F) layers, trained by data from 5 different children’s speech corpora. Speed perturbation, Room Impulse Response...
In a generation where industries are going through a paradigm shift because of the rampant growth of deep learning, structured data plays a crucial role in the automation of various tasks. Textual structured data is one such kind which is extensively used in systems like chat bots and automatic speech recogni- tion. Unfortunately, a majority of these textual data available is unstructured in...
We present a real-time, full-band, online voice conversion (VC) system that uses a single CPU. For practical applications, VC must be high quality and able to perform real-time, online conversion with fewer computational resources. Our system achieves this by combining non-linear conversion with a deep neural network and short-tap, sub-band filtering. We evaluate our system and demonstrate...
Tube phonation, or straw phonation, is a frequently used vocal training technique to improve the efficiency of the vocal mech- anism by repeatedly producing a speech sound into a tube or straw. Use of the straw results in a semi-occluded vocal tract in order to maximize the interaction between the vocal fold vi- bration and the vocal tract. This method requires a voice trainer or therapist to...
The SoapBox Labs Fluency API service allows the automatic assessment of a child’s reading fluency. The system uses auto- matic speech recognition (ASR) to transcribe the child’s speech as they read a passage. The ASR output is then compared to the text of the reading passage, and the fluency algorithm returns information about the accuracy of the child’s reading attempt. In this show and tell...
SoapBox Labs’ child speech verification platform is a service designed specifically for identifying keywords and phrases in children’s speech. Given an audio file containing children’s speech and one or more target keywords or phrases, the system will return the confidence score of recognition for the word(s) or phrase(s) within the the audio file. The confidence scores are provided at...
We demonstrate a multimodal conversational platform for remote patient diagnosis and monitoring. The plat- form engages patients in an interactive dialog session and automatically computes metrics relevant to speech acoustics and articulation, oro-motor and oro-facial move- ment, cognitive function and respiratory function. The dialog session includes a selection of exercises that have been...
We introduce an open-source Python library, VCTUBE, which can automatically generate <audio, text> pair of speech data from a given Youtube URL. We believe VCTUBE is useful for collecting, processing, and annotating speech data easily toward developing speech synthesis systems.
We proposed a novel AI framework to conduct real-time multi-speaker recognition without any prior registration or pre- training by learning the speaker identification on the fly. We considered the practical problem of online learning with episod- ically revealed rewards and introduced a solution based on semi-supervised and self-supervised learning methods in a web- based application at...
Well-designed adversarial examples can easily fool deep speech emotion recognition models into misclassifications. The transferability of adversarial attacks is a crucial evaluation indicator when generating adversarial examples to fool a new target model or multiple models. Herein, we propose a method to improve the transferability of black-box adversarial attacks using lifelong learning....
In this paper, we propose speech emotion recognition (SER) combined with an acoustic-to-word automatic speech recognition (ASR) model. While acoustic prosodic features are primarily used for SER, textual features are also useful but are error-prone, especially in emotional speech. To solve this problem, we integrate ASR model and SER model in an end-to-end manner. This is done by using an...
The manner that human encodes emotion information within an utterance is often complex and could result in a diverse salient acoustic profile that is conditioned on emotion types. In this work, we propose a framework in imposing a graph attention mechanism on gated recurrent unit network (GA-GRU) to improve utterance-based speech emotion recognition (SER). Our proposed GA-GRU combines both...
One of the keys for supervised learning techniques to succeed resides in the access to vast amounts of labelled training data. The process of data collection, however, is expensive, time-consuming, and application dependent. In the current digital era, data can be collected continuously. This continuity renders data annotation into an endless task, which potentially, in problems such as...
Reliable and generalizable speech emotion recognition(SER) systems have wide applications in various fields including healthcare, customer service, and security and defense. Towards this goal, this study presents a novel teacher-student (T-S) framework for SER, relying on an ensemble of probabilistic predictions of teacher embeddings to train an ensemble of students. We use uncertainty...
Generative adversarial networks (GANs) have shown potential in learning emotional attributes and generating new data samples. However, their performance is usually hindered by the unavailability of larger speech emotion recognition (SER) data. In this work, we propose a framework that utilises the mixup data augmentation scheme to augment the GAN in feature learning and generation. To show...
Speech Emotion Recognition (SER) has been a challenging task on which researchers have been working for decades. Recently, Deep Learning (DL) based approaches have been shown to perform well in SER tasks; however, it has been noticed that their superior performance is limited to the distribution of the data used to train the model. In this paper, we present an analysis of using autoencoders to...
Human emotions are inherently ambiguous and impure. When designing systems to anticipate human emotions based on speech, the lack of emotional purity must be considered. However, most of the current methods for speech emotion classification rest on the consensus, e.g., one single hard label for an utterance. This labeling principle imposes challenges for system performance considering...
Developing robust speech emotion recognition (SER) systems is challenging due to small-scale of existing emotional speech datasets. However, previous works have mostly relied on handcrafted acoustic features to build SER models that are difficult to handle a wide range of acoustic variations. One way to alleviate this problem is by using speech representations learned from deep end-to-end...
Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first...
State-of-the-art speaker verification models are based on deep learning techniques, which heavily depend on the hand-designed neural architectures from experts or engineers. We borrow the idea of neural architecture search(NAS) for the text-independent speaker verification task. As NAS can learn deep network structures automatically, we introduce the NAS conception into the well-known x-vector...
Time delay neural network (TDNN) has been widely used in speaker verification tasks. Recently, two TDNN-based models, including extended TDNN (E-TDNN) and factorized TDNN (F-TDNN), are proposed to improve the accuracy of vanilla TDNN. But E-TDNN and F-TDNN increase the number of parameters due to deeper networks, compared with vanilla TDNN. In this paper, we propose a novel TDNN-based model,...
In this paper we propose an end-to-end phonetically-aware coupled network for short duration speaker verification tasks. Phonetic information is shown to be beneficial for identifying short utterances. A coupled network structure is proposed to exploit phonetic information. The coupled convolutional layers allow the network to provide frame-level supervision based on phonetic representations...
Keyword spotting (KWS) and speaker verification (SV) have been studied independently although it is known that acoustic and speaker domains are complementary. In this paper, we propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information. The multi-task network tightly combines sub-networks aiming at performance improvement in...
The pooling mechanism plays an important role in deep neural network based systems for text-independent speaker verification, which aggregates the variable-length frame-level vector sequence across all frames into a fixed-dimensional utterance-level representation. Previous attentive pooling methods employ scalar attention weights for each frame-level vector, resulting in insufficient...
The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong...
The time-delay neural network (TDNN) is widely used in speaker verification to extract long-term temporal features of speakers. Although common TDNN approaches well capture time-sequential information, they lack the delicate transformations needed for deep representation. To solve this problem, we propose two TDNN architectures. RET integrates short-cut connections into conventional time-delay...
Deep neural networks (DNN) have achieved great success in speaker recognition systems. However, it is observed that DNN based systems are easily deceived by adversarial examples leading to wrong predictions. Adversarial examples, which are generated by adding purposeful perturbations on natural examples, pose a serious security threat. In this study, we propose the adversarial separation...
This paper presents a novel design of attention model for text-independent speaker verification. The model takes a pair of input utterances and generates an utterance-level embedding to represent speaker-specific characteristics in each utterance. The input utterances are expected to have highly similar embeddings if they are from the same speaker. The proposed attention model consists of a...
In this work, we propose minimum Bayes risk (MBR) training of RNN-Transducer (RNN-T) for end-to-end speech recognition. Specifically, initialized with a RNN-T trained model, MBR training is conducted via minimizing the expected edit distance between the reference label sequence and on-the- fly generated N-best hypothesis. We also introduce a heuris- tic to incorporate an external neural...
Shuo Ren(Beihang University), Guoli Ye -(Microsoft), Sheng Zhao(Microsoft) and Ming Zhou(microsoft research asia)
Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. This approach takes advantage of the memorization capacity of neural networks to learn the mapping from the input sequence to the...
In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by...
In this paper, a novel platform for Acoustic Model training based on Federated Learning (FL) is described. This is the first attempt to introduce Federated Learning techniques in Speech Recognition (SR) tasks. Besides the novelty of the task, the paper describes an easily generalizable FL platform and presents the design decisions used for this task. Amongst the novel algorithms introduced is...
This work investigates semi-supervised training of acoustic models (AM) with the lattice-free maximum mutual information (LF-MMI) objective in practically relevant scenarios with a limited amount of labeled in-domain data. An error detection driven semi-supervised AM training approach is proposed, in which an error detector controls the hypothesized transcriptions or lattices used as LF-MMI...
Wake word (WW) spotting is challenging in far-field due to the complexities and variations in acoustic conditions and the environmental interference in signal transmission. A suite of carefully designed and optimized audio front-end (AFE) algorithms help mitigate these challenges and provide better quality audio signals to the downstream modules such as WW spotter. Since the WW model is...
In this paper, we propose two novel regularization-based
speaker adaptive training approaches for connectionist temporal
classification (CTC) based speech recognition. The first method
is center loss (CL) regularization, which is used to penalize the
distances between the embeddings of different speakers and the
only center. The second method is speaker variance loss (SVL)
regularization...
We investigate the robustness and training dynamics of raw waveform acoustic models for automatic speech recognition (ASR). It is known that the first layer of such models learn a set of filters, performing a form of time-frequency analysis. This layer is liable to be under-trained owing to gradient vanishing, which can negatively affect the network performance. Through a set of experiments on...
Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled...
In this work we layout a Fast & Slow (F&S) acoustic model (AM) in an encoder-decoder architecture for streaming automatic speech recognition (ASR). The Slow model represents our baseline ASR model; it's significantly larger than Fast model and provides stronger accuracy. The Fast model is generally developed for related speech applications. It has weaker ASR accuracy but is faster to evaluate...
We present a novel training approach for encoder-decoder-based sequence-to-sequence (S2S) models. The important key factor of S2S is the attention mechanism as it captures the relationships between input and output sequences. The attention weights inform which time frames should be attended to for predicting the output labels. In previous work, we proposed distilling S2S knowledge into...
It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a...
Text-to-Speech synthesis (TTS) based data augmentation is a relatively new mechanism for utilizing text-only data to improve automatic speech recognition (ASR) training without parameter or inference architecture changes.
However, efforts to train speech recognition systems on synthesized utterances suffer from limited acoustic diversity of TTS outputs.
Additionally, the text-only corpus is...
We present PyChain, a fully parallelized PyTorch implementation of end-to-end lattice-free maximum mutual information (LF-MMI) training for the so-called chain models in the Kaldi automatic speech recognition (ASR) toolkit. Unlike other PyTorch and Kaldi based ASR toolkits, PyChain is designed to be as flexible and light-weight as possible so that it can be easily plugged into a new ASR...
In this paper, we present a new open source toolkit for speech recognition, named CAT (\underline{C}TC-CRF based \underline{A}SR \underline{T}oolkit). CAT inherits the data-efficiency of the hybrid approach and the simplicity of the E2E approach, providing a full-fledged implementation of CTC-CRFs and complete training and testing scripts for a number of English and Chinese benchmarks....
Monotonic chunkwise attention (MoChA) has been studied for the online streaming automatic speech recognition (ASR) based on a sequence-to-sequence framework.
In contrast to connectionist temporal classification (CTC), backward probabilities cannot be leveraged in the alignment marginalization process during training due to left-to-right dependency in the decoder.
This results in the error...
Using data from multiple dialects has shown promise in improving neural network acoustic models. While such training can improve the performance of an acoustic model on a single dialect, it can also produce a model capable of good performance on multiple dialects. However, training an acoustic model on pooled data from multiple dialects takes a significant amount of time and computing...
Recently, End-to-End (E2E) models have achieved state-of-the-art performance for automatic speech recognition (ASR). Within these large and deep models, overfitting remains an important problem that heavily influences the model performance. One solution to deal with the overfitting problem is to increase the quantity and variety of the training data with the help of data augmentation. In this...
eep learning enables the development of efficient end-to-end speech processing applications while bypassing the need for expert linguistic and signal processing features. Yet, recent studies show that good quality speech resources and phonetic transcription of the training data can enhance the results of these applications. In this paper, the RECOApy tool is introduced. RECOApy streamlines the...
The demand for fast and accurate incremental speech recognition increases as the applications of automatic speech recognition (ASR) proliferate. Incremental speech recognizers output chunks of partially recognized words while the user is still talking. Partial results can be revised before the ASR finalizes its hypothesis, causing instability issues. We analyze the quality and stability of...
A common question being raised in automatic speech recognition (ASR) evaluations is how reliable is an observed word error rate (WER) improvement comparing two ASR systems, where statistical hypothesis testing and confidence interval (CI) can be utilized to tell whether this improvement is real or only due to random chance. The bootstrap resampling method has been popular for such significance...
Psycholinguistic normatives represent various affective and mental
constructs using numeric scores and are used in a variety of applications in natural language processing. They are commonly used
at the sentence level, the scores of which are estimated by extrapolating word level scores using simple aggregation strategies, which
may not always be optimal. In this work, we present a novel...
The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a...
Recent improvements in Automatic Speech Recognition (ASR) systems have enabled the growth of myriad applications such as voice assistants, intent detection, keyword extraction and sentiment analysis. These applications, which are now widely used in the industry, are very sensitive to the errors generated by ASR systems. This could be overcome by having a reliable confidence measurement...
Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we continue our effort in estimating WER using acoustic, lexical and phonotactic features. Our novel approach to estimate the WER uses a multistream end-to-end architecture. We...
With laughter research seeing a development in recent years, there is also an increased need in materials having laughter annotations. We examine in this study how one can leverage existing spontaneous speech resources to this goal. We first analyze the process of manual laughter annotation in corpora, by establishing two important parameters of the process: the amount of time required and its...
Recent research has highlighted that state-of-the-art automatic speech recognition (ASR) systems exhibit a bias against African American speakers. In this research, we investigate the underlying causes of this racially based disparity in performance, focusing on a unique morpho-syntactic feature of African American English (AAE), namely habitual "be", an invariant form of "be" that encodes the...
A production study explored the acoustic characteristics of /æ/ in CVC and CVN words spoken by California speakers who raise /æ/ in pre-nasal contexts. Results reveal that the phonetic realization of the /æ/-/ɛ/ contrast in these contexts is multidimensional. Raised pre-nasal /æ/ is close in formant space to /ɛ/, particularly over the second half of the vowel. Yet, systematic differences in...
Languages tend to license segmental contrasts where they are maximally perceptible, i.e. where more perceptual cues to the contrast are available. For strident fricatives, the most salient cues to the presence of voicing are low-frequency energy concentrations and fricative duration, as voiced fricatives are systematically shorter than voiceless ones. Cross-linguistically, the voicing contrast...
It is well known that in Mandarin Chinese (MC) nasal rhymes, non-high vowels /a/ and /e/ undergo Vowel Nasalization and Backness Feature Specification processes to harmonize with the nasal coda in both manner and place of articulation. Specifically, the vowel is specified with the [+front] feature when followed by the /n/ coda and the [+back] feature when followed by /ŋ/. On the other hand,...
This paper gives an acoustic phonetic description of the obstruents in the Hangzhou Wu Chinese dialect. Based on the data from 8 speakers (4 male and 4 female), obstruents were examined in terms of VOT, silent closure duration, segment duration, and spectral properties such as H1-H2, H1-F1 and H1-F3. Results suggest that VOT cannot differentiate the voiced obstruents from their voiceless...
In this present study, we re-analyze the vowel system in Kaifeng Mandarin, adopting a phoneme-based approach. Our analysis deviates from the previous syllable-based analyses in a number of ways. First, we treat apical vowels [ɿ ʅ] as syllabic approximants and analyze them as allophones of the retroflex approximant /ɻ/. Second, the vowel inventory is of three sets, monophthongs, diphthongs and...
Fundamental frequency (F0) contours may show slight, microprosodic variations in the vicinity of plosive segments, which may have distinctive patterns relative to the place of articulation and voicing. Similarly, plosive bursts have distinctive characteristics associated with these articulatory features. The current study investigates the degree to which such microprosodic variations arise in...
This paper is an articulatory study of the er-suffixation (a.k.a. erhua) in Southwestern Mandarin (SWM), using co-registered EMA and ultrasound. Data from two female speakers in their twenties were analyzed and discussed. Our recording materials contain unsuffixed stems, er-suffixed forms and the rhotic schwa /ɚ/, a phonemic vowel in its own right. Results suggest that the er-suffixation in...
This paper examined the phonatory features induced by the tripartite plosives in Yanbian Korean, broadly considered as Hamkyungbukdo Korean dialect. Electroglottographic (EGG) and acoustic analysis was applied for five elderly Korean speakers. The results show that fortis-induced phonation is characterized with more constricted glottis, slower spectral tilt, and higher sub-harmonic-harmonic...
In this paper we consider the problem of computationally representing American Sign Language (ASL) phonetics. We specifically present a computational model inspired by the sequential phonological ASL representation, known as the Movement-Hold (MH) Model. Our computational model is capable of not only capturing ASL phonetics, but also has generative abilities. We present a Probabilistic...
In a variety of conversation contexts, accurately predicting the time point at which a conversational participant is about to speak can help improve computer-mediated human-human communications. Although it is not difficult for a human to perceive turn-taking intent in conversations, it has been a challenging task for computers to date. In this study, we employed eye activity acquired from...
Many approaches have been proposed to predict punctuation marks. Previous results demonstrate that these methods are effective.However, there still exists class imbalance problem during training. Most of the classes in the training set for punctuation prediction are non-punctuation marks. This will affect the performance of punctuation prediction tasks. Therefore, this paper uses a focal loss...
Voice assistants such as Siri, Alexa, etc. usually adopt a pipeline to process users’ utterances, which generally include transcribing the audio into text, understanding the text, and finally responding back to users. One potential issue is that some utterances could be devoid of any interesting speech, and are thus not worth being processed through the entire pipeline. Examples of...
End-to-end (E2E) mixed-case automatic speech recognition (ASR) systems that directly predict words in the written domain are attractive due to being simple to build, not requiring explicit capitalization models, allowing streaming capitalization without additional effort beyond that required for streaming ASR, and their small size. However, the fact that these systems produce various versions...
Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels. Recent advances have shown that joint ASR and SD models can learn to leverage audio-lexical inter-dependencies to improve word diarization performance. We introduce a new benchmark of hour-long podcasts collected from the weekly...
Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including...
In this paper, we propose a real-time robot-based auxiliary sys-
tem for risk evaluation of COVID-19 infection. It combines
real-time speech recognition, intent recognition, keyword de-
tection, cough detection and other functions in order to convert
live audio into actionable structured data to achieve the COVID-
19 infection risk assessment function. In order to better...
Anomia (word finding difficulties) is the hallmark of aphasia an acquired language disorder, most commonly caused by stroke. Assessment of speech performance using picture naming tasks is therefore a key method for identification of the disorder and monitoring patient’s response to treatment interventions. Currently, this assessment is conducted manually by speech and language therapists...
Audio-visual speech recognition (AVSR) technologies have been successfully applied to a wide range of tasks. When developing AVSR systems for disordered speech characterized by severe degradation of voice quality and large mismatch against normal, it is difficult to record large amounts of high quality audio-visual data. In order to address this issue, a cross-domain visual feature generation...
Spoken language transcripts generated from Automatic speech recognition (ASR) often contain a large portion of disfluency and lack punctuation symbols. Punctuation restoration and dis- fluency removal of the transcripts can facilitate downstream tasks such as machine translation, information extraction and syntactic analysis [1]. Various studies have shown the influence between these two tasks...
This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion. A recognizer is used to transform acoustic features into linguistic representations while a synthesizer recovers output features from the recognizer outputs together with the speaker identity. By separating the speaker characteristics from the linguistic representations, voice...
Phonetic Posteriorgrams (PPGs) have received much attention for non-parallel many-to-many Voice Conversion (VC), and have been shown to achieve state-of-the-art performance. These methods implicitly assume that PPGs are speaker-independent and contain only linguistic information in an utterance. In practice, however, PPGs carry speaker individuality cues, such as accent, intonation, and...
Voice Conversion (VC) aims at modifying source speaker's speech to sound like that of target speaker while preserving linguistic information of given speech. StarGAN-VC was recently proposed, which utilizes a variant of Generative Adversarial Networks (GAN) to perform non-parallel many-to-many VC. However, the quality of generated speech is not satisfactory enough. An improved method named...
We present a fully convolutional wav-to-wav network for converting between speakers' voices, without relying on text. Our network is based on an encoder-decoder architecture, where the encoder is pre-trained for the task of Automatic Speech Recognition, and a multi-speaker waveform decoder is trained to reconstruct the original signal in an autoregressive manner. We train the network on...
Non-parallel many-to-many voice conversion is recently attract- ing huge research efforts in the speech processing community. A voice conversion system transforms an utterance of a source speaker to another utterance of a target speaker by keeping the content in the original utterance and replacing by the vocal features from the target speaker. Existing solutions, e.g., StarGAN- VC2, present...
The low similarity and naturalness of synthesized speech remain a challenging problem for speaker adaptation with few resources. Since the acoustic model is too complex to interpret, overfitting will occur when training with few data. To prevent the model from overfitting, this paper proposes a novel speaker adaptation framework that decomposes the parameter space of the end-to-end acoustic...
We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator. The proposed generative architecture is invariant to the speaker's identity and can be trained to generate target...
This paper proposes a novel approach to embed speaker information to feature vectors at frame level using an attention mechanism, and its application to one-shot voice conversion. A one-shot voice conversion system is a type of voice conversion system where only one utterance from a target speaker is available for conversion. In many one-shot voice conversion systems, a speaker encoder...
Data efficient voice cloning aims at synthesizing target speaker's voice with only a few enrollment samples at hand. To this end, speaker adaptation and speaker encoding are two typical methods based on base model trained from multiple speakers. The former uses a small set of target speaker data to transfer the multi-speaker model to target speaker's voice through direct model update, while in...
Multiple instance learning (MIL) has recently been used for weakly labelled audio tagging, where the spectrogram of an audio signal is divided into segments to form instances in a bag, and then the low-dimensional features of these segments are pooled for tagging. The choice of a pooling scheme is the key to exploiting the weakly labelled data. However, the traditional pooling schemes are...
This paper presents SpeechMix, a regularization and data augmentation technique for deep sound recognition. Our strategy is to create virtual training samples by interpolating speech samples in hidden space. SpeechMix has the potential to generate an infinite number of new augmented speech samples since the combination of speech samples is continuous. Thus, it allows downstream models to avoid...
Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC). Recently, temporal attention mechanisms have been used in CNN to capture the useful information from the relevant time frames for audio classification, especially for weakly labelled data where the onset and offset times of the sound events are not...
With the vast amount of audio data available, powerful sound representations can be learned with self-supervised methods even in the absence of explicit annotations. In this work we investigate learning general audio representations directly from raw signals using the Contrastive Predictive Coding objective. We further extend it by leveraging ideas from adversarial machine learning to produce...
In this paper we investigate the importance of the extent of memory in sequential self attention for sound recognition. We propose to use a memory controlled sequential self attention mechanism on top of a convolutional recurrent neural network (CRNN) model for polyphonic sound event detection (SED). Experiments on the URBAN-SED dataset demonstrate the impact of the extent of memory on sound...
Audio based event recognition becomes quite challenging in real world noisy environments. To alleviate the noise issue, time-frequency mask based feature enhancement methods have been proposed. While these methods with fixed filter settings have been shown to be effective in familiar noise backgrounds, they become brittle when exposed to unexpected noise. To address the unknown noise problem,...
Mean teacher based methods are increasingly achieving state-of-the-art performance for large-scale weakly labeled and unlabeled sound event detection (SED) tasks in recent DCASE challenges.
By penalizing inconsistent predictions under different perturbations, mean teacher methods can exploit large-scale unlabeled data in a self-ensembling manner.
In this paper, an effective perturbation...
This paper proposes a network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED). The proposed network consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time. This architecture is inspired by the work proposed by Zhou et al., a...
Knowledge Distillation (KD) is a popular area of research for reducing the size of large models while still maintaining good performance. The outputs of larger teacher models are used to guide the training of smaller student models. Given the repetitive nature of acoustic events, we propose to leverage this information to regulate the KD training for Audio Tagging. This novel KD method,...
We propose a two-stage sound event detection (SED) model to deal with sound events overlapped in time-frequency. In the first stage which consists of a faster R-CNN and an attention-LSTM, each log-mel spectrogram segment is divided into one or more proposed regions (PRs) according to the coordinates of a region proposal network. To efficiently train polyphonic sound, we take only one PR for...
Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals. While the neural transformers consistently deliver the best performance among the state-of-the-art neural architectures in field of natural language processing (NLP), their merits in a closely related field, i.e., spoken language understanding (SLU) have not beed...
Large end-to-end neural open-domain chatbots are becoming increasingly popular. However, research on building such chatbots has typically assumed that the user input is written in nature and it is not clear whether these chatbots would seamlessly integrate with automatic speech recognition (ASR) models to serve the speech modality. We aim to bring attention to this important question by...
Spoken Language Understanding (SLU) converts hypotheses from automatic speech recognizer (ASR) into structured semantic representations. ASR recognition errors can severely degenerate the performance of the subsequent SLU module. To address this issue, word confusion networks (WCNs) have been used as the input for SLU, which contain richer information than 1-best or n-best hypotheses list. To...
We consider the problem of spoken language understanding (SLU) of extracting natural language intents and associated slot arguments or named entities from speech that is primarily directed at voice assistants. Such a system subsumes both automatic speech recognition (ASR) as well as natural language understanding (NLU). An end-to-end joint SLU model can be built to a required specification...
Spoken language understanding is typically based on pipeline architectures including speech recognition and natural language understanding steps. These components are optimized independently to allow usage of available data, but the overall system suffers from error propagation. In this paper, we propose a novel training method that enables pretrained contextual embeddings to process acoustic...
Conversational speech, while being unstructured at an utterance level, typically has a macro topic which provides larger context spanning multiple utterances. The current language models in speech recognition systems using recurrent neural networks (RNNLM) rely mainly on the local context and exclude the larger context. In order to model the long term dependencies of words across multiple...
End-to-end spoken language understanding (SLU) systems have many advantages over conventional pipeline systems, but collecting in-domain speech data to train an end-to-end system is costly and time consuming. One question arises from this: how to train an end-to-end SLU with limited amounts of data? Many researchers have explored approaches that make use of other related data resources,...
Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized...
A modern Spoken Language Understanding (SLU) system usually contains two sub-systems, Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU), where ASR transforms voice signal to text form and NLU provides intent classification and slot filling from the text. In practice, such decoupled ASR/NLU design facilitates fast model iteration for both components. However, this...
An essential component of spoken language understanding (SLU) is slot filling: representing the meaning of a spoken utterance using semantic entity labels. In this paper, we develop end-to-end spoken (E2E) language understanding systems that directly convert speech input to semantic entities and investigate if these E2E SLU models can be trained solely on semantic entity annotations without...
Recently, the pipeline consisting of an x-vector speaker embedding front-end and a Probabilistic Linear Discriminant Analysis (PLDA) back-end has achieved state-of-the-art results in text-independent speaker verification. In this paper, we further improve the performance of x-vector and PLDA based system for text-dependent speaker verification by exploring the choice of layer to produce...
Modern approaches to speaker verification represent speech utterances as fixed-length embeddings. With these approaches, we implicitly assume that speaker characteristics are independent of the spoken content. Such an assumption generally holds when sufficiently long utterances are given. In this context, speaker embeddings, like i-vector and x-vector, have shown to be extremely effective. For...
In this paper, we present our XMUSPEECH system for Task 1 in the Short-Duration Speaker Verification (SdSV) Challenge. In this challenge, Task 1 is a Text-Dependent (TD) mode where speaker verification systems are required to automatically determine whether a test segment with specific phrase belongs to the target speaker. We leveraged the system pipeline from three aspects, including the data...
This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling...
This paper presents the Tallinn University of Technology systems submitted to the
Short-duration Speaker Verification Challenge 2020.
The challenge consists of two tasks, focusing on text-dependent and text-independent speaker verification with
some cross-lingual aspects.
We used speaker embedding models that consist of squeeze-and-attention based residual layers,
multi-head attention...
In this paper, we describe the NICT speaker verification system for the text-independent task of the short-duration speaker verification (SdSV) challenge 2020.
We firstly present the details of the training data and feature preparation. Then, x-vector-based front-ends by considering different network configurations, back-ends of probabilistic linear discriminant analysis (PLDA), simplified...
In this paper we describe the top-scoring IDLab submission for the text-independent task of the Short-duration Speaker Verification (SdSV) Challenge 2020. The main difficulty of the challenge exists in the large degree of varying phonetic overlap between the potentially cross-lingual trials, along with the limited availability of in-domain DeepMine Farsi training data. We introduce...
In this paper, we present the winning BUT submission for the text-dependent task of the SdSV challenge 2020. Given the large amount of training data available in this challenge, we explore successful techniques from text-independent systems in the text-dependent scenario. In particular, we trained x-vector extractors on both in-domain and out-domain datasets and combine them with i-vectors...
In this paper, we propose a novel way of addressing text-dependent automatic speaker verification (TD-ASV) by using a shared-encoder with task-specific decoders. An autoregressive predictive coding (APC) encoder is pre-trained in an unsupervised manner using both out-of-domain (LibriSpeech, VoxCeleb) and in-domain (DeepMine) unlabeled datasets to learn generic, high-level feature...
Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of...
Automatic speech recognition (ASR) tasks are usually solved using lexicon-based hybrid systems or character-based acoustic models to automatically translate speech data into written text. While hybrid systems require a manually designed lexicon, end-to-end models can process character-based speech data. This resolves the need to define a lexicon for non-English languages for which a standard...
In this paper, we present the cross-lingual and multilingual experiments we have conducted using existing resources of other languages for the development of speech recognition system for less-resourced languages. In our experiments, we used the Globalphone corpus as source and considered four Ethiopian languages namely Amharic, Oromo, Tigrigna and Wolaytta as targets. We have developed...
In this paper, we report a large-scale end-to-end language-independent multilingual model for joint automatic speech recognition (ASR) and language identification (LID). This model adopts hybrid CTC/attention architecture and achieves Word Error Rate (WER) of 52.8 and LID accuracy of 93.5 on 42 languages with around 5000 hours of training data. We also compare the effects of using...
Code-switching (CS) occurs when a speaker alternates words of two or more languages within a single sentence or across sentences. Automatic speech recognition (ASR) of CS speech has to deal with two or more languages at the same time. In this study, we propose a Transformer-based architecture with two symmetric language-specific encoders to capture the individual language attributes, that...
Development of Multilingual Automatic Speech Recognition (ASR) systems enables to share existing speech and text corpora among languages. We have conducted experiments on the development of multilingual Acoustic Models (AM) and Language Models (LM) for Tigrigna. Using Amharic Deep Neural Network (DNN) AM, Tigrigna pronunciation dictionary and trigram LM, we achieved a Word Error Rate (WER) of...
Acoustic word embeddings (AWEs) are vector representations of spoken word segments. AWEs can be learned jointly with embeddings of character sequences, to generate phonetically meaningful embeddings of written words, or acoustically grounded word embeddings (AGWEs). Such embeddings have been used to improve speech retrieval, recognition, and spoken term discovery. In this work, we extend...
This paper presents our latest effort on improving Code-switching language models that suffer from data scarcity. We investigate methods to augment Code-switching training text data by artificially generating them. Concretely, we propose a cycle-consistent adversarial networks based framework to transfer monolingual text into Code-switching text, considering Code-switching as a speaking style....
To deal with the problem of data scarce in training language
model (LM) for code-switching (CS) speech recognition, we
proposed an approach to obtain augmentation texts from three
different viewpoints. The first one is to enhance monolingual
LM by selecting corresponding sentences for existing conversational corpora; The second one is based on replacements using
syntactic constraint for a...
Punctuation prediction is a critical component for speech recognition readability and speech translation segmentation. When considering multiple language support, traditional monolingual neural network models used for punctuation prediction can be costly to manage and may not produce the best accuracy. In this paper, we investigate multilingual Long Short-Term Memory (LSTM) modeling using...
In 2017 and 2018, two types of vocal-tract models with physical materials were developed that resemble anatomical models and can physically produce human-like speech sounds. The 2017 model is a static-model, and its vocal-tract configuration is set to produce the vowel /a/. The 2018 model is a dynamic-model, and portions of the articulators including the top surface of the tongue are made of a...
For decades, average Root Mean Square Error (RMSE) over all the articulatory channels is one of the most prevalent cost functions for training statistical models for the task of acoustic-to-articulatory inversion (AAI). One of the underlying assumptions is that the samples of all the articulatory channels used for training are balanced and play the same role in AAI. However, this is not true...
Speech production involves the movement of various articulators, including tongue, jaw, and lips. Estimating the movement of the articulators from the acoustics of speech is known as acoustic-to-articulatory inversion (AAI). Recently, it has been shown that instead of training AAI in a speaker specific manner, pooling the acoustic-articulatory data from multiple speakers is beneficial....
In this study we tested the hypothesis that consonant and vowel articulations start at the same time at syllable onset [1]. Articulatory data was collected for Mandarin Chinese using Electromagenetic Articulography (EMA), which tracks flesh- point movements in time and space. Unlike the traditional velocity threshold method [2], we used a triplet method based on the minimal pair paradigm [3]...
A new model for vocal folds with a polyp is proposed, based on a mass-spring-damper system and body-cover structure. The model was used to synthesize a wide variety of sustained vowels samples, with and without vocal polyps. Analytical conjectures regarding the effect of a polyp on synthesized voice signals corresponding to sustained vowels were performed. These conjectures are then used to...
This study attempts to describe a plausible causal mechanism of generating individual vocal characteristics in higher spectra. The lower vocal tract has been suggested to be such a causal region, but a question remains as to how this region modulates vowels’ higher spectra. Based on existing data, this study predicts that resonance of the lower vocal tract modulates higher vowel spectra into a...
The real-time Magnetic Resonance Imaging (rtMRI) is often used for speech production research as it captures the complete view of the vocal tract during speech. Air-tissue boundaries (ATBs) are the contours that trace the transition between high-intensity tissue region and low-intensity airway cavity region in the rtMRI video. The ATBs are used in several speech related applications. However,...
We propose a technique to estimate virtual upper lip (VUL) and virtual lower lip (VLL) trajectories during the production of bilabial stop consonants (/p/, /b/) and nasal (/m/). A VUL (VLL)is a hypothetical trajectory below (above) the measured UL (LL)trajectory which could have been achieved by UL (LL) if and LL were not in contact with each other during bilabial stops and nasal. The...
Speaker extraction aims to extract the target speech signal from a multi-talker environment given a target speaker's reference speech. We recently proposed a time-domain solution, SpEx, that avoids the phase estimation in frequency-domain approaches. Unfortunately, SpEx is not fully a time-domain solution since it performs time-domain speech encoding for speaker extraction, while taking...
Predicting and applying Time-Frequency (T-F) masks on mixture signals have been successfully utilized for speech separation. However, existing studies have not well utilized the identity context of a speaker for the inference of masks. In this paper, we propose a novel speaker-aware monaural speech
separation model. We firstly devise an encoder to disentangle speaker identity information with...
Recently, Convolutional Neural Network (CNN) and Long short-term memory (LSTM) based models have been introduced to deep learning-based target speaker separation. In this paper, we propose an Attention-based neural network (Atss-Net) in the spectrogram domain for the task. It allows the network to compute the correlation between each feature parallelly, and using shallower layers to extract...
Target speech separation refers to isolating target speech from a multi-speaker mixture signal by conditioning on auxiliary information about the target speaker. Different from the mainstream audio-visual approaches which usually require simultaneous visual streams as additional input, e.g. the corresponding lip movement sequences, in our approach we propose the novel use of a single face...
Extracting the speech of a target speaker from mixed audios, based on a reference speech from the target speaker, is a challenging yet powerful technology in speech processing. Recent studies of speaker-independent speech separation, such as TasNet, have shown promising results by applying deep neural networks over the time-domain waveform. Such separation neural network does not directly...
Solving the cocktail party problem with the multi-modal approach has become popular in recent years. Humans can focus on the speech that they are interested in for the multi-talker mixed speech, by hearing the mixed speech, watching the speaker, and understanding the context what the speaker is talking about. In this paper, we try to solve the speaker-independent speech separation problem with...
Speech recognition technology in single-talker scenes has
matured in recent years. However, in noisy environments, es-
pecially in multi-talker scenes, speech recognition performance
is significantly reduced. Towards cocktail party problem, we
propose a unified time-domain target speaker extraction frame-
work. In this framework, we obtain a voiceprint from a clean
speech of the target...
Target-speaker speech separation, due to its essence in industrial applications, has been heavily researched for long by many. The key metric for qualifying a good separation algorithm still lies on the separation performance, i.e., the quality of the separated voice. In this paper, we presented a novel high-performance time-domain waveform based target-speaker speech separation...
Media Intelligence Laboratories), Keisuke Kinoshita(NTT) and Shoko Araki(NTT Communication Science Laboratories)
Being able to control the acoustic events (AEs) to which we want to listen would allow the development of more controllable hearable devices. This paper addresses the AE sound selection (or removal) problems, that we define as the extraction (or suppression) of all the sounds that...
Recent advancements in representation learning enable crossmodal retrieval by modeling an audio-visual co-occurrence in a single aspect, such as physical and linguistic. Unfortunately, in real-world media data, since co-occurrences in various aspects are complexly mixed, it is difficult to distinguish a specific target co-occurrence from many other non-target co-occurrences, resulting in...
Automatic speaker verification systems are vulnerable to audio replay attacks which bypass security by replaying recordings of authorized speakers. Replay attack detection (RA) systems built upon Residual Neural Networks (ResNet)s have yielded astonishing results on the public benchmark ASVspoof 2019Physical Access challenge. With most teams using fine-tuned feature extraction...
We present a new database of voice recordings with the goal of promoting research on protection of automatic speaker verification systems from voice spoofing, such as replay attacks. Specifically, we focus on the liveness feature of live speech, i.e., pop noise, and the corresponding voice recordings without this feature, for the purpose of combating spoofing via liveness detection. Our...
Despite tremendous progress in speaker verification recently, replay spoofing attacks are still a major threat to these systems. Focusing on dataset-specific scenarios, anti-spoofing systems have achieved promising in-domain performance at the cost of poor generalization towards unseen out-of-domain datasets. This is treated as a domain mismatch problem with a domain adversarial training (DAT)...
Constructing a dataset for replay spoofing detection requires a physical process of playing an utterance and re-recording it, presenting a challenge to the collection of large-scale datasets. In this study, we propose a self-supervised framework for pretraining acoustic configurations using datasets published for other tasks, such as speaker verification. Here, acoustic configurations refer to the...
The fusion of i-vector with prosodic features is used to identify the most competent voice imitator through a deep neural network framework (DNN) in this paper. This experiment is conducted by analyzing the spectral and prosodic characteristics during voice imitation. Spectral features include mel frequency cepstral features (MFCC) and modified group delay features (MODGDF). Prosodic features,...
Modern text-to-speech (TTS) and voice conversion (VC) systems produce natural sounding speech that questions the security of automatic speaker verification (ASV). This makes detection of such synthetic speech very important to safeguard ASV systems from unauthorized access. Most of the existing spoofing countermeasures perform well when the nature of the attacks is made known to the system...
The threat of spoofing can pose a risk to the reliability of automatic speaker verification. Results from the bi-annual ASVspoof evaluations show that effective countermeasures demand front-ends designed specifically for the detection of spoofing artefacts. Given the diversity in spoofing attacks, ensemble methods are particularly effective. The work in this paper shows that a bank of very...
Current approaches to Voice Presentation Attack (VPA) detection have largely focused on spoofing detection within a single database and/or attack type. However, for practical Presentation Attack Detection (PAD) systems to be adopted by industry, they must be able to generalise to detect diverse and previously unseen VPAs. Inspired by successful aspects of deep learning systems for image...
The security and reliability of automatic speaker verification systems can be threatened by different types of spoofing attacks using speech synthetic, voice conversion, or replay. The 2-class Gaussian Mixture Model classifier for genuine and spoofed speech is usually used as the baseline in the ASVspoof challenge, which is designed to develop the generalized countermeasures with potential to...
Deep-learning based noise reduction algorithms have proven their success especially for non-stationary noises, which makes it desirable to also use them for embedded devices like hearing aids (HAs).
This, however, is currently not possible with state-of-the-art methods.
They either require a lot of parameters and computational power and thus are only feasible using modern CPUs.
Or they are...
This paper proposes a metric that we call the structured saliency
benchmark (SSBM) to evaluate importance maps computed for
automatic speech recognizers on individual utterances. These
maps indicate time-frequency points of the utterance that are
most important for correct recognition of a target word. Our
evaluation technique is not only suitable for standard classification tasks, but is...
We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound...
Previous studies have confirmed the effectiveness of incorporating visual information into speech enhancement (SE) systems. Despite improved denoising performance, two problems may be encountered when implementing an audio-visual SE (AVSE) system: (1) additional processing costs are incurred to incorporate visual input and (2) the use of face or lip images may cause privacy problems. In this...
In bioacoustics, passive acoustic monitoring of animals living in the wild, both on land and underwater, leads to large data archives characterized by a strong imbalance between recorded animal sounds and ambient noises. Bioacoustic datasets suffer extremely from such large noise-variety, caused by a multitude of external influences and changing environmental conditions over years. This leads...
We formulate active noise control (ANC) as a supervised learning problem and propose a deep learning approach, called deep ANC, to address the nonlinear ANC problem. A convolutional recurrent network (CRN) is trained to estimate the real and imaginary spectrograms of the canceling signal from the reference signal so that the corresponding anti-noise can eliminate or attenuate the primary noise...
Increasing speech intelligibility for hearing-impaired listeners and normal-hearing listeners in noisy environments remains a challenging problem. Spectral style conversion from habitual to clear speech is a promising approach to address the problem. Motivated by the success of generative adversarial networks (GANs) in various applications of image and speech processing, we explore the...
Data-driven speech intelligibility prediction has been slow to take off. Datasets of measured speech intelligibility are scarce, and so current models are relatively small and rely on hand-picked features. Classical predictors based on psychoacoustic models and heuristics are still the state-of-the-art. This work proposes a U-Net inspired fully convolutional neural network architecture, NSIP,...
The measurement of speech intelligibility (SI) still mainly relies on time-consuming and expensive subjective experiments because no versatile objective measure can predict SI. One promising candidate of an SI prediction method is an approach with a deep neural network (DNN)-based automatic speech recognition (ASR) system, due to its recent great advance. In this paper, we propose and evaluate...
In this article, we provide a model to estimate a real-valued measure of the intelligibility of individual speech segments.
We trained regression
models based on Convolutional Neural Networks (CNN) for stop consonants \textipa{/p,t,k,b,d,g/} associated with vowel \textipa{/A/}, to estimate the
corresponding Signal to Noise Ratio (SNR) at which the Consonant-Vowel (CV) sound becomes...
Convolutional neural networks are widely adopted in Acoustic Scene Classification (ASC) tasks, but they generally carry a heavy computational burden. In this work, we propose a high-performance yet lightweight baseline network inspired by MobileNetV2, which replaces square convolutional kernels with unidirectional ones to extract features alternately in temporal and frequency dimensions....
In this work, we compare the performance of three selected techniques in open set acoustic scenes classification (ASC). We test thresholding of the softmax output of a deep network classifier, which is the most popular technique nowadays employed in ASC. Further we compare the results with the Openmax classifier which is derived from the computer vision field. As the third model, we use the...
Acoustic scene classification systems using deep neural networks classify given recordings into pre-defined classes. In this study, we propose a novel scheme for acoustic scene classification which adopts an audio tagging system inspired by the human perception mechanism. When humans identify an acoustic scene, the existence of different sound events provides discriminative information which...
Convolutional Neural Networks (CNNs) have been widely investigated on Acoustic Scene Classification (ASC). Where the convolutional operation can extract useful semantic contents from a local receptive field in the input spectrogram within certain Manhattan distance, i.e., the kernel size. Although stacking multiple convolution layers can increase the range of the receptive field, without...
In this paper, we propose a model for the Environment Sound Classification Task (ESC) that consists of multiple feature channels given as input to a Deep Convolutional Neural Network (CNN) with Attention mechanism. The novelty of the paper lies in using multiple feature channels consisting of Mel-Frequency
Cepstral Coefficients (MFCC), Gammatone Frequency Cepstral Coefficients (GFCC), the...
Acoustic Scene Classification (ASC) is a challenging task, as a single scene may involve multiple events that contain com- plex sound patterns. For example, a cooking scene may con- tain several sound sources including silverware clinking, chop- ping, frying, etc. What complicates ASC more is that classes of different activities could have overlapping sounds patterns (e.g. both cooking and...
In this paper, we propose a domain adaptation framework to address the device mismatch issue in acoustic scene classification leveraging upon neural label embedding (NLE) and relational teacher student learning (RTSL). Taking into account the structural relationships between acoustic scene classes, our proposed framework captures such relationships which are intrinsically device-independent....
In this paper, we propose a sub-utterance unit selection framework to remove acoustic segments in audio recordings that carry little information for acoustic scene classification (ASC). Our approach is built upon a universal set of acoustic segment units covering the overall acoustic scene space. First, those units are modeled with acoustic segment models (ASMs) used to tokenize acoustic scene...
Acoustic soundscapes can be made up of background sound events and foreground sound events. Many times, either the background (or the foreground) may provide useful cues in discriminating one soundscape from another. A part of the background or a part of the foreground can be suppressed by using subspace projections. These projections can be learnt by utilising the framework of robust...
Auditory data is used by ecologists for a variety of purposes, including identifying species ranges, estimating population sizes, and studying behaviour. Autonomous recording units (ARUs) enable auditory data collection over a wider area, and can provide improved consistency over traditional sampling methods. The result is an abundance of audio data -- much more than can be analysed by...
We present UTACO, a singing synthesis model based on an attention-based sequence-to-sequence mechanism and a vocoder based on dilated causal convolutions. These two classes of models have significantly affected the field of text-to-speech, but have never been thoroughly applied to the task of singing synthesis. UTACO demonstrates that attention can be successfully applied to the singing...
Peking Opera has been the most dominant form of Chinese performing art since around 200 years ago. A Peking Opera singer usually exhibits a very strong personal style via introducing improvisation and expressiveness on stage which leads the actual rhythm and pitch contour to deviate significantly from the original music score. This inconsistency poses a great challenge in Peking Opera singing...
Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data.In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only...
Detecting singing voice in polyphonic instrumental music is critical to music information retrieval. To train a robust vocal detector, a large dataset marked with vocal or non-vocal label at frame-level is essential. However, frame-level labeling is time-consuming and labor expensive, resulting there is little well-labeled dataset available for singing-voice detection (S-VD). Hence, we propose...
This paper presents a new input format, channel-wise subband input (CWS), for convolutional neural networks (CNN) based music source separation (MSS) models in the frequency domain. We aim to address the major issues in CNN-based high resolution MSS model: high computational cost and weight sharing between distinctly different bands. Specifically, in this paper, we decompose the input mixture...
We emulate continual learning observed in real life, where new training data, which represent new application domain, are used for gradual improvement of an Automatic Speech Recognizer(ASR) trained on old domains. The data on which the original classifier was trained is no longer required and we observe no loss of performance on the original domain. Further, on previously unseen domain, our...
We present a new frame-wise online unsupervised adaptation method for an acoustic model based on a deep neural network (DNN). This is in contrast to many existing methods that assume offline and supervised processing. We use a likelihood cost function conditioned by past observations, which mathematically integrate the unsupervised adaptation and decoding process for automatic speech...
In our previous work, we introduced a speaker adaptive training method based on frame-level attention mechanism for speech recognition, which has been proved an effective way to do speaker adaptive training. In this paper, we present an improved method by introducing the attention-over-attention mechanism. This attention module is used to further measure the contribution of each frame to the...
Rapid unsupervised speaker adaptation in an E2E system posits us new challenges due to its end-to-end unified structure in addition to its intrinsic difficulty of data sparsity and imperfect label [1]. Previously we proposed utilizing the content relevant personalized speech synthesis for rapid speaker adaptation and achieved significant performance breakthrough in a hybrid system [2]. In this...
End-to-end models have been introduced into automatic speech recognition (ASR) successfully and achieved superior performance compared with conventional hybrid systems, especially with the newly proposed transformer model. However, speaker mismatch between training and test data remains a problem, and speaker adaptation for transformer model can be further improved. In this paper, we propose...
In this paper, we propose a new speaker normalization technique for acoustic model adaptation in connectionist temporal classification (CTC)-based automatic speech recognition. In the proposed method, for the inputs of a hidden layer, the mean and variance of each activation are first estimated at the speaker level. Then, we normalize each speaker representation independently by making them...
Unsupervised domain adaptation (UDA) using adversarial learning has shown promise in adapting speech models from a labeled source domain to an unlabeled target domain. However, prior works make a strong assumption that the label spaces of source and target domains are identical, which can be easily violated in real-world conditions. We present AMLS, an end-to-end architecture that performs...
Local dialects influence people to pronounce words of the same language differently from each other. The great variability and complex characteristics of accents creates a major challenge for training a robust and accent-agnostic automatic speech recognition (ASR) system. In this paper, we introduce a cross-accented English speech recognition task as a benchmark for measuring the ability of...
We introduce the problem of adapting a black-box, cloud-based system to speech from a target accent. While leading online ASR services obtain impressive performance on main-stream accents, they perform poorly on sub-populations — we observed that the word error rate (WER) achieved by Google’s ASR API on Indian accents is almost twice the WER on US accents. Existing adaptation methods either...
Current automatic speech recognition (ASR) systems trained on native speech often perform poorly when applied to non-native or accented speech. In this work, we propose to compute x-vector-like accent embeddings and use them as auxiliary inputs to an acoustic model trained on native data only in order to improve the recognition of multi-accent data comprising native, non-native, and accented...
This paper presents a high quality singing synthesizer that is able to model a voice with limited available recordings. Based on the sequence-to-sequence singing model, we design a multi-singer framework to leverage all the existing singing data of different singers. To attenuate the issue of musical score unbalance among singers, we incorporate an adversarial task of singer classification to...
This study investigates the direct use of speech waveforms to predict head motion for speech-driven head-motion synthesis, whereas the use of spectral features such as MFCC as basic input features together with additional features such as energy and F0 is common in the literature. We show that, rather than combining different features that originate from waveforms, it is more effective to use...
This paper presents XiaoiceSing, a high-quality singing voice synthesis system which employs an integrated network for spectrum, F0 and duration modeling. We follow the main architecture of FastSpeech while proposing some singing-specific design: 1) Besides phoneme ID and position encoding, features from musical score (e.g.note pitch and length) are also added. 2) To attenuate off-key issues,...
The ability to envisage the visual of a talking face based just on hearing a voice is a unique human capability. There have been a number of works that have solved for this ability recently. We differ from these approaches by enabling a variety of talking face generations based on single audio input. Indeed, just having the ability to generate a single talking face would make a system almost...
This paper investigates the use of generative adversarial network (GAN)-based models for converting the spectrogram of a speech signal into that of a singing one, without reference to the phoneme sequence underlying the speech. This is achieved by viewing the speech-to-singing conversion task as a style transfer problem. Specifically, given a speech input, and optionally the F0 contour of the...
We are quite able to imagine voice characteristics of a speaker from his/her appearance, especially a face. In this paper, we propose Face2Speech, which generates speech with its characteristics predicted from a face image. This framework consists of three separately trained modules: a speech encoder, a multi-speaker text-to-speech (TTS), and a face encoder. The speech encoder outputs an...
Previous talking head generation methods mostly focus on frontal face synthesis while neglecting natural person head motion. In this paper, a generative adversarial network (GAN) based method is proposed to generate talking head video with not only high quality facial appearance, accurate lip movement, but also natural head motion. To this aim, the facial landmarks are detected and used to...
This contributions describes the "IISPA" submission to the Hurricane Challenge 2.0. The challenge organizers called for submissions of speech signals processed with the aim to improve their intelligibility in adverse listening conditions.
They evaluated the submissions with matrix sentence tests in an international listening experiment. An intelligibility-improving signal processing approach...