# INTERSPEECH 2020

Asia/Shanghai
Shanghai International Convention Center

#### Shanghai International Convention Center

• Sunday, 25 October
• 16:30 18:00
Tutorial 1: Efficient and flexible implementation of machine learning for ASR and MT room1

### room1

https://zoom.com.cn/j/62835797158

• 16:30
Efficient and flexible implementation of machine learning for ASR and MT 1h 30m

Flexibility and speed are key features for a deep learning framework to allow fast transition from a research idea to prototyping and production code. We outline how to implement a unified framework for sequence processing that covers various kinds of models and applications. We will discuss our toolkit RETURNN as an example for such an implementation, that is easy to apply and understand for the user, flexible to allow for any kind of architecture or method, and at the same time also very efficient. In addition, a comparison of the properties of different machine learning toolkits for sequence classification is provided. The flexibility of using such specific implementations will be demonstrated describing the setup of recent state-of-the-art models for automatic speech recognition and machine translation, upon others.

Speakers: Albert Zeyer, André Merboldt, Nick Rossenbach, Parnia Bahar, Ralf Schlüter
• 16:30 18:00
Tutorial 2: Spoken dialogue for social robots room2

### room2

https://zoom.com.cn/j/68267360623

• 16:30
Spoken dialogue for social robots 1h 30m

While smartphone assistants and smart speakers are prevalent and there are high expectations for social communicative robots, spoken language interaction with these kinds of robots is not effectively deployed. This tutorial aims to give an overview of the issues and challenges related to the integration of natural multimodal dialogue processing for social robots. We first outline dialogue tasks and interfaces suitable for robots in comparison with the conventional dialogue systems and virtual agents. Then, challenges and approaches in the component technologies including ASR, TTS, SLU and dialogue management are reviewed with the focus on human-robot interaction. Issues related to multimodal processing are also addressed. In particular, we review non-verbal processing, including gaze and gesturing, for facilitating turn-taking, timing of backchannels, and indicating troubles in interaction. Finally, we will also briefly discuss open questions concerning architectures for integrating spoken dialogue systems and human-robot interaction.

Speakers: Kristiina Jokinen, Tatsuya Kawahara
• 16:30 18:00
Tutorial 3: Meta learning and its applications to human language processing room3

### room3

https://zoom.com.cn/j/67068089598

• 16:30
Meta learning and its applications to human language processing 1h 30m

Deep learning based human language technology (HLT), such as automatic speech recognition, intent and slot recognition, or dialog management, has become the mainstream of research in recent years and significantly outperforms conventional methods. However, deep learning models are notorious for being data and computation hungry. These downsides limit the application of such models from deployment to different languages, domains, or styles, since collecting in-genre data and training model from scratch are costly, and the long-tail nature of human language makes challenges even greater.
A typical machine learning algorithm, e.g., deep learning, can be considered as a sophisticated function. The function takes training data as input and a trained model as output. Today the learning algorithms are mostly human-designed. Usually, these algorithms are designed for one specific task and need a large amount of labeled training data to learn. One possible method which could potentially overcome these challenges is Meta Learning, also known as ‘Learning to Learn’ that aims at learning the learning algorithm, including better parameter initialization, optimization strategy, network architecture, distance metrics and beyond. Recently, in several HLT areas, Meta Learning has been shown high potential to allow faster fine-tuning, converge to better performance, and achieve few-shot learning. The goal of this tutorial is to introduce Meta Learning approaches and review the work applying this technology to HLT.

Speakers: Hung-yi Lee, Ngoc Thang Vu, Shang-Wen Li
• 16:30 18:00
Tutorial 4: Intelligibility evaluation and speech enhancement based on deep learning room4

### room4

https://zoom.com.cn/j/63526611552

• 16:30
Intelligibility evaluation and speech enhancement based on deep learning 1h 30m

Although recent success has demonstrated the effectiveness of adopting deep-learning-based models in the speech enhancement (SE) task, several directions are worthy explorations to further improve the SE performance. One direction is to derive a better objective function to replace the conventional mean squared error based one to train the deep-learning-based models. In this tutorial, we first present several well-known intelligibility evaluation metrics and then present the theory and implementation details of SE systems trained with metric-based objective functions. The effectiveness of these terms are confirmed by providing better standardized objective metric and subjective listening test scores, as well as higher automatic speech recognition accuracy.

Speakers: Fei Chen, Yu Tsao
• 18:00 18:15
Coffee Break
• 18:15 19:45
Tutorial 1: contd. room1

### room1

https://zoom.com.cn/j/62835797158

• 18:15
Efficient and flexible implementation of machine learning for ASR and MT 1h 30m

Flexibility and speed are key features for a deep learning framework to allow fast transition from a research idea to prototyping and production code. We outline how to implement a unified framework for sequence processing that covers various kinds of models and applications. We will discuss our toolkit RETURNN as an example for such an implementation, that is easy to apply and understand for the user, flexible to allow for any kind of architecture or method, and at the same time also very efficient. In addition, a comparison of the properties of different machine learning toolkits for sequence classification is provided. The flexibility of using such specific implementations will be demonstrated describing the setup of recent state-of-the-art models for automatic speech recognition and machine translation, upon others.

Speakers: Albert Zeyer, André Merboldt, Nick Rossenbach, Parnia Bahar, Ralf Schlüter
• 18:15 19:45
Tutorial 2: contd room2

### room2

https://zoom.com.cn/j/68267360623

• 18:15
Spoken dialogue for social robots 1h 30m

While smartphone assistants and smart speakers are prevalent and there are high expectations for social communicative robots, spoken language interaction with these kinds of robots is not effectively deployed. This tutorial aims to give an overview of the issues and challenges related to the integration of natural multimodal dialogue processing for social robots. We first outline dialogue tasks and interfaces suitable for robots in comparison with the conventional dialogue systems and virtual agents. Then, challenges and approaches in the component technologies including ASR, TTS, SLU and dialogue management are reviewed with the focus on human-robot interaction. Issues related to multimodal processing are also addressed. In particular, we review non-verbal processing, including gaze and gesturing, for facilitating turn-taking, timing of backchannels, and indicating troubles in interaction. Finally, we will also briefly discuss open questions concerning architectures for integrating spoken dialogue systems and human-robot interaction.

Speakers: Kristiina Jokinen, Tatsuya Kawahara
• 18:15 19:45
Tutorial 3: contd. room3

### room3

https://zoom.com.cn/j/67068089598

• 18:15
Meta learning and its applications to human language processing 1h 30m

Deep learning based human language technology (HLT), such as automatic speech recognition, intent and slot recognition, or dialog management, has become the mainstream of research in recent years and significantly outperforms conventional methods. However, deep learning models are notorious for being data and computation hungry. These downsides limit the application of such models from deployment to different languages, domains, or styles, since collecting in-genre data and training model from scratch are costly, and the long-tail nature of human language makes challenges even greater.
A typical machine learning algorithm, e.g., deep learning, can be considered as a sophisticated function. The function takes training data as input and a trained model as output. Today the learning algorithms are mostly human-designed. Usually, these algorithms are designed for one specific task and need a large amount of labeled training data to learn. One possible method which could potentially overcome these challenges is Meta Learning, also known as ‘Learning to Learn’ that aims at learning the learning algorithm, including better parameter initialization, optimization strategy, network architecture, distance metrics and beyond. Recently, in several HLT areas, Meta Learning has been shown high potential to allow faster fine-tuning, converge to better performance, and achieve few-shot learning. The goal of this tutorial is to introduce Meta Learning approaches and review the work applying this technology to HLT.

Speakers: Hung-yi Lee, Ngoc Thang Vu, Shang-Wen Li
• 18:15 19:45
Tutorial 4: contd. room4

### room4

https://zoom.com.cn/j/63526611552

• 18:15
Intelligibility evaluation and speech enhancement based on deep learning 1h 30m

Although recent success has demonstrated the effectiveness of adopting deep-learning-based models in the speech enhancement (SE) task, several directions are worthy explorations to further improve the SE performance. One direction is to derive a better objective function to replace the conventional mean squared error based one to train the deep-learning-based models. In this tutorial, we first present several well-known intelligibility evaluation metrics and then present the theory and implementation details of SE systems trained with metric-based objective functions. The effectiveness of these terms are confirmed by providing better standardized objective metric and subjective listening test scores, as well as higher automatic speech recognition accuracy.

Speakers: Fei Chen, Yu Tsao
• 19:45 20:00
Coffee Break
• 20:00 21:30
Tutorial 5: 'Speech 101' - What everyone working on spoken language processing needs to know about spoken language room1

### room1

https://zoom.com.cn/j/62835797158

• 20:00
'Speech 101' - What everyone working on spoken language processing needs to know about spoken language 1h 30m

In recent years, the field of spoken language processing has been moving at a very fast pace. The impact of deep learning coupled with access to vast data resources has given rise to unprecedented improvements in the performance of speech processing algorithms and systems. However, the availability of such pre-recorded datasets and open-source machine-learning toolkits means that practitioners – especially students – are in real danger of becoming detached from the nature and behaviour of actual speech signals. This tutorial is aimed at providing an appreciation of the fundamental properties of spoken language, from low-level phonetic detail to high-level communicative behaviour, with a special emphasis on aspects that may have significance for current and future research.

Speaker: Roger K. Moore
• 20:00 21:30
Tutorial 6: Neural approaches to conversational information retrieval room2

### room2

https://zoom.com.cn/j/68267360623

• 20:00
Neural approaches to conversational information retrieval 1h 30m

A conversational information retrieval (CIR) system is an information retrieval (IR) system with a conversational interface which allows users to interact with the system to seek information via multi-turn conversations of natural language (in spoken or written form). This tutorial surveys recent advances in CIR, focusing on neural approaches that have been developed in the last few years. We present (1) a typical architecture of a CIR system, (2) new tasks and applications which arise from the needs of developing such a system, in comparison with traditional keyword-based IR systems, (3) new methods of conversational question answering, and (4) case studies of several CIR systems developed in research communities and industry.

Speakers: Chenyan Xiong, Jianfeng Gao, Paul Bennett
• 20:00 21:30
Tutorial 7: Neural models for speaker diarization in the context of speech recognition room3

### room3

https://zoom.com.cn/j/67068089598

• 20:00
Neural models for speaker diarization in the context of speech recognition 1h 30m

Speaker diarization is an essential component for speech applications in multi-speaker settings. Spoken utterances need to be attributed to speaker-specific classes with or without prior knowledge of the speakers' identity or profile. Initially, speaker diarization technologies were developed as standalone processes without requiring much context of other components in a given speech application. As speech recognition technology has become more accessible, there is an emerging trend considering speaker diarization as an integral part of an overall speech recognition application; while benefiting from the speech recognition output to improve speaker diarization accuracy. As of lately, joint model training for speaker diarization and speech recognition is investigated in an attempt to consolidate the training objectives, enhancing the overall performance. In this tutorial, we will overview the development of speaker diarization in the era of deep learning, present the recent approaches to speaker diarization in the context of speech recognition, and share the industry perspectives on speaker diarization and its challenges. Finally, we will provide insights about future directions of speaker diarization as a part of context-aware interactive system.

Speakers: Dimitrios Dimitriadis, Kyu J. Han, Tae Jin Park
• 20:00 21:30
Tutorial 8: Spoken language processing for language learning and assessment room4

### room4

https://zoom.com.cn/j/63526611552

• 20:00
Spoken language processing for language learning and assessment 1h 30m

This tutorial will provide an in-depth survey of the state of the art in spoken language processing in language learning and assessment from a practitioner’s perspective. The first part of the tutorial will discuss in detail the acoustic, speech, and language processing challenges in recognizing and dealing with native and non-native speech from both adults and children from different language backgrounds at scale. The second part of the tutorial will examine the current state of the art in both knowledge-driven and data-driven approaches to automated scoring of such data along various dimensions of spoken language proficiency, be it monologic or dialogic in nature. The final part of the tutorial will look at a hot topic and key challenge facing the field at the moment – that of automatically generating targeted feedback for language learners that can help them improve their overall spoken language proficiency.

The presenters, based at Educational Testing Service R&D in Princeton and San Francisco, USA, have more than 40 years of combined R&D experience in spoken language processing for education, speech recognition, spoken dialog systems and automated speech scoring.

Speakers: Keelan Evanini, Klaus Zechner, Vikram Ramanarayanan
• 21:30 21:45
Coffee Break
• 21:45 23:15
Tutorial 5: contd. room1

### room1

https://zoom.com.cn/j/62835797158

• 21:45
'Speech 101' - What everyone working on spoken language processing needs to know about spoken language 1h 30m

In recent years, the field of spoken language processing has been moving at a very fast pace. The impact of deep learning coupled with access to vast data resources has given rise to unprecedented improvements in the performance of speech processing algorithms and systems. However, the availability of such pre-recorded datasets and open-source machine-learning toolkits means that practitioners – especially students – are in real danger of becoming detached from the nature and behaviour of actual speech signals. This tutorial is aimed at providing an appreciation of the fundamental properties of spoken language, from low-level phonetic detail to high-level communicative behaviour, with a special emphasis on aspects that may have significance for current and future research.

Speaker: Roger K. Moore
• 21:45 23:15
Tutorial 6: contd. room2

### room2

https://zoom.com.cn/j/68267360623

• 21:45
Neural approaches to conversational information retrieval 1h 30m

A conversational information retrieval (CIR) system is an information retrieval (IR) system with a conversational interface which allows users to interact with the system to seek information via multi-turn conversations of natural language (in spoken or written form). This tutorial surveys recent advances in CIR, focusing on neural approaches that have been developed in the last few years. We present (1) a typical architecture of a CIR system, (2) new tasks and applications which arise from the needs of developing such a system, in comparison with traditional keyword-based IR systems, (3) new methods of conversational question answering, and (4) case studies of several CIR systems developed in research communities and industry.

Speakers: Chenyan Xiong, Jianfeng Gao, Paul Bennett
• 21:45 23:15
Tutorial 7: contd. room3

### room3

https://zoom.com.cn/j/67068089598

• 21:45
Neural models for speaker diarization in the context of speech recognition 1h 30m

Speaker diarization is an essential component for speech applications in multi-speaker settings. Spoken utterances need to be attributed to speaker-specific classes with or without prior knowledge of the speakers' identity or profile. Initially, speaker diarization technologies were developed as standalone processes without requiring much context of other components in a given speech application. As speech recognition technology has become more accessible, there is an emerging trend considering speaker diarization as an integral part of an overall speech recognition application; while benefiting from the speech recognition output to improve speaker diarization accuracy. As of lately, joint model training for speaker diarization and speech recognition is investigated in an attempt to consolidate the training objectives, enhancing the overall performance. In this tutorial, we will overview the development of speaker diarization in the era of deep learning, present the recent approaches to speaker diarization in the context of speech recognition, and share the industry perspectives on speaker diarization and its challenges. Finally, we will provide insights about future directions of speaker diarization as a part of context-aware interactive system.

Speakers: Dimitrios Dimitriadis, Kyu J. Han, Tae Jin Park
• 21:45 23:15
Tutorial 8: contd. room4

### room4

https://zoom.com.cn/j/63526611552

• 21:45
Spoken language processing for language learning and assessment 1h 30m

This tutorial will provide an in-depth survey of the state of the art in spoken language processing in language learning and assessment from a practitioner’s perspective. The first part of the tutorial will discuss in detail the acoustic, speech, and language processing challenges in recognizing and dealing with native and non-native speech from both adults and children from different language backgrounds at scale. The second part of the tutorial will examine the current state of the art in both knowledge-driven and data-driven approaches to automated scoring of such data along various dimensions of spoken language proficiency, be it monologic or dialogic in nature. The final part of the tutorial will look at a hot topic and key challenge facing the field at the moment – that of automatically generating targeted feedback for language learners that can help them improve their overall spoken language proficiency.

The presenters, based at Educational Testing Service R&D in Princeton and San Francisco, USA, have more than 40 years of combined R&D experience in spoken language processing for education, speech recognition, spoken dialog systems and automated speech scoring.

Speakers: Keelan Evanini, Klaus Zechner, Vikram Ramanarayanan
• Monday, 26 October
• 17:00 19:00
Opening session Keynote 1:Janet B. Pierrehumbert, The cognitive status of simple and complex models room1

### room1

Janet B. Pierrehumbert, University of Oxford

(https://zoom.com.cn/j/68015160461

• 18:00
The cognitive status of simple and complex models 1h

Human languages are extraordinarily rich systems. They have extremely large lexical inventories, and the elements in these inventories can be combined to generate a potentially unbounded set of distinct messages. Regularities at many different levels of representation — from the phonetic level through the syntax and semantics — support people's ability to process mappings between the physical reality of speech, and the objects, events, and relationships that speech refers to. However, human languages also simplify reality. The phonological system establishes equivalence classes amongst articulatory-acoustic events that have considerable variation at the parametric level. The semantic system similarly establishes equivalence classes amongst real-world phenomena having considerable variation.

The tension between simplicity and complexity is a recurring theme of research on language modelling. In this talk, I will present three case studies in which a pioneering simple model omitted important complexities that were either included in later models, or that remain as challenges to this day. The first is the acoustic theory of speech production, as developed by Gunnar Fant, the inaugural Medal recipient in 1989. By approximating the vocal tract as a half-open tube, it showed that the first three formants of vowels (which are the most important for the perception of vowel quality) can be computed as a linear systems problem. The second is the autosegmental-metrical theory of intonation, to which I contributed early in my career. It made the simplifying assumption that the correct model of phonological representation will support the limited set of observed non-local patterns, while excluding non-local patterns that do not naturally occur. The third case concerns how word-formation patterns are generalised in forming new words, whether though inflectional morphology (as in “one wug; two wugs”) or derivational morphology (as in “nickname, unnicknameable”). Several early models of word-formation assume that the morphemes are conceptual categories, sharing formal properties of other categories in the cognitive system.

For all three case studies, I will suggest that — contrary to what one might imagine — the simple models enjoyed good success precisely because they were cognitively realistic. The most successful early models effectively incorporated ways in which the cognitive system simplifies reality. These simplifications are key to the learnability and adaptability of human languages. The simplified core of the system provides the scaffolding for more complex or irregular aspects of language. In progressing from simple models to fully complex models, we should make sure we continue to profit from insights into how humans learn, encode, remember, and produce speech patterns.

• 19:00 19:15
Coffee Break
• 19:15 20:15
Mon-1-1 ASR neural network architectures I room1

### room1

Chairs: Ralf Schluter ,Yanhua Long,

https://zoom.com.cn/j/68015160461

Convener: Schluter Ralf Schluter
• 19:15
Mon-1-1-1 On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition 1h

Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED.
In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.

Speakers: Chengyi Wang (Microsoft Research Asia) , Jinyu Li (Microsoft) , Shujie Liu (Microsoft Research Asia) , Yashesh Gaur (Microsoft) , Yu Wu (Microsoft Research Asia) , rui zhao (microsoft)
• 19:15
Mon-1-1-10 Effect of Adding Positional Information on Convolutional Neural Networks for End-to-End Speech Recognition 1h

Attention-based models with convolutional encoders enable faster training and inference than recurrent neural network-based ones. However, convolutional models often require a very large receptive field to achieve high recognition accuracy, which not only increases the parameter size but also the
computational cost and run-time memory footprint. A convolutional encoder with a short receptive field length can suffer from looping or skipping problems when the input utterance contains the same words as nearby sentences. We believe that this is due to the insufficient receptive field length, and try to remedy this problem by adding positional information to the convolution-based encoder. It is shown that the word error rate (WER) of a convolutional encoder with a short receptive field size can be reduced
significantly by augmenting it with positional information. Visualization results are presented to demonstrate the effectiveness of adding positional information. The proposed method improves the accuracy of attention models with a convolutional encoder and achieves a WER of 10.60% on TED-LIUMv2 for an end-to-end speech recognition task.

Speakers: Jinhwan Park (Seoul National University) , Wonyong Sung (Seoul National University)
• 19:15
Mon-1-1-2 SAN-M: Memory Equipped Self-Attention for End-to-End Speech Recognition 1h

End-to-end speech recognition has become popular in recent years, since it can integrate the acoustic, pronunciation and language models into a single neural network. Among end-to-end approaches, attention-based methods have emerged as being superior. For example, Transformer, which adopts an encoder-decoder architecture. The key improvement introduced by Transformer is the utilization of self-attention instead of recurrent mechanisms, enabling both encoder and decoder to capture long-range dependencies with lower computational complexity. In this work, we propose boosting the self-attention ability with a DFSMN memory block, forming the proposed memory equipped self-attention (SAN-M) mechanism. Theoretical and empirical comparisons have been made to demonstrate the relevancy and complementarity between self-attention and the DFSMN memory block. Furthermore, the proposed SAN-M provides an efficient mechanism to integrate these two modules. We have evaluated our approach on the public AISHELL-1 benchmark and an industrial-level 20,000-hour Mandarin speech recognition task. On both tasks, SAN-M systems achieved much better performance than the self-attention based Transformer baseline system. Specially, it can achieve a CER of 6.46% on the AISHELL-1 task even without using any external LM, comfortably outperforming other state-of-the-art systems.

Speakers: Ian McLoughlin (ICT Cluster, Singapore Institute of Technology) , Ming Lei (Alibaba Group) , ShiLiang Zhang (Alibaba Group) , Zhifu Gao (Alibaba Group)
• 19:15
Mon-1-1-3 CONTEXTUAL RNN-T FOR OPEN DOMAIN ASR 1h

End-to-end (E2E) systems for automatic speech recognition (ASR), such as RNN Transducer (RNN-T) and Listen-Attend-Spell (LAS) blend the individual components of a traditional hybrid ASR system – acoustic model, language model, pronunciation model – into a single neural network. While this has some nice advantages, it limits the system to be trained using only paired audio and text. Because of this, E2E models tend to have difficulties with correctly recognizing rare words that are not frequently seen during training, such as entity names. In this paper, we propose modifications to the RNN-T model that allow the model to utilize additional metadata text with the objective of improving performance on these named entity words. We evaluate our approach on an in-house dataset sampled from de-identified public social media videos, which represent an open domain ASR task. By using an attention model to leverage the contextual metadata that accompanies a video, we observe a relative improvement of about 16% in Word Error Rate on Named Entities (WER-NE) for videos with related metadata

Speakers: Florian Metze (facebook) , Geoffrey Zweig (facebook) , Gil Keren (facebook) , Jay Mahadeokar (facebook) , Yatharth Saraf (facebook) , mahaveer jain (facebook)
• 19:15
Mon-1-1-4 ASAPP-ASR: Multistream CNN and Self-Attentive SRU for SOTA Speech Recognition 1h

In this paper we present state-of-the-art (SOTA) performance on the LibriSpeech corpus with two novel neural network architectures, a multistream CNN for acoustic modeling and a self-attentive simple recurrent unit (SRU) for language modeling. In the hybrid ASR framework, the multistream CNN acoustic model processes an input of speech frames in multiple parallel pipelines where each stream has a unique dilation rate for diversity. Trained with the SpecAugment data augmentation method, it achieves relative word error rate (WER) improvements of 4% on test-clean and 14% on test-other. We further improve the performance via N-best rescoring using a 24-layer self-attentive SRU language model, achieving WERs of 1.75% on test-clean and 4.46% on test-other.

Speakers: Jeremy Wohlwend (ASAPP) , Jing Pan (ASAPP) , Joshua Shapiro (ASAPP) , Kyu Han (ASAPP) , Tao Lei (ASAPP) , Tao Ma (ASAPP)
• 19:15
Mon-1-1-5 Compressing LSTM Networks with Hierarchical Coarse-Grain Sparsity 1h

The long short-term memory (LSTM) network is one of the most widely used recurrent neural networks (RNNs) for automatic speech recognition (ASR), but exhibits millions of parameters. This makes it prohibitive for memory constrained hardware accelerators as the storage demand causes higher dependence on off-chip memory, which becomes a bottleneck for latency and power. In this paper, we propose a new LSTM training technique based on hierarchical coarse-grain sparsity (HCGS), which enforces hierarchical structured sparsity by randomly dropping static block-wise connections between layers. HCGS maintains the same hierarchical structured sparsity throughout training and inference; this can aid acceleration and storage reduction for both training and inference hardware systems. We also jointly optimize in-training low-precision quantization with HCGS-based structured sparsity on 2-/3-layer LSTM networks for TIMIT and TED-LIUM corpora. With 16X structured compression and 6-bit weight precision, we achieved a phoneme error rate (PER) of 16.9% for TIMIT and a word error rate (WER) of 18.9% for TED-LIUM corpora, showing the best trade-off between error rate and LSTM memory compression compared to prior works.

Speakers: Chaitali Chakrabarti (Arizona State University) , Deepak Kadetotad (Arizona State University / Starkey Hearing Technologies) , Jae-sun Seo (Arizona State University) , Jian Meng (Arizona State Unviersity) , Visar Berisha (Arizona State University)
• 19:15
Mon-1-1-6 BLSTM-Driven Stream Fusion for Automatic Speech Recognition: Novel Methods and a Multi-Size Window Fusion Example 1h

Optimal fusion of streams for ASR is a nontrivial problem. Recently, so-called posterior-in-posterior-out (PIPO-)BLSTMs have been proposed that serve as state sequence enhancers and have highly attractive training properties. In this work, we adopt the PIPO-BLSTMs and employ them in the context of stream fusion for ASR. Our contributions are the following: First, we show the positive effect of a PIPO-BLSTM as state sequence enhancer for various stream fusion approaches. Second, we confirm the advantageous context-free (CF) training property of the PIPO-BLSTM for all investigated fusion approaches. Third, we show with a fusion example of two streams, stemming from different short-time Fourier transform window lengths, that all investigated fusion approaches take profit. Finally, the turbo fusion approach turns out to be best, employing a CF-type PIPO-BLSTM with a novel iterative augmentation in training.

Speakers: Tim Fingscheidt (Technische Universität Braunschweig) , Timo Lohrenz (Technische Universität Braunschweig)
• 19:15
Mon-1-1-7 Relative Positional Encoding for Speech Recognition and Direct Translation 1h

Transformer models are powerful sequence-to-sequence architecture that is capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism of modeling positions in this model was tailored for text modeling and thus is less ideal for acoustic inputs. In this work, we adapted the relative position encoding scheme to the Speech Transformer, in which the key is to add relative distance between input states to the self-attention network. As a result, the network can adapt better with the large variation of the pattern distribution in speech data. Our experiments showed that the resulting model achieved the best recognition result on the Switchboard benchmark in the non-augmentation condition, and the best published result in the MuST-C speech translation benchmark. We also showed that this model is able to utilize better simulated data than the Transformer, and also adapt better with the segmentation quality in speech translation.

Speakers: Alexander Waibel (Carnegie Mellon) , Elizabeth Salesky (Johns Hopkins University) , Jan Niehues (Maastricht University) , Ngoc-Quan Pham (Karlsruhe Institute of Technology) , Sebastian Stüker (Karlsruhe Institute of Technology) , Thai Son Nguyen (Karlsruhe Institute of Technology) , Thanh-Le Ha (Karlsruhe Institute of Technology) , Tuan Nam Nguyen (Karlsruhe Institute of Technology)
• 19:15
Mon-1-1-8 Joint Speaker Counting, Speech Recognition, and Speaker Identification for Overlapped Speech of Any Number of Speakers 1h

We propose an end-to-end speaker-attributed automatic speech recognition model that unifies speaker counting, speech recognition, and speaker identification on monaural overlapped speech. Our model is built on serialized output training (SOT) with attention-based encoder-decoder, a recently proposed method for recognizing overlapped speech comprising an arbitrary number of speakers. We extend SOT by introducing a speaker inventory as an auxiliary input to produce speaker labels as well as multi-speaker transcriptions. All model parameters are optimized by speaker-attributed maximum mutual information criterion, which represents a joint probability for overlapped speech recognition and speaker identification. Experiments on LibriSpeech corpus show that our proposed method achieves significantly better speaker-attributed word error rate than the baseline that separately performs overlapped speech recognition and speaker identification.

Speakers: Naoyuki Kanda (Microsoft) , Takuya Yoshioka (Microsoft) , Tianyan Zhou (Microsoft) , Xiaofei Wang (Microsoft) , Yashesh Gaur (Microsoft) , Zhong Meng (Microsoft) , Zhuo Chen (Microsoft)
• 19:15
Mon-1-1-9 Implicit Transfer of Privileged Acoustic Information in a Generalized Knowledge Distillation Framework 1h

This paper proposes a novel generalized knowledge distillation framework, with an implicit transfer of privileged information. In our proposed framework, teacher networks are trained with two input branches on pairs of time-synchronous lossless and lossy acoustic features. While one branch of the teacher network processes a privileged view of the data using lossless features, the second branch models a student view, by processing lossy features corresponding to the same data. During the training step, weights of this teacher network are updated using a composite two-part cross entropy loss. The first part of this loss is computed between the predicted output labels of the lossless data and the actual ground truth. The second part of the loss is computed between the predicted output labels of the lossy data and lossless data. In the next step of generating soft labels, only the student view branch of the teacher is used with lossy data. The benefit of this proposed technique is shown on speech signals with long-term time-frequency bandwidth loss due to recording devices and network conditions. Compared to conventional generalized knowledge distillation with privileged information, the proposed method has a relative improvement of 9.5% on both lossless and lossy test sets.

Speakers: Samuel Thomas (IBM Research AI) , Takashi Fukuda (IBM Research)
• 19:15 20:15
Mon-1-10 Speech, Language, and Multimodal Resources room10

### room10

Chairs: Shuai Nie , Qiang Fang

https://zoom.com.cn/j/61218542656

• 19:15
Mon-1-10-1 ATCSpeech: a Multilingual pilot-controller Speech Corpus from Real Air Traffic Control Environment 1h

Automatic Speech Recognition (ASR) technique has been greatly developed in recent years, which expedites many applications in other fields. For the ASR research, speech corpus is always an essential foundation, especially for the vertical industry, such as Air Traffic Control (ATC). There are some speech corpora for common applications, public or paid. However, for the ATC domain, it is difficult to collect raw speeches from real systems due to safety issues. More importantly, annotating the transcription is a more laborious work for the supervised learning ASR task, which hugely restricts the prospect of ASR application. In this paper, a multilingual speech corpus (ATCSpeech) from real ATC systems, including accented Mandarin Chinese and English speeches, is built and released to encourage the non-commercial ASR research in ATC domain. The corpus is detailly introduced from the perspective of data amount, speaker gender and role, speech quality and other attributions. In addition, the performance of our baseline ASR models is also reported. A community edition for our speech database can be applied and used under a special contrast. To our best knowledge, this is the first work that aims at building a real and multilingual ASR corpus for the ATC related research.

Speakers: Bing Wang (Southwest Air Traffic Management Bureau, Civil Aviation Administration of China) , Bo Yang (Sichuan University) , Dan Li (Southwest Air Traffic Management Bureau, Civil Aviation Administration of China) , Min Ruan (Southwest Air Traffic Management Bureau, Civil Aviation Administration of China) , Xianlong Tan (Southwest Air Traffic Management Bureau, Civil Aviation Administration of China) , Xiping Wu (Sichuan University) , Yi LIN (Sichuan University) , Zhengmao Chen (Sichuan University) , Zhongping Yang (Wisesoft Co. Ltd.)
• 19:15
Mon-1-10-10 FT Speech: Danish Parliament Speech Corpus 1h

This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament, otherwise known as the Folketing (FT). The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers. It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish, which are largely limited to read-aloud and dictation data. We outline design considerations, including the preprocessing methods and the alignment procedure. To evaluate the quality of the corpus, we train automatic speech recognition systems (ASR) on the new resource and compare them to the systems trained on the Danish part of Språkbanken, the largest public ASR corpus for Danish to date. Our baseline results show that we achieve a 14.01% WER on the new corpus. A combination of FT Speech with in-domain language data provides comparable results to models trained specifically on Språkbanken, showing that FT Speech transfers well to this data set. Interestingly, our results demonstrate that the opposite is not the case. This shows that FT Speech provides a valuable resource for promoting research on Danish ASR with more spontaneous speech.

Speakers: Andreas Søeborg Kirkedal (Interactions) , Barbara Plank (IT University of Copenhagen) , Marija Stepanović (IT University of Copenhagen)
• 19:15
Mon-1-10-2 Developing an Open-Source Corpus of Yoruba Speech 1h

This paper introduces an open-source speech dataset for Yoruba - one of the largest low-resource West African languages spoken by at least 22 million people. Yoruba is one of the official languages of Nigeria, Benin and Togo, and is spoken in other neighboring African countries and beyond. The corpus
consists of over four hours of 48 kHz recordings from 36 male and female volunteers and the corresponding transcriptions that include disfluency annotation. The transcriptions have full diacritization, which is vital for pronunciation and lexical disambiguation. The annotated speech dataset described in this paper is primarily intended for use in text-to-speech systems, serve as adaptation data in automatic speech recognition and speech-to-speech translation, and provide insights in West African corpus linguistics. We demonstrate the use of this corpus in a simple statistical parametric speech synthesis (SPSS) scenario evaluating it against the related languages from the CMU Wilderness dataset and the Yoruba Lagos-NWU corpus.

Speakers: Alexander Gutkin (Google) , Clara Rivera (Google Research) , Isin Demirsahin (Google Research) , Kọ́lá Túbọ̀sún (Chevening Research Fellow at British Library) , Oddur Kjartansson (Google Research)
• 19:15
Mon-1-10-3 ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers 1h

Automatic speech recognition (ASR) via call is essential for various applications, including AI for contact center (AICC) services. Despite the advancement of ASR, however, most publicly available call-based speech corpora such as Switchboard are old-fashioned. Also, most existing call corpora are in English and mainly focus on open-domain dialog or general scenarios such as audiobooks. Here we introduce a new large-scale Korean call-based speech corpus under a goal-oriented dialog scenario from more than 11,000 people, i.e., ClovaCall corpus. ClovaCall includes approximately 60,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain. We validate the effectiveness of our dataset with intensive experiments using two standard ASR models. Furthermore, we release our ClovaCall dataset and baseline source codes to be available via https://github.com/ClovaAI/ClovaCall.

Speakers: Chan Kyu Lee (Clova AI, NAVER Corp.) , Eunmi Kim (Clova AI, NAVER Corp.) , Hyeji Kim (Clova AI, NAVER Corp.) , Hyun Ah Kim, Hyunhoon Jung (Clova AI, NAVER Corp.) , Jin Gu Kang (Clova AI, NAVER Corp.) , Jung-Woo Ha (Clova AI, NAVER Corp.) , Kihyun Nam (Hankuk University of Foreign Stuides) , Kyoungtae Doh (Clova AI, NAVER Corp.) , Nako Sung (Clova AI, NAVER Corp.) , Sang-Woo Lee (Clova AI, NAVER Corp.) , Sohee Yang (Clova AI, NAVER Corp.) , Soojin Kim (Clova AI, NAVER Corp.) , Sunghun Kim (Clova AI, NAVER Corp.;The Hong Kong University of Science and Technology)
• 19:15
Mon-1-10-4 LAIX Corpus of Chinese Learner English Towards A Benchmark for L2 English ASR 1h

This paper introduces a corpus of Chinese Learner English containing 82 hours of L2 English speech by Chinese learners from all major dialect regions, collected through mobile apps developed by LAIX Inc. The LAIX corpus was created to serve as a benchmark dataset for evaluating Automatic Speech Recognition (ASR) performance on L2 English, the first of this kind as far as we know. The paper describes our effort to build the corpus, including corpus design, data selection and transcription. Multiple rounds of quality check were conducted in the transcription process. Transcription errors were analyzed in terms of error types, rounds of reviewing, and learners' proficiency levels. Word error rates of state-of-the-art ASR systems on the benchmark corpus were also reported.

Speakers: Huan Luan (LAIX) , Hui Lin (LAIX) , Jiahong Yuan (LAIX) , Yanhong Wang (LAIX)
• 19:15
Mon-1-10-5 Design and Development of a Human-Machine Dialog Corpus for the Automated Assessment of Conversational English Proficiency 1h

This paper presents a carefully designed corpus of scored spoken conversations between English language learners and a dialog system to facilitate research and development of both human and machine scoring of dialog interactions. We collected speech, demographic and user experience data from non-native speakers of English who interacted with a virtual boss as part of a workplace pragamatics skill building application. Expert raters then scored the dialogs on a custom rubric encompassing 12 aspects of conversational proficiency as well as an overall holistic performance score. We analyze key corpus stastistics and discuss the advantages of such a corpus for both human and machine scoring.

Speaker: Vikram Ramanarayanan (Educational Testing Service R&D)
• 19:15
Mon-1-10-6 CUCHILD: A Large-Scale Cantonese Corpus of Child Speech for Phonology and Articulation Assessment 1h

This paper describes the design and development of CUCHILD, a large-scale Cantonese corpus of child speech. The corpus contains spoken words collected from 1,986 child speakers aged from 3 to 6 years old. The speech materials include 130 words of 1 to 4 syllables in length. The speakers cover both typically developing (TD) children and children with speech disorder. The intended use of the corpus is to support scientific and clinical research, as well as technology development related to child speech assessment. The design of the corpus, including selection of words, participants recruitment, data acquisition process, and data pre-processing are described in detail. The results of acoustical analysis are presented to illustrate the properties of child speech. Potential applications of the corpus in automatic speech recognition, phonological error detection and speaker diarization are also discussed.

Speakers: Cymie Wing-Yee Ng (The Chinese University of Hong Kong) , Jiarui Wang (The Chinese University of Hong Kong) , Kathy Yuet-Sheung Lee (The Chinese University of Hong Kong) , Michael Chi-Fai Tong (The Chinese University of Hong Kong) , Si-Ioi Ng (The Chinese University of Hong Kong) , Tan Lee (The Chinese University of Hong Kong)
• 19:15
Mon-1-10-7 FinChat: Corpus and evaluation setup for Finnish chat conversations on everyday topics 1h

Creating open-domain chatbots requires large amounts of conversational data and related benchmark tasks to evaluate them. Standardized evaluation tasks are crucial for creating automatic evaluation metrics for model development, as otherwise comparing the models would require resource-expensive human evaluation. While chatbot challenges have recently managed to provide a plethora of such resources for English, resources in other languages are not yet available. In this work, we provide a starting point for Finnish open-domain chatbot research. We describe our collection efforts to create the Finnish chat conversation corpus FinChat, which is made available publicly. FinChat includes unscripted conversations on seven topics from people of different ages. Using this corpus, we also construct a retrieval-based evaluation task for Finnish chatbot development. We observe that off-the-shelf chatbot models trained on conversational corpora do not perform better than by chance at choosing the right answer based on automatic metrics, while humans are able to do the same task almost perfectly. Similarly, in a human evaluation, responses to questions from evaluation set generated by the chatbots are predominantly marked as incoherent. Thus, FinChat provides a challenging evaluation set, meant to encourage chatbot development in Finnish.

Speakers: Juho Leinonen (Aalto University) , Katri Leino (Aalto University) , Mikko Kurimo (Aalto University) , Mittul Singh (Aalto University) , Sami Virpioja (University of Helsinki)
• 19:15
Mon-1-10-8 DiPCo - Dinner Party Corpus 1h

We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.

Speakers: Ahmed Zaid (Apple) , Bjorn Hoffmeister (Apple) , Cirenia Huerta (Amazon) , Jan Trmal (Johns Hopkins University) , Ksenia Kutsenko (Amazon) , Maarten Van Segbroeck (Amazon) , Maurizio Omologo (Fondazione Bruno Kessler - irst) , Roland Maas (Amazon.com) , Tinh Nguyen (Amazon) , Xuewen Luo (Amazon)
• 19:15
Mon-1-10-9 Learning to Detect Bipolar Disorder and Borderline Personality Disorder with Language and Speech in Non-Clinical Interviews 1h

Bipolar disorder (BD) and borderline personality disorder (BPD) are both chronic psychiatric disorders. However, their overlapping symptoms and common comorbidity make it challenging for the clinicians to distinguish the two conditions on the basis of a clinical interview. In this work, we first present a new multi-modal dataset containing interviews involving individuals with BD or BPD being interviewed about a non-clinical topic . We investigate the automatic detection of the two conditions, and demonstrate a good linear classifier that can be learnt using a down-selected set of features from the different aspects of the interviews and a novel approach of summarising these features. Finally, we find that different sets of features characterise BD and BPD, thus providing insights into the difference between the automatic screening of the two conditions.

Speakers: Alejo J Nevado-Holgado (University of Oxford) , Bo Wang (University of Oxford) , Kate Saunders (University of Oxford) , Maria Liakata (The Alan Turing Institute) , Niall Taylor (University of Oxford) , Terry Lyons (University of Oxford) , Yue Wu (University of Oxford)
• 19:15 20:15
Mon-1-11 Language Recognition room11

### room11

Chairs: Sriram Ganapathy ,Dong Wang

https://zoom.com.cn/j/66725122123

• 19:15
Mon-1-11-1 Metric learning loss functions to reduce domain mismatch in the x-vector space for language recognition 1h

State-of-the-art language recognition systems are based on discriminative embeddings called x-vectors. Channel and gender distortions produce mismatch in such x-vector space where embeddings corresponding to the same language are not grouped in an unique cluster. To control this mismatch, we propose to train the x-vector DNN with metric learning objective functions. Combining a classification loss with the metric learning n-pair loss allows to improve the language recognition performance. Such a system achieves a robustness comparable to a system trained with a domain adaptation loss function but without using the domain information. We also analyze the mismatch due to channel and gender, in comparison to language proximity, in the x-vector space. This is achieved using the Maximum Mean Discrepancy divergence measure between groups of x-vectors. Our analysis shows that using the metric learning loss function reduces gender and channel mismatch in the x-vector space, even for languages only observed on one channel in the train set.

Speakers: Denis Jouvet ((LORIA - INRIA) and Irina Illina(LORIA/INRIA) , Raphaël Duroselle (Université de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy)
• 19:15
Mon-1-11-2 The XMUSPEECH System for AP19-OLR Challenge 1h

In this paper, we present our XMUSPEECH system for the oriental language recognition (OLR) challenge, AP19-OLR. The challenge this year contained three tasks: (1) short-utterance LID, (2) cross-channel LID, and (3) zero-resource LID. We leveraged the system pipeline from three aspects, including front-end training, back-end processing, and fusion strategy. We implemented many encoder networks for Tasks 1 and 3, such as extended x-vector, multi-task learning x-vector with phonetic information, and our previously presented multi-feature integration structure. Furthermore, our previously proposed length expansion method was used in the test set for Task 1. I-vector systems based on different acoustic features were built for the cross-channel task. For all of three tasks, the same back-end procedure was used for the sake of stability but with different settings for three tasks. Finally, the greedy fusion strategy helped to choose the subsystems to compose the final fusion systems (submitted systems). Cavg values of 0.0263, 0.2813, and 0.1697 from the development set for Task 1, 2, and 3 were obtained from our submitted systems, and we achieved rank 3rd, 3rd, and 1st in the three tasks in this challenge, respectively.

Speakers: Jing Li (Xiamen University) , Lin Li (Xiamen University) , Miao Zhao (Xiamen University) , Qingyang Hong (Xiamen University) , Yiming Zhi (Xiamen University) , Zheng Li (Xiamen University)
• 19:15
Mon-1-11-3 On the Usage of Multi-feature Integration for Speaker Verification and Language Identification 1h

In this paper, we study the technology of multiple acoustic feature integration for the applications of Automatic Speaker Verification (ASV) and Language Identification (LID). In contrast to score level fusion, a common method for integrating subsystems built upon various acoustic features, we explore a new integration strategy, which integrates multiple acoustic features based on the x-vector framework. The frame level, statistics pooling level, segment level, and embedding level integrations are investigated in this study. Our results indicate that frame level integration of multiple acoustic features achieves the best performance
in both speaker and language recognition tasks, and the multi-feature integration strategy can be generalized in both classification tasks. Furthermore, we introduce a time-restricted attention mechanism into the frame level integration structure to further improve the performance of multi-feature integration. The experiments are conducted on VoxCeleb 1 for ASV and AP-OLR-17 for LID, and we achieve 28% and 19% relative improvement in terms of Equal Error Rate (EER) in ASV and LID tasks, respectively.

Speakers: Jing Li (Xiamen University) , Lin Li (Xiamen University) , Miao Zhao (Xiamen University) , Qingyang Hong (Xiamen University) , Zheng Li (Xiamen University)
• 19:15
Mon-1-11-4 What does an End-to-End Dialect Identification Model Learn about Non-dialectal Information? 1h

An end-to-end dialect identification system generates the likelihood of each dialect, given a speech utterance. The performance relies on its capabilities to discriminate the acoustic properties between the different dialects, even though the input signal contains non-dialectal information such as speaker and channel. In this work, we study how non-dialectal information are encoded inside the end-to-end dialect identification model. We design several proxy tasks to understand the model's ability to represent speech input for differentiating non-dialectal information -- such as (a) gender and voice identity of speakers, (b) languages, (c) channel (recording and transmission) quality -- and compare with dialectal information (i.e., predicting geographic region of the dialects). By analyzing non-dialectal representations from layers of an end-to-end Arabic dialect identification (ADI) model, we observe that the model retains gender and channel information throughout the network while learning a speaker-invariant representation. Our findings also suggest that the CNN layers of the end-to-end model mirror feature extractors capturing voice-specific information, while the fully-connected layers encode more dialectal information.

Speakers: Ahmed Ali (Qatar Computing Research Institute) , James Glass (Massachusetts Institute of Technology) , Shammur Absar Chowdhury (University of Trento) , Suwon Shon (Massachusetts Institute of Technology)
• 19:15
Mon-1-11-5 Releasing a toolkit and comparing the performance of language embeddings across various spoken language identification datasets 1h

In this paper, we propose a software toolkit for easier end-to-end training of deep learning based spoken language identification models across several speech datasets.
We apply our toolkit to implement three baseline models, one speaker recognition model, and three x-vector architecture variations, which are trained on three datasets previously used in spoken language identification experiments.
All models are trained separately on each dataset (closed task) and on a combination of all datasets (open task), after which we compare if the open task training yields better language embeddings.
We begin by training all models end-to-end as discriminative classifiers of spectral features, labeled by language.
Then, we extract language embedding vectors from the trained end-to-end models, train separate Gaussian Naive Bayes classifiers on the vectors, and compare which model provides best language embeddings for the back-end classifier.
Our experiments show that the open task condition leads to improved language identification performance on only one of the datasets.
In addition, we discovered that increasing x-vector model robustness with random frequency channel dropout significantly reduces its end-to-end classification performance on the test set, while not affecting back-end classification performance of its embeddings.
Finally, we note that two baseline models consistently outperformed all other models.

Speakers: Matias Lindgren (Aalto University) , Mikko Kurimo (Aalto University) , Tommi Jauhiainen (University of Helsinki)
• 19:15
Mon-1-11-6 Learning Intonation Pattern Embeddings for Arabic Dialect Identification 1h

This article presents a full end-to-end pipeline for Arabic Dialect Identification (ADI) using intonation patterns and acoustic representations. Recent approaches to language and dialect identification use linguistic aware deep architectures that are able to capture phonetic differences amongst languages and dialects. Specifically, in ADI tasks, different combinations of linguistic features and acoustic representations have been successful with deep learning models. The approach presented in this article uses intonation patterns and hybrid residual and bidirectional LSTM networks to learn acoustic embeddings with no additional linguistic information. Results of the experiments show that intonation patterns for Arabic dialects provide sufficient information to achieve state-of-the-art results on the VarDial 17 ADI datatset, outperforming single-feature systems. The pipeline presented is robust to data sparsity, in contrast to other deep learning approaches that require large quantities of data. We conjecture on the importance of sufficient information as a criterion for optimality in a deep learning ADI task, and more generally, its application to acoustic modeling problems. Small intonation patterns, when sufficient in an information-theoretic sense, allow deep learning architectures to learn more accurate speech representations.

Speakers: Aitor Arronte Alvarez (Center for Language and Technology, University of Hawaii. Technical University of Madrid) , Elsayed Issa (University of Arizona)
• 19:15
Mon-1-11-7 Cross-Domain Adaptation of Spoken Language Identification for Related Languages: The Curious Case of Slavic Languages 1h

State-of-the-art spoken language identification (LID) systems, which are based on end-to-end deep neural networks, have shown remarkable success not only in discriminating between distant languages but also between closely-related languages or even different spoken varieties of the same language. However, it is still unclear to what extent neural LID models generalize to speech samples with different acoustic conditions due to domain shift. In this paper, we present a set of experiments to investigate the impact of domain mismatch on the performance of neural LID systems for a subset of six Slavic languages across two domains (read speech and radio broadcast) and examine two low-level signal descriptors (spectral and cepstral features) for this task. Our experiments show that (1) out-of-domain speech samples severely hinder the performance of neural LID models, and (2) while both spectral and cepstral features show comparable performance within-domain, spectral features show more robustness under domain mismatch. Moreover, we apply unsupervised domain adaptation to minimize the discrepancy between the two domains in our study. We achieve relative accuracy improvements that range from 9\% to 77\% depending on the diversity of acoustic conditions in the source domain.

Speakers: Badr Abdullah (Saarland University) , Bernd Möbius (Saarland University) , Dietrich Klakow (dietrich.klakow@lsv.uni-saarland.de) , Tania Avgustinova (Saarland University)
• 19:15 20:15
Mon-1-2 Multi-channel speech enhancement room2

### room2

Chairs:Xiaolei Zhang,Ying-Hui Lai

https://zoom.com.cn/j/68442490755

• 19:15
Mon-1-2-1 Deep Neural Network-Based Generalized Sidelobe Canceller for Robust Multi-channel Speech Recognition 1h

The elastic spatial filter (ESF) proposed in recent years is a popular multi-channel speech enhancement front end based on deep neural network (DNN). It is suitable for real-time processing and has shown promising automatic speech recognition (ASR) results. However, the ESF only utilizes the knowledge of fixed beamforming, resulting in limited noise reduction capabilities. In this paper, we propose a DNN-based generalized sidelobe canceller (GSC) that can automatically track the target speaker's direction in real time and use the blocking technique to generate reference noise signals to further reduce noise from the fixed beam pointing to the target direction. The coefficients in the proposed GSC are fully learnable and an ASR criterion is used to optimize the entire network. The 4-channel experiments show that the proposed GSC achieves a relative word error rate improvement of 27.0% compared to the raw observation, 20.6% compared to the oracle direction-based traditional GSC, 10.5% compared to the ESF and 7.9% compared to the oracle mask-based generalized eigenvalue (GEV) beamformer.

Speakers: Guanjun Li (National Laboratory of Patten Recognition, Institute of Automation, Chinese Academy of Sciences,) , Longshuai Xiao (NLPR, Institute of Automation, Chinese Academy of Sciences) , Shan Liang (NLPR, Institute of Automation, Chinese Academy of Sciences) , Shuai Nie (NLPR, Institute of Automation, Chinese Academy of Sciences) , Wenju Liu (NLPR, Institute of Automation, Chinese Academy of Sciences) , Zhanlei Yang (Huawei Technologies)
• 19:15
Mon-1-2-10 A Space-and-Speaker-Aware Iterative Mask Estimation Approach to Multi-channel Speech Recognition in the CHiME-6 Challenge 1h

We propose a space-and-speaker-aware iterative mask estimation (SSA-IME) approach to improving complex angular central Gaussian distributions (cACGMM) based beamforming in an iterative manner by leveraging upon the complementary information obtained from SSA-based regression. First, a mask calculated by beamformed speech features is proposed to enhance the estimation accuracy of the ideal ratio mask from noisy speech. Second, the outputs of cACGMM-beamformed speech with given time annotation as initial values are used to extract the log-power spectral and inter-phase difference features of different speakers serving as inputs to estimate the regression-based SSA model. Finally, in decoding, the mask estimated by the SSA model is also used to iteratively refine cACGMM-based masks, yielding enhanced multi-array speech. Tested on the recent CHiME-6 Challenge Track 1 tasks, the proposed SSA-IME framework significantly and consistently outperforms state-of-the-art approaches, and achieves the lowest word error rates for both Track 1 speech recognition tasks.

Speakers: Chin-Hui Lee (Georgia Institute of Technology) , Feng Ma (University of Science and Technology of China) , Jia Pan (University of Science and Technology of China) , Jun Du (University of Science and Technologoy of China) , Lei Sun (University of Science and Technology of China) , Yan-Hui Tu (University of Science and Technology of China)
• 19:15
Mon-1-2-2 Neural Spatio-Temporal Beamformer for Target Speech Separation 1h

Purely neural network (NN) based speech separation and enhancement methods, although can achieve good objective scores, inevitably cause nonlinear speech distortions that are harmful for the automatic speech recognition (ASR). On the other hand, the minimum variance distortionless response (MVDR) beamformer with NN-predicted masks, although can significantly reduce speech distortions, has limited noise reduction capability. In this paper, we propose a multi-tap MVDR beamformer with complex-valued masks for speech separation and enhancement. Compared to the state-of-the-art NN-mask based MVDR beamformer, the multi-tap MVDR beamformer exploits the inter-frame correlation in addition to the inter-microphone correlation that is already utilized in prior arts. Further improvements include the replacement of the real-valued masks with the complex-valued masks and the joint training of the complex-mask NN. The evaluation on our multi-modal multi-channel target speech separation and enhancement platform demonstrates that our proposed multi-tap MVDR beamformer improves both the ASR accuracy and the perceptual speech quality against prior arts.

Speakers: Chao Weng (Tencent AI lab) , Dong Yu (Tencent AI lab) , Jianming Liu (Tencent AI lab) , Lianwu Chen (Tencent AI lab) , Meng Yu (Tencent AI lab) , Shi-Xiong Zhang (Tencent AI lab) , YONG XU (Tencent AI lab)
• 19:15
Mon-1-2-3 Online directional speech enhancement using geometrically constrained independent vector analysis 1h

This paper proposes an online dual-microphone system for directional speech enhancement, which employs geometrically constrained independent vector analysis (IVA) based on the auxiliary function approach and vectorwise coordinate descent. Its offline version has recently been proposed and shown to outperform the conventional auxiliary function approach-based IVA (AuxIVA) thanks to the properly designed spatial constraints. We extend the offline algorithm to online by incorporating the autoregressive approximation of an auxiliary variable. Experimental evaluations revealed that the proposed online algorithm could work in real-time and achieved superior speech enhancement performance to online AuxIVA in both situations where a fixed target was interfered by a spatially stationary or dynamic interference.

Speakers: Kazuhito Koishida (Microsoft Corporation) , Li Li (University of Tsukuba) , Shoji Makino (University of Tsukuba)
• 19:15
Mon-1-2-4 End-to-End Multi-Look Keyword Spotting 1h

The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multi-look neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end KWS model which integrates the enhanced signals from multiple look directions and leverages an attention mechanism to dynamically tune the model's attention to the reliable sources. We demonstrate, on our large noisy and far-field evaluation sets, that the proposed approach significantly improves the KWS performance against the baseline KWS system and a recent beamformer based multi-beam KWS system.

Speakers: Bo Wu (Tencent AI Lab) , Dan Su (Tencent AI Lab) , Dong Yu (Tencent AI Lab) , Meng Yu (Tencent AI Lab) , Xuan Ji (Tencent AI Lab)
• 19:15
Mon-1-2-5 Differential Beamforming for Uniform Circular Array with Directional Microphones 1h

Use of omni-directional microphones is commonly assumed in
the differential beamforming with uniform circular arrays. The
conventional differential beamforming with omni-directional
elements tends to suffer in low white-noise-gain (WNG) at the
low frequencies and decrease of directivity factor (DF) at high
frequencies. WNG measures the robustness of beamformer and
DF evaluates the array performance in the presence of reverberation.The major contributions of this paper are as follows:First, we extends the existing work by presenting a new approach with the use of the directional microphone elements, and show clearly the connection between the conventional beam forming and the proposed beam forming. Second, a comparative study is made to show that the proposed approach brings about the noticeable improvement in WNG at the low frequencies and some improvement in DF at the high frequencies by exploiting an additional degree of freedom in the differential beam forming design. In addition, the beam pattern appears more frequency invariant than that of the conventional method. Third, we study how the proposed beam former performs as the number of microphone elements and the radius of the array vary.

Speakers: Jinwei Feng (Alibaba group) , Weilong Huang (Alibaba group)
• 19:15
Mon-1-2-6 Exploring Deep Hybrid Tensor-to-Vector Network Architectures for Regression Based Speech Enhancement 1h

This paper investigates different trade-offs between the number of model parameters and enhanced speech qualities by employing several deep tensor-to-vector regression models for speech enhancement. We find that a hybrid architecture, namely CNN-TT, can maintain a good quality performance with a reduced model parameter size. CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality and a tensor-train (TT) output layer on the top to reduce model parameters. We first derive a new upper bound on the generalization power of the convolutional neural network (CNN) based vector-to-vector regression models. Then, we provide experimental evidence on Edinburgh noisy speech corpus to demonstrate that, in single-channel speech enhancement, CNN outperforms DNN at the expense of a small increment of model sizes. Besides, CNN-TT slightly outperforms CNN by utilizing 32% of the CNN model parameters. Besides, further performance improvement can be attained if the number of CNN-TT parameters is increased to 44% of the CNN model size. Finally, our experiments of multi-channel speech enhancement on a simulated noisy WSJ0 corpus demonstrate that our proposed hybrid CNN-TT architecture achieves better results than both DNN and CNN models in terms of better-enhanced speech qualities and smaller parameter sizes.

Speakers: Chao-Han Huck Yang (Georgia Institute of Technology) , Chin-Hui Lee (Georgia Institute of Technology) , Hu Hu (Georgia Institute of Technology) , Jun Qi (Georgia Institute of Technology) , Sabato Marco Siniscalchi (University of Enna) , Yannan Wang (Tencent Corporation)
• 19:15
Mon-1-2-7 An End-to-end Architecture of Online Multi-channel Speech Separation 1h

Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, ﬁxed-beamformer and extraction (UFE), that was shown to be effective in addressing the speech overlap problem in conversation transcription. With UFE, an input mixed signal is processed by ﬁxed beamformers, followed by a neural network post ﬁltering. Although promising results were obtained, the system contains multiple individually developed modules, leading potentially sub-optimum performance. In this work, we introduce an end-to-end modeling version of UFE. To enable gradient propagation all the way, an attentional selection module is proposed, where an attentional weight is learnt for each beamformer and spatial feature sampled over space. Experimental results show that the proposed system achieves comparable performance in an ofﬂine evaluation with the original separate processing-based pipeline, while producing remarkable improvements in an online evaluation.

Speakers: Ed Lin (Microsoft, STCA) , Jian Wu (Northwestern Polytechnical University) , Jinyu Li (Microsoft, One Microsoft Way, Redmond, WA, USA) , Lei Xie (School of Computer Science, Northwestern Polytechnical University) , Takuya Yoshioka (Microsoft, One Microsoft Way, Redmond, WA, USA) , Yi Luo (Microsoft, One Microsoft Way, Redmond, WA, USA) , Zhili Tan (Microsoft, STCA, Beijing) , Zhuo Chen (Microsoft, One Microsoft Way, Redmond, WA, USA)
• 19:15
Mon-1-2-8 Mentoring-Reverse Mentoring for Unsupervised Multi-channel Speech Source Separation 1h

Mentoring-reverse mentoring, which is a novel knowledge transfer framework for unsupervised learning, is introduced in multi-channel speech source separation. This framework aims to improve two different systems, which are referred to as a senior and a junior system, by mentoring each other. The senior system, which is composed of a neural separator and a statistical blind source separation (BSS) model, generates a pseudo-target signal. The junior system, which is composed of a neural separator and a post-filter, was constructed using teacher-student learning with the pseudo-target signal generated from the senior system i.e, imitating the output from the senior system (mentoring step). Then, the senior system can be improved by propagating the shared neural separator of the grown-up junior system to the senior system (reverse mentoring step). Since the improved neural separator can give better initial parameters for the statistical BSS model, the senior system can yield more accurate pseudo-target signals, leading to iterative improvement of the pseudo-target signal generator and the neural separator. Experimental comparisons conducted under the condition where mixture-clean parallel data are not available demonstrated that the proposed mentoring-reverse mentoring framework yielded improvements in speech source separation over the existing unsupervised source separation methods.

Speakers: Masahito Togami (Line Corporation) , Tetsuji Ogawa (Waseda University) , Tetsunori Kobayashi (Waseda University) , Yu Nakagome (Waseda Univ.)
• 19:15
Mon-1-2-9 Computationally efficient and versatile framework for joint optimization of blind speech separation and dereverberation 1h

This paper proposes new blind signal processing techniques foroptimizing a multi-input multi-output (MIMO) convolutionalbeamformer (CBF) in a computationally efficient way to per-form dereverberation and source separation simultaneously. Foreffective optimization of a CBF, a conventional technique fac-torizes it into a multiple-target weighted prediction error (WPE)based dereverberation filter and a separation matrix. However,this technique requires calculation of a huge matrix that repre-sents spatio-temporal covariances over different sources, whichmakes the computational cost very high. To realize computa-tionally efficient optimization, this paper introduces two tech-niques: one decomposing the huge covariance matrix into onesfor individual sources, and the other decomposing the CBF intoones for estimating individual sources. It is shown that bothtechniques effectively reduce the size of the covariance matri-ces to be calculated substantively, and allow us to greatly reducethe computational cost without loss of optimality.

Speakers: Hiroshi Sawada (NTT Corporation) , Keisuke Kinoshita (NTT) , Rintaro Ikeshita (NTT Corporation) , Shoko Araki (NTT Communication Science Laboratories) , Tomohiro Nakatani (NTT Corporation)
• 19:15 20:15
Mon-1-3 Speech processing in the brain room3

### room3

Chairs: Haifeng Li,Hans Rutger Bosker

https://zoom.com.cn/j/61951480857

• 19:15
Mon-1-3-1 Identifying Causal Relationships Between Behavior and Local Brain Activity During Natural Conversation 1h

Characterizing precisely neurophysiological activity involved in natural conversations remains a major challenge. We explore in this paper the relationship between multimodal conversational behavior and brain activity during natural conversations. This is challenging due to fMRI time resolution and to the diversity of the recorded multimodal signals. We use a unique corpus including localized brain activity and behavior recorded during a Functional Magnetic Resonance Imaging (fMRI) experiment when several participants had natural conversations alternatively with a human and a conversational robot. The corpus includes fMRI responses as well as conversational signals that consist of synchronized raw audio and their transcripts, video and eyetracking recordings. The proposed approach includes a first step to extract discrete neurophysiological time-series from functionally well defined brain areas, as well as behavioral time-series describing specific behaviors. Then, machine learning models are applied to predict neurophysiological time-series based on the extracted behavioral features. The results show promising prediction scores, and specific causal relationships are found between behaviors and the activity in functional brain areas for both conditions, i.e., human-human and human-robot conversations.

Speakers: Laurent Prévot (Aix Marseille Université & CNRS) , Magalie Ochs (LIS) , Thierry Chaminade (INT, Aix Marseille Université) , Youssef Hmamouche (Aix Marseille University)
• 19:15
Mon-1-3-2 Neural Entrainment to Natural Speech Envelope Based on Subject Aligned EEG Signals 1h

Reconstruction of speech envelope from neural signal is a general way to study neural entrainment, which helps to understand the neural mechanism underlying speech processing.
Previous neural entrainment studies were mainly based on single-trial neural activities,
and the reconstruction accuracy of speech envelope is not high enough, probably due to the interferences from diverse noises such as breath and heartbeat. Considering that such noises independently emerge in the consistent neural processing of the subjects responding to the same speech stimulus, we proposed a method to align and average electroencephalograph (EEG) signals of the subjects for the same stimuli to reduce the noises of neural signals.
Pearson correlation of constructed speech envelops with the original ones showed a great improvement comparing to the single-trial based method. Our study improved the correlation coefficient in delta band from around 0.25 to 0.5, where 0.25 was obtained in previous leading studies based on single-trial. The speech tracking phenomenon not only occurred in the commonly reported delta and theta band, but also occurred in the gamma band of EEG.
Moreover, the reconstruction accuracy for regular speech was higher than that for the time-reversed speech, suggesting that neural entrainment to natural speech envelope reflects speech semantics.

Speakers: Di Zhou (Japan Advanced Institute of Science and Technology) , Gaoyan Zhang (Tianjin University) , Jianwu Dang (JAIST) , Shuang Wu (Tianjin University) , Zhuo Zhang (Tianjin University)
• 19:15
Mon-1-3-3 Does Lexical Retrieval Deteriorate in Patients with Mild Cognitive Impairment? Analysis of Brain Functional Network Will Tell 1h

Alterations in speech and language are typical signs of mild cognitive impairment (MCI), considered to be the prodromal stage of Alzheimer’s disease (AD). Yet, very few studies have pointed out at what stage their speech production is disrupted. To bridge this knowledge gap, the present study focused on lexical retrieval, a specific process during speech production, and investigated how it is affected in cognitively impairment patients with the state-of-the-art analysis of brain functional network. 17 patients with MCI and 20 age-matched controls were invited to complete a primed picture naming task, of which the prime was either semantically related or unrelated to the target. Using electroencephalography (EEG) signals collected during task performance, even-related potentials (ERPs) were analyzed, together with the construction of the brain functional network. Results showed that whereas MCI patients did not exhibit significant differences in reaction time and ERP responses, their brain functional network did alter associated with a significant main effect in accuracy. The observation of increased cluster coefficients and characteristic path length indicated deteriorations in global information processing, which provided evidence that deficits in lexical retrieval might have occurred even at the preclinical stage of AD.

Speakers: Chongyuan Lian (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) , Feiqi Zhu (Shenzhen Luohu People’s Hospital) , Lan Wang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) , Manwa Lawrence Ng (The University of Hong Kong) , Mingxiao Gu (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) , Nan Yan (Shenzhen Institutes of Advanced Technology) , Tianqi Wang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences)
• 19:15
Mon-1-3-4 Congruent Audiovisual Speech Enhances Cortical Envelope Tracking during Auditory Selective Attention 1h

Listeners usually have the ability to selectively attend to the target speech while ignoring competing sounds. The mechanism that top-down attention modulates the cortical envelope tracking to speech was proposed to account for this ability. Additional visual input, such as lipreading was considered beneficial for speech perception, especially in noise. However, the effect of audiovisual (AV) congruency on the dynamic properties of cortical envelope tracking activities was not discussed explicitly. And the involvement of cortical regions processing AV speech was unclear. To solve these issues, electroencephalography (EEG) was recorded while participants attending to one talker from a mixture for several AV conditions (audio-only, congruent and incongruent). Approaches of temporal response functions (TRFs) and inter-trial phase coherence (ITPC) analysis were utilized to index the cortical envelope tracking for each condition. Comparing with the audio-only condition, both indices were enhanced only for the congruent AV condition, and the enhancement was prominent over both the auditory and visual cortex. In addition, timings of different cortical regions involved in cortical envelope tracking activities were subject to stimulus modality. The present work provides new insight into the neural mechanisms of auditory selective attention when visual input is available.

Speakers: Jing Chen (Peking University) , Zhen Fu (Peking University)
• 19:15
Mon-1-3-5 Contribution of RMS-level-based speech segments to target speech decoding under noisy conditions 1h

Human listeners can recognize target speech streams in complex auditory scenes. The cortical activities can robustly track the amplitude fluctuations of target speech with auditory attentional modulation under a range of signal-to-masker ratios (SMRs). The root-mean-square (RMS) level of the speech signal is a crucial acoustic cue for target speech perception. However, in most studies, the neural-tracking activities were analyzed with the intact speech temporal envelopes, ignoring the characteristic decoding features in different RMS-level-specific speech segments. This study aimed to explore the contributions of high- and middle-RMS-level segments to target speech decoding in noisy conditions based on electroencephalogram (EEG) signals. The target stimulus was mixed with a competing speaker at five SMRs (i.e., 6, 3, 0, -3, and -6 dB), and then the temporal response function (TRF) was used to analyze the relationship between neural responses and high/middle-RMS-level segments. Experimental results showed that target and ignored speech streams had significantly different TRF responses under conditions with the high- or middle-RMS-level segments. Besides, the high- and middle-RMS-level segments elicited different TRF responses in morphological distributions. These results suggested that distinct models could be used in different RMS-level-specific speech segments to better decode target speech with corresponding EEG signals.

Speakers: Ed X. Wu (The University of Hong Kong) , Fei Chen (Southern University of Science and Technology) , Lei Wang (Southern University of Science and Technology)
• 19:15
Mon-1-3-6 Cortical Oscillatory Hierarchy for Natural Sentence Processing 1h

Human speech processing, either for listening or oral reading, requires dynamic cortical activities that are not only driven by sensory stimuli externally but also inﬂuenced by semantic knowledge and speech planning goals internally. Each of these functions has been known to accompany speciﬁc rhythmic oscillations and be localized in distributed networks. The question is how the brain organizes these spatially and spectrally distinct functional networks in such a temporal precision that endows us with incredible speech abilities. For clariﬁcation, this study conducted an oral reading task with natural sentences and collected simultaneously the involved brain waves, eye movements, and speech signals with high-density EEG and eye movement equipment. By examining the regional oscillatory spectral perturbation and modeling the frequency-speciﬁc interregional connections, our results revealed a hierarchical oscillatory mechanism, in which gamma oscillation entrains with the ﬁne-structured sensory input while beta oscillation modulated the sensory output. Alpha oscillation mediated between sensory perception and cognitive function via selective suppression. Theta oscillation synchronized local networks for largescale coordination. Differing from a single function-frequency correspondence, the coexistence of multi-frequency oscillations was found to be critical for local regions to communicate remotely and diversely in a larger network.

Speakers: Bin Zhao (Tianjin University) , Gaoyan Zhang (Tianjin University) , Jianwu Dang (JAIST) , Masashi Unoki (JAIST)
• 19:15
Mon-1-3-7 Comparing EEG analyses with different epoch alignments in an auditory lexical decision experiment 1h

In processing behavioral data from auditory lexical decision, reaction times (RT) can be defined relative to stimulus onset or relative to stimulus offset. Using stimulus onset as the reference invokes models that assumes that relevant processing starts immediately, while stimulus offset invokes models that assume that relevant processing can only start when the acoustic input is complete. It is suggested that EEG recordings can be used to tear apart putative processes. EEG analysis requires some kind of time-locking of epochs, so that averaging of multiple signals does not mix up effects of different processes. However, in many lexical decision experiments the duration of the speech stimuli varies substantially. Consequently, processes tied to stimulus offset are not appropriately aligned and might get lost in the averaging process. In this paper we investigate whether the time course of putative processes such as phonetic encoding, lexical access and decision making can be derived from ERPs and from instantaneous power representations in several frequency bands when epochs are time-locked at stimulus onset or stimulus offset. In addition, we investigate whether time-locking at the moment when the response is given can shed light on the decision process per se.

Speakers: Kimberley Mulder (Center for Language Studies, Radboud University, Nijmegen) , Lou Boves (Centre for Language and Speech Technology, Radboud University Nijmegen) , Louis ten Bosch (Radboud University Nijmegen)
• 19:15
Mon-1-3-8 Detection of Subclinical Mild Traumatic Brain Injury (mTBI) Through Speech and Gait 1h

Between 15% to 40% of mild traumatic brain injury (mTBI) patients experience incomplete recoveries or provide subjective reports of decreased motor abilities, despite a clinically-determined complete recovery. This demonstrates a need for objective measures capable of detecting subclinical residual mTBI, particularly in return-to-duty decisions for warfighters and return-to-play decisions for athletes. In this paper, we utilize features from recordings of directed speech and gait tasks completed by ten healthy controls and eleven subjects with lingering subclinical impairments from an mTBI. We hypothesize that decreased coordination and precision during fine motor movements governing speech production (articulation, phonation, and respiration), as well as during gross motor movements governing gait, can be effective indicators of subclinical mTBI. Decreases in coordination are measured from correlations of vocal acoustic feature time series and torso acceleration time series. We apply eigenspectra derived from these correlations to machine learning models to discriminate between the two subject groups. The fusion of correlation features derived from acoustic and gait time series achieve an AUC of 0.98. This highlights the potential of using the combination of vocal acoustic features from speech tasks and torso acceleration during a simple gait task as a rapid screening tool for subclinical mTBI.

Speakers: Adam Lammert (Worcester Polytechnic Institute) , Anne O'Brien (Spaulding Rehabilitation Hospital) , Daniel Hannon (MIT Lincoln Laboratory) , Douglas Sturim (MIT) , Gloria Vergara-Diaz (Spaulding Rehabilitation Hospital) , Gregory Ciccarelli (MIT Lincoln Laboratory) , Hrishikesh Rao (MIT Lincoln Laboratory) , James Williamson (MIT Lincoln Laboratory) , Jeffrey Palmer (MIT Lincoln Laboratory) , Paolo Bonato (Spaulding Rehabilitation Hospital) , Richard DeLaura (MIT Lincoln Laboratory) , Ross Zafonte (Spaulding Rehabilitation Hospital) , Sophia Yuditskaya (MIT Lincoln Laboratory) , Tanya Talkar (Harvard University) , Thomas Quatieri (MIT Lincoln Laboratory)
• 19:15 20:15
Mon-1-4 Speech Signal Representation room4

### room4

Chairs: Ken-Ichi Sakakibara , Reinhold Haeb-Umbach

https://zoom.com.cn/j/69279928709

• 19:15
Mon-1-4-1 Towards Learning a Universal Non-Semantic Representation of Speech 1h

The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a preexisting embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embeddings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective. The proposed representation outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks. The embedding is trained on a publicly available dataset, and it is tested on a variety of low-resource down-stream tasks, including personalization tasks and medical domain. The benchmark, models, and evaluation code are publicly released.

Speakers: Aren Jansen (Google) , Dotan Emanuel (Google) , Félix de Chaumont Quitry (Google) , Ira Shavitt (Google) , Joel Shor (Google) , Marco Tagliasacchi (Google) , Omry Tuval (Google) , Oran Lang (Google) , Ronnie Maor (Google) , Yinnon Haviv (Google)
• 19:15
Mon-1-4-10 Harmonic Lowering for Accelerating Harmonic Convolution for Audio Signals 1h

Convolutional neural networks have been successfully applied to a variety of audio signal processing tasks including sound source separation, speech recognition and acoustic scene understanding. Since many pitched sounds have a harmonic structure, an operation, called harmonic convolution, has been proposed to take advantages of the structure appearing in the audio signals. However, the computational cost involved is higher than that of normal convolution. This paper proposes a faster calculation method of harmonic convolution called
Harmonic Lowering. The method unrolls the input data to a redundant layout so that the normal convolution operation can be applied. The analysis of the runtimes and the number of multiplication operations show that the proposed method accelerates the harmonic convolution 2 to 7 times faster than the conventional method under realistic parameter settings, while no approximation is introduced.

Speakers: Hiroshi Saruwatari (The University of Tokyo) , Hirotoshi Takeuchi (University of Tokyo) , Kunio Kashino (NTT Corporation) , Yasunori Ohishi (NTT Corporation)
• 19:15
Mon-1-4-2 Poetic Meter Classification Using i-vector-MTF Fusion 1h

In this paper, a deep neural network (DNN)-based poetic meter classification scheme is proposed using a fusion of musical texture features (MTF) and i-vectors. The experiment is performed in two phases. Initially, the mel-frequency cepstral coefficient (MFCC) features are fused with MTF and classification is done using DNN. MTF include timbral, rhythmic, and melodic features. Later, in the second phase, the MTF is fused with i-vectors and classification is performed. The performance is evaluated using a newly created poetic corpus in Malayalam, one of the prominent languages in India. While the MFCCMTF/DNN system reports an overall accuracy of 80.83%, the ivector/MTF fusion reports an overall accuracy of 86.66%. The performance is also compared with a baseline support vector machine (SVM)-based classifier. The results show that the architectural choice of i-vector fusion with MTF on DNN has merit in recognizing meters from recited poems

Speakers: Aiswarya Vinod (College of Engineering,Trivandrum) , Ben P. Babu (RIT Kottayam) , Rajeev Rajan (College of Engineering ,Trivandrum)
• 19:15
Mon-1-4-3 Formant Tracking Using Dilated Convolutional Networks Through Dense Connection with Gating Mechanism 1h

Formant tracking is one of the most fundamental problems in speech processing. Traditionally, formants are estimated using signal processing methods. Recent studies showed that generic convolutional architectures can outperform recurrent networks on temporal tasks such as speech synthesis and machine translation. In this paper, we explored the use of Temporal Convolutional Network (TCN) for formant tracking. In addition to the conventional implementation, we modified the architecture from three aspects. First, we turned off the “causal” mode of dilated convolution, making sure the dilated convolution see the future speech frames. Second, each hidden layer reused the output information from all the previous layer through dense connection. Third, we also adopted a gating mechanism to alleviate the problem of gradient disappearance by selectively forgetting unimportant information. The model was validated on the open access formant database VTR. Experiment showed that our model was easy to converge and achieved the overall mean absolute percent error (MAPE) of 8.2% on speech-labeled frames, compared to three competitive baselines of 9.4%(LSTM), 9.1%(Bi-LSTM) and 8.9% (TCN).

Speakers: Binghuai Lin (MIG, Tencent Science and Technology Ltd., Beijing) , Dengfeng Ke (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Jinsong Zhang (Beijing Language and Culture University) , Wang Dai (Beijing Language and Culture University) , Wei Wei (Beijing Language and Culture University) , Yanlu Xie (Beijing Language and Culture University) , Yingming Gao (Institute of Acoustics and Speech Communication, Technische Universität Dresden)
• 19:15
Mon-1-4-4 Automatic Analysis of Speech Prosody in Dutch 1h

In this paper we present a publicly available tool for automatic analysis of speech prosody (AASP) in Dutch. Incorporating the state-of-the-art analytical frameworks, AASP enables users to analyze prosody at two levels from different theoretical perspectives. Holistically, by means of the Functional Principal Component Analysis (FPCA) it generates mathematical functions that capture changes in the shape of a pitch contour. The tool outputs the weights of principal components in a table for users to process in further statistical analysis. Structurally, AASP analyzes prosody in terms of prosodic events within the auto-segmental metrical framework, hypothesizing prosodic labels in accordance with Transcription of Dutch Intonation (ToDI) with accuracy comparable to similar tools for other languages. Published as a Docker container, the tool can be set up on various operating systems in only two steps. Moreover, the tool is accessed through a graphic user interface, making it accessible to users with limited programming skills.

Speakers: Aoju Chen (Utrecht University) , Berit Janssen (Utrecht University) , Carlos Gussenhoven (Radboud University) , Judith Hanssen (Avans University of Applied Sciences) , Na Hu (Utrecht University)
• 19:15
Mon-1-4-5 Learning Voice Representation Using Knowledge Distillation For Automatic Voice Casting 1h

The search for professional voice-actors for audiovisual productions is a sensitive task, performed by the artistic directors (ADs). The ADs have a strong appetite for new talents/voices but cannot perform large scale auditions. Automatic tools able to suggest the most suited voices are of a great interest for audiovisual industry.

In previous works, we showed the existence of acoustic information allowing to mimic the AD's choices. However, the only available information is the ADs' choices from the already dubbed multimedia productions. In this paper, we propose a representation-learning based strategy to build a character/role representation, called p-vector. In addition, the large variability between audiovisual productions makes it difficult to have homogeneous training datasets. We overcome this difficulty by using knowledge distillation methods to take advantage of external datasets.

Experiments are conducted on video-game voice excerpts. Results show a significant improvement using the p-vector, compared to the speaker-based $x$-vector representation.

Speakers: Adrien Gresse (LIA - Avignon University) , Jean-Francois Bonastre (Avignon University, LIA) , Mathias Quillot (LIA - Avignon University) , Richard Dufour (LIA - Avignon University)
• 19:15
Mon-1-4-6 Enhancing formant information in spectrographic display of speech 1h

Formants are resonances of the time varying vocal tract system, and their characteristics are reflected in the response of the system for a sequence of impulse-like excitation sequence originated at the glottis. This paper presents a method to enhance the formants information in the display of spectrogram of the speech signal, especially for high pitched voices. It is well known that in the narrowband spectrogram, the presence of pitch harmonics masks the formant information, whereas in the wideband spectrogram, the formant regions are smeared. Using single frequency filtering (SFF) analysis, we show that the wideband equivalent SFF spectrogram can be modified to enhance the formant information in the display by improving the frequency resolution. For this, we obtain two SFF spectrograms by using single frequency filtering of the speech signal at two closely spaced roots on the real axis in the z-plane. The ratio or difference of the two SFF spectrograms is processed to enhance the formant information in the spectrographic display. This will help in tracking rapidly changing formants and in resolving closely spaced formants. The effect is more pronounced in the case of high-pitched voices, like female and children speech.

Speakers: Anand Medabalimi (IIIT Hyderabad) , Bayya Yegnanarayana (International Institute of Information Technology at Hyderabad) , Vishala Pannala (International Institute of Information Technology Hyderabad)
• 19:15
Mon-1-4-7 Unsupervised Methods for Evaluating Speech Representations 1h

Disentanglement is a desired property in representation learning and a significant body of research has tried to show that it is a useful representational prior. Evaluating disentanglement is challenging, particularly for real world data like speech, where ground truth generative factors are typically not available. Previous work on disentangled representation learning in speech has used categorical supervision like phoneme or speaker identity in order to disentangle grouped feature spaces. However, this work differs from the typical dimension-wise view of disentanglement in other domains. This paper proposes to use low-level acoustic features to provide the structure required to evaluate dimension-wise disentanglement. By choosing well-studied acoustic features, grounded and descriptive evaluation is made possible for unsupervised representation learning. This work produces a toolkit for evaluating disentanglement in unsupervised representations of speech and evaluates its efficacy on previous research.

Speakers: James Glass (Massachusetts Institute of Technology) , Michael Gump (MIT) , Wei-Ning Hsu (Massachusetts Institute of Technology)
• 19:15
Mon-1-4-8 Robust pitch regression with voiced/unvoiced classification in nonstationary noise environments 1h

Accurate voiced/unvoiced information is crucial in estimating the pitch of a target speech signal in severe nonstationary noise environments. Nevertheless, state-of-the-art pitch estimators based on deep neural networks (DNN) lack a dedicated mechanism for robustly detecting voiced and unvoiced segments in the target speech in noisy conditions. In this work, we proposed an end-to-end deep learning-based pitch estimation framework which jointly detects voiced/unvoiced segments and predicts pitch values for the voiced regions of the ground-truth speech. We empirically showed that our proposed framework significantly more robust than state-of-the-art DNN based pitch detectors in nonstationary noise settings. Our results suggest that joint training of voiced/unvoiced detection and voiced pitch prediction can significantly improve pitch estimation performance.

Speakers: Dung Tran (Microsoft) , Kazuhito Koishida (Microsoft) , Uros Batricevic (Microsoft)
• 19:15
Mon-1-4-9 Nonlinear ISA with Auxiliary Variables for Learning Speech Representations 1h

This paper extends recent work on nonlinear Independent Component Analysis (ICA) by introducing a theoretical framework for nonlinear Independent Subspace Analysis (ISA) in the presence of auxiliary variables. Observed high dimensional acoustic features like log Mel spectrograms can be considered as surface level manifestations of nonlinear transformations over individual multivariate sources of information like speaker characteristics, phonological content etc. Under assumptions of energy based models we use the theory of nonlinear ISA to propose an algorithm that learns unsupervised speech representations whose subspaces are independent and potentially highly correlated with the original non-stationary multivariate sources. We show how nonlinear ICA with auxiliary variables can be extended to a generic identifiable model for subspaces as well while also providing sufficient conditions for the identifiability of these high dimensional subspaces. Our proposed methodology is generic and can be integrated with standard unsupervised approaches to learn speech representations with subspaces that can theoretically capture independent higher order speech signals. We evaluate the gains of our algorithm when integrated with the Autoregressive Predictive Coding (APC) model by showing empirical results on the speaker verification and phoneme recognition tasks.

Speakers: Alan W Black (Carnegie Mellon University) , Amrith Setlur (CMU) , Barnabas Poczos (Carnegie Mellon University)
• 19:15 20:15
Mon-1-5 Speech Synthesis: Neural Waveform Generation I room5

### room5

Chairs: Sunayana Sitaram,Paavo Alku

https://zoom.com.cn/j/67438690809

• 19:15
Mon-1-5-1 Knowledge-and-Data-Driven Amplitude Spectrum Prediction for Hierarchical Neural Vocoders 1h

In our previous work, we have proposed a neural vocoder called HiNet which recovers speech waveforms by predicting amplitude and phase spectra hierarchically from input acoustic features. In HiNet, the amplitude spectrum predictor (ASP) predicts log amplitude spectra (LAS) from input acoustic features. This paper proposes a novel knowledge-and-data-driven ASP (KDD-ASP) to improve the conventional one. First, acoustic features (i.e., F0 and mel-cepstra) pass through a knowledge-driven LAS recovery module to obtain approximate LAS (ALAS). This module is designed based on the combination of STFT and source-filter theory, in which the source part and the filter part are designed based on input F0 and mel-cepstra, respectively. Then, the recovered ALAS are processed by a data-driven LAS refinement module which consists of multiple trainable convolutional layers to get the final LAS. Experimental results show that the HiNet vocoder using KDD-ASP can achieve higher quality of synthetic speech than that using conventional ASP and the WaveRNN vocoder on a text-to-speech (TTS) task.

Speakers: Yang Ai (University of Science and Technology of China) , Zhenhua Ling (University of Science and Technology of China)
• 19:15
Mon-1-5-10 Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions 1h

Recent advancements in deep learning led to human-level performance in single-speaker speech synthesis. However, there are still limitations in terms of speech quality when generalizing those systems into multiple-speaker models especially for unseen speakers and unseen recording qualities. For instance, conventional neural vocoders are adjusted to the training speaker and have poor generalization capabilities to unseen speakers. In this work, we propose a variant of WaveRNN, referred to as speaker conditional WaveRNN (SC-WaveRNN). We target towards the development of an efficient universal vocoder even for unseen speakers and recording conditions. In contrast to standard WaveRNN, SC-WaveRNN exploits additional information given in the form of speaker embeddings. Using publicly-available data for training, SC-WaveRNN achieves significantly better performance over baseline WaveRNN on both subjective and objective metrics. In MOS, SC-WaveRNN achieves an improvement of about 23% for seen speaker and seen recording condition and up to 95% for unseen speaker and unseen condition. Finally, we extend our work by implementing a multi-speaker text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation.
In terms of performance, our system has been preferred over the baseline TTS system by 60% over 15.5% and by 60.9% over 32.6%, for seen and unseen speakers, respectively.

Speakers: Dipjyoti Paul (Computer Science Department, University of Crete, Greece) , Yannis Pantazis (Institute of Applied and Computational Mathematics, FORTH) , Yannis Stylianou (Univ of Crete)
• 19:15
Mon-1-5-11 Neural Homomorphic Vocoder 1h

In this paper, we propose the neural homomorphic vocoder (NHV), a source-filter model based neural vocoder framework. NHV synthesizes speech by filtering impulse trains and noise with linear time-varying (LTV) filters. A neural network controls the LTV filters by estimating complex cepstrums of time-varying impulse responses given acoustic features. The proposed framework can be trained with a combination of multi-resolution STFT loss and adversarial loss functions. Due to the use of DSP-based synthesis methods, NHV is highly efficient, fully controllable and interpretable. A vocoder was built under the framework to synthesize speech given log-Mel spectrograms and fundamental frequencies. While the model cost only 15 kFLOPs per sample, the synthesis quality remained comparable to baseline neural vocoders in both copy-synthesis and text-to-speech.

Speakers: Kai Yu (Shanghai Jiao Tong University) , Kuan Chen (Shanghai Jiao Tong University) , Zhijun Liu (Shanghai Jiao Tong University)
• 19:15
Mon-1-5-2 FeatherWave: An efficient high-fidelity neural vocoder with multi-band linear prediction 1h

In this paper, we propose the FeatherWave, yet another variant of WaveRNN vocoder combining the multi-band signal processing and the linear predictive coding.
The LPCNet, a recently proposed neural vocoder which utilized the linear predictive characteristic of speech signal in the WaveRNN architecture, can generate high quality speech with a speed faster than real-time on a single CPU core.
However, LPCNet is still not efficient enough for online speech generation tasks.
To address this issue, we adopt the multi-band linear predictive coding for WaveRNN vocoder.
The multi-band method enables the model to generate several speech samples in parallel at one step.
Therefore, it can significantly improve the efficiency of speech synthesis.
The proposed model with 4 sub-bands needs less than 1.6 GFLOPS for speech generation.
In our experiments, it can generate 24 kHz high-fidelity audio 9x faster than real-time on a single CPU, which is much faster than the LPCNet vocoder.
Furthermore, our subjective listening test shows that the FeatherWave can generate speech with better quality than LPCNet.

Speakers: Heng Lu (Tencent) , Ling-Hui Chen (Tencent) , Qiao Tian (Tencent) , Shan Liu (Tencent) , Zewang Zhang (Tencent)
• 19:15
Mon-1-5-3 VocGAN: A High-Fidelity Real-time Vocoder with a Hierarchically-nested Adversarial Network 1h

We present a novel high-fidelity real-time neural vocoder called VocGAN. A recently developed GAN-based vocoder, MelGAN, produces speech waveforms in real-time. However, it often produces a waveform that is insufficient in quality or inconsistent with acoustic characteristics of the input mel spectrogram. VocGAN is nearly as fast as MelGAN, but it significantly improves the quality and consistency of the output waveform. VocGAN applies a multi-scale waveform generator and a hierarchically-nested discriminator to learn multiple levels of acoustic properties in a balanced way. It also applies the joint conditional and unconditional objective, which has shown successful results in high-resolution image synthesis. In experiments, VocGAN synthesizes speech waveforms 416.7x faster on a GTX 1080Ti GPU and 3.24x faster on a CPU than real-time. Compared with MelGAN, it also exhibits significantly improved quality in multiple evaluation metrics including mean opinion score (MOS) with minimal additional overhead. Additionally, compared with Parallel WaveGAN, another recently developed high-fidelity vocoder, VocGAN is 6.98x faster on a CPU and exhibits higher MOS.

Speakers: HOON-YOUNG CHO (NCSOFT, AI Center, Speech Lab) , Injung Kim (Handong Global University) , Jinhyeok Yang (NCSOFT) , Junmo Lee (NCSOFT) , Young-Ik Kim (Researcher)
• 19:15
Mon-1-5-4 Lightweight LPCNet-based Neural Vocoder with Tensor Decomposition 1h

This paper proposes a lightweight neural vocoder based on LPCNet. The recently proposed LPCNet exploits linear predictive coding to represent vocal tract characteristics, and can rapidly synthesize high-quality waveforms with fewer parameters than WaveRNN. For even greater speeds, it is necessary to reduce the time-heavy two GRUs and the DualFC. Although the original work only pruned the first GRU weight, there is room for improvements in the other GRU and DualFC. Accordingly, we use tensor decomposition to reduce these remaining parameters by more than 80%. For the proposed method we demonstrate that 1) it is 1.26 times faster on a CPU, and 2) it matched naturalness of the original LPCNet for acoustic features extracted from natural speech and for those predicted by TTS.

Speakers: Hiroki Kanagawa (NTT Corporation) , Yusuke Ijima (NTT corporation)
• 19:15
Mon-1-5-5 WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU 1h

In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform. Our PyTorch implementation can be trained using less than 8 GB GPU memory and generates audio samples at a rate of more than 960 kHz on an NVIDIA 1080Ti GPU. Furthermore, even if synthesizing on a CPU, we show that the proposed method is capable of generating 44.1 kHz speech waveform 1.2 times faster than real-time. Experiments also show that the quality of generated audio is comparable to those of other methods. Audio samples are publicly available online.

Speakers: Hung-yi Lee (National Taiwan University (NTU)) , Po-chun Hsu (College of Electrical Engineering and Computer Science, National Taiwan University)
• 19:15
Mon-1-5-6 What the future brings: investigating the impact of lookahead for incremental neural TTS 1h

In incremental text to speech synthesis (iTTS), the synthesizer produces an audio output before it has access to the entire input sentence. In this paper, we study the behavior of a neural sequence-to-sequence TTS system when used in an incremental mode, i.e. when generating speech output for token n, the system has access to n + k tokens from the text sequence. We first analyze the impact of this incremental policy on the evolution of the encoder representations of token n for different values of k (the lookahead parameter). The results show that, on average, tokens travel 88% of the way to their full context representation with a one-word lookahead and 94% after 2 words. We then investigate which text features are the most influential on the evolution towards the final representation using a random forest analysis. The results show that the most salient factors are related to token length. We finally evaluate the effects of lookahead k at the decoder level, using a MUSHRA listening test. This test shows results that contrast with the above high figures: speech synthesis quality obtained with 2 word-lookahead is significantly lower than the one obtained with the full sentence.

Speakers: Brooke Stephenson (Université Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, 38000 Grenoble and LIG, UGA, G-INP, CNRS, INRIA, Grenoble, France) , Laurent Besacier (LIG) , Laurent Girin (GIPSA-lab / University of Grenoble) , Thomas Hueber (CNRS / GIPSA-lab)
• 19:15
Mon-1-5-7 Fast and lightweight on-device TTS with Tacotron2 and LPCNet 1h

We present a fast and lightweight on-device text-to-speech system based on state-of-art methods of feature and speech generation i.e. Tacotron2 and LPCNet. We show that modification of the basic pipeline combined with hardware-specific optimizations and extensive usage of parallelization enables running TTS service even on low-end devices with faster than realtime waveform generation. Moreover, the system preserves high quality of speech without noticeable degradation of Mean Opinion Score compared to the non-optimized baseline. While the system is mostly oriented on low-to-mid range hardware we believe that it can also be used in any CPU-based environment.

Speakers: Denis Parkhomenko (Huawei Technologies Co. Ltd.) , Mikhail Kudinov (Huawei Technologies Co. Ltd.) , Sergey Repyevsky (Huawei Technologies Co. Ltd.) , Stanislav Kamenev (Huawei Technologies Co. Ltd.) , Tasnima Sadekova (Huawei Technologies Co. Ltd.) , Vadim Popov (Huawei Technologies Co. Ltd.) , Vitalii Bushaev (Huawei Technologies Co. Ltd.) , Vladimir Kryzhanovskiy (Huawei Technologies Co. Ltd.)
• 19:15
Mon-1-5-8 Efficient WaveGlow: An Improved WaveGlow Vocoder with Enhanced Speed 1h

Neural vocoder, such as WaveGlow, has become an important component in recent high-quality text-to-speech (TTS) systems. In this paper, we propose Efficient WaveGlow (EWG), a flow-based generative model serving as an efficient neural vocoder. Similar to WaveGlow, EWG has a normalizing flow backbone where each flow step consists of an affine coupling layer and an invertible 1x1 convolution. To reduce the number of model parameters and enhance the speed without sacrificing the quality of the synthesized speech, EWG improves WaveGlow in three aspects. First, the WaveNet-style transform network in WaveGlow is replaced with an FFTNet-style dilated convolution network. Next, to reduce the computation cost, group convolution is applied to both audio and local condition features. At last, the local condition is shared among the transform network layers in each coupling layer. As a result, EWG can reduce the number of floating-point operations (FLOPs) required to generate one-second audio and the number of model parameters both by more than 12 times. Experimental results show that EWG can reduce real-world inference time cost by more than twice, without any obvious reduction in the speech quality.

Speakers: Bowen Zhou (JD AI Research) , Chao Zhang (University of Cambridge) , Guanghui Xu (JD AI Research) , Wei Song (JD AI Research) , Xiaodong He (JD AI Research) , Zhengchen Zhang (JD.com)
• 19:15
Mon-1-5-9 Can Auditory Nerve models tell us what’s different about WaveNet vocoded speech? 1h

Nowadays, synthetic speech is almost indistinguishable from human speech.
The remarkable quality is mainly due to the displacing of signal processing based vocoders in favour of neural vocoders and, in particular, the WaveNet architecture.
At the same time, speech synthesis evaluation is still facing difficulties in adjusting to these improvements.
These difficulties are even more prevalent in the case of objective evaluation methodologies which do not correlate well with human perception.
Yet, an often forgotten use of objective evaluation is to uncover prominent differences between speech signals.
Such differences are crucial to decipher the improvement introduced by the use of WaveNet.
Therefore, abandoning objective evaluation could be a serious mistake.
In this paper, we analyze vocoded synthetic speech re-rendered using WaveNet, comparing it to standard vocoded speech.
To do so, we objectively compare spectrograms and neurograms, the latter being the output of Auditory Nerve (AN) models.
The spectrograms allow us to look at the speech production side, and the neurograms relate to the speech perception path.
While we were not yet able to pinpoint how WaveNet and WORLD differ, our results suggest that the Mean Rate (MR) neurograms in particular warrant further investigation.

Speakers: Naomi Harte (Trinity College Dublin) , Sébastien Le Maguer (Adapt Centre / Trinity College Dublin)
• 19:15 20:15
Mon-1-7 Speaker Diarization room7

### room7

Chairs: Hagai Aronowitz ,Yu Wang

https://zoom.com.cn/j/69983075794

• 19:15
Mon-1-7-1 End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors 1h

End-to-end speaker diarization for an unknown number of speakers is addressed in this paper. Recently proposed end-to-end speaker diarization outperformed conventional clustering-based speaker diarization, but it has one drawback: it is less flexible in terms of the number of speakers. This paper proposes a method for encoder-decoder based attractor calculation (EDA), which first generates a flexible number of attractors from a speech embedding sequence. Then, the generated multiple attractors are multiplied by the speech embedding sequence to produce the same number of speaker activities. The speech embedding sequence is extracted using the conventional self-attentive end-to-end neural speaker diarization (SA-EEND) network. In a two-speaker condition, our method achieved a 2.69 % diarization error rate (DER) on simulated mixtures and a 8.07 % DER on the two-speaker subset of CALLHOME, while vanilla SA-EEND attained 4.56 % and 9.54 %, respectively. In unknown numbers of speakers conditions, our method attained a 15.29 % DER on CALLHOME, while the x-vector-based clustering method achieved a 19.43 % DER.

Speakers: Kenji Nagamatsu (Hitachi, Ltd.) , Shinji Watanabe (Johns Hopkins University) , Shota Horiguchi (Hitachi, Ltd.) , Yawen Xue (Hitachi, Ltd.) , Yusuke Fujita (Hitachi, Ltd.)
• 19:15
Mon-1-7-2 Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario 1h

Andrusenko, Ivan Podluzhny, Aleksandr Laptev and Aleksei Romanenko
Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which directly predicts an activity of each speaker on each time frame. TS-VAD model takes conventional speech features (e.g., MFCC) along with i-vectors for each speaker as inputs. A set of binary classification output layers produces activities of each speaker. I-vectors can be estimated iteratively, starting with a strong clustering-based diarization.

We also extend the TS-VAD approach to the multi-microphone case using a simple attention mechanism on top of hidden representations extracted from the single-channel TS-VAD model. Moreover, post-processing strategies for the predicted speaker activity probabilities are investigated. Experiments on the CHiME-6 unsegmented data show that TS-VAD achieves state-of-the-art results outperforming the baseline x-vector-based system by more than 30% Diarization Error Rate (DER) abs.

Speakers: Aleksandr Laptev (ITMO University) , Aleksei Romanenko (ITMO University) , Andrei Andrusenko (ITMO University) , Anton Mitrofanov (STC-innovations Ltd) , Ivan Medennikov (STC-innovations Ltd) , Ivan Podluzhny (STC-innovations Ltd) , Ivan Sorokin (STC) , Mariya Korenevskaya (STC-innovations Ltd) , Maxim Korenevsky (Speech Technology Center) , Tatiana Prisyach (STC-innovations Ltd) , Tatiana Timofeeva (STC-innovations Ltd) , Yuri Khokhlov (STC-innovations Ltd)
• 19:15
Mon-1-7-4 New advances in speaker diarization 1h

Recently, speaker diarization based on speaker embeddings has shown excellent results in many works. In this paper we propose several enhancements throughout the diarization pipeline. This work addresses two clustering frameworks: agglomerative hierarchical clustering (AHC) and spectral clustering (SC).
First, we use multiple speaker embeddings. We show that fusion of x-vectors and d-vectors boosts accuracy significantly.
Second, we train neural networks to leverage both acoustic and duration information for scoring similarity of segments or clusters. Third, we introduce a novel method to guide the AHC clustering mechanism using a neural network. Fourth, we handle short duration segments in SC by deemphasizing their effect on setting the number of speakers.
Finally, we propose a novel method for estimating the number of clusters in the SC framework. The method takes each eigenvalue and analyzes the projections of the SC similarity matrix on the corresponding eigenvector.
We evaluated our system on NIST SRE 2000 CALLHOME and, using cross-validation, we achieved an error rate of 5.1%, going beyond state-of-the-art speaker diarization.

Speakers: Gakuto Kurata (IBM Research) , Hagai Aronowitz (IBM Research - Haifa) , Masayuki Suzuki (IBM Research) , Ron Hoory (IBM Haifa Research Lab) , Weizhong Zhu (IBM T.J. Watson Research Center)
• 19:15
Mon-1-7-5 Self-Attentive Similarity Measurement Strategies in Speaker Diarization 1h

Speaker diarization can be described as the process of extracting sequential speaker embeddings from an audio stream and clustering them according to speaker identities. Nowadays, deep neural network based approaches like x-vector have been widely adopted for speaker embedding extraction. However, in the clustering back-end, probabilistic linear discriminant analysis (PLDA) is still the dominant algorithm for similarity measurement. PLDA works in a pair-wise and independent manner, which may ignore the positional correlation of adjacent speaker embeddings. To address this issue, our previous work proposed the long short-term memory (LSTM) based scoring model, followed by the spectral clustering algorithm. In this paper, we further propose two enhanced methods based on the self-attention mechanism, which no longer focuses on the local correlation but searches for similar speaker embeddings in the whole sequence. The first approach achieves state-of-theart performance on the DIHARD II Eval Set (18.44% DER after resegmentation), while the second one operates with higher efficiency.

Speakers: Ming Li (Duke Kunshan University) , Qingjian Lin (SEIT, Sun Yat-sen University) , Yu Hou (Duke Kunshan University)
• 19:15
Mon-1-7-6 Speaker attribution with voice profiles by graph-based semi-supervised learning 1h

Speaker attribution is required in many real-world applications, such as meeting transcription, where speaker identity is assigned to each utterance according to speaker voice profiles. In this paper, we propose to solve the speaker attribution problem by using graph-based semi-supervised learning methods. A graph of speech segments is built for each session, on which segments from voice profiles are represented by labeled nodes while segments from test utterances are unlabeled nodes. The weight of edges between nodes is evaluated by the similarities between the pretrained speaker embeddings of speech segments. Speaker attribution then becomes a semi-supervised learning problem on graphs, on which two graph-based methods are applied: label propagation (LP) and graph neural networks (GNNs). The proposed approaches are able to utilize the structural information of the graph to improve speaker attribution performance. Experimental results on real meeting data show that the graph based approaches reduce speaker attribution error by up to 68% compared to a baseline speaker identification approach that processes each utterance independently.

Speakers: Frank Rudzicz (University of Toronto) , Jian Wu (Microsoft) , Jixuan Wang (University of Toronto) , Michael Brudno (University of Toronto) , Ranjani Ramamurthy (Microsoft) , Xiong Xiao (Microsoft)
• 19:15
Mon-1-7-7 Deep Self-Supervised Hierarchical Clustering for Speaker Diarization 1h

The state-of-the-art speaker diarization systems use agglomerative hierarchical clustering (AHC) which performs the clustering of previously learned neural embeddings. While the clustering approach attempts to identify speaker clusters, the AHC algorithm does not involve any further learning. In this paper, we propose a novel algorithm for hierarchical clustering which combines the speaker clustering along with a representation learning framework. The proposed approach is based on principles of self-supervised learning where the self-supervision is derived from the clustering algorithm. The representation learning network is trained with a regularized triplet loss using the clustering solution at the current step while the clustering algorithm uses the deep embeddings from the representation learning step. By combining the self-supervision based representation learning along with the clustering algorithm, we show that the proposed algorithm improves significantly (29% relative improvement) over the AHC algorithm with cosine similarity for a speaker diarization task on CALLHOME dataset. In addition, the proposed approach also improves over the state-of-the-art system with PLDA affinity matrix with 10% relative improvement in DER.

Speakers: Prachi Singh (Indian Institute of Science, Bangalore) , Sriram Ganapathy (Indian Institute of Science, Bangalore, India, 560012)
• 19:15
Mon-1-7-8 Spot the conversation: speaker diarisation in the wild 1h

The goal of this paper is speaker diarisation of videos collected in the wild'.

We make three key contributions.
First, we propose an automatic audio-visual diarisation method for YouTube videos.
Our method consists of active speaker detection using audio-visual methods and speaker verification using self-enrolled speaker models. Second, we integrate our method into a semi-automatic dataset creation pipeline which significantly reduces the number of hours required to annotate videos with diarisation labels. Finally, we use this pipeline to create a large-scale diarisation dataset called VoxConverse, collected from in the wild' videos, which we will release publicly to the research community. Our dataset consists of overlapping speech, a large and diverse speaker pool, and challenging background conditions.

Speakers: Andrew Zisserman (University of Oxford) , Arsha Nagrani (University of Oxford) , Jaesung Huh (Naver Corporation) , Joon Son Chung (University of Oxford) , Triantafyllos Afouras (University of Oxford)
• 19:15 20:15
Mon-1-8 Noise robust and distant speech recognition room8

### room8

Chairs: Yanmin Qian , Ozlem Kalinli (Apple)

https://zoom.com.cn/j/63352125526

• 19:15
Mon-1-8-1 Learning Contextual Language Embeddings for Monaural Multi-talker Speech Recognition 1h

End-to-end multi-speaker speech recognition has been a popular topic in recent years, as more and more researches focus on speech processing in more realistic scenarios. Inspired by the hearing mechanism of human beings, which enables us to concentrate on the interested speaker from the multi-speaker mixed speech by utilizing both audio and context knowledge, this paper explores the contextual information to improve the multi-talker speech recognition. In the proposed architecture, the novel embedding learning model is designed to accurately extract the contextual embedding from the multi-talker mixed speech directly. Then two advanced training strategies are further proposed to improve the new model. Experimental results show that our proposed method achieves a very large improvement on multi-speaker speech recognition, with ∼25% relative WER reduction against the baseline end-to-end multi-talker ASR model.

Speakers: Wangyou Zhang (Shanghai Jiao Tong University) , Yanmin Qian (Shanghai Jiao Tong University)
• 19:15
Mon-1-8-10 Simulating realistically-spatialised simultaneous speech using video-driven speaker detection and the CHiME-5 dataset 1h

Simulated data plays a crucial role in the development and evaluation of novel distant microphone ASR techniques. However, the commonly used simulated datasets adopt uninformed and potentially unrealistic speaker location distributions. We wish to generate more realistic simulations driven by recorded human behaviour. By using devices with a paired microphone array and camera, we analyse unscripted dinner party scenarios (CHiME-5) to estimate the distribution of speaker separation in a realistic setting. We deploy face-detection, and pose-detection techniques on 114 cameras to automatically locate speakers in 20 dinner party sessions. Our analysis found that on average, the separation between speakers was only 17 degrees. We use this analysis to create datasets with realistic distributions and compare it with commonly used datasets of simulated signals. By changing the position of speakers, we show that the word error rate can increase by over 73.5% relative when using a strong speech enhancement and ASR system.

Speakers: Jack Deadman (University of Sheffield) , Jon Barker (University of Sheffield)
• 19:15
Mon-1-8-2 Double Adversarial Network based Monaural Speech Enhancement for Robust Speech Recognition 1h

To improve the noise robustness of automatic speech recognition (ASR), the generative adversarial network (GAN) based enhancement methods are employed as the front-end processing, which comprise a single adversarial process of an enhancement model and a discriminator. In this single adversarial process, the discriminator is encouraged to find differences between the enhanced and clean speeches, but the distribution of clean speeches is ignored. In this paper, we propose a double adversarial network (DAN) by adding another adversarial generation process (AGP), which forces the discriminator not only to find the differences but also to model the distribution. Furthermore, a functional mean square error (f-MSE) is proposed to utilize the representations learned by the discriminator. Experimental results reveal that AGP and f-MSE are crucial for the enhancement performance on ASR task, which are missed in previous GAN-based methods. Specifically, our DAN achieves 13.00% relative word error rate improvements over the noisy speeches on the test set of CHiME-2, which outperforms several recent GAN-based enhancement methods significantly.

Speakers: Jiqing Han (Harbin Institute of Technology) , Xueliang Zhang (Inner Mongolia University) , Zhihao Du (Harbin Institute of Technology)
• 19:15
Mon-1-8-3 Anti-aliasing regularization in stacking layers 1h

Shift-invariance is a desirable property of many machine learning models. It means that delaying the input of a model in time should only result in delaying its prediction in time. A model that is shift-invariant, also eliminates undesirable side effects like frequency aliasing. When building sequence models, not only should the shift-invariance property be preserved when sampling input features, it must also be respected inside the model itself. Here, we study the impact of the commonly used stacking layer in LSTM-based ASR models and show that aliasing is likely to occur. Experimentally, by adding merely 7 parameters to an existing speech recognition model that has 120 million parameters, we are able to reduce the impact of aliasing. This acts as a regularizer that discards frequencies the model shouldn't be relying on for predictions. Our results show that under conditions unseen at training, we are able to reduce the relative word error rate by up to 5%.

Speakers: Ananya Misra (Google) , Antoine Bruguier (Google) , Arun Narayanan (Google Inc.) , Rohit Prabhavalkar (Google)
• 19:15
Mon-1-8-4 Towards a Competitive End-to-End Speech Recognition for CHiME-6 Dinner Party Transcription 1h

While end-to-end ASR systems have proven competitive with the conventional hybrid approach, they are prone to accuracy degradation when it comes to noisy and low-resource conditions. In this paper, we argue that, even in such difficult cases, some end-to-end approaches show performance close to the hybrid baseline. To demonstrate this, we use the CHiME-6 Challenge data as an example of challenging environments and noisy conditions of everyday speech. We experimentally compare and analyze CTC-Attention versus RNN-Transducer approaches along with RNN versus Transformer architectures.
We also provide a comparison of acoustic features and speech enhancements. Besides, we evaluate the effectiveness of neural network language models for hypothesis re-scoring in low-resource conditions. Our best end-to-end model based on RNN-Transducer, together with improved beam search, reaches quality
by only 3.8% WER abs. worse than the LF-MMI TDNN-F CHiME-6 Challenge baseline. With the Guided Source Separation based training data augmentation, this approach outperforms the hybrid baseline system by 2.7% WER abs. and the end-to-end system best known before by 25.7% WER abs.

Speakers: Aleksandr Laptev (ITMO University) , Andrei Andrusenko (ITMO University) , Ivan Medennikov (STC-innovations Ltd)
• 19:15
Mon-1-8-5 End-to-End Far-Field Speech Recognition with Uniﬁed Dereverberation and Beamforming 1h

Despite successful applications of end-to-end approaches in multi-channel speech recognition, the performance still degrades severely when the speech is corrupted by reverberation. In this paper, we integrate the dereverberation module into the end-to-end multi-channel speech recognition system and explore two different frontend architectures. First, a multi-source mask-based weighted prediction error (WPE) module is incorporated in the frontend for dereverberation. Second, another novel frontend architecture is proposed, which extends the weighted power minimization distortionless response (WPD) convolutional beamformer to perform simultaneous separation and dereverberation. We derive a new formulation from the original WPD, which can handle multi-source input, and replace eigenvalue decomposition with the matrix inverse operation to make the back-propagation algorithm more stable. The above two architectures are optimized in a fully end-to-end manner, only using the speech recognition criterion. Experiments on both spatialized wsj1-2mix corpus and REVERB show that our proposed model outperformed the conventional methods in reverberant scenarios.

Speakers: Aswin Shanmugam Subramanian (Johns Hopkins University) , Shinji Watanabe (Johns Hopkins University) , Wangyou Zhang (Shanghai Jiao Tong University) , Xuankai Chang (Johns Hopkins University) , Yanmin Qian (Shanghai Jiao Tong University)
• 19:15
Mon-1-8-6 Quaternion Neural Networks for Multi-channel Distant Speech Recognition 1h

Despite the significant progress in automatic speech recognition (ASR), distant ASR remains challenging due to noise and reverberation. A common approach to mitigate this issue consists of equipping the recording devices with multiple microphones that capture the acoustic scene from different perspectives. These multi-channel audio recordings contain specific internal relations between each signal. In this paper, we propose to capture these inter- and intra- structural dependencies with quaternion neural networks, which can jointly process multiple signals as whole quaternion entities. The quaternion algebra replaces the standard dot product with the Hamilton one, thus offering a simple and elegant way to model dependencies between elements. The quaternion layers are then coupled with a recurrent neural network, which can learn long-term dependencies in the time domain. We show that a quaternion long-short term memory neural network (QLSTM), trained on the concatenated multi-channel speech signals, outperforms equivalent real-valued LSTM on two different tasks of multi-channel distant speech recognition.

Speakers: Mirco Ravanelli (Université de Montréal) , Mohamed Morchid (University of Avignon) , Nicholas Lane (University of Oxford) , Titouan parcollet (University of Oxford) , Xinchi Qiu (University of Oxford)
• 19:15
Mon-1-8-7 Improved Guided Source Separation Integrated with a Strong Back-end for the CHiME-6 Dinner Party Scenario 1h

The CHiME-6 dataset presents a difficult task with extreme speech overlap, severe noise and a natural speaking style. The gap of the word error rate (WER) is distinct between the audios recorded by the distant microphone arrays and the individual headset microphones. The official baseline exhibits a WER gap of approximately 10% even though the guided source separation (GSS) has achieved considerable WER reduction. In the paper, we make an effort to integrate an improved GSS with a strong automatic speech recognition (ASR) back-end, which bridges the WER gap and achieves substantial ASR performance improvement. Specifically, the proposed GSS is initialized by masks from data-driven deep-learning models, utilizes the spectral information and conducts a selection of the input channels. Meanwhile, we propose a data augmentation technique via random channel selection and deep convolutional neural network-based multi-channel acoustic models for back-end modeling. In the experiments, our framework largely reduced the WER to 34.78%/36.85% on the CHiME-6 development/evaluation set. Moreover, a narrower gap of 0.89%/4.67% was observed between the distant and headset audios. This framework is also the foundation of the IOA's submission to the CHiME-6 competition, which is ranked among the top systems.

Speakers: Hangting Chen (Institute of Acoustics,Chinese Academy of Sciences) , Pengyuan Zhang (Institute of Acoustics,Chinese Academy of Sciences) , Qian Shi (Institute of Acoustics,Chinese Academy of Sciences) , Zuozhen Liu (Institute of Acoustics,Chinese Academy of Sciences)
• 19:15
Mon-1-8-8 Neural Speech Separation Using Spatially Distributed Microphones 1h

This paper proposes a neural network based speech separation method using spatially distributed microphones. Unlike with traditional microphone array settings, neither the number of microphones nor their spatial arrangement is known in advance, which hinders the use of conventional multi-channel speech separation neural networks based on fixed size input. To overcome this, a novel network architecture is proposed that interleaves inter-channel processing layers and temporal processing layers. The inter-channel processing layers apply a self-attention mechanism along the channel dimension to exploit the information obtained with a varying number of microphones. The temporal processing layers are based on a bidirectional long short term memory (BLSTM) model and applied to each channel independently. The proposed network leverages information across time and space by stacking these two kinds of layers alternately. Our network estimates time-frequency (TF) masks for each speaker, which are then used to generate enhanced speech signals either with TF masking or beamforming. Speech recognition experimental results show that the proposed method significantly outperforms baseline multi-channel speech separation systems.

Speakers: Dongmei Wang (Microsoft) , Takuya Yoshioka (Microsoft) , Zhuo Chen (Microsoft)
• 19:15
Mon-1-8-9 Utterance-Wise Meeting Transcription System Using Asynchronous Distributed Microphones 1h

A novel framework for meeting transcription using asynchronous microphones is proposed in this paper. It consists of audio synchronization, speaker diarization, utterance-wise speech enhancement using guided source separation, automatic speech recognition, and duplication reduction. Doing speaker diarization before speech enhancement enables the system to deal with overlapped speech without considering sampling frequency mismatch between microphones. Evaluation on our real meeting datasets showed that our framework achieved a character error rate (CER) of 28.7 % by using 11 distributed microphones, while a monaural microphone placed on the center of the table had a CER of 38.2 %. We also showed that our framework achieved CER of 21.8 %, which is only 2.1 percentage points higher than the CER in headset microphone-based transcription.

Speakers: Kenji Nagamatsu (Hitachi, Ltd.) , Shota Horiguchi (Hitachi, Ltd.) , Yusuke Fujita (Hitachi, Ltd.)
• 19:15 20:15
Mon-1-9 Speech in Multimodality (MULTIMODAL) room9

### room9

Chairs: Dongyan Huang (A-STAR, Singapore), Zixing Zhang (Huawei)

https://zoom.com.cn/j/64287533785

• 19:15
Mon-1-9-1 Toward Silent Paralinguistics: Speech-to-EMG – Retrieving Articulatory Muscle Activity from Speech 1h

Electromyographic (EMG) signals recorded during speech production encode information on articulatory muscle activity and also on the facial expression of emotion, thus representing a speech-related biosignal with strong potential for paralinguistic applications. In this work, we estimate the electrical activity of the muscles responsible for speech articulation directly from the speech signal. To this end, we first perform a neural conversion of speech features into electromyographic time domain features, and then attempt to retrieve the original EMG signal from the time domain features. We propose a feed forward neural network to address the first step of the problem (speech features to EMG features) and a neural network composed of a convolutional block and a bidirectional long short-term memory block to address the second problem (true EMG features to EMG signal). We observe that four out of the five originally proposed time domain features can be estimated reasonably well from the speech signal. Further, the five time domain features are able to predict the original speech-related EMG signal with a concordance correlation coefficient of 0.663. We further compare our results with the ones achieved on the inverse problem of generating acoustic speech features from EMG features.

Speakers: Alberto Abad (INESC-ID/IST) , Björn Schuller (University of Augsburg / Imperial College London) , Catarina Botelho (INESC-ID/Instituto Superior Técnico, University of Lisbon, Portugal) , Dennis Küster (Cognitive Systems Lab (CSL), University of Bremen) , Isabel Trancoso (INESC-ID / IST Univ. Lisbon) , Kevin Scheck (Cognitive Systems Lab (CSL), University of Bremen) , Lorenz Diener (University of Bremen) , Shahin Amiriparian (University of Augsburg) , Tanja Schultz (Universität Bremen)
• 19:15
Mon-1-9-2 Multimodal Deception Detection using Automatically Extracted Acoustic, Visual, and Lexical Features 1h

Deception detection in conversational dialogue has attracted much attention in recent years. Yet existing methods for this rely heavily on human-labeled annotations that are costly and potentially inaccurate. In this work, we present an automated system that utilizes multimodal features for conversational deception detection, without the use of human annotations. We study the predictive power of different modalities and combine them for better performance. We use openSMILE to extract acoustic features after applying noise reduction techniques to the original audio. Facial landmark features are extracted from the visual modality. We experiment with training facial expression detectors and applying Fisher Vectors to encode sequences of facial landmarks with varying length. Linguistic features are extracted from automatic transcriptions of the data. We examine the performance of these methods on the Box of Lies dataset of deception game videos, achieving 73% accuracy using features from all modalities. This result is significantly better than previous results on this corpus which relied on manual annotations, and also better than human performance.

Speakers: Jiaxuan Zhang (Columbia University) , Julia Hirschberg (Columbia University) , Sarah Ita Levitan (Columbia University)
• 19:15
Mon-1-9-3 Multi-modal Attention for Speech Emotion Recognition 1h

Emotion represents an essential aspect of human speech that is manifested in speech prosody. Speech, visual, and textual cues are complementary in human communication. In this paper, we study a hybrid fusion method, referred to as multi-modal attention network (MMAN) to makes use of visual and textual cues in speech emotion recognition. We propose a novel multi-modal attention mechanism, cLSTM-MMA, which facilitates the attention across three modalities and selectively fuse the information. cLSTM-MMA is fused with other uni-modal sub-networks in the late fusion. The experiments show that speech emotion recognition benefits significantly from visual and textual cues, and the proposed cLSTM-MMA alone is as competitive as other fusion methods in terms of accuracy, but with a much more compact network structure. The proposed hybrid network MMAN achieves state-of-the-art performance on IEMOCAP database for emotion recognition.

Speakers: Haizhou Li (National University of Singapore) , Jichen Yang (National University of Singapore) , Zexu Pan (National University of Singapore) , Zhaojie Luo (Osaka University)
• 19:15
Mon-1-9-4 WISE: Word-Level Interaction-Based Multimodal Fusion for Speech Emotion Recognition 1h

While having numerous real-world applications, speech emotion recognition is still a technically challenging problem. How to effectively leverage the inherent multiple modalities in speech data (e.g., audio and text) is key to accurate classification. Existing studies normally choose to fuse multimodal features at the utterance level and largely neglect the dynamic interplay of features from different modalities at a fine-granular level over time. In this paper, we explicitly model dynamic interactions between audio and text at the word level via interaction units between two long short-term memory networks representing audio and text. We also devise a hierarchical representation of audio information from the frame, phoneme and word levels, which largely improves the expressiveness of resulting audio features. We finally propose WISE, a novel word-level interaction-based multimodal fusion framework for speech emotion recognition, to accommodate the aforementioned components. We evaluate WISE on the public benchmark IEMOCAP corpus and demonstrate that it outperforms state-of-the-art methods.

Speakers: Guang Shen (Harbin Engineering University) , Hongtao Song (Harbin Engineering University) , Kejia Zhang (Harbin Engineering University) , Qilong Han (Harbin Engineering University) , Riwei Lai (Harbin Engineering University) , Rui Chen (Harbin Engineering University) , Yu Zhang (Southern University of Science and Technology)
• 19:15
Mon-1-9-5 A Multi-scale Fusion Framework for Bimodal Speech Emotion Recognition 1h

Speech emotion recognition (SER) is a challenging task that requires to learn suitable features for achieving good performance. The development of deep learning techniques makes it possible to automatically extract features rather than construct hand-crafted features. In this paper, a multi-scale fusion framework named STSER is proposed for bimodal SER by using speech and text information. A smodel, which takes advantage of convolutional neural network (CNN), bi-directional long short-term memory (Bi-LSTM) and the attention mechanism, is proposed to learn speech representation from the log-mel spectrogram extracted from speech data. Specifically, the CNN layers are utilized to learn local correlations. Then the Bi-LSTM layer is applied to learn long-term dependencies and contextual information. Finally, the multi-head self-attention layer makes the model focus on the features that are most related to the emotions. A tmodel using a pre-trained ALBERT model is applied for learning text representation from text data. Finally, a multi-scale fusion strategy, including feature fusion and ensemble learning, is applied to improve the overall performance. Experiments conducted on the public emotion dataset IEMOCAP have shown that the proposed STSER can achieve comparable recognition accuracy with fewer feature inputs.

Speakers: Ming Chen (Zhejiang University) , Xudong Zhao (Hithink RoyalFlush Information Network Co., Ltd.)
• 19:15
Mon-1-9-6 Group Gated Fusion on Attention-based Bidirectional Alignment for Multimodal Emotion Recognition 1h

Emotion recognition is a challenging and actively-studied research area that plays a critical role in emotion-aware human-computer interaction systems. In a multimodal setting, temporal alignment between different modalities has not been well investigated yet. This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states to explicitly capture the alignment relationship between speech and text, and a novel group gated fusion (GGF) layer to integrate the representations of different modalities. We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly, and the proposed GBAN model outperforms existing state-of-the- art multimodal approaches on the IEMOCAP dataset.

Speakers: Helen Meng (The Chinese University of Hong Kong) , Kun Li (SpeechX Limited) , Pengfei Liu (SpeechX Limited)
• 19:15
Mon-1-9-7 Multi-modal embeddings using multi-task learning for emotion recognition 1h

General embeddings like word2vec, GloVe and ELMo have shown a lot of success in natural language tasks. The embed- dings are typically extracted from models that are built on general tasks such as skip-gram models and natural language generation. In this paper, we extend the work from natural language understanding to multi-modal architectures that use audio, visual and textual information for machine learning tasks. The embeddings in our network are extracted using the encoder of a transformer model trained using multi-task training. We use person identification and automatic speech recognition as the tasks in our embedding generation framework. We tune and evaluate the embeddings on the downstream task of emotion recognition and demonstrate that on the CMU-MOSEI dataset, the embeddings can be used to improve over previous state of the art results.

Speakers: Aparna Khare (Amazon.com) , Shiva Sundaram (Amazon) , Srinivas Parthasarathy (Amazon)
• 19:15
Mon-1-9-8 Using Speaker-Aligned Graph Memory Block in Multimodally Attentive Emotion Recognition Network 1h

Integrating multimodal emotion sensing modules in realizing human-centered technologies is rapidly growing. Despite recent advancement of deep architectures in improving recognition performances, inability to handle individual differences in the expressive cues creates a major hurdle for real world applications. In this work, we propose a Speaker-aligned Graph Memory Network (SaGMN) that leverages the use of speaker embedding learned from a large speaker verification network to characterize such an individualized personal difference across speakers. Specifically, the learning of the gated memory block is jointly optimized with a speaker graph encoder which aligns similar vocal characteristics samples together while effectively enlarge the discrimination across emotion classes. We evaluate our multimodal emotion recognition network on the CMUMOSEI database and achieve a state-of-art accuracy of 65.1% UAR and 74.7% F1 score. Further visualization experiments demonstrate the effect of speaker space alignment with the use of graph memory blocks.

Speakers: Chi-Chun Lee (Department of Electrical Engineering, National Tsing Hua University) , Jeng-Lin Li (Department of Electrical Engineering, National Tsing Hua University)
• 19:15
Mon-1-9-9 Context-Dependent Domain Adversarial Neural Network for Multimodal Emotion Recognition 1h

Emotion recognition remains a complex task due to speaker variations and low-resource training samples. To address these difficulties, we focus on the domain adversarial neural networks (DANN) for emotion recognition. The primary task is to predict emotion labels. The secondary task is to learn a common representation where speaker identities can not be distinguished. By using this approach, we bring the representations of different speakers closer. Meanwhile, through using the unlabeled data in the training process, we alleviate the impact of low-resource training samples. In the meantime, prior work found that contextual information and multimodal features are important for emotion recognition. However, previous DANN based approaches ignore these information, thus limiting their performance. In this paper, we propose the context-dependent domain adversarial neural network for multimodal emotion recognition. To verify the effectiveness of our proposed method, we conduct experiments on the benchmark dataset IEMOCAP. Experimental results demonstrate that the proposed method shows an absolute improvement of 3.48% over state-of-the-art strategies.

Speakers: Bin Liu (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Jian Huang (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Jianhua Tao (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Rongjun Li (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Zhanlei Yang (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Zheng Lian (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
• 19:15 21:30
Mon-S&T 1 Speech processing and analysis Mon-S&T 2 Speech annotation and speech assessment room12

### room12

https://zoom.com.cn/j/63445767313

• 19:15
A Dynamic 3D Pronunciation Teaching Model based on Pronunciation Attributes and Anatomy 2h 15m

In this paper, a dynamic three dimensional (3D) head model is introduced which is built based on knowledge of (the human) anatomy and the theory of distinctive features. The model is used to help Chinese learners understand the exact location and method of the phoneme articulation intuitively. You can access the phonetic learning system, choose the target sound you want to learn and then watch the 3D dynamic animations of the phonemes. You can look at the lips, tongue, soft palate, uvula, and other dynamic vocal organs as well as teeth, gums, hard jaw, and other passive vocal organs from different angles. In this process, you can make the skin and some of the muscles semi- transparent, or zoom in or out the model to see the dynamic changes of articulators clearly. By looking at the 3D model, learners can find the exact location of each sound and imitate the pronunciation actions.

Speakers: Boxue Li (Yunfan Hailiang (Beijing) technology co., LTD) , Xiaoli Feng (Beijing Language and Culture University,Yunfan Hailiang (Beijing) technology co., LTD) , Yanlu Xie (Beijing Language and Culture University) , Yayue Deng (Beijing Language and Culture University)
• 19:15
A Mandarin L2 Learning APP with Mispronunciation Detection and Feedback 2h 15m

In this paper, an APP with Mispronunciation Detection and Feedback for Mandarin L2 Learners is shown. The APP could detect the mispronunciation in the words and highlight it with red at the phone level. Also, the score will be shown to evaluate the overall pronunciation. When touching the highlight, the pronunciation of the learner’s and the standard’s is played. Then the flash animation that describes the movement of the tongue, mouth, and other articulators will be shown to the learner. The learner could repeat the process to improve and excise the pronunciation. The App called ‘SAIT 汉语’ can be downloaded at App Store.

Speakers: Boxue Li (Yunfan Hailiang (Beijing) Technology co., LTD) , Jinsong Zhang (Advanced Innovation Center for Language Resource and Intelligence, Beijing Language and Culture University) , Xiaoli Feng (Advanced Innovation Center for Language Resource and Intelligence, Beijing Language and Culture University) , Yanlu Xie (Advanced Innovation Center for Language Resource and Intelligence, Beijing Language and Culture University) , Yujia Jin (Advanced Innovation Center for Language Resource and Intelligence, Beijing Language and Culture University)
• 19:15
CATOTRON–A Neural Text-to-Speech System in Catalan 2h 15m

We present Catotron, a neural network-based open-source speech synthesis system in Catalan. Catotron consists of a sequence-to-sequence model trained with two small open- source datasets based on semi-spontaneous and read speech. We demonstrate how a neural TTS can be built for languages with limited resources using found-data optimization and cross- lingual transfer learning. We make the datasets, initial models and source code publicly available for both commercial and re- search purposes.

Speakers: Alex Peiro ́-Lilja (Universitat Pompeu Fabra) , Alp Öktem ( Col·lectivaT) , Baybars Ku ̈lebi (Col·lectivaT) , Mireia Farru ́s (Universitat Pompeu Fabra) , Santiago Pascual (Universitat Polite`cnica de Catalunya)
• 19:15
Computer-Assisted Language Learning System: Automatic Speech Evaluation for Singapore an Children Learning Malay and Tamil 2h 15m

We present a computer-assisted language learning system that automatically evaluates the pronunciation and fluency of spoken Malay and Tamil. Our system consists of a server and a user- facing Android application, where the server is responsible for speech-to-text alignment as well as pronunciation and fluency scoring. We describe our system architecture and discuss the technical challenges associated with low resource languages. To the best of our knowledge, this work is the first pronunciation and fluency scoring system for Malay and Tamil.

Speakers: Siti Umairah Md Salleh (Institute for Infocomm Research, A*STAR, Singapore) , Ke Shi (Institute for Infocomm Research, A*STAR, Singapore) , Kye Min Tan (Institute for Infocomm Research, A*STAR, Singapore) , Nancy F. Chen (Institute for Infocomm Research, A*STAR, Singapore) , Nur Farah Ain Binte Suhaimi (Institute for Infocomm Research, A*STAR, Singapore) , Rajan s/o Vellu (Institute for Infocomm Research, A*STAR, Singapore) , Richeng Duan (Institute for Infocomm Research, A*STAR, Singapore) , Thai Ngoc Thuy Huong Helen (Institute for Infocomm Research, A*STAR, Singapore)
• 19:15
End-to-End Deep Learning Speech Recognition Model for Silent Speech Challenge 2h 15m

This work is the first attempt to apply an end-to-end, deep neural network-based automatic speech recognition (ASR) pipeline to the Silent Speech Challenge dataset (SSC), which contains synchronized ultrasound images and lip images cap- tured when a single speaker read the TIMIT corpus without uttering audible sounds. In silent speech research using SSC dataset, established methods in ASR have been utilized with some modifications to use it in visual speech recognition. In this work, we tested the SOTA method of ASR on the SSC dataset using the End-to-End Speech Processing Toolkit, ESPnet. The experimental results show that this end-to-end method achieved a character error rate (CER) of 10.1% and a WER of 20.5% by incorporating SpecAugment, demonstrating the possibility to further improve the performance with additional data collec- tion.

Speakers: Naoki Kimura (The University of Tokyo) , Takaaki Saeki (The University of Tokyo) , Zixiong Su (The University of Tokyo)
• 19:15
ICE-Talk: an Interface for a Controllable Expressive Talking Machine 2h 15m

ICE-Talk is an open source web-based GUI that allows the use of a TTS system with controllable parameters via a text field and a clickable 2D plot. It enables the study of latent spaces for controllable TTS. Moreover it is implemented as a module that can be used as part of a Human-Agent interaction.

Speakers: Kevin El Haddad (Numediart Institute, University of Mons) , Noe ́ Tits (Numediart Institute, University of Mons) , Thierry Dutoit (Numediart Institute, University of Mons)
• 19:15
Kaldi-web: An installation-free, on-device speech recognition system 2h 15m

Speech provides an intuitive interface to communicate with ma- chines. Today, developers willing to implement such an inter- face must either rely on third-party proprietary software or be- come experts in speech recognition. Conversely, researchers in speech recognition wishing to demonstrate their results need to be familiar with technologies that are not relevant to their re- search (e.g., graphical user interface libraries). In this demo, we introduce Kaldi-web: an open-source, cross-platform tool which bridges this gap by providing a user interface built around the online decoder of the Kaldi toolkit. Additionally, because we compile Kaldi to Web Assembly, speech recognition is per- formed directly in web browsers. This addresses privacy issues as no data is transmitted to the network for speech recognition.

Speakers: Denis Jouvet (Universite ́ de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France) , Emmanuel Vincent (Universite ́ de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France) , Laurent Pierron (Universite ́ de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France) , Mathieu Hu (Universite ́ de Lorraine, CNRS, Inria, LORIA, F-54000 Nancy, France)
• 19:15
Rapid Enhancement of NLP systems by Acquisition of Data in Correlated Domains 2h 15m

In a generation where industries are going through a paradigm shift because of the rampant growth of deep learning, structured data plays a crucial role in the automation of various tasks. Textual structured data is one such kind which is extensively used in systems like chat bots and automatic speech recogni- tion. Unfortunately, a majority of these textual data available is unstructured in the form of user reviews and feedback, social media posts etc. Automating the task of categorizing or clus- tering these data into meaningful domains will reduce the time and effort needed in building sophisticated human-interactive systems. In this paper, we present a web tool that builds a do- main specific data based on a search phrase from a database of highly unstructured user utterances. We also show the usage of Elasticsearch database with custom indexes for full correlated text-search. This tool uses the open sourced Glove model com- bined with cosine similarity and performs a graph based search to provide semantically and syntactically meaningful corpora. In the end, we discuss its applications with respect to natural language processing.

Speakers: Ajit Ashok Saunshikar (Samsung Research and Development Institute) , Kinnera Saranu (Samsung Research and Development Institute) , Mayuresh Sanjay Oak (Samsung Research and Development Institute) , Sandip Shriram Bapat (Samsung Research and Development Institute) , Tejas Udayakumar (Samsung Research and Development Institute)
• 19:15
Real-time, full-band, online DNN-based voice conversion system using a single CPU 2h 15m

We present a real-time, full-band, online voice conversion (VC) system that uses a single CPU. For practical applications, VC must be high quality and able to perform real-time, online conversion with fewer computational resources. Our system achieves this by combining non-linear conversion with a deep neural network and short-tap, sub-band filtering. We evaluate our system and demonstrate that it 1) achieves the estimated complexity around 2.5 GFLOPS and measures real-time factor (RTF) around 0.5 with a single CPU and 2) can attain converted speech with a 3.4 / 5.0 mean opinion score (MOS) of natural- ness.

Speakers: Hiroshi Saruwatari (Graduate School of Information Science and Technology, The University of Tokyo, Japan.) , Shinnnosuke Takamichi (Graduate School of Information Science and Technology, The University of Tokyo, Japan) , Takaaki Saeki (Graduate School of Information Science and Technology, The University of Tokyo, Japan.) , Yuki Saito (Graduate School of Information Science and Technology, The University of Tokyo, Japan.)
• 19:15
Smart Tube: A Biofeedback System for Vocal Training and Therapy through Tube Phonation 2h 15m

Tube phonation, or straw phonation, is a frequently used vocal training technique to improve the efficiency of the vocal mech- anism by repeatedly producing a speech sound into a tube or straw. Use of the straw results in a semi-occluded vocal tract in order to maximize the interaction between the vocal fold vi- bration and the vocal tract. This method requires a voice trainer or therapist to raise the trainee or patient’s awareness of the vi- brations around his or her mouth, guiding him/her to maximize the vibrations, which results in efficient phonation. A major problem with this process is that the trainer cannot monitor the trainee/patient’s vibratory state in a quantitative manner. This study proposes the use of Smart Tube, a straw with an attached acceleration sensor and LED strip that can measure vibrations and provide corresponding feedback through LED lights in real- time. The biofeedback system was implemented using a mi- crocontroller board, Arduino Uno, to minimize cost. Possible system function enhancements include Bluetooth compatibility with personal computers and/or smartphones. Smart Tube can facilitate improved phonation for trainees/patients by providing quantitative visual feedback.

Speakers: Kenta Hamada (Konan University) , Naoko Kawamura (Himeji Dokkyo University) , Tatsuya Kitamura (Konan University)
• 19:15
SoapBox Labs Fluency Assessment Platform for child speech 2h 15m

The SoapBox Labs Fluency API service allows the automatic assessment of a child’s reading fluency. The system uses auto- matic speech recognition (ASR) to transcribe the child’s speech as they read a passage. The ASR output is then compared to the text of the reading passage, and the fluency algorithm returns information about the accuracy of the child’s reading attempt. In this show and tell paper we describe how the fluency cloud API is accessed and demonstrate how the fluency demo system processes an audio file, as shown in the accompanying video.

Speakers: Adrian Hempel (SoapBox Labs, Dublin, Ireland) , Agape Deng (SoapBox Labs, Dublin, Ireland) , Amelia C. Kelly (SoapBox Labs, Dublin, Ireland) , Armin Saeb (SoapBox Labs, Dublin, Ireland) , Arnaud Letondor (SoapBox Labs, Dublin, Ireland) , Eleni Karamichali (SoapBox Labs, Dublin, Ireland) , Gloria Montoya Gomez (SoapBox Labs, Dublin, Ireland) , Karel Vesely ́ (SoapBox Labs, Dublin, Ireland) , Niall Mullally (SoapBox Labs, Dublin, Ireland) , Nicholas Parslow (SoapBox Labs, Dublin, Ireland) , Qiru Zhou (SoapBox Labs, Dublin, Ireland) , Robert O’Regan (SoapBox Labs, Dublin, Ireland)
• 19:15
Soapbox Labs Veriﬁcation Platform for child speech 2h 15m

SoapBox Labs’ child speech verification platform is a service designed specifically for identifying keywords and phrases in children’s speech. Given an audio file containing children’s speech and one or more target keywords or phrases, the system will return the confidence score of recognition for the word(s) or phrase(s) within the the audio file. The confidence scores are provided at utterance level, word level and phoneme level. The service is available online through an cloud API service, or offline on Android and iOS. The platform is accurate for child speech from children as young as 3, and is robust to noisy en- vironments. In this demonstration we show how to access the online API and give some examples of common use cases in literacy and language learning, gaming and robotics.

Speakers: Agape Deng (SoapBox Labs, Dublin, Ireland) , Amelia C. Kelly (SoapBox Labs, Dublin, Ireland) , Armin Saeb (SoapBox Labs, Dublin, Ireland) , Arnaud Letondor (SoapBox Labs, Dublin, Ireland) , Eleni Karamichali, Karel Vesely ́ (SoapBox Labs, Dublin, Ireland) , Nicholas Parslow (SoapBox Labs, Dublin, Ireland) , Qiru Zhou (SoapBox Labs, Dublin, Ireland) , Robert O’Regan (SoapBox Labs, Dublin, Ireland)
• 19:15
Toward Remote Patient Monitoring of Speech, Video, Cognitive and Respiratory Biomarkers Using Multimodal Dialog Technology 2h 15m

We demonstrate a multimodal conversational platform for remote patient diagnosis and monitoring. The plat- form engages patients in an interactive dialog session and automatically computes metrics relevant to speech acoustics and articulation, oro-motor and oro-facial move- ment, cognitive function and respiratory function. The dialog session includes a selection of exercises that have been widely used in both speech language pathology re- search as well as clinical practice – an oral motor exam, sustained phonation, diadochokinesis, read speech, spon- taneous speech, spirometry, picture description, emotion elicitation and other cognitive tasks. Finally, the system automatically computes speech, video, cognitive and res- piratory biomarkers that have been shown to be useful in capturing various aspects of speech motor function and neurological health and visualizes them in a user-friendly dashboard.

Speakers: Andrew Cornish (Modality.ai, Inc.) , David Pautler (Modality.ai, Inc.) , David Suendermann-Oeft (Modality.ai, Inc.) , Dirk Schnelle-Walka (Modality.ai, Inc.) , Doug Habberstad (Modality.ai, Inc.) , Hardik Kothare (Modality.ai, Inc. University of California, San Francisco) , Jackson Liscombe (Modality.ai, Inc.) , Michael Neumann (Modality.ai, Inc.) , Oliver Roesler (Modality.ai, Inc.) , Patrick Lange (Modality.ai, Inc.) , Vignesh Murali (Modality.ai, Inc.) , Vikram Ramanarayanan (Modality.ai, Inc.University of California, San Francisco)
• 19:15
VCTUBE: A Library for Automatic Speech Data Annotation 2h 15m

We introduce an open-source Python library, VCTUBE, which can automatically generate <audio, text> pair of speech data from a given Youtube URL. We believe VCTUBE is useful for collecting, processing, and annotating speech data easily toward developing speech synthesis systems.

Speakers: Eunil Park (Sungkyunkwan University) , Jeewoo Yoon (Sungkyunkwan University) , Jinyoung Han (Sungkyunkwan University) , Migyeong Yang (Sungkyunkwan University) , Minsam Ko (Hanyang University) , Munyoung Lee (Electronics and Telecommunications Research Institute) , Seong Choi (Sungkyunkwan University) , Seonghee Lee (Electronics and Telecommunications Research Institute) , Seunghoon Jeong (Hanyang University)
• 19:15
VoiceID on the ﬂy: A Speaker Recognition System that Learns from Scratch 2h 15m

We proposed a novel AI framework to conduct real-time multi-speaker recognition without any prior registration or pre- training by learning the speaker identification on the fly. We considered the practical problem of online learning with episod- ically revealed rewards and introduced a solution based on semi-supervised and self-supervised learning methods in a web- based application at https://www.baihan.nyc/viz/VoiceID/.

Speakers: Baihan Lin (Department of Applied Mathematics, University of Washington, Seattle, USA) , Xinxin Zhang (Department of Electrical & Computer Engineering, University of Washington, Seattle, USA)
• 19:15 20:15
Mon-SS-1-6 Automatic Speech Recognition for Non- Native Children's Speech room6

### room6

Chairs: Kate Knill,Daniele Falavigna

https://zoom.com.cn/j/67261969599

• 19:15
Mon-SS-1-6-1 Overview of the Interspeech TLT2020 Shared Task on ASR for Non-Native Children’s Speech 1h

We present an overview of the ASR challenge for non-native children's speech organized for a special session at Interspeech 2020. The data for the challenge was obtained in the context of a spoken language proficiency assessment administered at Italian schools for students between the ages of 9 and 16 who were studying English and German as a foreign language. The corpus distributed for the challenge was a subset of the English recordings. Participating teams competed either in a closed track, in which they could use only the training data released by the organizers of the challenge, or in an open track, in which they were allowed to use additional training data. The closed track received 9 entries and the open track received 7 entries, with the best scoring systems achieving substantial improvements over a state-of-the-art baseline system. This paper describes the corpus of non-native children's speech that was used for the challenge, analyzes the results, and discusses some points that should be considered for subsequent challenges in this domain in the future.

Speakers: Chee Wee (Ben) Leong (Educational Testing Service) , Falavigna Daniele (Fondazione Bruno Kessler) , Keelan Evanini (Educational Testing Service) , Marco Matassoni (Fondazione Bruno Kessler) , Roberto Gretter (FBK)
• 19:15
Mon-SS-1-6-2 The NTNU System at the Interspeech 2020 Non-Native Children’s Speech ASR Challenge 1h

This paper describes the NTNU ASR system participating in the Interspeech 2020 Non-Native Children’s Speech ASR Challenge supported by the SIG-CHILD group of ISCA. This ASR shared task is made much more challenging due to the coexisting diversity of non-native and children speaking characteristics. In the setting of closed-track evaluation, all participants were restricted to develop their systems merely based on the speech and text corpora provided by the organizer. To work around this low-resource issue, we built our ASR system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the synergistic power of various data augmentation strategies, including both utterance- and word-level speed perturbation and spectrogram augmentation, alongside a simple yet effective data-cleansing approach. All variants of our ASR system employed an RNN-based language model to rescore the first-pass recognition hypotheses, which was trained solely on the text dataset released by the organizer. Our system with the best configuration came out in second place, resulting in a word error rate (WER) of 17.59 %, while those of the top-performing, second runner-up and official baseline systems are 15.67%, 18.71%, 35.09%, respectively.

Speakers: Berlin Chen (National Taiwan Normal University) , Fu-An Chao (National Taiwan Normal University) , Shi-Yan Weng (National Taiwan Normal Unversity) , Tien-Hong Lo (National Taiwan Normal University)
• 19:15
Mon-SS-1-6-3 Non-Native Children's Automatic Speech Recognition: the INTERSPEECH 2020 Shared Task ALTA Systems 1h

Automatic spoken language assessment (SLA) is a challenging problem due to the large variations in learner speech combined with limited resources. These issues are even more problematic when considering children learning a language, with higher levels of acoustic and lexical variability, and of code-switching compared to adult data. This paper describes the ALTA system for the INTERSPEECH 2020 Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech. The data for this task consists of examination recordings of Italian school children aged 9-16, ranging in ability from minimal, to basic, to limited but effective command of spoken English. A variety of systems were developed using the limited training data available, 49 hours. State-of-the-art acoustic models and language models were evaluated, including a diversity of lexical representations, handling code-switching and learner pronunciation errors, and grade specific models. The best single system achieved a word error rate (WER) of 16.9% on the evaluation data. By combining multiple diverse systems, including both grade independent and grade specific models, the error rate was reduced to 15.7%. This combined system was the best performing submission for both the closed and open tasks.

Speakers: Kate Knill (University of Cambridge) , Linlin Wang (Cambridge University Engineering Department) , Mark Gales (Cambridge University) , Xixin Wu (University of Cambridge) , Yu Wang (University of Cambridge)
• 19:15
Mon-SS-1-6-4 Data augmentation using prosody and false starts to recognize non-native children's speech 1h

This paper describes AaltoASR’s speech recognition system for the INTERSPEECH 2020 shared task on Automatic Speech Recognition (ASR) for non-native children’s speech. The task is to recognize non-native speech from children of various age groups given a limited amount of speech. Moreover, the speech being spontaneous has false starts transcribed as partial words, which in the test transcriptions leads to unseen partial words. To cope with these two challenges, we investigate a data augmentation-based approach. Firstly, we apply the prosody -based data augmentation to supplement the audio data. Secondly, we simulate false starts by introducing partial-word noise
in the language modeling corpora creating new words. Acoustic models trained on prosody-based augmented data outperform the models using the baseline recipe or the SpecAugment-based augmentation. The partial-word noise also helps to improve the baseline language model. Our ASR system, a combination of these schemes, is placed third in the evaluation period and achieves the word error rate of 18.71%. Post-evaluation period, we observe that increasing the amounts of prosody-based augmented data leads to better performance. Furthermore, removing low-confidence-score words from hypotheses can lead to further gains. These two improvements lower the ASR error rate to 17.99%.

Speakers: Hemant Kathania (Aalto University) , Mikko Kurimo (Aalto University) , Mittul Singh (Aalto University) , Tamás Grósz (Department of Signal Processing and Acoustics, Aalto University)
• 19:15
Mon-SS-1-6-5 UNSW System Description for the Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech 1h

In this paper we describe our children’s Automatic Speech Recognition (ASR) system for the first shared task on ASR for English non-native children’s speech. The acoustic model comprises 6 Convolutional Neural Network (CNN) layers and 12 Factored Time-Delay Neural Network (TDNN-F) layers, trained by data from 5 different children’s speech corpora. Speed perturbation, Room Impulse Response (RIR), babble noise and non-speech noise data augmentation methods were utilized to enhance the model robustness. Three Language Models (LMs) were employed: an in-domain LM trained on written data and speech transcriptions of non-native children, a LM trained on non-native written data and transcription of both native and non-native children’s speech and a TEDLIUM LM trained on adult TED talks transcriptions. Lattices produced from the different ASR systems were combined and decoded using the Minimum Bayes-Risk (MBR) decoding algorithm to get the final output. Our system achieved a final Word Error Rate (WER) of 17.55% and 16.59% for both developing and testing sets respectively and ranked second among the 10 teams participating in the task.

Speakers: Beena Ahmed (University of New South Wales) , Julien Epps (University of New South Wales) , Mostafa Shahin (University of New South Wales) , Renée Lu (University of New South Wales)
• 20:15 20:30
Coffee Break
• 20:30 21:30
Mon-2-1 Speech Emotion Recognition I (SER I) room1

### room1

Chairs: Gábor Gosztolya (U Szeged, Hungary) , Yongwei Li

https://zoom.com.cn/j/68015160461

• 20:30
Mon-2-1-1 Enhancing Transferability of Black-box Adversarial Attacks via Lifelong Learning for Speech Emotion Recognition Models 1h

Well-designed adversarial examples can easily fool deep speech emotion recognition models into misclassifications. The transferability of adversarial attacks is a crucial evaluation indicator when generating adversarial examples to fool a new target model or multiple models. Herein, we propose a method to improve the transferability of black-box adversarial attacks using lifelong learning. First, black-box adversarial examples are generated by an atrous Convolutional Neural Network (CNN) model. This initial model is trained to attack a CNN target model. Then, we adapt the trained atrous CNN attacker to a new CNN target model using lifelong learning. We use this paradigm, as it enables multi-task sequential learning, which saves more memory space than conventional multi-task learning. We verify this property on an emotional speech database, by demonstrating that the updated atrous CNN model can attack all target models which have been learnt, and can better attack a new target model than an attack model trained on one target model only.

Speakers: Björn Schuller (University of Augsburg / Imperial College London) , Jing Han (University of Augsburg) , Nicholas Cummins (University of Augsburg) , Zhao Ren (University of Augsburg)
• 20:30
Mon-2-1-2 End-to-End Speech Emotion Recognition Combined with Acoustic-to-Word ASR Model 1h

In this paper, we propose speech emotion recognition (SER) combined with an acoustic-to-word automatic speech recognition (ASR) model. While acoustic prosodic features are primarily used for SER, textual features are also useful but are error-prone, especially in emotional speech. To solve this problem, we integrate ASR model and SER model in an end-to-end manner. This is done by using an acoustic-to-word model. Specifically, we utilize the states of the decoder in the ASR model with the acoustic features and input them into the SER model. On top of a recurrent network to learn features from this input, we adopt a self-attention mechanism to focus on important feature frames. Finally, we finetune the ASR model on the new dataset using a multi-task learning method to jointly optimize ASR with the SER task. Our model has achieved a 68.63% weighted accuracy (WA) and 69.67% unweighted accuracy (UA) on the IEMOCAP database, which is state-of-the-art performance.

Speakers: Han Feng (Graduate School of Informatics, Kyoto University, Sakyo-ku, Kyoto, Japan) , Sei Ueno (Graduate School of Informatics, Kyoto University, Sakyo-ku, Kyoto, Japan) , Tatsuya Kawahara (Kyoto University)
• 20:30
Mon-2-1-3 Improving Speech Emotion Recognition Using Graph Attentive Bi-directional Gated Recurrent Unit Network 1h

The manner that human encodes emotion information within an utterance is often complex and could result in a diverse salient acoustic profile that is conditioned on emotion types. In this work, we propose a framework in imposing a graph attention mechanism on gated recurrent unit network (GA-GRU) to improve utterance-based speech emotion recognition (SER). Our proposed GA-GRU combines both long-range time-series based modeling of speech and further integrates complex saliency using a graph structure. We evaluate our proposed GA-GRU on the IEMOCAP and the MSP-IMPROV database and achieve a 63.8% UAR and 57.47% UAR in a four class emotion recognition task. The GA-GRU obtains consistently better performances as compared to recent state-of-art in per-utterance emotion classification model, and we further observe that different emotion categories would require distinct flexible structures in modeling emotion information in the acoustic data that is beyond conventional left-to-right or vice versa.

Speakers: Bo-Hao Su (Department of Electrical Engineering, National Tsing Hua University) , Chi-Chun Lee (Department of Electrical Engineering, National Tsing Hua University) , Chun-Min Chang (Department of Electrical Engineering, National Tsing Hua University) , Yun-Shao Lin (Department of Electrical Engineering, National Tsing Hua University)
• 20:30
Mon-2-1-4 An Investigation of Cross-Cultural Semi-Supervised Learning for Continuous Affect Recognition 1h

One of the keys for supervised learning techniques to succeed resides in the access to vast amounts of labelled training data. The process of data collection, however, is expensive, time-consuming, and application dependent. In the current digital era, data can be collected continuously. This continuity renders data annotation into an endless task, which potentially, in problems such as emotion recognition, requires annotators with different cultural backgrounds. Herein, we study the impact of utilising data from different cultures in a semi-supervised learning approach to label training material for the automatic recognition of arousal and valence. Specifically, we compare the performance of culture-specific affect recognition models trained with manual or cross-cultural automatic annotations. The experiments performed in this work use the dataset released for the Cross-cultural Emotion Sub-challenge of the Audio/Visual Emotion Challenge (AVEC) 2019. The results obtained convey that the cultures used for training impact on the system performance. Furthermore, in most of the scenarios assessed, affect recognition models trained with hybrid solutions, combining manual and automatic annotations, surpass the baseline model, which was exclusively trained with manual annotations.

Speakers: Adria Mallol-Ragolta (University of Augsburg) , Björn Schuller (University of Augsburg / Imperial College London) , Nicholas Cummins (University of Augsburg)
• 20:30
Mon-2-1-5 Ensemble of Students Taught by Probabilistic Teachers to Improve Speech Emotion Recognition 1h

Reliable and generalizable speech emotion recognition(SER) systems have wide applications in various fields including healthcare, customer service, and security and defense. Towards this goal, this study presents a novel teacher-student (T-S) framework for SER, relying on an ensemble of probabilistic predictions of teacher embeddings to train an ensemble of students. We use uncertainty modeling with Monte-Carlo (MC) dropout to create a distribution for the embeddings of an intermediate dense layer of the teacher. The embeddings guiding the student models are derived by sampling from this distribution. The final prediction combines the results obtained by the student ensemble. The proposed model not only increases the prediction performance over the teacher model, but also generates more consistent predictions. As a T-S formulation, the approach allows the use of unlabeled data to improve the performance of the students in a semi-supervised manner. An ablation analysis shows the importance of the MC-based ensemble and the use of unlabeled data. The results show relative improvements in concordance correlation coefficient (CCC) up to 4.25% for arousal, 2.67% for valence and 4.98% for dominance from their baseline results. The results also show that the student ensemble decreases the uncertainty in the predictions, leading to more consistent results.

Speakers: Carlos Busso (The University of Texas at Dallas) , Kusha Sridhar (The University of Texas at Dallas)
• 20:30
Mon-2-1-6 Augmenting Generative Adversarial Networks for Speech Emotion Recognition 1h

Generative adversarial networks (GANs) have shown potential in learning emotional attributes and generating new data samples. However, their performance is usually hindered by the unavailability of larger speech emotion recognition (SER) data. In this work, we propose a framework that utilises the mixup data augmentation scheme to augment the GAN in feature learning and generation. To show the effectiveness of the proposed framework, we present results for SER on (i) synthetic feature vectors, (ii) augmentation of the training data with synthetic features, (iii) encoded features in compressed representation. Our results show that the proposed framework can effectively learn compressed emotional representations as well as it can generate synthetic samples that help improve performance in within-corpus and cross-corpus evaluation.

Speakers: Björn Schuller (University of Augsburg / Imperial College London) , Muhammad Asim (Information Technology University, Lahore) , Raja Jurdak (Queensland University of Technology (QUT)) , Rajib Rana (University of Southern Queensland) , Sara Khalifa (Distributed Sensing Systems Group, Data61, CSIRO Australia) , Siddique Latif (University of Southern Queensland Australia/Distributed Sensing Systems Group, Data61, CSIRO Australia)
• 20:30
Mon-2-1-7 Speech Emotion Recognition ‘in the wild’ Using an Autoencoder 1h

Speech Emotion Recognition (SER) has been a challenging task on which researchers have been working for decades. Recently, Deep Learning (DL) based approaches have been shown to perform well in SER tasks; however, it has been noticed that their superior performance is limited to the distribution of the data used to train the model. In this paper, we present an analysis of using autoencoders to improve the generalisability of DL based SER solutions. We train a sparse autoencoder using a large speech corpus extracted from social media. Later, the trained encoder part of the autoencoder is reused as the input to a long short-term memory (LSTM) network, and the encoder-LSTM modal is re-trained on an aggregation of five commonly used speech emotion corpora. Our evaluation uses an unseen corpus in the training validation stages to simulate ‘in the wild’ condition and analyse the generalisability of our solution. A performance comparison is carried out between the encoder based model and a model trained without an encoder. Our results show that the autoencoder based model improves the unweighted accuracy of the unseen corpus by 8%, indicating autoencoder based pre-training can improve the generalisability of DL based SER solutions.

Speakers: Haimo Zhang (University of Auckland) , Mark Billinghurst (University of Auckland) , Suranga Nanayakkara (University of Auckland) , Vipula Dissanayake (University of Auckland)
• 20:30
Mon-2-1-8 Emotion Profile Refinery for Speech Emotion Classification 1h

Human emotions are inherently ambiguous and impure. When designing systems to anticipate human emotions based on speech, the lack of emotional purity must be considered. However, most of the current methods for speech emotion classification rest on the consensus, e.g., one single hard label for an utterance. This labeling principle imposes challenges for system performance considering emotional impurity. In this paper, we recommend the use of emotional profiles (EPs), which provides a time series of segment-level soft labels to capture the subtle blends of emotional cues present across a specific speech utterance. We further propose the emotion profile refinery (EPR), an iterative procedure to update EPs. The EPR method produces soft, dynamically-generated, multiple probabilistic class labels during successive stages of refinement, which results in significant improvements in the model accuracy. Experiments on three well-known emotion corpora show noticeable gain using the proposed method.

Speakers: P. C. Ching (The Chinese University of Hong Kong) , Shuiyang Mao (The Chinese University of Hong Kong) , Tan Lee (The Chinese University of Hong Kong)
• 20:30
Mon-2-1-9 Speech Representation Learning for Emotion Recognition Using End-to-End ASR with Factorized Adaptation 1h

Developing robust speech emotion recognition (SER) systems is challenging due to small-scale of existing emotional speech datasets. However, previous works have mostly relied on handcrafted acoustic features to build SER models that are difficult to handle a wide range of acoustic variations. One way to alleviate this problem is by using speech representations learned from deep end-to-end models trained on large-scale speech database. Specifically, in this paper, we leverage an end-to-end ASR to extract ASR-based representations for speech emotion recognition. We further devise a factorized domain adaptation approach on the pre-trained ASR model to improve both the speech recognition rate and the emotion recognition accuracy on the target emotion corpus, and we also provide an analysis in the effectiveness of representations extracted from different ASR layers. Our experiments demonstrate the importance of ASR adaptation and layer depth for emotion recognition.

Speakers: Chi-Chun Lee (Department of Electrical Engineering, National Tsing Hua University) , Sung-Lin Yeh (Department of Electrical Engineering, National Tsing Hua University) , Yun-Shao Lin (Department of Electrical Engineering)
• 20:30 21:30
Mon-2-10 DNN architectures for Speaker Recognition room10

### room10

Chairs: Zhijian Ou,Rosa González Hautamäki

https://zoom.com.cn/j/61218542656

• 20:30
Mon-2-10-1 AutoSpeech: Neural Architecture Search for Speaker Recognition 1h

Speaker recognition systems based on Convolutional Neural Networks (CNNs) are often built with off-the-shelf backbones such as VGG-Net or ResNet. However, these backbones were originally proposed for image classification, and therefore may not be naturally fit for speaker recognition. Due to the prohibitive complexity of manually exploring the design space, we propose the first neural architecture search approach for the speaker recognition tasks, named as AutoSpeech. Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times. The final speaker recognition model can be obtained by training the derived CNN model through the standard scheme. To evaluate the proposed approach, we conduct experiments on both speaker identification and speaker verification tasks using the VoxCeleb1 dataset. Results demonstrate that the derived CNN architectures from the proposed approach significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.

Speakers: Shaojin Ding (Texas A&M University) , Tianlong Chen (Texas A&M University) , Weiwei Zha (University of Science and Technology of China) , Xinyu Gong (Texas A&M University) , Zhangyang Wang (Texas A&M University)
• 20:30
Mon-2-10-10 Evolutionary Algorithm Enhanced Neural Architecture Search for Text-Independent Speaker Verification 1h

State-of-the-art speaker verification models are based on deep learning techniques, which heavily depend on the hand-designed neural architectures from experts or engineers. We borrow the idea of neural architecture search(NAS) for the text-independent speaker verification task. As NAS can learn deep network structures automatically, we introduce the NAS conception into the well-known x-vector network. Furthermore, this paper proposes an evolutionary algorithm enhanced neural architecture search method called Auto-Vector to automatically discover promising networks for the speaker verification task. The experimental results demonstrate our NAS-based model outperforms state-of-the-art speaker verification models.

Speakers: Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd.) , Jing Xiao (Ping An Technology (Shenzhen) Co., Ltd.) , Xiaoyang Qu (Ping An Technology (Shenzhen) Co., Ltd.)
• 20:30
Mon-2-10-2 Densely Connected Time Delay Neural Network for Speaker Verification 1h

Time delay neural network (TDNN) has been widely used in speaker verification tasks. Recently, two TDNN-based models, including extended TDNN (E-TDNN) and factorized TDNN (F-TDNN), are proposed to improve the accuracy of vanilla TDNN. But E-TDNN and F-TDNN increase the number of parameters due to deeper networks, compared with vanilla TDNN. In this paper, we propose a novel TDNN-based model, called densely connected TDNN (D-TDNN), by adopting bottleneck layers and dense connectivity. D-TDNN has fewer parameters than existing TDNN-based models. Furthermore, we propose an improved variant of D-TDNN, called D-TDNN-SS, to employ multiple TDNN branches with short-term and long-term contexts. D-TDNN-SS can integrate the information from multiple TDNN branches with a newly designed channel-wise selection mechanism called statistics-and-selection (SS). Experiments on VoxCeleb datasets show that both D-TDNN and D-TDNN-SS can outperform existing models to achieve state-of-the-art accuracy with fewer parameters, and D-TDNN-SS can achieve better accuracy than D-TDNN.

Speakers: Wu-Jun Li (Nanjing University) , Ya-Qi Yu (Nanjing University)
• 20:30
Mon-2-10-3 Phonetically-Aware Coupled Network For Short Duration Text-independent Speaker Verification 1h

In this paper we propose an end-to-end phonetically-aware coupled network for short duration speaker verification tasks. Phonetic information is shown to be beneficial for identifying short utterances. A coupled network structure is proposed to exploit phonetic information. The coupled convolutional layers allow the network to provide frame-level supervision based on phonetic representations of the corresponding frames. The end-to-end training scheme using triplet loss function provides direct comparison of speech contents between two utterances and hence enabling phonetic-based normalization. Our systems are compared against the current mainstream speaker verification systems on both NIST SRE and VoxCeleb evaluation datasets. Relative reductions of up to 34% in equal error rate are reported.

Speakers: Hongbin Suo (Alibaba Group) , Siqi Zheng (Alibaba ) , Yun Lei (Alibaba Group)
• 20:30
Mon-2-10-4 Multi-Task Network for Noise-Robust Keyword Spotting and Speaker Verification using CTC-based Soft VAD and Global Query Attention 1h

Keyword spotting (KWS) and speaker verification (SV) have been studied independently although it is known that acoustic and speaker domains are complementary. In this paper, we propose a multi-task network that performs KWS and SV simultaneously to fully utilize the interrelated domain information. The multi-task network tightly combines sub-networks aiming at performance improvement in challenging conditions such as noisy environments, open-vocabulary KWS, and short-duration SV, by introducing novel techniques of connectionist temporal classification (CTC)-based soft voice activity detection (VAD) and global query attention. Frame-level acoustic and speaker information is integrated with phonetically originated weights so that forms a word-level global representation. Then it is used for the aggregation of feature vectors to generate discriminative embeddings. Our proposed approach shows 4.06% and 26.71% relative improvements in equal error rate (EER) compared to the baselines for both tasks. We also present a visualization example and results of ablation experiments.

Speakers: Hoi Rin Kim (KAIST) , Jahyun Goo (KAIST) , Myunghun Jung (KAIST) , Youngmoon Jung (KAIST)
• 20:30
Mon-2-10-5 Vector-based attentive pooling for text-independent speaker verification 1h

The pooling mechanism plays an important role in deep neural network based systems for text-independent speaker verification, which aggregates the variable-length frame-level vector sequence across all frames into a fixed-dimensional utterance-level representation. Previous attentive pooling methods employ scalar attention weights for each frame-level vector, resulting in insufficient collection of discriminative information. To address this issue, this paper proposes a vector-based attentive pooling method, which adopts vectorial attention instead of scalar attention. The vectorial attention can extract fine-grained features for discriminating different speakers. Besides, the vector-based attentive pooling is extended in a multi-head way for better speaker embeddings from multiple aspects. The proposed pooling method is evaluated with the x-vector baseline system. Experiments are conducted on two public datasets, Voxceleb and Speaker in the Wild (SITW). The results show that the vector-based attentive pooling method achieves superior performance compared with statistics pooling and three state-of-the-art attentive pooling methods, with the best equal error rate (EER) of 2.734 and 3.062 in SITW as well as the best EER of 2.466 in Voxceleb.

Speakers: Chenkai Guo (Nankai University) , Hongcan Gao (Nankai University) , Jing Xu (Nankai University) , Xiaolei Hou (Nankai University) , Yanfeng Wu (Nankai University)
• 20:30
Mon-2-10-6 self-attention encoding and pooling for speaker recognition 1h

The computing power of mobile devices limits the end-user applications in terms of storage size, processing, memory and energy consumption. These limitations motivate researchers for the design of more efficient deep models. On the other hand, self-attention networks based on Transformer architecture have attracted remarkable interests due to their high parallelization capabilities and strong performance on a variety of Natural Language Processing (NLP) applications. Inspired by the Transformer, we propose a tandem Self-Attention Encoding and Pooling (SAEP) mechanism to obtain a discriminative speaker embedding given non-fixed length speech utterances. SAEP is a stack of identical blocks solely relied on self-attention and position-wise feed-forward networks to create vector representation of speakers. This approach encodes short-term speaker spectral features into speaker embeddings to be used in text-independent speaker verification. We have evaluated this approach on both VoxCeleb1 & 2 datasets. The proposed architecture is able to outperform the baseline x-vector, and shows competitive performance to some other benchmarks based on convolutions, with a significant reduction in model size. It employs 94%, 95%, and 73% less parameters compared to ResNet-34, ResNet-50, and x-vector, respectively. This indicates that the proposed fully attention based architecture is more efficient in extracting time-invariant features from speaker utterances.

Speakers: Javier Hernando (Universitat Politecnica de Catalunya) , Miquel India (Universitat Politecnica de Catalunya) , pooyan safari (TALP research center, BarcelonaTech)
• 20:30
Mon-2-10-7 ARET: Aggregated Residual Extended Time-delay Neural Networks for Speaker Verification 1h

The time-delay neural network (TDNN) is widely used in speaker verification to extract long-term temporal features of speakers. Although common TDNN approaches well capture time-sequential information, they lack the delicate transformations needed for deep representation. To solve this problem, we propose two TDNN architectures. RET integrates short-cut connections into conventional time-delay blocks, and ARET adopts a split-transform-merge strategy to extract more discriminative representation. Experiments on VoxCeleb datasets without augmentation indicate that ARET realizes satisfactory performance on the VoxCeleb1 test set, VoxCeleb1-E, and VoxCeleb1-H, with 1.389%, 1.520%, and 2.614% equal error rate (EER), respectively. Compared to state-of-the-art results on these test sets, RET achieves a 23% ∼ 43% relative reduction in EER, and ARET reaches 32% ∼ 45%.

Speakers: Jianguo Wei (Tianjin University) , Jiayu Jin (Tianjin University) , Junhai Xu (Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University) , Lin Zhang (Tianjin University) , Longbiao Wang (Tianjin University) , Meng Liu (Tianjin University) , Ruiteng Zhang (Tianjin University) , Wenhuan Lu (Tianjin University)
• 20:30
Mon-2-10-8 Adversarial Separation Network for Speaker Recognition 1h

Deep neural networks (DNN) have achieved great success in speaker recognition systems. However, it is observed that DNN based systems are easily deceived by adversarial examples leading to wrong predictions. Adversarial examples, which are generated by adding purposeful perturbations on natural examples, pose a serious security threat. In this study, we propose the adversarial separation network (AS-Net) to protect the speaker recognition system against adversarial attacks. Our proposed AS-Net is featured by its ability to separate adversarial perturbation from the test speech to restore the natural clean speech. As a standalone component, each input speech is pre-processed by AS-Net first. Furthermore, we incorporate the compression structure and the speaker quality loss to enhance the capacity of the AS-Net. Experimental results on the VCTK dataset demonstrated that the AS-Net effectively enhanced the robustness of speaker recognition systems against adversarial examples. It also significantly outperformed other state-of-the-art adversarial-detection mechanisms, including adversarial perturbation elimination network (APE-GAN), feature squeezing, and adversarial training.

Speakers: Hanyi Zhang (Yunnan University) , Jianguo Wei (Tianjin University) , Kong Aik Lee (Biometrics Research Laboratories, NEC Corporation) , Longbiao Wang (Tianjin University) , Meng Liu (Tianjin University) , Yunchun Zhang (Yunnan University)
• 20:30
Mon-2-10-9 Text-Independent Speaker Verification with Dual Attention Network 1h

This paper presents a novel design of attention model for text-independent speaker verification. The model takes a pair of input utterances and generates an utterance-level embedding to represent speaker-specific characteristics in each utterance. The input utterances are expected to have highly similar embeddings if they are from the same speaker. The proposed attention model consists of a self-attention module and a mutual attention module, which jointly contributes to the generation of the utterance-level embedding. The self-attention weights are computed from the utterance itself while the mutual-attention weights are computed with the involvement of the other utterance in the input pairs. As a result, each utterance is represented by a self-attention weighted embedding and a mutual-attention weighted embedding. The similarity between the embeddings is measured by a cosine distance score and a binary classifier output score. The whole model, named Dual Attention Network, is trained end-to-end on Voxceleb database. The evaluation results on Voxceleb 1 test set show that the Dual Attention Network significantly outperforms the baseline systems. The best result yields an equal error rate of 1.6%

Speakers: Jingyu Li (The Chinese University of Hong Kong) , Tan Lee (The Chinese University of Hong Kong)
• 20:30 21:30
Mon-2-11 ASR model training and strategies room11

### room11

chairs:Michael Seltzer ,Dan Povey ,

https://zoom.com.cn/j/66725122123

• 20:30
Mon-2-11-1 Minimum Bayes Risk Training of RNN-Transducer for End-to-End Speech Recognition 1h

In this work, we propose minimum Bayes risk (MBR) training of RNN-Transducer (RNN-T) for end-to-end speech recognition. Specifically, initialized with a RNN-T trained model, MBR training is conducted via minimizing the expected edit distance between the reference label sequence and on-the- fly generated N-best hypothesis. We also introduce a heuris- tic to incorporate an external neural network language model (NNLM) in RNN-T beam search decoding and explore MBR training with the external NNLM. Experimental results demon- strate an MBR trained model outperforms a RNN-T trained model substantially and further improvements can be achieved if trained with an external NNLM. Our best MBR trained sys- tem achieves absolute character error rate (CER) reductions of 1.2% and 0.5% on read and spontaneous Mandarin speech respectively over a strong convolution and transformer based RNN-T baseline trained on ∼21,000 hours of speech.

Speakers: Chao Weng (Tencent AI Lab) , Chengzhu Yu (Tencent) , Chunlei Zhang (Tencent AI Lab) , Dong Yu (Tencent AI Lab) , Jia Cui (Tencent)
• 20:30
Mon-2-11-2 Semantic Mask for Transformer based End-to-End Speech Recognition 1h

Shuo Ren(Beihang University), Guoli Ye -(Microsoft), Sheng Zhao(Microsoft) and Ming Zhou(microsoft research asia)
Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks. This approach takes advantage of the memorization capacity of neural networks to learn the mapping from the input sequence to the output sequence from scratch, without the assumption of prior knowledge such as the alignments. However, this model is prone to overfitting, especially when the amount of training data is limited. Inspired by SpecAugment and BERT, in this paper, we propose a semantic mask based regularization for training such kind of end-to-end (E2E) model. The idea is to mask the input features corresponding to a particular output token, e.g., a word or a word-piece, in order to encourage the model to fill the token based on the contextual information. While this approach is applicable to the encoder-decoder framework with any type of neural network architecture, we study the transformer-based model for ASR in this work. We perform experiments on Librispeech 960h and TedLium2 data sets, and achieve the state-of-the-art performance in the scope of attention based E2E models.

Speakers: Chengyi Wang (Nankai University) , Guoli Ye (Microsoft) , Jinyu Li (Microsoft) , Liang Lu (Microsoft) , Ming Zhou (microsoft research asia) , Sheng Zhao (Microsoft) , Shujie Liu (Microsoft Research Asia) , Shuo Ren (Beihang University) , Yu Wu (Microsoft Research Asia) , Yujiao Du (Alibaba Corporation)
• 20:30
Mon-2-11-3 Fast, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces 1h

In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal VideoASR datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.

Speakers: Chunxi Liu (Facebook AI, USA) , Frank Zhang (Facebook AI, USA) , Geoffrey Zweig (Facebook AI, USA) , Xiaohui Zhang (Facebook AI, USA) , Yatharth Saraf (Facebook AI, USA) , Yongqiang Wang (Facebook AI, USA)
• 20:30
Mon-2-11-4 A Federated Approach in Training Acoustic Models 1h

In this paper, a novel platform for Acoustic Model training based on Federated Learning (FL) is described. This is the first attempt to introduce Federated Learning techniques in Speech Recognition (SR) tasks. Besides the novelty of the task, the paper describes an easily generalizable FL platform and presents the design decisions used for this task. Amongst the novel algorithms introduced is a hierarchical optimization scheme employing pairs of optimizers and an algorithm for gradient selection, leading to improvements in training time and SR performance. The gradient selection algorithm is based on weighting the gradients during the aggregation step. It effectively acts as a regularization process right before the gradient propagation. This process may address one of the FL challenges, i.e. training on vastly heterogeneous data. The experimental validation of the proposed system is based on the LibriSpeech task, presenting a speed-up of x1.5 and 6% WERR. The proposed Federated Learning system appears to outperform the golden standard of distributed training in both convergence speed and overall model performance. Further improvements have been experienced in internal tasks.

Speakers: Dimitrios Dimitriadis (Microsoft) , Kenichi Kumatani (Amazon Inc.) , Robert Gmyr (Microsoft) , Sefik Emre Eskimez (Microsoft) , Yashesh Gaur (Microsoft)
• 20:30
Mon-2-11-5 On Semi-Supervised LF-MMI Training of Acoustic Models with Limited Data 1h

This work investigates semi-supervised training of acoustic models (AM) with the lattice-free maximum mutual information (LF-MMI) objective in practically relevant scenarios with a limited amount of labeled in-domain data. An error detection driven semi-supervised AM training approach is proposed, in which an error detector controls the hypothesized transcriptions or lattices used as LF-MMI training targets on additional unlabeled data. Under this approach, our first method uses a single error-tagged hypothesis whereas our second method uses a modified supervision lattice. These methods are evaluated and compared with existing semi-supervised AM training methods in three different matched or mismatched, limited data setups. Word error recovery rates of 28 to 89% are reported.

Speakers: Emmanuel Vincent (Inria) , Imran Sheikh (Inria) , Irina Illina (LORIA/INRIA)
• 20:30
Mon-2-11-6 On Front-end Gain Invariant Modeling for Wake Word Spotting 1h

Wake word (WW) spotting is challenging in far-field due to the complexities and variations in acoustic conditions and the environmental interference in signal transmission. A suite of carefully designed and optimized audio front-end (AFE) algorithms help mitigate these challenges and provide better quality audio signals to the downstream modules such as WW spotter. Since the WW model is trained with the AFE-processed audio data, its performance is sensitive to AFE variations, such as gain changes. In addition, when deploying to new devices, the WW performance is not guaranteed because the AFE is unknown to the WW model. To address these issues, we propose a novel approach to use a new feature called $\Delta$LFBE to decouple the AFE gain variations from the WW model. We modified the neural network architectures to accommodate the delta computation, with the feature extraction module unchanged. We evaluate our WW models using data collected from real household settings and showed the models with the $\Delta$LFBE is robust to AFE gain changes. Specifically, when AFE gain changes up to $\pm$12dB, the baseline CNN model lost up to 19.0\% in false alarm rate or 34.3\% in false reject rate, while the model with $\Delta$LFBE demonstrates no performance loss.

Speakers: Noah D. Stein (Amazon) , Chieh-Chi Kao (Amazon) , Ming Sun (Amazon) , Shiv Vitaladevuni (Amazon) , Tao Zhang (Amazon) , Yixin Gao (Amazon) , Yunliang Cai (Amazon)
• 20:30
Mon-2-11-7 Unsupervised Regularization-Based Adaptive Training for Speech Recognition 1h

In this paper, we propose two novel regularization-based
speaker adaptive training approaches for connectionist temporal
classification (CTC) based speech recognition. The first method
is center loss (CL) regularization, which is used to penalize the
distances between the embeddings of different speakers and the
only center. The second method is speaker variance loss (SVL)
regularization in which we directly minimize the speaker interclass variance during model training. Both methods achieve the purpose of training an adaptive model on the fly by adding regularization terms to the training loss function. Our experiment on the AISHELL-1 Mandarin recognition task shows that both methods are effective at adapting the CTC model without requiring any specific fine-tuning or additional complexity, achieving character error rate improvements of up to 8.1% and 8.6% over the speaker independent (SI) model, respectively.

Speakers: Bin Gu (University of Science and Technology of China) , Fenglin Ding (University of Science and Technology of China) , Jun Du (University of Science and Technologoy of China) , Wu Guo (university of science and technology of china) , Zhenhua Ling (University of Science and Technology of China)
• 20:30
Mon-2-11-8 On the Robustness and Training Dynamics of Raw Waveform Models 1h

We investigate the robustness and training dynamics of raw waveform acoustic models for automatic speech recognition (ASR). It is known that the first layer of such models learn a set of filters, performing a form of time-frequency analysis. This layer is liable to be under-trained owing to gradient vanishing, which can negatively affect the network performance. Through a set of experiments on TIMIT, Aurora-4 and WSJ datasets, we investigate the training dynamics of the first layer by measuring the evolution of its average frequency response over different epochs. We demonstrate that the network efficiently learns an optimal set of filters with a high spectral resolution and the dynamics of the first layer highly correlates with the dynamics of the cross entropy (CE) loss and word error rate (WER). In addition, we study the robustness of raw waveform models in both matched and mismatched conditions. The accuracy of these models is found to be comparable to, or better than, their MFCC-based counterparts in matched conditions and notably improved by using a better alignment. The role of raw waveform normalisation was also examined and up to 4.3\% absolute WER reduction in mismatched conditions was achieved.

Speakers: Erfan Loweimi (The University of Edinburgh) , Peter Bell (University of Edinburgh) , Steve Renals (University of Edinburgh)
• 20:30
Mon-2-11-9 Iterative Pseudo-Labeling for Speech Recognition 1h

Pseudo-labeling has recently shown promise in end-to-end automatic speech recognition (ASR). We study Iterative Pseudo-Labeling (IPL), a semi-supervised algorithm which efficiently performs multiple iterations of pseudo-labeling on unlabeled data as the acoustic model evolves. In particular, IPL fine-tunes an existing model at each iteration using both labeled data and a subset of unlabeled data. We study the main components of IPL: decoding with a language model and data augmentation. We then demonstrate the effectiveness of IPL by achieving state-of-the-art word-error rate on the Librispeech test sets in both standard and low-resource setting. We also study the effect of language models trained on different corpora to show IPL can effectively utilize additional text. Finally, we release a new large in-domain text corpus which does not overlap with the Librispeech training transcriptions to foster research in low-resource, semi-supervised ASR.

Speakers: Awni Hannun (Facebook AI Research) , Gabriel Synnaeve (Facebook AI Research) , Jacob Kahn (Facebook AI Research) , Qiantong Xu (Facebook) , Ronan Collobert (Facebook AI Research) , Tatiana Likhomanenko (Facebook AI Research)
• 20:30 21:30
Mon-2-2 ASR neural network architectures and training I room2

### room2

Chairs: Chiori Hori , Yu Zhang

https://zoom.com.cn/j/68442490755

• 20:30
Mon-2-2-1 FAST AND SLOW ACOUSTIC MODEL 1h

In this work we layout a Fast & Slow (F&S) acoustic model (AM) in an encoder-decoder architecture for streaming automatic speech recognition (ASR). The Slow model represents our baseline ASR model; it's significantly larger than Fast model and provides stronger accuracy. The Fast model is generally developed for related speech applications. It has weaker ASR accuracy but is faster to evaluate and consequently leads to better user-perceived latency. We propose a joint F&S model that encodes output state information from Fast model, feeds that to Slow model to improve overall model accuracy from F&S AM. We demonstrate scenarios where individual Fast and Slow models are already available to build the joint F&S model. We apply our work on a large vocabulary ASR task. Compared to Slow AM, our Fast AM is 3-4x smaller and 11.5% relatively weaker in ASR accuracy. The proposed F&S AM achieves 4.7% relative gain over the Slow AM. We also report a progression of techniques and improve the relative gain to 8.1% by encoding additional Fast AM outputs. Our proposed framework has generic attributes - we demonstrate a specific extension by encoding two Slow models to achieve 12.2% relative gain.

Speakers: Emilian Stoimenov (Microsoft Corp) , Hosam Khalil (Microsoft Corp) , Jian Wu (Microsoft Corp) , Kshitiz Kumar (Microsoft Corp)
• 20:30
Mon-2-2-2 Self-Distillation for Improving CTC-Transformer-based ASR Systems 1h

We present a novel training approach for encoder-decoder-based sequence-to-sequence (S2S) models. The important key factor of S2S is the attention mechanism as it captures the relationships between input and output sequences. The attention weights inform which time frames should be attended to for predicting the output labels. In previous work, we proposed distilling S2S knowledge into connectionist temporal classification (CTC) models by using the attention characteristics to create pseudo-targets for an auxiliary cross entropy loss term. This approach can significantly improve CTC models. However, it remained unclear whether our proposal could be used to improve S2S models. In this paper, we extend our previous work to create a strong S2S model, i.e. Transformer with CTC (CTC-Transformer). We utilize Transformer outputs and the source attention weights for making pseudo-targets that contain both the posterior and the timing information of each Transformer output. These pseudo-targets are used to train the shared encoder of the CTC-Transformer so as to consider the direct feedback from the Transformer-decoder and obtain more informative representations. Experiments on various tasks demonstrate that our proposal is also effective for enhancing S2S model training. In particular, our best system on Japanese ASR task outperforms the previous state-of-the-art alternative.

Speakers: Hiroshi Sato (NTT media intelligent laboratory) , Marc Delcroix (NTT Communication Science Laboratories) , Ryo Masumura (NTT Corporation) , Shigeki Karita (NTT Communication Science Laboratories) , Takafumi Moriya (NTT Corporation) , Takanori Ashihara (NTT Corporation) , Tomohiro Tanaka (NTT Corporation) , Tsubasa Ochiai (NTT Communication Science Laboratories) , Yusuke Shinohara (NTT Corporation)
• 20:30
Mon-2-2-3 Single headed attention based sequence-to-sequence model for state-of-the-art results on Switchboard 1h

It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single headed attention, LSTM based model. Using a cross-utterance language model, our single-pass speaker independent system reaches 6.4% and 12.5% word error rate (WER) on the Switchboard and CallHome subsets of Hub5'00, without a pronunciation lexicon. While careful regularization and data augmentation are crucial in achieving this level of performance, experiments on Switchboard-2000 show that nothing is more useful than more data. Overall, the combination of various regularizations and a simple but fairly large model results in a new state of the art, 4.8% and 8.3% WER on the Switchboard and CallHome sets, using SWB-2000 without any external data resources.

Speakers: Brian Kingsbury (IBM Research) , George Saon (IBM) , Kartik Audhkhasi (IBM Research) , Zoltán Tüske (IBM Research)
• 20:30
Mon-2-2-4 Improving Speech Recognition using GAN-based Speech Synthesis and Contrastive Unspoken Text Selection 1h

Text-to-Speech synthesis (TTS) based data augmentation is a relatively new mechanism for utilizing text-only data to improve automatic speech recognition (ASR) training without parameter or inference architecture changes.
However, efforts to train speech recognition systems on synthesized utterances suffer from limited acoustic diversity of TTS outputs.
Additionally, the text-only corpus is always much larger than the transcribed speech corpus by several orders of magnitude, which makes speech synthesis of all the text data impractical. In this work, we propose to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data.
We also present a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text.
We demonstrate that our proposed method allows ASR models to learn from synthesis of large-scale unspoken text sources and achieves a 35% relative WER reduction on a voice-search task.

Speakers: Andrew Rosenberg (Google LLC) , Bhuvana Ramabhadran (Google) , Gary Wang (Simon Fraser University) , Pedro Moreno (google inc.) , Yu Zhang (Google) , Zhehuai Chen (Google)
• 20:30
Mon-2-2-5 PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR 1h

We present PyChain, a fully parallelized PyTorch implementation of end-to-end lattice-free maximum mutual information (LF-MMI) training for the so-called chain models in the Kaldi automatic speech recognition (ASR) toolkit. Unlike other PyTorch and Kaldi based ASR toolkits, PyChain is designed to be as flexible and light-weight as possible so that it can be easily plugged into a new ASR project, or other existing PyTorch-based ASR tools, as exemplified respectively by a new project PyChain-example, and Espresso, an existing end-to-end ASR toolkit. PyChain’s efficiency and flexibility is demonstrated through such novel features as full GPU training on numerator/denominator graphs, and support for unequal length sequences. Experiments on the WSJ dataset show that with simple neural networks and commonly used machine learning techniques, PyChain can achieve competitive results that are comparable to Kaldi and better than other end-to-end ASR systems.

Speakers: Dan Povey (Xiaomi, Inc.) , Sanjeev Khudanpur (Johns Hopkins University) , Yiming Wang (Johns Hopkins University) , Yiwen Shao (Center for Language and Speech Processing,Johns Hopkins University)
• 20:30
Mon-2-2-6 CAT: A CTC-CRF based ASR Toolkit Bridging the Hybrid and the End-to-end Approaches towards Data Efficiency and Low Latency 1h

In this paper, we present a new open source toolkit for speech recognition, named CAT (\underline{C}TC-CRF based \underline{A}SR \underline{T}oolkit). CAT inherits the data-efficiency of the hybrid approach and the simplicity of the E2E approach, providing a full-fledged implementation of CTC-CRFs and complete training and testing scripts for a number of English and Chinese benchmarks.
Experiments show CAT obtains state-of-the-art results, which are comparable to the fine-tuned hybrid models in Kaldi but with a much simpler training pipeline. Compared to existing non-modularized E2E models, CAT performs better on limited-scale datasets, demonstrating its data efficiency.
Furthermore, we propose a new method called contextualized soft forgetting, which enables CAT to do streaming ASR without accuracy degradation.
We hope CAT, especially the CTC-CRF based framework and software, will be of broad interest to the community, and can be further explored and improved.

Speakers: Hongyu Xiang (Tsinghua University) , Zhijian Ou (Department of Electronic Engineering, Tsinghua University) , keyu An (Tsinghua University)
• 20:30
Mon-2-2-7 CTC-synchronous Training for Monotonic Attention Model 1h

Monotonic chunkwise attention (MoChA) has been studied for the online streaming automatic speech recognition (ASR) based on a sequence-to-sequence framework.
In contrast to connectionist temporal classification (CTC), backward probabilities cannot be leveraged in the alignment marginalization process during training due to left-to-right dependency in the decoder.
This results in the error propagation of alignments to subsequent token generation.
To address this problem, we propose CTC-synchronous training (CTC-ST), in which MoChA uses CTC alignments to learn optimal monotonic alignments.
Reference CTC alignments are extracted from a CTC branch sharing the same encoder with the decoder.
The entire model is jointly optimized so that the expected boundaries from MoChA are synchronized with the alignments.
Experimental evaluations of the TEDLIUM release-2 and Librispeech corpora show that the proposed method significantly improves recognition, especially for long utterances.
We also show that CTC-ST can bring out the full potential of SpecAugment for MoChA.

Speakers: Hirofumi Inaguma (Kyoto University) , Masato Mimura (Kyoto University) , Tatsuya Kawahara (Kyoto University)
• 20:30
Mon-2-2-8 Continual Learning for Multi-Dialect Acoustic Models 1h

Using data from multiple dialects has shown promise in improving neural network acoustic models. While such training can improve the performance of an acoustic model on a single dialect, it can also produce a model capable of good performance on multiple dialects. However, training an acoustic model on pooled data from multiple dialects takes a significant amount of time and computing resources, and it needs to be retrained every time a new dialect is added to the model. In contrast, sequential transfer learning (fine-tuning) does not require retraining using all data, but may result in catastrophic forgetting of previously-seen dialects. Using data from four English dialects, we demonstrate that by using loss functions that mitigate catastrophic forgetting, sequential transfer learning can be used to train multi-dialect acoustic models that narrow the WER gap between the best (combined training) and worst (fine-tuning) case by up to 65%. Continual learning shows great promise in minimizing training time while approaching the performance of models that require much more training time.

Speakers: Brady Houston (Amazon) , Katrin Kirchhoff (Amazon)
• 20:30
Mon-2-2-9 SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition 1h

Recently, End-to-End (E2E) models have achieved state-of-the-art performance for automatic speech recognition (ASR). Within these large and deep models, overfitting remains an important problem that heavily influences the model performance. One solution to deal with the overfitting problem is to increase the quantity and variety of the training data with the help of data augmentation. In this paper, we present SpecSwap, a simple data augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances. The augmentation policy consists of swapping blocks of frequency channels and swapping blocks of time steps. We apply SpecSwap on Transformer-based networks for end-to-end speech recognition task. Our experiments on Aishell-1 show state-of-the-art performance for E2E models that are trained solely on the speech training data. Further, by increasing the depth of model, the Transformers trained with augmentations can outperform certain hybrid systems, even without the aid of a language model.

Speakers: Dan Su (Tencent AILab Shenzhen) , Helen Meng (The Chinese University of Hong Kong) , Xingchen Song (Tsinghua University) , Yiheng Huang (Tencent AI Lab) , Zhiyong Wu (Tsinghua University)
• 20:30 21:30
Mon-2-3 Evaluation of Speech Technology Systems and Methods for Resource Construction and Annotation room3

### room3

Chairs: Petra Wagner,Steve Renals

https://zoom.com.cn/j/61951480857

• 20:30
Mon-2-3-1 RECOApy: Data recording, pre-processing and phonetic transcription for end-to-end speech-based applications 1h

eep learning enables the development of efficient end-to-end speech processing applications while bypassing the need for expert linguistic and signal processing features. Yet, recent studies show that good quality speech resources and phonetic transcription of the training data can enhance the results of these applications. In this paper, the RECOApy tool is introduced. RECOApy streamlines the steps of data recording and pre-processing required in end-to-end speech-based applications. The tool implements an easy-to-use interface for prompted speech recording, spectrogram and waveform analysis, utterance-level normalisation and silence trimming, as well grapheme-to-phoneme conversion of the prompts in eight languages: Czech, English, French, German, Italian, Polish, Romanian and Spanish.

The grapheme-to-phoneme (G2P) converters are deep neural network (DNN) based architectures trained on lexicons extracted from the Wiktionary online collaborative resource. With the different degree of orthographic transparency, as well as the varying amount of phonetic entries across the languages, the DNN's hyperparameters are optimised with an evolution strategy. The phoneme and word error rates of the resulting G2P converters are presented and discussed. The tool, the processed phonetic lexicons and trained G2P models are made freely available.

Speaker: Adriana Stan (Communications Department, Technical University of Cluj-Napoca)
• 20:30
Mon-2-3-2 Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer 1h

The demand for fast and accurate incremental speech recognition increases as the applications of automatic speech recognition (ASR) proliferate. Incremental speech recognizers output chunks of partially recognized words while the user is still talking. Partial results can be revised before the ASR finalizes its hypothesis, causing instability issues. We analyze the quality and stability of on-device streaming end-to-end (E2E) ASR models. We first introduce a novel set of metrics that quantify the instability at word and segment levels. We study the impact of several model training techniques that improve E2E model qualities but degrade model stability. We categorize the causes of instability and explore various solutions to mitigate them in a streaming E2E ASR system.

Speakers: Francoise Beaufays (Google) , Ian McGraw (Google) , Katie Knister (Google) , Yanzhang He (Google) , Yuan Shangguan (facebook)
• 20:30
Mon-2-3-3 Statistical Testing on ASR Performance via Blockwise Bootstrap 1h

A common question being raised in automatic speech recognition (ASR) evaluations is how reliable is an observed word error rate (WER) improvement comparing two ASR systems, where statistical hypothesis testing and confidence interval (CI) can be utilized to tell whether this improvement is real or only due to random chance. The bootstrap resampling method has been popular for such significance analysis which is intuitive and easy to use. However, this method fails in dealing with dependent data, which is prevalent in speech world - for example, ASR performance on utterances from the same speaker could be correlated. In this paper we present blockwise bootstrap approach - by dividing evaluation utterances into nonoverlapping blocks, this method resamples these blocks instead of original data. We show that the resulting variance estimator of absolute WER difference between two ASR systems is consistent under mild conditions. We also demonstrate the validity of blockwise bootstrap method on both synthetic and real-world speech data.

Speakers: Fuchun Peng (Facebook) , Zhe Liu (Facebook, Inc)
• 20:30
Mon-2-3-4 SENTENCE LEVEL ESTIMATION OF PSYCHOLINGUISTIC NORMS USING JOINT MULTIDIMENSIONAL ANNOTATIONS 1h

Psycholinguistic normatives represent various affective and mental
constructs using numeric scores and are used in a variety of applications in natural language processing. They are commonly used
at the sentence level, the scores of which are estimated by extrapolating word level scores using simple aggregation strategies, which
may not always be optimal. In this work, we present a novel approach to estimate the psycholinguistic norms at sentence level. We
apply a multidimensional annotation fusion model on annotations at
the word level to estimate a parameter which captures relationships
between different norms. We then use this parameter at sentence
level to estimate the norms. We evaluate our approach by predicting valence, arousal and dominance on sentences from an annotated
dataset and show improved performance compared to word aggregation schemes.

Speakers: Anil Ramakrishna (Amazon) , Shrikanth Narayanan (University of Southern California)
• 20:30
Mon-2-3-5 Neural Zero-Inflated Quality Estimation Model For Automatic Speech Recognition System 1h

The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a single continuous distribution. In order to address the two issues of ASR quality estimation (QE), we pro- pose a novel neural zero-inflated model to predict the WER of the ASR result without transcripts. We design a neural zero- inflated beta regression on top of a bidirectional transformer language model conditional on speech features (speech-BERT). We adopt the pre-training strategy of token level masked language modeling for speech-BERT as well, and further fine-tune with our zero-inflated layer for the mixture of discrete and continuous outputs. The experimental results show that our approach achieves better performance on WER prediction com- pared with strong baselines.

Speakers: Bo Li (Alibaba Group) , Boxing Chen (Alibaba) , Jiayi Wang (Alibaba Group) , Kai Fan (Alibaba Group) , Niyu Ge (IBM Research) , Shiliang Zhang (Alibaba Group) , Zhi-Jie Yan (Microsoft Research Asia)
• 20:30
Mon-2-3-6 Confidence measures in encoder-decoder models for speech recognition 1h

Recent improvements in Automatic Speech Recognition (ASR) systems have enabled the growth of myriad applications such as voice assistants, intent detection, keyword extraction and sentiment analysis. These applications, which are now widely used in the industry, are very sensitive to the errors generated by ASR systems. This could be overcome by having a reliable confidence measurement associated to the predicted output. This work presents a novel method which uses internal neural features of a frozen ASR model to train an independent neural network to predict a softmax temperature value. This value is computed in each decoder time step and multiplied by the logits in order to redistribute the output probabilities. The resulting softmax values corresponding to predicted tokens constitute a more reliable confidence measure. Moreover, this work also studies the effect of teacher forcing on the training of the proposed temperature prediction module. The output confidence estimation shows an improvement of -25.78\% in EER and +7.59\% in AUC-ROC with respect to the unaltered softmax values of the predicted tokens, evaluated on a proprietary dataset consisting on News and Entertainment videos.

Speakers: Alejandro Woodward (Universitat Politècnica de Catalunya) , Clara Bonnín (Vilynx) , Daivid Varas (Vilynx) , Elisenda Bou-Balust (Vilynx) , Issey Masuda (Vilynx) , Juan Carlos Riveiro (Vilynx)
• 20:30
Mon-2-3-7 Word Error Rate Estimation Without ASR Output: e-WER2 1h

Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we continue our effort in estimating WER using acoustic, lexical and phonotactic features. Our novel approach to estimate the WER uses a multistream end-to-end architecture. We report results for systems using internal speech decoder features (glass-box), systems without speech decoder features (black-box), and for systems without having access to the ASR system (no-box). The no-box system learns joint acoustic-lexical representation from phoneme recognition results along with MFCC acoustic features to estimate WER. Considering WER per sentence, our no-box system achieves 0.56 Pearson correlation with the reference evaluation and 0.24 root mean square error (RMSE) across 1,400 sentences. The estimated overall WER by e-WER2 is 30.9% for a three hours test set, while the WER computed using the reference transcriptions was 28.5%.

Speakers: Ahmed Ali (Qatar Computing Research Institute) , Steve Renals (University of Edinburgh)
• 20:30
Mon-2-3-8 An evaluation of manual and semi-automatic laughter annotation 1h

With laughter research seeing a development in recent years, there is also an increased need in materials having laughter annotations. We examine in this study how one can leverage existing spontaneous speech resources to this goal. We first analyze the process of manual laughter annotation in corpora, by establishing two important parameters of the process: the amount of time required and its inter-rater reliability. Next, we propose a novel semi-automatic tool for laughter annotation, based on a signal-based representation of speech rhythm. We test both annotation approaches on the same recordings, containing German dyadic spontaneous interactions, and employing a larger pool of annotators than previously done. We then compare and discuss the obtained results based on the two aforementioned parameters, highlighting the benefits and costs associated to each approach.

Speakers: Bogdan Ludusan (Bielefeld University) , Petra Wagner (Universität Bielefeld)
• 20:30
Mon-2-3-9 Understanding Racial Disparities in Automatic Speech Recognition: the case of habitual "be" 1h

Recent research has highlighted that state-of-the-art automatic speech recognition (ASR) systems exhibit a bias against African American speakers. In this research, we investigate the underlying causes of this racially based disparity in performance, focusing on a unique morpho-syntactic feature of African American English (AAE), namely habitual "be", an invariant form of "be" that encodes the habitual aspect. By looking at over 100 hours of spoken AAE, we evaluated two ASR systems -- DeepSpeech and Google Cloud Speech -- to examine how well habitual "be" and its surrounding contexts are inferred. While controlling for local language and acoustic factors such as the amount of context, noise, and speech rate, we found that habitual "be" and its surrounding words were more error prone than non-habitual "be" and its surrounding words. These findings hold both when the utterance containing "be" is processed in isolation and in conjunction with surrounding utterances within speaker turn. Our research highlights the need for equitable ASR systems to take into account dialectal differences beyond acoustic modeling.

Speakers: Joshua Martin (University of Florida) , Kevin Tang (University of Florida)
• 20:30 21:30
Mon-2-4 Phonetics and Phonology room4

### room4

Chairs: Philippe MARTIN , Zhiqiang Li

https://zoom.com.cn/j/69279928709

• 20:30
Mon-2-4-1 Secondary phonetic cues in the production of the nasal short-a system in California English 1h

A production study explored the acoustic characteristics of /æ/ in CVC and CVN words spoken by California speakers who raise /æ/ in pre-nasal contexts. Results reveal that the phonetic realization of the /æ/-/ɛ/ contrast in these contexts is multidimensional. Raised pre-nasal /æ/ is close in formant space to /ɛ/, particularly over the second half of the vowel. Yet, systematic differences in the realization of the secondary acoustic features of duration, formant movement, and degree of coarticulatory vowel nasalization keep these vowels phonetically distinct. These findings have implications for systems of vowel contrast and the use of secondary phonetic properties to maintain lexical distinctions.

Speakers: Georgia Zellou (UC Davis) , Rebecca Scarborough (University of Colorado) , Renee Kemp (UC Davis)
• 20:30
Mon-2-4-2 Acoustic properties of strident fricatives at the edges: implications for consonant discrimination 1h

Languages tend to license segmental contrasts where they are maximally perceptible, i.e. where more perceptual cues to the contrast are available. For strident fricatives, the most salient cues to the presence of voicing are low-frequency energy concentrations and fricative duration, as voiced fricatives are systematically shorter than voiceless ones. Cross-linguistically, the voicing contrast is more frequently realized word-initially than word-finally, as for obstruents. We investigate the phonetic underpinnings of this asymmetric behavior at the word edges, focusing on the availability of durational cues to the contrast in the two positions. To assess segmental duration, listeners rely on temporal markers, i.e. jumps in acoustic energy which demarcate segmental boundaries, thereby facilitating duration discrimination. We conducted an acoustic analysis of word-initial and word-final strident fricatives in American English. We found that temporal markers are sharper at the left edge of word-initial fricatives than at the right edge of word-final fricatives, in terms of absolute value of the intensity slope, in the high-frequency region. These findings allow us to make predictions about the availability of durational cues to the voicing contrast in the two positions.

Speakers: Leo Varnet (ENS) , Maria Giavazzi (Ecole Normale Supérieure) , lorenzo maselli (Scuola normale Superiore di Pisa)
• 20:30
Mon-2-4-3 Processes and Consequences of Co-articulation in Mandarin V1N.(C2)V2 Context: Phonology and Phonetics 1h

It is well known that in Mandarin Chinese (MC) nasal rhymes, non-high vowels /a/ and /e/ undergo Vowel Nasalization and Backness Feature Specification processes to harmonize with the nasal coda in both manner and place of articulation. Specifically, the vowel is specified with the [+front] feature when followed by the /n/ coda and the [+back] feature when followed by /ŋ/. On the other hand, phonetic experiments in recent researches have shown that in MC disyllabic words, the nasal coda tends to undergo place assimilation in the V1N.C2V2 context and complete deletion in the V1N.V2 context.
These processes raises two questions: firstly, will V1 in V1N.C2V2 contexts also change in its backness feature to harmonize with the assimilated nasal coda? Secondly, will the duration of V1N reduce significantly after nasal coda deletion in the V1N.(G)V context?
A production experiment and a perception experiment were designed to answer these two questions. Results show that the vowel backness feature of V1 is not re-specified despite the appropriate environment, and the duration of V1N is not reduced after nasal deletion. The phonological consequences of these findings will be discussed.

Speaker: Mingqiong Luo (Shanghai International Studies University)
• 20:30
Mon-2-4-4 Voicing Distinction of Obstruents in the Hangzhou Wu Chinese Dialect 1h

This paper gives an acoustic phonetic description of the obstruents in the Hangzhou Wu Chinese dialect. Based on the data from 8 speakers (4 male and 4 female), obstruents were examined in terms of VOT, silent closure duration, segment duration, and spectral properties such as H1-H2, H1-F1 and H1-F3. Results suggest that VOT cannot differentiate the voiced obstruents from their voiceless counterparts, but the silent closure duration can. There is no voiced aspiration. And breathiness was detected on the vowel following the voiced category of obstruents. An acoustic consequence is that there is no segment for the voiced glottal fricative [ɦ], since it was realized as the breathiness on the following vowel. But interestingly, it is observed that syllables with [ɦ] are longer than their onset-less counterparts.

Speakers: Fang Hu (Institute of Linguistics, Chinese Academy of Social Sciences) , Yang Yue (University of Chinese Academy of Social Sciences)
• 20:30
Mon-2-4-5 The phonology and phonetics of Kaifeng Mandarin vowels 1h

In this present study, we re-analyze the vowel system in Kaifeng Mandarin, adopting a phoneme-based approach. Our analysis deviates from the previous syllable-based analyses in a number of ways. First, we treat apical vowels [ɿ ʅ] as syllabic approximants and analyze them as allophones of the retroflex approximant /ɻ/. Second, the vowel inventory is of three sets, monophthongs, diphthongs and retroflex vowels. The classification of monophthongs and diphthongs is based on the phonological distribution of the coda nasal. That is, monophthongs can be followed by a nasal coda, while diphthongs cannot. This argument has introduced two new opening diphthongs /eɛ ɤʌ/ in the inventory, which have traditionally been described as monophthongs. Our phonological characterization of the vowels in Kaifeng Mandarin is further backed up by acoustic data. It is argued that the present study has gone some way towards enhancing our understanding of Mandarin segmental phonology in general.

Speaker: Lei Wang (East China University of Science and Technology)
• 20:30
Mon-2-4-6 Microprosodic variability in plosives in German and Austrian German 1h

Fundamental frequency (F0) contours may show slight, microprosodic variations in the vicinity of plosive segments, which may have distinctive patterns relative to the place of articulation and voicing. Similarly, plosive bursts have distinctive characteristics associated with these articulatory features. The current study investigates the degree to which such microprosodic variations arise in two varieties of German, and how the two varieties differ. We find that microprosodic effects indeed arise in F0 as well as burst intensity and Center of Gravity, but that the extent of the variability is different in the two varieties under investigation, with northern German tending towards more variability in the microprosody of plosives than Austrian German. Coarticulatory effects on the burst with the following segment also arise, but also have different features in the two varieties. This evidence is consistent with the possibility that the fortis-lenis contrast is not equally stable in Austrian German and northern German.

Speakers: Barbara Schuppler (SPSC Laboratory, Graz University of Technology) , Margaret Zellers (University of Kiel)
• 20:30
Mon-2-4-7 Er-suffixation in Southwestern Mandarin: An EMA and ultrasound study 1h

This paper is an articulatory study of the er-suffixation (a.k.a. erhua) in Southwestern Mandarin (SWM), using co-registered EMA and ultrasound. Data from two female speakers in their twenties were analyzed and discussed. Our recording materials contain unsuffixed stems, er-suffixed forms and the rhotic schwa /ɚ/, a phonemic vowel in its own right. Results suggest that the er-suffixation in SWM involves suffixing a rhotic schwa [ɚ] to the stem, unlike its counterpart in Beijing and Northeastern Mandarin [5]. Specifically, an entire rime will be replaced with the er-suffix if the nucleus vowel is non-high; only high vocoids will be preserved after the er-suffixation. The “rhoticity” is primarily realized as a bunched tongue shape configuration (i.e. a domed tongue body), while the Tongue Tip gesture plays a more limited role in SWM. A phonological analysis is accordingly proposed for the er-suffixation in SWM.

Speakers: Feng-fan Hsieh (National Tsing Hua University) , Jing Huang (National Tsing Hua University) , Yueh-chin Chang (National Tsing Hua University)
• 20:30
Mon-2-4-8 Electroglottographic-Phonetic Study on Korean Phonation Induced by Tripartite Plosives in Yanbian Korean 1h

This paper examined the phonatory features induced by the tripartite plosives in Yanbian Korean, broadly considered as Hamkyungbukdo Korean dialect. Electroglottographic (EGG) and acoustic analysis was applied for five elderly Korean speakers. The results show that fortis-induced phonation is characterized with more constricted glottis, slower spectral tilt, and higher sub-harmonic-harmonic ratio. Lenis-induced phonation is shown to be breathier with smaller Contact Quotient and faster spectral tilt. Most articulatory and acoustic measures for the aspirated are shown to be patterned with the lenis; However, sporadic difference between the two indicates that the lenis induces more breathier phonation. The diplophonia phonation is argued to be a salient feature for the fortis-head syllables in Yanbian Korean. The vocal fold medial compression and adductive tension mechanisms are tentatively argued to be responsible for the production of the fortis. At last, gender difference is shown to be salient in the fortis-induced phonation.
Index Terms: phonation, electroglottography (EGG), Yanbian Korean, tripartite plosives

Speakers: Jinghua Zhang (Yanbian Univeristy) , Yinghao Li (Yanbian Univeristy)
• 20:30
Mon-2-4-9 Modeling Global Body Configurations in American Sign Language 1h

In this paper we consider the problem of computationally representing American Sign Language (ASL) phonetics. We specifically present a computational model inspired by the sequential phonological ASL representation, known as the Movement-Hold (MH) Model. Our computational model is capable of not only capturing ASL phonetics, but also has generative abilities. We present a Probabilistic Graphical Model (PGM) which explicitly models holds and implicitly models movement in the MH model. For evaluation, we introduce a novel data corpus, ASLing, and compare our PGM to other models (GMM, LDA, and VAE) and show its superior performance. Finally, we demonstrate our model's interpretability by computing various phonetic properties of ASL through the inspection of our learned model.

Speakers: Beck Cordes Galbraith (Sign-Speak) , Ifeoma Nwogu (Rochester Institute of Technology) , Nicholas Wilkins (Rochester Institute of Technology)
• 20:30 21:30
Mon-2-5 Topics in ASR I room5

### room5

Chairs: Ganna Raboshchuk , Sheng Li

https://zoom.com.cn/j/67438690809

• 20:30
Mon-2-5-1 Augmenting Turn-taking Prediction with Wearable Eye Activity During Conversation 1h

In a variety of conversation contexts, accurately predicting the time point at which a conversational participant is about to speak can help improve computer-mediated human-human communications. Although it is not difficult for a human to perceive turn-taking intent in conversations, it has been a challenging task for computers to date. In this study, we employed eye activity acquired from low-cost wearable hardware during natural conversation and studied how pupil diameter, blink and gaze direction could assist speech in voice activity and turn-taking prediction. Experiments on a new 2-hour corpus of natural conversational speech between six pairs of speakers wearing near-field eye video glasses revealed that the F1 score for predicting the voicing activity up to 1s ahead of the current instant can be above 80%, for speech and non-speech detection with fused eye and speech features. Further, extracting features synchronously from both interlocutors provides a relative reduction in error rate of 8.5% compared with a system based on just a single speaker. The performance of four turn-taking states based on the predicted voice activity also achieved F1 scores significantly higher than chance level. These findings suggest that wearable eye activity can play a role in future speech communication systems.

Speakers: Hang Li (UNSW) , Julien Epps (School of Electrical Engineering and Telecommunications, UNSW Australia) , Siyuan Chen (University of New South Wales)
• 20:30
Mon-2-5-10 Focal Loss for Punctuation Prediction 1h

Many approaches have been proposed to predict punctuation marks. Previous results demonstrate that these methods are effective.However, there still exists class imbalance problem during training. Most of the classes in the training set for punctuation prediction are non-punctuation marks. This will affect the performance of punctuation prediction tasks. Therefore, this paper uses a focal loss to alleviate this issue. The focal loss can down-weight easy examples and focus training on a sparse set of hard examples. Experiments are conducted on IWSLT2011 datasets. The results show that the punctuation predicting models trained with a focal loss obtain performance improvement over that trained with a cross entropy loss by up to 2.7% absolute overall F_1-score on test set. The proposed model also outperforms previous state-of-the-art models.

Speakers: Cunhang Fan (Institute of Automation, Chinese Academy of Sciences) , Jiangyan Yi (Institute of Automation Chinese Academy of Sciences) , Jianhua Tao (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Ye Bai (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Zhengkun Tian (Institute of Automation, Chinese Academy of Sciences)
• 20:30
Mon-2-5-2 CAM: Uninteresting Speech Detector 1h

Voice assistants such as Siri, Alexa, etc. usually adopt a pipeline to process users’ utterances, which generally include transcribing the audio into text, understanding the text, and finally responding back to users. One potential issue is that some utterances could be devoid of any interesting speech, and are thus not worth being processed through the entire pipeline. Examples of uninteresting utterances include those that have too much noise, are devoid of intelligible speech, etc. It is therefore desirable to have a model to filter out such useless utterances be- fore they are ingested for downstream processing, thus saving system resources. Towards this end, we propose the Combination of Audio and Metadata (CAM) detector to identify utterances that contain only uninteresting speech. Our experimental results show that the CAM detector considerably outperforms using either an audio model or a metadata model alone, which demonstrates the effectiveness of the proposed system.

Speakers: Belinda Zeng (Amazon) , Peng Yang (Amazon) , Weiyi Lu (Amazon) , Yi Xu (Amazon)
• 20:30
Mon-2-5-3 Mixed Case Contextual ASR Using Capitalization Masks 1h

End-to-end (E2E) mixed-case automatic speech recognition (ASR) systems that directly predict words in the written domain are attractive due to being simple to build, not requiring explicit capitalization models, allowing streaming capitalization without additional effort beyond that required for streaming ASR, and their small size. However, the fact that these systems produce various versions of the same word with different capitalizations, and even different word segmentations for different case variants when wordpieces (WP) are predicted, leads to multiple problems with contextual ASR. In particular, the size of and time to build contextual models grows considerably with the number of variants per word. In this paper, we propose separating orthographic recognition from capitalization, so that the
ASR system first predicts a word, then predicts its capitalization in the form of a capitalization mask. We show that the use of capitalization masks achieves the same low error rate as traditional mixed case ASR, while reducing the size and compilation time of contextual models. Furthermore, we observe significant improvements in capitalization quality.

Speakers: Diamantino Caseiro (Google Inc.) , Pat Rondon (Google Inc.) , Petar Aleksic (Google Inc.) , Quoc-Nam Le The (Google Inc.)
• 20:30
Mon-2-5-4 Speech Recognition and Multi-Speaker Diarization of Long Conversations 1h

Speech recognition (ASR) and speaker diarization (SD) models have traditionally been trained separately to produce rich conversation transcripts with speaker labels. Recent advances have shown that joint ASR and SD models can learn to leverage audio-lexical inter-dependencies to improve word diarization performance. We introduce a new benchmark of hour-long podcasts collected from the weekly This American Life radio program to better compare these approaches when applied to extended multi-speaker conversations. We find that training separate ASR and SD models perform better when utterance boundaries are known but otherwise joint models can perform better. To handle long conversations with unknown utterance boundaries, we introduce a striding attention decoding algorithm and data augmentation techniques which, combined with model pre-training, improves ASR and SD.

Speakers: Garrison Cottrell (University of California, San Diego) , Henry Mao (University of California, San Diego) , Julian McAuley (University of California, San Diego) , Shuyang Li (University of California, San Diego)
• 20:30
Mon-2-5-5 Investigation of Data Augmentation Techniques for Disordered Speech Recognition 1h

Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including vocal tract length perturbation (VTLP), tempo perturbation and speed perturbation. Both normal and disordered speech were exploited in the augmentation process. Variability among impaired speakers in both the original and augmented data was modeled using learning hidden unit contributions (LHUC) based speaker adaptive training. The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute (9.3% relative) word error rate (WER) reduction over the baseline system without data augmentation, and gave an overall WER of 26.37% on the test set containing 16 dysarthria speakers

Speakers: Helen Meng (Chinese University of Hong Kong) , Jianwei Yu (Chinese University of Hong Kong) , Mengzhe Geng (Chinese University of Hong Kong) , SHANSONG LIU (Chinese University of Hong Kong) , Xunying Liu (Chinese University of Hong Kong) , Xurong Xie (Chinese University of Hong Kong) , shoukang hu (Chinese University of Hong Kong)
• 20:30
Mon-2-5-6 A Real-time Robot-based Auxiliary System for Risk Evaluation of COVID-19 Infection 1h

In this paper, we propose a real-time robot-based auxiliary sys-
tem for risk evaluation of COVID-19 infection. It combines
real-time speech recognition, intent recognition, keyword de-
tection, cough detection and other functions in order to convert
live audio into actionable structured data to achieve the COVID-
19 infection risk assessment function. In order to better evalu-
ate the COVID-19 infection, We propose an end-to-end method
for cough detection and classification for our proposed systeam.
It is based on real conversation data from human-robot, which
processes speech signals to detect cough and classify it if de-
tected. The structure of our model are maintained concise to
be implemented for real-time applications. And we further em-
bed this entire auxiliary diagnostic system in the robot and it is
placed in the community, hospital or supermarket to facilitate
people’s detection. The system can be further leveraged within
a business rules engine, thus serving as a foundation for real-
time supervision and assistance applications. Our model comes
with a pretrained, robust training environment that allows for
efficient creation and customization of customer-specific health
states.

Speakers: Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd.) , Jing Xiao (Ping An Technology (Shenzhen) Co., Ltd.) , Jiteng Ma (Ping An Technology (Shenzhen) Co., Ltd.) , Ning Cheng (Ping An Technology (Shenzhen) Co., Ltd.) , Wenqi Wei (Ping An Technology (Shenzhen) Co., Ltd.)
• 20:30
Mon-2-5-7 An Utterance Verification System for Word Naming Therapy in Aphasia 1h

Anomia (word finding difficulties) is the hallmark of aphasia an acquired language disorder, most commonly caused by stroke. Assessment of speech performance using picture naming tasks is therefore a key method for identification of the disorder and monitoring patient’s response to treatment interventions. Currently, this assessment is conducted manually by speech and language therapists (SLT). Surprisingly, despite advancements in ASR and artificial intelligence with technologies like deep learning, research on developing automated systems for this task has been scarce. Here we present an utterance verification system incorporating a deep learning element that classifies ‘correct’/’incorrect’ naming attempts from aphasic stroke patients. When tested on 8 native British-English speaking aphasics the system’s performance accuracy ranged between 83.6% to 93.6%, with a 10 fold cross validation mean of 89.5%. This performance was not only significantly better than one of the leading commercially available ASRs (Google speech-to-text service) but also comparable in some instances with two independent SLT ratings for the same dataset.

Speakers: Alexander Paul Leff (Institute of Cognitive Neuroscience, University College London) , David Barbera (University College London) , Emily Upton (Institute of Cognitive Neuroscience, University College London) , Henry Coley-Fisher (Institute of Cognitive Neuroscience, University College London) , Ian Shaw (Technical Consultant at SoftV) , Jenny Crinion (Institute of Cognitive Neuroscience, University College London)) , Mark Huckvale (Speech, Hearing and Phonetic Sciences, University College London) , Victoria Fleming (Speech, Hearing and Phonetic Sciences, University College London) , William Latham (Goldsmiths College University of London)
• 20:30
Mon-2-5-8 Exploiting Cross Domain Visual Feature Generation for Disordered Speech Recognition 1h

Audio-visual speech recognition (AVSR) technologies have been successfully applied to a wide range of tasks. When developing AVSR systems for disordered speech characterized by severe degradation of voice quality and large mismatch against normal, it is difﬁcult to record large amounts of high quality audio-visual data. In order to address this issue, a cross-domain visual feature generation approach is proposed in this paper. Audio-visual inversion DNN system constructed using widely available out-of-domain audio-visual data was used to generate visual features for disordered speakers for whom video data is either very limited or unavailable. Experiments conducted on the UASpeech corpus suggest that the proposed cross-domain visual feature generation based AVSR system consistently outperformed the baseline ASR system and AVSR system using original visual features. An overall word error rate reduction of 3.6% absolute (14% relative) was obtained over the previously published best system on the 8 UASpeech dysarthric speakers with audio-visual data of the same task.

Speakers: Helen Meng (The Chinese University of Hong Kong) , Jianwei Yu (The Chinese University of Hong Kong) , Mengzhe Geng (The Chinese University of Hong Kong) , Rongfeng Su (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences.) , SHANSONG LIU (The Chinese University of Hong Kong) , Shi-Xiong ZHANG (Tencent AI Lab) , Xunying Liu (Chinese University of Hong Kong) , Xurong Xie (Chinese University of Hong Kong) , shoukang hu (Chinese University of Hong Kong)
• 20:30
Mon-2-5-9 Joint prediction of punctuation and disfluency in speech transcripts 1h

Spoken language transcripts generated from Automatic speech recognition (ASR) often contain a large portion of disfluency and lack punctuation symbols. Punctuation restoration and dis- fluency removal of the transcripts can facilitate downstream tasks such as machine translation, information extraction and syntactic analysis [1]. Various studies have shown the influence between these two tasks and thus performed modeling based on a multi-task learning (MTL) framework [2, 3], which learns general representations in the shared layers and separate repre- sentations in the task-specific layers. However, task dependen- cies are normally ignored in the task-specific layers. To model the dependencies of tasks, we propose an attention based struc- ture in the task-specific layers of the MTL framework incorpo- rating the pretrained BERT (a state-of-art NLP-related model) [4]. Experimental results based on English IWSLT dataset and the Switchboard dataset show the proposed architecture outper- forms the separate modeling methods as well as the traditional MTL methods.

Speakers: Binghuai Lin (Tencent Technology Co., Ltd) , Liyuan Wang (Tencent Technology Co., Ltd)
• 20:30 21:30
Mon-2-7 Voice Conversion and Adaptation I room7

### room7

Chairs: Heiga Zen,Zhenhua Ling

https://zoom.com.cn/j/69983075794

• 20:30
Mon-2-7-1 Recognition-Synthesis Based Non-Parallel Voice Conversion with Adversarial Learning 1h

This paper presents an adversarial learning method for recognition-synthesis based non-parallel voice conversion. A recognizer is used to transform acoustic features into linguistic representations while a synthesizer recovers output features from the recognizer outputs together with the speaker identity. By separating the speaker characteristics from the linguistic representations, voice conversion can be achieved by replacing the speaker identity with the target one. In our proposed method, a speaker adversarial loss is adopted in order to obtain speaker-independent linguistic representations using the recognizer. Furthermore, discriminators are introduced and a generative adversarial network (GAN) loss is used to prevent the predicted features from being over-smoothed. For training model parameters, a strategy of pre-training on a multi-speaker dataset and then fine-tuning on the source-target speaker pair is designed. Our method achieved higher similarity than the baseline model that obtained the best performance in Voice Conversion Challenge 2018.

Speakers: Li-Rong Dai (University of Science and Technology of China) , Jing-Xuan Zhang (University of Science and Technology of China) , Zhen-Hua Ling (University of Science and Technology of China)
• 20:30
Mon-2-7-2 Improving the Speaker Identity of Non-Parallel Many-to-Many VoiceConversion with Adversarial Speaker Recognition 1h

Phonetic Posteriorgrams (PPGs) have received much attention for non-parallel many-to-many Voice Conversion (VC), and have been shown to achieve state-of-the-art performance. These methods implicitly assume that PPGs are speaker-independent and contain only linguistic information in an utterance. In practice, however, PPGs carry speaker individuality cues, such as accent, intonation, and speaking rate. As a result, these cues can leak into the voice conversion, making it sound similar to the source speaker. To address this issue, we propose an adversarial learning approach that can remove speaker-dependent information in VC models based on a PPG2speech synthesizer. During training, the encoder output of a PPG2speech synthesizer is fed to a classifier trained to identify the corresponding speaker, while the encoder is trained to fool the classifier. As a result, a more speaker-independent representation is learned. The proposed method is advantageous as it does not require pre-training the speaker classifier, and the adversarial speaker classifier is jointly trained with the PPG2speech synthesizer end-to-end. We conduct objective and subjective experiments on the CSTR VCTK Corpus under standard and one-shot VC conditions. Results show that the proposed method significantly improves the speaker identity of VC syntheses when compared with a baseline system trained without adversarial learning.

Speakers: Guanlong Zhao (Texas A&M University) , Ricardo Gutierrez-Osuna (Texas A&M University) , Shaojin Ding (Texas A&M University)
• 20:30
Mon-2-7-3 Non-parallel Many-to-many Voice Conversion with PSR-StarGAN 1h

Voice Conversion (VC) aims at modifying source speaker's speech to sound like that of target speaker while preserving linguistic information of given speech. StarGAN-VC was recently proposed, which utilizes a variant of Generative Adversarial Networks (GAN) to perform non-parallel many-to-many VC. However, the quality of generated speech is not satisfactory enough. An improved method named "PSR-StarGAN-VC'' is proposed in this paper by incorporating three improvements. Firstly, perceptual loss functions are introduced to optimize the generator in StarGAN-VC aiming to learn high-level spectral features. Secondly, considering that Switchable Normalization (SN) could learn different operations in different normalization layers of model, it is introduced to replace Batch Normalization (BN) in StarGAN-VC. Lastly, Residual Network (ResNet) is applied to establish the mapping of different layers between the encoder and decoder of generator aiming to retain more semantic features when converting speech, and to reduce the difficulty of training. Experiment results on the VCC 2018 datasets demonstrate superiority of the proposed method in terms of naturalness and speaker similarity.

Speakers: Binbin Chen (vivo AI Lab) , Dongxiang Xu (Nanjing University of Posts and Telecommunications) , Yan Zhang (JIT) , Yang Wang (vivo AI Lab) , Yanping Li (Nanjing University of Posts and Telecommunications)
• 20:30
Mon-2-7-4 TTS Skins: Speaker Conversion via ASR 1h

We present a fully convolutional wav-to-wav network for converting between speakers' voices, without relying on text. Our network is based on an encoder-decoder architecture, where the encoder is pre-trained for the task of Automatic Speech Recognition, and a multi-speaker waveform decoder is trained to reconstruct the original signal in an autoregressive manner. We train the network on narrated audiobooks, and demonstrate multi-voice TTS in those voices, by converting the voice of a TTS robot.

Speakers: Adam Polyak (Facebook) , Lior Wolf (Tel Aviv University) , Yaniv Taigman (Facebook)
• 20:30
Mon-2-7-5 GAZEV: GAN-Based Zero Shot Voice Conversion over Non-parallel Speech Corpus 1h

Non-parallel many-to-many voice conversion is recently attract- ing huge research efforts in the speech processing community. A voice conversion system transforms an utterance of a source speaker to another utterance of a target speaker by keeping the content in the original utterance and replacing by the vocal features from the target speaker. Existing solutions, e.g., StarGAN- VC2, present promising results, only when speech corpus of the engaged speakers is available during model training. AUTOVC is able to perform voice conversion on unseen speakers, but it needs an external pretrained speaker verification model. In this paper, we present our new GAN-based zero-shot voice conversion solution, called GAZEV, which targets to support unseen speakers on both source and target utterances. Our key technical contribution is the adoption of speaker embedding loss on top of the GAN framework, as well as adaptive instance normalization strategy, in order to address the limitations of speaker identity transfer in existing solutions. Our empirical evaluations demonstrate significant performance improvement on output speech quality, and comparable speaker similarity to AUTOVC.

Speakers: Bingsheng He (National University of Singapore) , Zhenjie Zhang (Yitu) , zining zhang (National University of Singapore)
• 20:30
Mon-2-7-6 Spoken Content and Voice Factorization for Few-shot Speaker Adaptation 1h

The low similarity and naturalness of synthesized speech remain a challenging problem for speaker adaptation with few resources. Since the acoustic model is too complex to interpret, overfitting will occur when training with few data. To prevent the model from overfitting, this paper proposes a novel speaker adaptation framework that decomposes the parameter space of the end-to-end acoustic model into two parts, with the one on predicting spoken content and the other on modeling speaker's voice. The spoken content is represented by phone posteriorgram(PPG) which is speaker independent. By adapting the two sub-modules separately, the overfitting can be alleviated effectively. Moreover, we propose two different adaptation strategies based on whether the data has text annotation. In this way, speaker adaptation can also be performed without text annotations. Experimental results confirm the adaptability of our proposed method of factorizating spoken content and voice. And listening tests demonstrate that our proposed method can achieve better performance with just 10 sentences than speaker adaptation conducted on Tacotron in terms of naturalness and speaker similarity.

Speakers: Jiangyan Yi (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Jianhua Tao (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Rongxiu Zhong (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Ruibo Fu (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Tao Wang (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Zhengqi Wen (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
• 20:30
Mon-2-7-7 Unsupervised Cross-Domain Singing Voice Conversion 1h

We present a wav-to-wav generative model for the task of singing voice conversion from any identity. Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator. The proposed generative architecture is invariant to the speaker's identity and can be trained to generate target singers from unlabeled training data, using either speech or singing sources. The model is optimized in an end-to-end fashion without any manual supervision, such as lyrics, musical notes or parallel samples. The proposed approach is fully-convolutional and can generate audio in real-time. Experiments show that our method significantly outperforms the baseline methods while generating convincingly better audio samples than alternative attempts.

Speakers: Adam Polyak (Facebook) , Lior Wolf (Tel Aviv University) , Yaniv Taigman (Facebook) , Yossi Adi (Facebook AI Research)
• 20:30
Mon-2-7-8 Attention-Based Speaker Embeddings for One-Shot Voice Conversion 1h

This paper proposes a novel approach to embed speaker information to feature vectors at frame level using an attention mechanism, and its application to one-shot voice conversion. A one-shot voice conversion system is a type of voice conversion system where only one utterance from a target speaker is available for conversion. In many one-shot voice conversion systems, a speaker encoder mechanism, compresses an utterance of the target speaker into a fixed-size vector for propagating speaker information. However, the obtained representation has lost temporal information related to speaker identities and it could degrade conversion quality. To alleviate this problem, we propose a novel way to embed speaker information using an attention mechanism. Instead of compressing into a fixed-size vector, our proposed speaker encoder outputs a sequence of speaker embedding vectors. The obtained sequence is selectively combined with input frames of a source speaker by an attention mechanism. Finally the obtained time varying speaker information is utilized for a decoder to generate the converted features. Objective evaluation showed that our method reduced the averaged mel-cepstrum distortion to 5.23dB from 5.34dB compared with the baseline system. The subjective preference test showed that our proposed system outperformed the baseline one.

Speakers: Daisuke Saito (The University of Tokyo) , Tatsuma Ishihara (GREE Inc.)
• 20:30
Mon-2-7-9 Data Efficient Voice Cloning from Noisy Samples with Domain Adversarial Training 1h

Data efficient voice cloning aims at synthesizing target speaker's voice with only a few enrollment samples at hand. To this end, speaker adaptation and speaker encoding are two typical methods based on base model trained from multiple speakers. The former uses a small set of target speaker data to transfer the multi-speaker model to target speaker's voice through direct model update, while in the latter, only a few seconds of target speaker's audio directly goes through an extra speaker encoding model along with the multi-speaker model to synthesize target speaker's voice without model update. Nevertheless, the two methods need clean target speaker data. However, the samples provided by user may inevitably contain acoustic noise in real applications. It's still challenging to generating target voice with noisy data. In this paper, we study the data efficient voice cloning problem from noisy samples under the sequence-to-sequence based TTS paradigm. Specifically, we introduce domain adversarial training (DAT) to speaker adaptation and speaker encoding, which aims to disentangle noise from speech-noise mixture. Experiments show that for both speaker adaptation and encoding, the proposed approaches can consistently synthesize clean speech from noisy speaker samples, apparently outperforming the method adopting state-of-the-art speech enhancement module.

Speakers: Guanglu Wan (Meituan-Dianping Group) , Guoqiao Yu (Meituan-Dianping Group,) , Jian Cong (Northwestern Polytechnical University) , Lei Xie (Northwestern Polytechnical University) , Shan Yang (Northwestern Polytechnical University)
• 20:30 21:30
Mon-2-8 Acoustic Event Detection room8

### room8

Chairs: Akinori Ito,Kunio Kashino

https://zoom.com.cn/j/63352125526

• 20:30
Mon-2-8-1 Gated Multi-head Attention Pooling for Weakly Labelled Audio Tagging 1h

Multiple instance learning (MIL) has recently been used for weakly labelled audio tagging, where the spectrogram of an audio signal is divided into segments to form instances in a bag, and then the low-dimensional features of these segments are pooled for tagging. The choice of a pooling scheme is the key to exploiting the weakly labelled data. However, the traditional pooling schemes are usually fixed and unable to distinguish the contributions, making it difficult to adapt to the characteristics of the sound events. In this paper, a novel pooling algorithm is proposed for MIL, named gated multi-head attention pooling (GMAP), which is able to attend to the information of events from different heads at different positions. Each head allows the model to learn information from different representation subspaces. Furthermore, in order to avoid the redundancy of multi-head information, a gating mechanism is used to fuse individual head features. The proposed GMAP increases the modeling power of the single-head attention with no computational overhead. Experiments are carried out on Audioset, which is a large-scale weakly labelled dataset, and show superior results to the non-adaptive pooling and the vanilla attention pooling schemes.

Speakers: Sixin Hong (Peking University) , Wenwu Wang (Center for Vision, Speech and Signal Processing, University of Surrey, UK) , Yuexian Zou (ADSPLAB, School of ECE, Peking University, Shenzhen)
• 20:30
Mon-2-8-10 SpeechMix - Augmenting Deep Sound Recognition using Hidden Space Interpolations 1h

This paper presents SpeechMix, a regularization and data augmentation technique for deep sound recognition. Our strategy is to create virtual training samples by interpolating speech samples in hidden space. SpeechMix has the potential to generate an infinite number of new augmented speech samples since the combination of speech samples is continuous. Thus, it allows downstream models to avoid overfitting drastically. Unlike other mixing strategies that only work on the input space, we apply our method on the intermediate layers to capture a broader representation of the feature space. Through an extensive quantitative evaluation, we demonstrate the effectiveness of SpeechMix in comparison to standard learning regimes and previously applied mixing strategies. Furthermore, we highlight how different hidden layers contribute to the improvements in classification using an ablation study.

Speakers: Amit Jindal (Manipal Institute of Technology) , Aniket Didolkar (Manipal Institute of Technology) , Arijit Ghosh Chowdhury (Manipal Institute of Technology) , Di Jin (MIT) , Narayanan Elavathur Ranganatha (Manipal Academy of Higher Education) , Rajiv Ratn Shah (IIIT Delhi) , Ramit Sawhney (Netaji Subhas Institute of Technology)
• 20:30
Mon-2-8-2 Environmental Sound Classification with Parallel Temporal-spectral Attention 1h

Convolutional neural networks (CNN) are one of the best-performing neural network architectures for environmental sound classification (ESC). Recently, temporal attention mechanisms have been used in CNN to capture the useful information from the relevant time frames for audio classification, especially for weakly labelled data where the onset and offset times of the sound events are not applied. In these methods, however, the inherent spectral characteristics and variations are not explicitly exploited when obtaining the deep features. In this paper, we propose a novel parallel temporal-spectral attention mechanism for CNN to learn discriminative sound representations, which enhances the temporal and spectral features by capturing the importance of different time frames and frequency bands. Parallel branches are constructed to allow temporal attention and spectral attention to be applied respectively in order to mitigate interference from the segments without the presence of sound events. The experiments on three environmental sound classification (ESC) datasets and two acoustic scene classification (ASC) datasets show that our method improves the classification performance and also exhibits robustness to noise.

Speakers: Helin Wang (Peking University) , Wenwu Wang (University of Surrey) , Yuexian Zou (Peking University Shenzhen Graduate School) , dading chong (Peking University ShenZhen Graduate School)
• 20:30
Mon-2-8-3 Contrastive Predictive Coding of Audio with an Adversary 1h

With the vast amount of audio data available, powerful sound representations can be learned with self-supervised methods even in the absence of explicit annotations. In this work we investigate learning general audio representations directly from raw signals using the Contrastive Predictive Coding objective. We further extend it by leveraging ideas from adversarial machine learning to produce additive perturbations that effectively makes the learning harder, such that the predictive tasks will not be distracted by trivial details. We also look at the effects of different design choices for the objective, including the nonlinear similarity measure and the way the negatives are drawn. Combining these contributions our models are able to considerably outperform previous spectrogram-based unsupervised methods. On AudioSet we observe a relative improvement of 14% in mean average precision over the state of the art with half the size of the training data.

Speakers: Aaron van den Oord (DeepMind) , Kazuya Kawakami (DeepMind) , Luyu Wang (DeepMind)
• 20:30
Mon-2-8-4 Memory Controlled Sequential Self Attention for Sound Recognition 1h

In this paper we investigate the importance of the extent of memory in sequential self attention for sound recognition. We propose to use a memory controlled sequential self attention mechanism on top of a convolutional recurrent neural network (CRNN) model for polyphonic sound event detection (SED). Experiments on the URBAN-SED dataset demonstrate the impact of the extent of memory on sound recognition performance with the self attention induced SED model. We extend the proposed idea with a multi-head self attention mechanism where each attention head processes the audio embedding with explicit attention width values. The proposed use of memory controlled sequential self attention offers a way to induce relations among frames of sound event tokens. We show that our memory controlled self attention model achieves an event based F -score of 33.92% on the URBAN-SED dataset, outperforming the F -score of 20.10% reported by the model without self attention.

Speakers: Arjun Pankajakshan (Queen Mary University of London) , Emmanouil Benetos (Queen Mary University of London) , Helen L. Bear (Queen Mary University of London) , Vinod Subramanian (Queen Mary University of London)
• 20:30
Mon-2-8-5 Dual Stage Learning based Dynamic Time-Frequency Mask Generation for Audio Event Classification 1h

Audio based event recognition becomes quite challenging in real world noisy environments. To alleviate the noise issue, time-frequency mask based feature enhancement methods have been proposed. While these methods with fixed filter settings have been shown to be effective in familiar noise backgrounds, they become brittle when exposed to unexpected noise. To address the unknown noise problem, we develop an approach based on dynamic filter generation learning. In particular, we propose a dual stage dynamic filter generator networks that can be trained to generate a time-frequency mask specifically created for each input audio. Two alternative approaches of training the mask generator network are developed for feature enhancements in high noise environments. Our proposed method shows improved performance and robustness in both clean and unseen noise environments.

Speakers: David Han (US Army Research Laboratory) , Donghyeon Kim (Korea university) , Hanseok Ko (Korea University) , Jaihyun Park (Korea University)
• 20:30
Mon-2-8-6 An Effective Perturbation based Semi-Supervised Learning Method for Sound Event Detection 1h

Mean teacher based methods are increasingly achieving state-of-the-art performance for large-scale weakly labeled and unlabeled sound event detection (SED) tasks in recent DCASE challenges.
By penalizing inconsistent predictions under different perturbations, mean teacher methods can exploit large-scale unlabeled data in a self-ensembling manner.
In this paper, an effective perturbation based semi-supervised learning (SSL) method is proposed based on the mean teacher method.
Specifically, a new independent component (IC) module is proposed to introduce perturbations for different convolutional layers, designed as a combination of batch normalization and dropblock operations.
The proposed IC module can reduce correlation between neurons to improve performance.
A global statistics pooling based attention module is further proposed to explicitly model inter-dependencies between the time-frequency domain and channels, using statistics information (e.g. mean, standard deviation, max) along different dimensions.
This can provide an effective attention mechanism to adaptively re-calibrate the output feature map.
Experimental results on Task 4 of the DCASE2018 challenge demonstrate the superiority of the proposed method, achieving about 39.8% F1-score, outperforming the previous winning system's 32.4% by a significant margin.

Speakers: Ian McLoughlin (ICT Cluster, Singapore Institute of Technology) , Jie Yan (National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China) , Li-Rong Dai (National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China) , Lin Liu (iFLYTEK Research, iFLYTEK CO., LTD, Hefei) , Xu Zheng (National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China) , Yan Song (National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China)
• 20:30
Mon-2-8-7 A Joint Framework for Audio Tagging and Weakly Supervised Acoustic Event Detection Using DenseNet with Global Average Pooling 1h

This paper proposes a network architecture mainly designed for audio tagging, which can also be used for weakly supervised acoustic event detection (AED). The proposed network consists of a modified DenseNet as the feature extractor, and a global average pooling (GAP) layer to predict frame-level labels at inference time. This architecture is inspired by the work proposed by Zhou et al., a well-known framework using GAP to localize visual objects given image-level labels. While most of the previous works on weakly supervised AED used recurrent layers with attention-based mechanism to localize acoustic events, the proposed network directly localizes events using the feature map extracted by DenseNet without any recurrent layers. In the audio tagging task of DCASE 2017, our method significantly outperforms the state-of-the-art method in F1 score by 5.3% on the dev set, and 6.0% on the eval set in terms of absolute values. For weakly supervised AED task in DCASE 2018, our model outperforms the state-of-the-art method in event-based F1 by 8.1% on the dev set, and 0.5% on the eval set in terms of absolute values, by using data augmentation and tri-training to leverage unlabeled data.

Speakers: Bowen Shi (Toyota Technological Institute at Chicago) , Chao Wang (Amazon.com) , Chieh-Chi Kao (Amazon.com) , Ming Sun (Amazon.com)
• 20:30
Mon-2-8-8 Intra-Utterance Similarity Preserving Knowledge Distillation for Audio Tagging 1h

Knowledge Distillation (KD) is a popular area of research for reducing the size of large models while still maintaining good performance. The outputs of larger teacher models are used to guide the training of smaller student models. Given the repetitive nature of acoustic events, we propose to leverage this information to regulate the KD training for Audio Tagging. This novel KD method, “Intra-Utterance Similarity Preserving KD” (IUSP), shows promising results for the audio tagging task. It is motivated by the previously published KD method: “Similarity Preserving KD” (SP). However, instead of preserving the pairwise similarities between inputs within a mini-batch, our method preserves the pairwise similarities between the frames of a single input utterance. Our proposed KD method, IUSP, shows consistent improvements over SP across student models of different sizes on the DCASE 2019 Task 5 dataset for audio tagging. There is a 27.1% to 122.4% percent increase in improvement of micro AUPRC over the baseline relative to SP’s improvement of over the baseline.

Speakers: Chao Wang (Amazon.com) , Chieh-Chi Kao (Amazon.com) , Chun-Chieh Chang (Johns Hopkins University) , Ming Sun (Amazon.com)
• 20:30
Mon-2-8-9 Two-stage Polyphonic Sound Event Detection Based on Faster R-CNN-LSTM with Multi-token Connectionist Temporal Classification 1h

We propose a two-stage sound event detection (SED) model to deal with sound events overlapped in time-frequency. In the first stage which consists of a faster R-CNN and an attention-LSTM, each log-mel spectrogram segment is divided into one or more proposed regions (PRs) according to the coordinates of a region proposal network. To efficiently train polyphonic sound, we take only one PR for each sound event from a bounding box regressor associated with the attention-LSTM. In the second stage, the original input image and the difference image between adjacent segments are separately pooled according to the coordinate of each PR predicted in the first stage. Then, two feature maps using CNNs are concatenated and processed further by LSTM. Finally, CTC-based n-best SED is conducted using the softmax output from the CNN-LSTM, where CTC has two tokens for each event so that the start and ending time frames are accurately detected. Experiments on SED using DCASE 2019 Task 3 show that the proposed two-stage model with multi-token CTC achieves an F1-score of 97.5%, while the first stage alone and the two-stage model with a conventional CTC yield F1-scores of 91.9% and 95.6%, respectively.

Speakers: Hong Kook Kim (Professor) , Inyoung Park (Ph. D. Student)
• 20:30 21:30
Mon-2-9 Spoken Language Understanding I room9

### room9

Chairs: Yannik Estève,Yuanzhe Zhang

https://zoom.com.cn/j/64287533785

• 20:30
Mon-2-9-1 End-to-End Neural Transformer Based Spoken Language Understanding 1h

Spoken language understanding (SLU) refers to the process of inferring the semantic information from audio signals. While the neural transformers consistently deliver the best performance among the state-of-the-art neural architectures in field of natural language processing (NLP), their merits in a closely related field, i.e., spoken language understanding (SLU) have not beed investigated. In this paper,
we introduce an end-to-end neural transformer-based SLU model that can predict the variable-length domain, intent, and slots vectors embedded in an audio signal with no intermediate token prediction architecture. This new architecture leverages the self-attention mechanism by which the audio signal is transformed to various sub-subspaces allowing to extract the semantic context implied by an utterance. Our end-to-end transformer SLU predicts the domains, intents and slots in the Fluent Speech Commands dataset with accuracy equal to 98.1 %, 99.6 %, and 99.6 %, respectively and outperforms the SLU models that leverage a combination of recurrent and convolutional neural networks by 1.4 % while the size of our model is 25% smaller than that of these architectures. Additionally, due to independent sub-space projections in the self-attention layer, the model is highly parallelizable which makes it a good candidate for on-device SLU.

Speakers: Athanasios Mouchtaris (Amazon Inc) , Jimmy Kunnzmann (Amazon Inc) , martin radfar (Amazon Inc)
• 20:30
Mon-2-9-10 Are Neural Open-Domain Dialog Systems Robust to Speech Recognition Errors in the Dialog History? An Empirical Study 1h

Large end-to-end neural open-domain chatbots are becoming increasingly popular. However, research on building such chatbots has typically assumed that the user input is written in nature and it is not clear whether these chatbots would seamlessly integrate with automatic speech recognition (ASR) models to serve the speech modality. We aim to bring attention to this important question by empirically studying the effects of various types of synthetic and actual ASR hypotheses in the dialog history on TransferTransfo, a state-of-the-art Generative Pre-trained Transformer (GPT) based neural open-domain dialog system from the NeurIPS ConvAI2 challenge. We observe that TransferTransfo trained on written data is very sensitive to such hypotheses introduced to the dialog history during inference time. As a baseline mitigation strategy, we introduce synthetic ASR hypotheses to the dialog history during training and observe marginal improvements, demonstrating the need for further research into techniques to make end-to-end open-domain chatbots fully speech-robust. To the best of our knowledge, this is the first study to evaluate the effects of synthetic and actual ASR hypotheses on a state-of-the-art neural open-domain dialog system and we hope it promotes speech-robustness as an evaluation criterion in open-domain dialog.

Speakers: Behnam Hedayatnia (Amazon) , Dilek Hakkani-Tur (Amazon Alexa AI) , Karthik Gopalakrishnan (Amazon Alexa AI) , Longshaokan Wang (Amazon) , Yang Liu (Amazon)
• 20:30
Mon-2-9-2 Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding 1h

Spoken Language Understanding (SLU) converts hypotheses from automatic speech recognizer (ASR) into structured semantic representations. ASR recognition errors can severely degenerate the performance of the subsequent SLU module. To address this issue, word confusion networks (WCNs) have been used as the input for SLU, which contain richer information than 1-best or n-best hypotheses list. To further eliminate ambiguity, the last system act of dialogue context is also utilized as additional input. In this paper, a novel BERT based SLU model (WCN-BERT SLU) is proposed to encode WCNs and the dialogue context jointly. It can integrate both structural information and ASR posterior probabilities of WCNs in the BERT architecture. Experiments on DSTC2, a benchmark of SLU, show that the proposed method is effective and can outperform previous state-of-the-art models significantly.

Speakers: Chen Liu (Shanghai Jiao Tong University) , Kai Yu (Shanghai Jiao Tong University) , Lu Chen (Shanghai Jiao Tong University) , Ruisheng Cao (Shanghai Jiao Tong University) , Su Zhu (Shanghai Jiao Tong University) , Zijian Zhao (Shanghai Jiao Tong University)
• 20:30
Mon-2-9-3 Speech To Semantics: Improve ASR and NLU Jointly via All-Neural Interfaces 1h

We consider the problem of spoken language understanding (SLU) of extracting natural language intents and associated slot arguments or named entities from speech that is primarily directed at voice assistants. Such a system subsumes both automatic speech recognition (ASR) as well as natural language understanding (NLU). An end-to-end joint SLU model can be built to a required specification opening up the opportunity to deploy on hardware constrained scenarios like devices enabling voice assistants to work offline, in a privacy preserving manner, whilst also reducing server costs.

We first present models that extract utterance intent directly from speech without intermediate text output. We then present a compositional model, which generates the transcript using the Listen Attend Spell ASR system and then extracts interpretation using a neural NLU model. Finally, we contrast these methods to a jointly trained end-to-end joint SLU model, consisting of ASR and NLU subsystems which are connected by a neural network based interface instead of text, that produces transcripts as well as NLU interpretation. We show that the jointly trained model shows improvements to ASR incorporating semantic information from NLU and also improves NLU by exposing it to ASR confusion encoded in the hidden layer.

Speakers: Anirudh Raju (Amazon) , Ariya Rastrow (Amazon.com) , Bach Bui (Amazon Alexa) , Milind Rao (Applied Scientist) , Pranav Dheram (Amazon Alexa)
• 20:30
Mon-2-9-4 Pretrained Semantic Speech Embeddings for End-to-End Spoken Language Understanding via Cross-Modal Teacher-Student Learning 1h

Spoken language understanding is typically based on pipeline architectures including speech recognition and natural language understanding steps. These components are optimized independently to allow usage of available data, but the overall system suffers from error propagation. In this paper, we propose a novel training method that enables pretrained contextual embeddings to process acoustic features. In particular, we extend it with an encoder of pretrained speech recognition systems in order to construct end-to-end spoken language understanding systems. Our proposed method is based on the teacher-student framework across speech and text modalities that aligns the acoustic and the semantic latent spaces. Experimental results in three benchmarks show that our system reaches the performance comparable to the pipeline architecture without using any training data and outperforms it after fine-tuning with ten examples per class on two out of three benchmarks.

Speakers: Ngoc Thang Vu (University of Stuttgart) , Pavel Denisov (University of Stuttgart)
• 20:30
Mon-2-9-5 Context Dependent RNNLM for Automatic Transcription of Conversations 1h

Conversational speech, while being unstructured at an utterance level, typically has a macro topic which provides larger context spanning multiple utterances. The current language models in speech recognition systems using recurrent neural networks (RNNLM) rely mainly on the local context and exclude the larger context. In order to model the long term dependencies of words across multiple sentences, we propose a novel architecture where the words from prior utterances are converted to an embedding. The relevance of these embeddings for the prediction of next word in the current sentence is found using a gating network. The relevance weighted context embedding vector is augmented in the language model to improve the next word prediction, and the entire model including the context embedding and the relevance weighting layers is jointly learned for a conversational language modeling task. Experiments are performed on two conversational datasets - AMI corpus and the Switchboard corpus. In these tasks, we illustrate that the proposed approach yields significant improvements in language model perplexity over the RNNLM baseline. In addition, the use of proposed conversational LM for ASR rescoring results in absolute WER reduction of $1.2$\% on Switchboard dataset and $1.0$\% on AMI dataset over the RNNLM based ASR baseline.

Speakers: Srikanth Raj Chetupalli (Indian Institute of Science, Bangalore) , Sriram Ganapathy (Indian Institute of Science, Bangalore, India,)
• 20:30
Mon-2-9-6 Improving End-to-End Speech-to-Intent Classification with Reptile 1h

End-to-end spoken language understanding (SLU) systems have many advantages over conventional pipeline systems, but collecting in-domain speech data to train an end-to-end system is costly and time consuming. One question arises from this: how to train an end-to-end SLU with limited amounts of data? Many researchers have explored approaches that make use of other related data resources, typically by pre-training parts of the model on high-resource speech recognition. In this pa- per, we suggest improving the generalization performance of SLU models with a non-standard learning algorithm, Reptile. Though Reptile was originally proposed for model-agnostic meta learning, we argue that it can also be used to directly learn a target task and result in better generalization than conventional gradient descent. In this work, we employ Reptile to the task of end-to-end spoken intent classification. Experiments on four datasets of different languages and domains show improvement of intent prediction accuracy, both when Reptile is used alone and used in addition to pre-training.

Speakers: Philip John Gorinski (Huawei Noah's Ark Lab) , Yusheng Tian (Huawei Noah’s Ark Lab, London)
• 20:30
Mon-2-9-7 Speech to Text Adaptation: Towards an Efficient Cross-Modal Distillation 1h

Speech is one of the most effective means of communication and is full of information that helps the transmission of utterer's thoughts. However, mainly due to the cumbersome processing of acoustic features, phoneme or word posterior probability has frequently been discarded in understanding the natural language. Thus, some recent spoken language understanding (SLU) modules have utilized end-to-end structures that preserve the uncertainty information. This further reduces the propagation of speech recognition error and guarantees computational efficiency. We claim that in this process, the speech comprehension can benefit from the inference of massive pre-trained language models (LMs). We transfer the knowledge from a concrete Transformer-based text LM to an SLU module which can face a data shortage, based on recent cross-modal distillation methodologies. We demonstrate the validity of our proposal upon the performance on Fluent Speech Command, an English SLU benchmark. Thereby, we experimentally verify our hypothesis that the knowledge could be shared from the top layer of the LM to a fully speech-based module, in which the abstracted speech is expected to meet the semantic representation.

Speakers: Donghyun Kwak (Search Solution Inc.) , Jiwon Yoon (Department of Electrical and Computer Engineering and INMC, Seoul National University) , Nam Soo Kim (Seoul National University) , Won Ik Cho (Department of Electrical and Computer Engineering and INMC, Seoul National University)
• 20:30
Mon-2-9-8 Towards an ASR error robust Spoken Language Understanding System 1h

A modern Spoken Language Understanding (SLU) system usually contains two sub-systems, Automatic Speech Recognition (ASR) and Natural Language Understanding (NLU), where ASR transforms voice signal to text form and NLU provides intent classification and slot filling from the text. In practice, such decoupled ASR/NLU design facilitates fast model iteration for both components. However, this makes downstream NLU susceptible to errors from the upstream ASR, causing significant performance degradation. Therefore, dealing with such errors is a major opportunity to improve overall SLU model performance. In this work, we first propose a general evaluation criterion that requires an ASR error robust model to perform well on both transcription and ASR hypothesis. Then robustness training techniques for both classification task and NER task are introduced. Experimental results on two datasets show that our proposed approaches improve model robustness to ASR errors for both tasks.

Speakers: Chengwei Su (Amazon Alexa) , Imre Kiss (Amazon Alexa) , Luoxin Chen (Amazon Alexa) , Weitong Ruan (Amazon Alexa) , Yaroslav Nechaev (Amazon Alexa)
• 20:30
Mon-2-9-9 End-to-End Spoken Language Understanding Without Full Transcripts 1h

An essential component of spoken language understanding (SLU) is slot filling: representing the meaning of a spoken utterance using semantic entity labels. In this paper, we develop end-to-end spoken (E2E) language understanding systems that directly convert speech input to semantic entities and investigate if these E2E SLU models can be trained solely on semantic entity annotations without word-for-word transcripts. Training such models is very useful as they can drastically reduce the cost of data collection. We created two types of such speech-to-entities models, a CTC model and an attention-based encoder-decoder model, by adapting models trained originally for speech recognition. Given that our experiments involve speech input, these systems need to recognize both the entity label and words representing the entity value correctly. For our speech-to-entities experiments on the ATIS corpus, both the CTC and attention models showed impressive ability to skip non-entity words: there was little degradation when trained on just entities versus full transcripts. We also explored the scenario where the entities are in an order not necessarily related to spoken order in the utterance. With its ability to do re-ordering, the attention model did remarkably well, achieving only about 2% degradation in speech-to-bag-of-entities F1 score.

Speakers: Brian Kingsbury (IBM Research) , Gakuto Kurata (IBM Research) , Hong-Kwang Kuo (IBM T. J. Watson Research Center) , Kartik Audhkhasi (IBM Research) , Luis Lastras (IBM Research AI) , Ron Hoory (IBM Haifa Research Lab) , Samuel Thomas (IBM Research AI) , Yinghui Huang (IBM) , Zoltán Tüske (IBM Research) , Zvi Kons (IBM Haifa research lab)
• 20:30 21:30
Mon-SS-2-6 Large-Scale Evaluation of Short- Duration Speaker Verification (SdSV) room6

### room6

Chairs: Hossein Zeinali,Kong Aik Lee

https://zoom.com.cn/j/67261969599

• 20:30
Mon-SS-2-6-1 Improving X-vector and PLDA for Text-dependent Speaker Verification 1h

Recently, the pipeline consisting of an x-vector speaker embedding front-end and a Probabilistic Linear Discriminant Analysis (PLDA) back-end has achieved state-of-the-art results in text-independent speaker verification. In this paper, we further improve the performance of x-vector and PLDA based system for text-dependent speaker verification by exploring the choice of layer to produce embedding and modifying the back-end training strategies. In particular, we probe that x-vector based embeddings, specifically the standard deviation statistics in the pooling layer, contain the information related to both speaker characteristics and spoken content. Accordingly, we modify the back-end training labels by utilizing both of the speaker-id and phrase-id. A correlation-alignment-based PLDA adaptation is also adopted to make use of the text-independent labeled data during back-end training. Experimental results on the SDSVC 2020 dataset show that our proposed methods achieve significant performance improvement compared with the x-vector and HMM based i-vector baselines.

Speakers: Yue Lin (NetEase Games AI Lab) , Zhuxin Chen (NetEase Games AI Lab)
• 20:30
Mon-SS-2-6-2 SdSV Challenge 2020: Large-Scale Evaluation of Short‐Duration Speaker Verification 1h

Modern approaches to speaker verification represent speech utterances as fixed-length embeddings. With these approaches, we implicitly assume that speaker characteristics are independent of the spoken content. Such an assumption generally holds when sufficiently long utterances are given. In this context, speaker embeddings, like i-vector and x-vector, have shown to be extremely effective. For speech utterances of short duration (in the order of a few seconds), speaker embeddings have shown significant dependency on the phonetic content. In this regard, the SdSV Challenge 2020 was organized with a broad focus on systematic benchmark and analysis on varying degrees of phonetic variability on short-duration speaker verification (SdSV). In addition to text-dependent and text-independent tasks, the challenge features an unusual and difficult task of cross-lingual speaker verification (English vs. Persian). This paper describes the dataset and tasks, the evaluation rules and protocols, the performance metric, baseline systems, and challenge results. We also present insights gained from the evaluation and future research directions.

Speakers: Hossein Zeinali (Amirkabir University of Technology) , Kong Aik Lee (Biometrics Research Laboratories, NEC Corporation) , Lukas Burget (Brno University of Technology) , Md Jahangir Alam (Computer Research Institute of Montreal (CRIM))
• 20:30
Mon-SS-2-6-3 The XMUSPEECH System for Short-Duration Speaker Verification Challenge 2020 1h

In this paper, we present our XMUSPEECH system for Task 1 in the Short-Duration Speaker Verification (SdSV) Challenge. In this challenge, Task 1 is a Text-Dependent (TD) mode where speaker verification systems are required to automatically determine whether a test segment with specific phrase belongs to the target speaker. We leveraged the system pipeline from three aspects, including the data processing, front-end training and back-end processing. In addition, we have explored some training strategies such as spectrogram augmentation and transfer learning. The experimental results show that the attempts we had done are effective and our best single system, a transfered model with spectrogram augmentation and attentive statistic pooling, signiﬁcantly outperforms the ofﬁcial baseline on both progress subset and evaluation subset. Finally, a fusion of seven subsystems are chosen as our primary system which yielded 0.0856 and 0.0862 in term of minDCF, for the progress subset and evaluation subset respectively.

Speakers: Lin Li (Xiamen University) , Miao Zhao (School of Informatics, Xiamen University) , Qingyang Hong (Xiamen University) , Tao Jiang (School of Informatics, Xiamen University)
• 20:30
Mon-SS-2-6-4 Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020 1h

This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods (e.g., statistical, self-attentive, ghostVLAD pooling). Although the conventional pooling methods provide embeddings with a sufficient amount of speaker-dependent information, our experiments show that these embeddings often lack phrase-dependent information. To mitigate this problem, we propose a new pooling and score compensation methods that leverage a CTC-based automatic speech recognition (ASR) model for taking the lexical content into account. Both methods showed improvement over the conventional techniques, and the best performance was achieved by fusing all the experimented systems, which showed 0.0785% MinDCF and 2.23% EER on the challenge's evaluation subset.

Speakers: Min Hyun Han (Seoul National University) , Nam Soo Kim (Seoul National University) , Sung Hwan Mun (Seoul National University) , Woo Hyun Kang (Department of Electrical and Computer Engineering and INMC, Seoul National University)
• 20:30
Mon-SS-2-6-5 The TalTech Systems for the Short-duration Speaker Verification Challenge 2020 1h

This paper presents the Tallinn University of Technology systems submitted to the
Short-duration Speaker Verification Challenge 2020.
The challenge consists of two tasks, focusing on text-dependent and text-independent speaker verification with
some cross-lingual aspects.
We used speaker embedding models that consist of squeeze-and-attention based residual layers,
multi-head attention and either cross-entropy-based or additive angular margin based objective function.
In order to encourage the model to produce language-independent embeddings, we trained the models
in a multi-task manner, using dataset specific output layers. In the text-dependent task we employed a phrase
classifier to reject trials with non-matching phrases. In the text-independent task we used a language classifier
to boost the scores of trials where the language of the test and enrollment utterances does not match.
Our final primary metric score was 0.075 in Task 1 (ranked as 6th) and 0.118 in Task 2 (rank 8).

Speakers: Jörgen Valk (Tallinn University of Technology) , Tanel Alumäe (Tallinn University of Technology)
• 20:30
Mon-SS-2-6-6 Investigation of NICT submission for short-duration speaker verification challenge 2020 1h

In this paper, we describe the NICT speaker verification system for the text-independent task of the short-duration speaker verification (SdSV) challenge 2020.
We firstly present the details of the training data and feature preparation. Then, x-vector-based front-ends by considering different network configurations, back-ends of probabilistic linear discriminant analysis (PLDA), simplified PLDA, cosine similarity, and neural network-based PLDA are investigated and explored.
Finally, we apply a greedy fusion and calibration approach to select and combine the subsystems.
To improve the performance of the speaker verification system on short-duration evaluation data, we introduce our investigations on how to reduce the duration mismatch between training and test datasets.
Experimental results showed that our primary fusion yielded minDCF of 0.074 and EER of 1.50 on the evaluation subset, which was the 2nd best result in the text-independent speaker verification task.

Speakers: Hisashi Kawai (NICT) , Peng Shen (NICT) , Xugang Lu (NICT)
• 20:30
Mon-SS-2-6-7 Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization 1h

In this paper we describe the top-scoring IDLab submission for the text-independent task of the Short-duration Speaker Verification (SdSV) Challenge 2020. The main difficulty of the challenge exists in the large degree of varying phonetic overlap between the potentially cross-lingual trials, along with the limited availability of in-domain DeepMine Farsi training data. We introduce domain-balanced hard prototype mining to fine-tune the state-of-the-art ECAPA-TDNN x-vector based speaker embedding extractor. The sample mining technique efficiently exploits speaker distances between the speaker prototypes of the popular AAM-softmax loss function to construct challenging training batches that are balanced on the domain-level. To enhance the scoring of cross-lingual trials, we propose a language-dependent s-norm score normalization. The imposter cohort only contains data from the Farsi target-domain which simulates the enrollment data always being Farsi. In case a Gaussian-Backend language model detects the test speaker embedding to contain English, a cross-language compensation offset determined on the AAM-softmax speaker prototypes is subtracted from the maximum expected imposter mean score. A fusion of five systems with minor topological tweaks resulted in a final MinDCF and EER of 0.065 and 1.45% respectively on the SdSVC evaluation set.

Speakers: Brecht Desplanques (Ghent University - imec, IDLab, Department of Electronics and Information Systems) , Jenthe Thienpondt (IDLab, Department of Electronics and Information Systems, Ghent University - imec, Belgium) , Kris Demuynck (Ghent University)
• 20:30
Mon-SS-2-6-8 BUT Text-Dependent Speaker Verification System for SdSV Challenge 2020 1h

In this paper, we present the winning BUT submission for the text-dependent task of the SdSV challenge 2020. Given the large amount of training data available in this challenge, we explore successful techniques from text-independent systems in the text-dependent scenario. In particular, we trained x-vector extractors on both in-domain and out-domain datasets and combine them with i-vectors trained on concatenated MFCCs and bottleneck features, which have proven effective for the text-dependent scenario. Moreover, we proposed the use of phrase-dependent PLDA backend for scoring and its combination with a simple phrase recognizer, which brings up to 63% relative improvement on our development set with respect to using standard PLDA. Finally, we combine our different i-vector and x-vector based systems using a simple linear logistic regression score level fusion, which provides 28% relative improvement on the evaluation set with respect to our best single system.

Speakers: Alicia Lozano-Diez (Brno University of Technology) , Anna Silnova (Brno University of Technology) , Bhargav Pulugundla (Brno University of Technology) , Johan Rohdin (Brno University of Technology) , Karel Vesely (Brno University of Technology) , Lukas Burget (Brno University of Technology) , Oldrich Plchot (Brno University of Technology) , Ondrej Glembek (Brno University of Technology) , Ondrej Novotny (Brno University of Technology) , Pavel Matejka (Brno University of Technology)
• 20:30
Mon-SS-2-6-9 Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification 1h

In this paper, we propose a novel way of addressing text-dependent automatic speaker verification (TD-ASV) by using a shared-encoder with task-specific decoders. An autoregressive predictive coding (APC) encoder is pre-trained in an unsupervised manner using both out-of-domain (LibriSpeech, VoxCeleb) and in-domain (DeepMine) unlabeled datasets to learn generic, high-level feature representation that encapsulates speaker and phonetic content. Two task-specific decoders were trained using labeled datasets to classify speakers (SID) and phrases (PID). Speaker embeddings extracted from the SID decoder were scored using a PLDA. SID and PID systems were fused at the score level. There is a 51.9% relative improvement in minDCF for our system compared to the fully supervised x-vector baseline on the cross-lingual DeepMine dataset. However, the i-vector/HMM method outperformed the proposed APC encoder-decoder system. A fusion of the x-vector/PLDA baseline and the SID/PLDA scores prior to PID fusion further improved performance by 15% indicating complementarity of the proposed approach to the x-vector system. We show that the proposed approach can leverage from large, unlabeled, data-rich domains, and learn speech patterns independent of downstream tasks. Such a system can provide competitive performance in domain-mismatched scenarios where test data is from data-scarce domains.

Speakers: Abeer Alwan (UCLA) , Amber Afshan (University of California, Los Angeles) , Huanhua Lu (UCLA) , Ruchao Fan (University of California, Los Angeles) , Vijay Ravi (Ph.D. Student, UCLA)
• 21:30 21:45
Coffee Break
• 21:45 22:45
Diversity Meeting room12

### room12

https://zoom.com.cn/j/63445767313

• 21:45 22:45
ISCA-SAC "2nd Mentoring" room6

### room6

Mentors: TBA

https://zoom.com.cn/j/67261969599

• 21:45
ISCA-SAC "2nd Mentoring" 1h

The Student Advisory Committee of the International Speech Communication Association (ISCA-SAC) is pleased to announce that we are planning to hold the 2nd Mentoring Event in 2020! After a successful first edition at Interspeech 2019, we would like to establish this event in upcoming years.

The event gives PhD students the opportunity to engage in a discussion with early-career and senior researchers from academia and industry. ISCA-SAC aims at providing a warm environment for discussing questions concerning a variety of topics, such as research in academia and industry, equal opportunities, publishing, professional development.

This year, the event will take place virtually around the same time as Interspeech 2020 in Shanghai.

The event is planned in a round table format with two mentors and 6-8 PhD students per table. Each table will have an assigned topic and the mentors will be chosen and invited accordingly.

• 21:45 22:45
Mon-3-1 Cross/multi- lingual and code- switched speech recognition room1

### room1

Chairs: Preethi Jyothi, Sunayana Sitaram,

https://zoom.com.cn/j/68015160461

• 21:45
Mon-3-1-1 Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous? 1h

Phones, the segmental units of the International Phonetic Alphabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA, are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition (ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization between phones and tones. In this paper, we test four Connectionist Temporal Classification (CTC)-based acoustic models, differing in the degree of synchrony they impose between phones and tones. Models are trained and tested multilingually in three languages, then adapted and tested cross-lingually in a fourth. Both synchronous and asynchronous models are effective in both multilingual and cross-lingual settings. Synchronous models achieve lower error rate in the joint phone+tone tier, but asynchronous training results in lower tone error rate.

Speakers: Jialu Li<