Dear Organizer, this instance of indico is depracated, please use our updated indico3 instance.

25-29 October 2020
Shanghai International Convention Center
Asia/Shanghai timezone

Contribution List

1291 / 1291
Boxue Li (Yunfan Hailiang (Beijing) Technology co., LTD) , Jinsong Zhang (Advanced Innovation Center for Language Resource and Intelligence, Beijing Language and Culture University) , Xiaoli Feng (Advanced Innovation Center for Language Resource and Intelligence, Beijing Language and Culture University) , Yanlu Xie (Advanced Innovation Center for Language Resource and Intelligence, Beijing Language and Culture University) , Yujia Jin (Advanced Innovation Center for Language Resource and Intelligence, Beijing Language and Culture University)
26/10/2020, 19:15

In this paper, an APP with Mispronunciation Detection and Feedback for Mandarin L2 Learners is shown. The APP could detect the mispronunciation in the words and highlight it with red at the phone level. Also, the score will be shown to evaluate the overall pronunciation. When touching the highlight, the pronunciation of the learner’s and the standard’s is played. Then the flash animation that...

Siti Umairah Md Salleh (Institute for Infocomm Research, A*STAR, Singapore) , Ke Shi (Institute for Infocomm Research, A*STAR, Singapore) , Kye Min Tan (Institute for Infocomm Research, A*STAR, Singapore) , Nancy F. Chen (Institute for Infocomm Research, A*STAR, Singapore) , Nur Farah Ain Binte Suhaimi (Institute for Infocomm Research, A*STAR, Singapore) , Rajan s/o Vellu (Institute for Infocomm Research, A*STAR, Singapore) , Richeng Duan (Institute for Infocomm Research, A*STAR, Singapore) , Thai Ngoc Thuy Huong Helen (Institute for Infocomm Research, A*STAR, Singapore)
26/10/2020, 19:15

We present a computer-assisted language learning system that automatically evaluates the pronunciation and fluency of spoken Malay and Tamil. Our system consists of a server and a user- facing Android application, where the server is responsible for speech-to-text alignment as well as pronunciation and fluency scoring. We describe our system architecture and discuss the technical challenges...

Alexander Waibel (Carnegie Mellon) , Elizabeth Salesky (Johns Hopkins University) , Jan Niehues (Maastricht University) , Ngoc-Quan Pham (Karlsruhe Institute of Technology) , Sebastian Stüker (Karlsruhe Institute of Technology) , Thai Son Nguyen (Karlsruhe Institute of Technology) , Thanh-Le Ha (Karlsruhe Institute of Technology) , Tuan Nam Nguyen (Karlsruhe Institute of Technology)
26/10/2020, 19:15

Transformer models are powerful sequence-to-sequence architecture that is capable of directly mapping speech inputs to transcriptions or translations. However, the mechanism of modeling positions in this model was tailored for text modeling and thus is less ideal for acoustic inputs. In this work, we adapted the relative position encoding scheme to the Speech Transformer, in which the key is...

Bing Wang (Southwest Air Traffic Management Bureau, Civil Aviation Administration of China) , Bo Yang (Sichuan University) , Dan Li (Southwest Air Traffic Management Bureau, Civil Aviation Administration of China) , Min Ruan (Southwest Air Traffic Management Bureau, Civil Aviation Administration of China) , Xianlong Tan (Southwest Air Traffic Management Bureau, Civil Aviation Administration of China) , Xiping Wu (Sichuan University) , Yi LIN (Sichuan University) , Zhengmao Chen (Sichuan University) , Zhongping Yang (Wisesoft Co. Ltd.)
26/10/2020, 19:15

Automatic Speech Recognition (ASR) technique has been greatly developed in recent years, which expedites many applications in other fields. For the ASR research, speech corpus is always an essential foundation, especially for the vertical industry, such as Air Traffic Control (ATC). There are some speech corpora for common applications, public or paid. However, for the ATC domain, it is...

Chan Kyu Lee (Clova AI, NAVER Corp.) , Eunmi Kim (Clova AI, NAVER Corp.) , Hyeji Kim (Clova AI, NAVER Corp.) , Hyun Ah Kim, Hyunhoon Jung (Clova AI, NAVER Corp.) , Jin Gu Kang (Clova AI, NAVER Corp.) , Jung-Woo Ha (Clova AI, NAVER Corp.) , Kihyun Nam (Hankuk University of Foreign Stuides) , Kyoungtae Doh (Clova AI, NAVER Corp.) , Nako Sung (Clova AI, NAVER Corp.) , Sang-Woo Lee (Clova AI, NAVER Corp.) , Sohee Yang (Clova AI, NAVER Corp.) , Soojin Kim (Clova AI, NAVER Corp.) , Sunghun Kim (Clova AI, NAVER Corp.;The Hong Kong University of Science and Technology)
26/10/2020, 19:15

Automatic speech recognition (ASR) via call is essential for various applications, including AI for contact center (AICC) services. Despite the advancement of ASR, however, most publicly available call-based speech corpora such as Switchboard are old-fashioned. Also, most existing call corpora are in English and mainly focus on open-domain dialog or general scenarios such as audiobooks. Here...

Guanjun Li (National Laboratory of Patten Recognition, Institute of Automation, Chinese Academy of Sciences,) , Longshuai Xiao (NLPR, Institute of Automation, Chinese Academy of Sciences) , Shan Liang (NLPR, Institute of Automation, Chinese Academy of Sciences) , Shuai Nie (NLPR, Institute of Automation, Chinese Academy of Sciences) , Wenju Liu (NLPR, Institute of Automation, Chinese Academy of Sciences) , Zhanlei Yang (Huawei Technologies)
26/10/2020, 19:15

The elastic spatial filter (ESF) proposed in recent years is a popular multi-channel speech enhancement front end based on deep neural network (DNN). It is suitable for real-time processing and has shown promising automatic speech recognition (ASR) results. However, the ESF only utilizes the knowledge of fixed beamforming, resulting in limited noise reduction capabilities. In this paper, we...

Ed Lin (Microsoft, STCA) , Jian Wu (Northwestern Polytechnical University) , Jinyu Li (Microsoft, One Microsoft Way, Redmond, WA, USA) , Lei Xie (School of Computer Science, Northwestern Polytechnical University) , Takuya Yoshioka (Microsoft, One Microsoft Way, Redmond, WA, USA) , Yi Luo (Microsoft, One Microsoft Way, Redmond, WA, USA) , Zhili Tan (Microsoft, STCA, Beijing) , Zhuo Chen (Microsoft, One Microsoft Way, Redmond, WA, USA)
26/10/2020, 19:15

Multi-speaker speech recognition has been one of the key challenges in conversation transcription as it breaks the single active speaker assumption employed by most state-of-the-art speech recognition systems. Speech separation is considered as a remedy to this problem. Previously, we introduced a system, called unmixing, fixed-beamformer and extraction (UFE), that was shown to be effective in...

Chongyuan Lian (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) , Feiqi Zhu (Shenzhen Luohu People’s Hospital) , Lan Wang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) , Manwa Lawrence Ng (The University of Hong Kong) , Mingxiao Gu (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) , Nan Yan (Shenzhen Institutes of Advanced Technology) , Tianqi Wang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences)
26/10/2020, 19:15

Alterations in speech and language are typical signs of mild cognitive impairment (MCI), considered to be the prodromal stage of Alzheimer’s disease (AD). Yet, very few studies have pointed out at what stage their speech production is disrupted. To bridge this knowledge gap, the present study focused on lexical retrieval, a specific process during speech production, and investigated how it is...

Adam Lammert (Worcester Polytechnic Institute) , Anne O'Brien (Spaulding Rehabilitation Hospital) , Daniel Hannon (MIT Lincoln Laboratory) , Douglas Sturim (MIT) , Gloria Vergara-Diaz (Spaulding Rehabilitation Hospital) , Gregory Ciccarelli (MIT Lincoln Laboratory) , Hrishikesh Rao (MIT Lincoln Laboratory) , James Williamson (MIT Lincoln Laboratory) , Jeffrey Palmer (MIT Lincoln Laboratory) , Paolo Bonato (Spaulding Rehabilitation Hospital) , Richard DeLaura (MIT Lincoln Laboratory) , Ross Zafonte (Spaulding Rehabilitation Hospital) , Sophia Yuditskaya (MIT Lincoln Laboratory) , Tanya Talkar (Harvard University) , Thomas Quatieri (MIT Lincoln Laboratory)
26/10/2020, 19:15

Between 15% to 40% of mild traumatic brain injury (mTBI) patients experience incomplete recoveries or provide subjective reports of decreased motor abilities, despite a clinically-determined complete recovery. This demonstrates a need for objective measures capable of detecting subclinical residual mTBI, particularly in return-to-duty decisions for warfighters and return-to-play decisions for...

Binghuai Lin (MIG, Tencent Science and Technology Ltd., Beijing) , Dengfeng Ke (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Jinsong Zhang (Beijing Language and Culture University) , Wang Dai (Beijing Language and Culture University) , Wei Wei (Beijing Language and Culture University) , Yanlu Xie (Beijing Language and Culture University) , Yingming Gao (Institute of Acoustics and Speech Communication, Technische Universität Dresden)
26/10/2020, 19:15

Formant tracking is one of the most fundamental problems in speech processing. Traditionally, formants are estimated using signal processing methods. Recent studies showed that generic convolutional architectures can outperform recurrent networks on temporal tasks such as speech synthesis and machine translation. In this paper, we explored the use of Temporal Convolutional Network (TCN) for...

Denis Parkhomenko (Huawei Technologies Co. Ltd.) , Mikhail Kudinov (Huawei Technologies Co. Ltd.) , Sergey Repyevsky (Huawei Technologies Co. Ltd.) , Stanislav Kamenev (Huawei Technologies Co. Ltd.) , Tasnima Sadekova (Huawei Technologies Co. Ltd.) , Vadim Popov (Huawei Technologies Co. Ltd.) , Vitalii Bushaev (Huawei Technologies Co. Ltd.) , Vladimir Kryzhanovskiy (Huawei Technologies Co. Ltd.)
26/10/2020, 19:15

We present a fast and lightweight on-device text-to-speech system based on state-of-art methods of feature and speech generation i.e. Tacotron2 and LPCNet. We show that modification of the basic pipeline combined with hardware-specific optimizations and extensive usage of parallelization enables running TTS service even on low-end devices with faster than realtime waveform generation....

Aleksandr Laptev (ITMO University) , Aleksei Romanenko (ITMO University) , Andrei Andrusenko (ITMO University) , Anton Mitrofanov (STC-innovations Ltd) , Ivan Medennikov (STC-innovations Ltd) , Ivan Podluzhny (STC-innovations Ltd) , Ivan Sorokin (STC) , Mariya Korenevskaya (STC-innovations Ltd) , Maxim Korenevsky (Speech Technology Center) , Tatiana Prisyach (STC-innovations Ltd) , Tatiana Timofeeva (STC-innovations Ltd) , Yuri Khokhlov (STC-innovations Ltd)
26/10/2020, 19:15

Andrusenko, Ivan Podluzhny, Aleksandr Laptev and Aleksei Romanenko
Speaker diarization for real-life scenarios is an extremely challenging problem. Widely used clustering-based diarization approaches perform rather poorly in such conditions, mainly due to the limited ability to handle overlapping speech. We propose a novel Target-Speaker Voice Activity Detection (TS-VAD) approach, which...

Alberto Abad (INESC-ID/IST) , Björn Schuller (University of Augsburg / Imperial College London) , Catarina Botelho (INESC-ID/Instituto Superior Técnico, University of Lisbon, Portugal) , Dennis Küster (Cognitive Systems Lab (CSL), University of Bremen) , Isabel Trancoso (INESC-ID / IST Univ. Lisbon) , Kevin Scheck (Cognitive Systems Lab (CSL), University of Bremen) , Lorenz Diener (University of Bremen) , Shahin Amiriparian (University of Augsburg) , Tanja Schultz (Universität Bremen)
26/10/2020, 19:15

Electromyographic (EMG) signals recorded during speech production encode information on articulatory muscle activity and also on the facial expression of emotion, thus representing a speech-related biosignal with strong potential for paralinguistic applications. In this work, we estimate the electrical activity of the muscles responsible for speech articulation directly from the speech signal....

Bin Liu (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Jian Huang (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Jianhua Tao (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Rongjun Li (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Zhanlei Yang (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Zheng Lian (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
26/10/2020, 19:15

Emotion recognition remains a complex task due to speaker variations and low-resource training samples. To address these difficulties, we focus on the domain adversarial neural networks (DANN) for emotion recognition. The primary task is to predict emotion labels. The secondary task is to learn a common representation where speaker identities can not be distinguished. By using this approach,...

Hiroshi Saruwatari (Graduate School of Information Science and Technology, The University of Tokyo, Japan.) , Shinnnosuke Takamichi (Graduate School of Information Science and Technology, The University of Tokyo, Japan) , Takaaki Saeki (Graduate School of Information Science and Technology, The University of Tokyo, Japan.) , Yuki Saito (Graduate School of Information Science and Technology, The University of Tokyo, Japan.)
26/10/2020, 19:15

We present a real-time, full-band, online voice conversion (VC) system that uses a single CPU. For practical applications, VC must be high quality and able to perform real-time, online conversion with fewer computational resources. Our system achieves this by combining non-linear conversion with a deep neural network and short-tap, sub-band filtering. We evaluate our system and demonstrate...

Adrian Hempel (SoapBox Labs, Dublin, Ireland) , Agape Deng (SoapBox Labs, Dublin, Ireland) , Amelia C. Kelly (SoapBox Labs, Dublin, Ireland) , Armin Saeb (SoapBox Labs, Dublin, Ireland) , Arnaud Letondor (SoapBox Labs, Dublin, Ireland) , Eleni Karamichali (SoapBox Labs, Dublin, Ireland) , Gloria Montoya Gomez (SoapBox Labs, Dublin, Ireland) , Karel Vesely ́ (SoapBox Labs, Dublin, Ireland) , Niall Mullally (SoapBox Labs, Dublin, Ireland) , Nicholas Parslow (SoapBox Labs, Dublin, Ireland) , Qiru Zhou (SoapBox Labs, Dublin, Ireland) , Robert O’Regan (SoapBox Labs, Dublin, Ireland)
26/10/2020, 19:15

The SoapBox Labs Fluency API service allows the automatic assessment of a child’s reading fluency. The system uses auto- matic speech recognition (ASR) to transcribe the child’s speech as they read a passage. The ASR output is then compared to the text of the reading passage, and the fluency algorithm returns information about the accuracy of the child’s reading attempt. In this show and tell...

Agape Deng (SoapBox Labs, Dublin, Ireland) , Amelia C. Kelly (SoapBox Labs, Dublin, Ireland) , Armin Saeb (SoapBox Labs, Dublin, Ireland) , Arnaud Letondor (SoapBox Labs, Dublin, Ireland) , Eleni Karamichali, Karel Vesely ́ (SoapBox Labs, Dublin, Ireland) , Nicholas Parslow (SoapBox Labs, Dublin, Ireland) , Qiru Zhou (SoapBox Labs, Dublin, Ireland) , Robert O’Regan (SoapBox Labs, Dublin, Ireland)
26/10/2020, 19:15

SoapBox Labs’ child speech verification platform is a service designed specifically for identifying keywords and phrases in children’s speech. Given an audio file containing children’s speech and one or more target keywords or phrases, the system will return the confidence score of recognition for the word(s) or phrase(s) within the the audio file. The confidence scores are provided at...

Andrew Cornish (Modality.ai, Inc.) , David Pautler (Modality.ai, Inc.) , David Suendermann-Oeft (Modality.ai, Inc.) , Dirk Schnelle-Walka (Modality.ai, Inc.) , Doug Habberstad (Modality.ai, Inc.) , Hardik Kothare (Modality.ai, Inc. University of California, San Francisco) , Jackson Liscombe (Modality.ai, Inc.) , Michael Neumann (Modality.ai, Inc.) , Oliver Roesler (Modality.ai, Inc.) , Patrick Lange (Modality.ai, Inc.) , Vignesh Murali (Modality.ai, Inc.) , Vikram Ramanarayanan (Modality.ai, Inc.University of California, San Francisco)
26/10/2020, 19:15

We demonstrate a multimodal conversational platform for remote patient diagnosis and monitoring. The plat- form engages patients in an interactive dialog session and automatically computes metrics relevant to speech acoustics and articulation, oro-motor and oro-facial move- ment, cognitive function and respiratory function. The dialog session includes a selection of exercises that have been...

Eunil Park (Sungkyunkwan University) , Jeewoo Yoon (Sungkyunkwan University) , Jinyoung Han (Sungkyunkwan University) , Migyeong Yang (Sungkyunkwan University) , Minsam Ko (Hanyang University) , Munyoung Lee (Electronics and Telecommunications Research Institute) , Seong Choi (Sungkyunkwan University) , Seonghee Lee (Electronics and Telecommunications Research Institute) , Seunghoon Jeong (Hanyang University)
26/10/2020, 19:15

We introduce an open-source Python library, VCTUBE, which can automatically generate <audio, text> pair of speech data from a given Youtube URL. We believe VCTUBE is useful for collecting, processing, and annotating speech data easily toward developing speech synthesis systems.

Björn Schuller (University of Augsburg / Imperial College London) , Muhammad Asim (Information Technology University, Lahore) , Raja Jurdak (Queensland University of Technology (QUT)) , Rajib Rana (University of Southern Queensland) , Sara Khalifa (Distributed Sensing Systems Group, Data61, CSIRO Australia) , Siddique Latif (University of Southern Queensland Australia/Distributed Sensing Systems Group, Data61, CSIRO Australia)
26/10/2020, 20:30

Generative adversarial networks (GANs) have shown potential in learning emotional attributes and generating new data samples. However, their performance is usually hindered by the unavailability of larger speech emotion recognition (SER) data. In this work, we propose a framework that utilises the mixup data augmentation scheme to augment the GAN in feature learning and generation. To show...

Jianguo Wei (Tianjin University) , Jiayu Jin (Tianjin University) , Junhai Xu (Tianjin Key Laboratory of Cognitive Computing and Application, College of Intelligence and Computing, Tianjin University) , Lin Zhang (Tianjin University) , Longbiao Wang (Tianjin University) , Meng Liu (Tianjin University) , Ruiteng Zhang (Tianjin University) , Wenhuan Lu (Tianjin University)
26/10/2020, 20:30

The time-delay neural network (TDNN) is widely used in speaker verification to extract long-term temporal features of speakers. Although common TDNN approaches well capture time-sequential information, they lack the delicate transformations needed for deep representation. To solve this problem, we propose two TDNN architectures. RET integrates short-cut connections into conventional time-delay...

Hiroshi Sato (NTT media intelligent laboratory) , Marc Delcroix (NTT Communication Science Laboratories) , Ryo Masumura (NTT Corporation) , Shigeki Karita (NTT Communication Science Laboratories) , Takafumi Moriya (NTT Corporation) , Takanori Ashihara (NTT Corporation) , Tomohiro Tanaka (NTT Corporation) , Tsubasa Ochiai (NTT Communication Science Laboratories) , Yusuke Shinohara (NTT Corporation)
26/10/2020, 20:30

We present a novel training approach for encoder-decoder-based sequence-to-sequence (S2S) models. The important key factor of S2S is the attention mechanism as it captures the relationships between input and output sequences. The attention weights inform which time frames should be attended to for predicting the output labels. In previous work, we proposed distilling S2S knowledge into...

Cunhang Fan (Institute of Automation, Chinese Academy of Sciences) , Jiangyan Yi (Institute of Automation Chinese Academy of Sciences) , Jianhua Tao (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Ye Bai (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Zhengkun Tian (Institute of Automation, Chinese Academy of Sciences)
26/10/2020, 20:30

Many approaches have been proposed to predict punctuation marks. Previous results demonstrate that these methods are effective.However, there still exists class imbalance problem during training. Most of the classes in the training set for punctuation prediction are non-punctuation marks. This will affect the performance of punctuation prediction tasks. Therefore, this paper uses a focal loss...

Helen Meng (Chinese University of Hong Kong) , Jianwei Yu (Chinese University of Hong Kong) , Mengzhe Geng (Chinese University of Hong Kong) , SHANSONG LIU (Chinese University of Hong Kong) , Xunying Liu (Chinese University of Hong Kong) , Xurong Xie (Chinese University of Hong Kong) , shoukang hu (Chinese University of Hong Kong)
26/10/2020, 20:30

Disordered speech recognition is a highly challenging task. The underlying neuro-motor conditions of people with speech disorders, often compounded with co-occurring physical disabilities, lead to the difficulty in collecting large quantities of speech required for system development. This paper investigates a set of data augmentation techniques for disordered speech recognition, including...

Alexander Paul Leff (Institute of Cognitive Neuroscience, University College London) , David Barbera (University College London) , Emily Upton (Institute of Cognitive Neuroscience, University College London) , Henry Coley-Fisher (Institute of Cognitive Neuroscience, University College London) , Ian Shaw (Technical Consultant at SoftV) , Jenny Crinion (Institute of Cognitive Neuroscience, University College London)) , Mark Huckvale (Speech, Hearing and Phonetic Sciences, University College London) , Victoria Fleming (Speech, Hearing and Phonetic Sciences, University College London) , William Latham (Goldsmiths College University of London)
26/10/2020, 20:30

Anomia (word finding difficulties) is the hallmark of aphasia an acquired language disorder, most commonly caused by stroke. Assessment of speech performance using picture naming tasks is therefore a key method for identification of the disorder and monitoring patient’s response to treatment interventions. Currently, this assessment is conducted manually by speech and language therapists...

Helen Meng (The Chinese University of Hong Kong) , Jianwei Yu (The Chinese University of Hong Kong) , Mengzhe Geng (The Chinese University of Hong Kong) , Rongfeng Su (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences.) , SHANSONG LIU (The Chinese University of Hong Kong) , Shi-Xiong ZHANG (Tencent AI Lab) , Xunying Liu (Chinese University of Hong Kong) , Xurong Xie (Chinese University of Hong Kong) , shoukang hu (Chinese University of Hong Kong)
26/10/2020, 20:30

Audio-visual speech recognition (AVSR) technologies have been successfully applied to a wide range of tasks. When developing AVSR systems for disordered speech characterized by severe degradation of voice quality and large mismatch against normal, it is difficult to record large amounts of high quality audio-visual data. In order to address this issue, a cross-domain visual feature generation...

Jiangyan Yi (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Jianhua Tao (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Rongxiu Zhong (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Ruibo Fu (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Tao Wang (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences) , Zhengqi Wen (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
26/10/2020, 20:30

The low similarity and naturalness of synthesized speech remain a challenging problem for speaker adaptation with few resources. Since the acoustic model is too complex to interpret, overfitting will occur when training with few data. To prevent the model from overfitting, this paper proposes a novel speaker adaptation framework that decomposes the parameter space of the end-to-end acoustic...

Ian McLoughlin (ICT Cluster, Singapore Institute of Technology) , Jie Yan (National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China) , Li-Rong Dai (National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China) , Lin Liu (iFLYTEK Research, iFLYTEK CO., LTD, Hefei) , Xu Zheng (National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China) , Yan Song (National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China)
26/10/2020, 20:30

Mean teacher based methods are increasingly achieving state-of-the-art performance for large-scale weakly labeled and unlabeled sound event detection (SED) tasks in recent DCASE challenges.
By penalizing inconsistent predictions under different perturbations, mean teacher methods can exploit large-scale unlabeled data in a self-ensembling manner.
In this paper, an effective perturbation...

Brian Kingsbury (IBM Research) , Gakuto Kurata (IBM Research) , Hong-Kwang Kuo (IBM T. J. Watson Research Center) , Kartik Audhkhasi (IBM Research) , Luis Lastras (IBM Research AI) , Ron Hoory (IBM Haifa Research Lab) , Samuel Thomas (IBM Research AI) , Yinghui Huang (IBM) , Zoltán Tüske (IBM Research) , Zvi Kons (IBM Haifa research lab)
26/10/2020, 20:30

An essential component of spoken language understanding (SLU) is slot filling: representing the meaning of a spoken utterance using semantic entity labels. In this paper, we develop end-to-end spoken (E2E) language understanding systems that directly convert speech input to semantic entities and investigate if these E2E SLU models can be trained solely on semantic entity annotations without...

Alicia Lozano-Diez (Brno University of Technology) , Anna Silnova (Brno University of Technology) , Bhargav Pulugundla (Brno University of Technology) , Johan Rohdin (Brno University of Technology) , Karel Vesely (Brno University of Technology) , Lukas Burget (Brno University of Technology) , Oldrich Plchot (Brno University of Technology) , Ondrej Glembek (Brno University of Technology) , Ondrej Novotny (Brno University of Technology) , Pavel Matejka (Brno University of Technology)
26/10/2020, 20:30

In this paper, we present the winning BUT submission for the text-dependent task of the SdSV challenge 2020. Given the large amount of training data available in this challenge, we explore successful techniques from text-independent systems in the text-dependent scenario. In particular, we trained x-vector extractors on both in-domain and out-domain datasets and combine them with i-vectors...

Bo Xu (Institute of Automation, Chinese Academy of Science) , Jing Shi (Institute of Automation, Chinese Academy of Sciences.) , Lei Qin (Huawei Consumer Business Group) , Peng Zhang (Institute of Automation, Chinese Academy of Science) , Yunzhe Hao (Institute of Automation, Chinese Academy of Sciences) , jiaming xu (Institute of Automation, Chinese Academy of Sciences)
26/10/2020, 21:45

Speech recognition technology in single-talker scenes has
matured in recent years. However, in noisy environments, es-
pecially in multi-talker scenes, speech recognition performance
is significantly reduced. Towards cocktail party problem, we
propose a unified time-domain target speaker extraction frame-
work. In this framework, we obtain a voiceprint from a clean
speech of the target...

Changhong Liu (School of Computer and Information Engineering, Jiangxi Normal University,) , Jihua Ye (School of Computer and Information Engineering, Jiangxi Normal University) , Yingen Yang (School of Computer and Information Engineering, Jiangxi Normal University, Nanchang) , Zhenchun Lei (School of Computer and Information Engineering, Jiangxi Normal University, Nanchang)
26/10/2020, 21:45

The security and reliability of automatic speaker verification systems can be threatened by different types of spoofing attacks using speech synthetic, voice conversion, or replay. The 2-class Gaussian Mixture Model classifier for genuine and spoofed speech is usually used as the baseline in the ASVspoof challenge, which is designed to develop the generalized countermeasures with potential to...

Andreas Maier (University Erlangen-Nuremberg) , Christian Bergler (Friedrich-Alexander-University Erlangen-Nuremberg, Department of Computer Science, Pattern Recognition Lab) , Elmar Nöth (Friedrich-Alexander-University Erlangen-Nuremberg) , Manuel Schmitt (Friedrich-Alexander-University Erlangen-Nuremberg, Department of Computer Science, Pattern Recognition Lab) , Simeon Smeele (Max Planck Institute of Animal Behavior, Cognitive and Cultural Ecology Lab and Max Planck Institute for Evolutionary Anthropology, Department for Human Behavior, Ecology and Culture) , Volker Barth (Anthro-Media)
26/10/2020, 21:45

In bioacoustics, passive acoustic monitoring of animals living in the wild, both on land and underwater, leads to large data archives characterized by a strong imbalance between recorded animal sounds and ambient noises. Bioacoustic datasets suffer extremely from such large noise-variety, caused by a multitude of external influences and changing environmental conditions over years. This leads...

Andrea Madotto (The Hong Kong University Of Science and Technology) , Genta Indra Winata (The Hong Kong University Of Science and Technology) , Pascale Fung (The Hong Kong University Of Science and Technology) , Peng Xu (The Hong Kong University Of Science and Technology) , Samuel Cahyawijaya (HKUST) , Zhaojiang Lin (The Hong Kong University Of Science and Technology) , Zihan Liu (The Hong Kong University Of Science and Technology)
26/10/2020, 21:45

Local dialects influence people to pronounce words of the same language differently from each other. The great variability and complex characteristics of accents creates a major challenge for training a robust and accent-agnostic automatic speech recognition (ASR) system. In this paper, we introduce a cross-accented English speech recognition task as a benchmark for measuring the ability of...