Automatic Speech Recognition and Understanding Workshop
December 8-12, 2013 | Olomouc, Czech Republic
The decomposition and re-integration of complex sounds, such as speech, are at the core of auditory cortical processing. We will discuss the transformation of this process between subcortical and cortical stations and its relationship to the structural organization of auditory cortex in animal models. Processing of speech in noise is a particular challenge and recent physiological findings indicate some solutions to this task. Finally, we will compare some recent findings of recordings from human superior temporal gyrus to speech sounds with the findings obtained from animal models and discuss potential implications for speech recognition approaches.
Christoph E. Schreiner is a Professor in the Departments of Otolaryngology and Bioengineering at the University of California, San Francisco, USA. He holds a Ph.D. in Physics and an MD from the University of Göttingen, Germany. His main research interests include the processing of complex sounds from the auditory midbrain to the auditory cortex in various animal models with special consideration of hearing impairments. He is a member of the Tinnitus Research Consortium and has edited—jointly with Dr. Jeffery Winer—two books on central auditory processing.
Artificial neural networks have been applied to speech tasks for well over 50 years. In particular, multilayer perceptrons (MLPs) have been used as components in HMM-based systems for 25 years. This presentation will describe the long journey from early speech classification experiments with MLPs in the 1960s to the present day implementations. There will be an emphasis on hybrid HMM/MLP approaches that have dominated the use of artificial neural networks for speech recognition since the late 1980s, but which have only recently gained mainstream adoption.
Nelson Morgan has been working on problems in signal processing and pattern recognition since 1974, with a primary emphasis on speech processing. He may have been the first to use neural networks for speech classification in a commercial application. He is a former Editor-in-chief of Speech Communication, and is also a Fellow of the IEEE and of ISCA. In 1997 he received the Signal Processing Magazine best paper award (together with co-author Herve Bourlard) for an article that described the basic hybrid HMM/MLP approach. He also co-wrote a text (written jointly with Ben Gold) on speech and audio signal processing, with a new (2011) second edition that was revised in collaboration with Dan Ellis of Columbia University. He is the deputy director (and former director) of the International Computer Science Institute (ICSI), and is a Professor-in-residence in the EECS Department at the University of California at Berkeley.
In 2010, it was shown that the combination of hybrid ANN-HMMs with both traditional senone modeling and deep learning is a powerful new acoustic model for ASR. Dubbed the Context-Dependent Deep-Neural-Network HMM, or CD-DNN-HMM, it has so far led to over 40% relative error reduction for speaker-independent recognition on the Switchboard benchmark, compared to the conventional GMM baseline. This is arguably the largest gain obtained through a single technology in ASR. This talk will describe how this discovery has been further developed towards use in practical systems. We will specifically focus on the remarkable benefits from the DNN's ability to learn better feature representations and how they can help in real-life applications; as well as the no less remarkable difficulties arising from the computational cost in training and at runtime, and approaches to address them.
Frank Seide, a native of Hamburg, Germany, is a Senior Researcher/Research Manager at Microsoft Research. His current research focus is on deep neural networks for conversational speech recognition; together with co-author Dong Yu, he was first to show the effectiveness of CD-DNN-HMMs for recognition of conversational speech. Since graduation in 1993, Frank has worked on various speech topics, first at Philips Research in Aachen and Taipei, now at Microsoft Research Asia (Beijing), including spoken-dialogue systems, Mandarin speech recognition, audio search, and speech-to-speech translation.