John McDonough

John McDonough


Visiting Scientist
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
5000 Forbes Avenue,
Pittsburgh, PA-15213

Office: 5411 Gates Hillman Complex
Phone: 412 952 2881

e-mail:johnmcd at cs dot cmu dot edu

Affiliations:Language Technologies Institute
                   Machine Learning for Signal Processing Group
Date


Events


Reaserch Interests

My research concerns automatic speech recognition (ASR) in general and distant speech recognition (DSR) in particular. I think of the latter as being any situation where a microphone is not located in the immediate vicinity of the mouth. This implies that in addition to contending with all of the usual problems associated with ASR, there are also problems stemming noise, room reverberation, speech of other speakers, etc. Such effects make DSR a far more difficult problem than conventional ASR, and moreover, a problem that is by no means solved. Using a microphone array to capture the far-field speech and then applying acoustic array processing is an extremely effective way to reduce the total system error rate. When the signals from each channel in a microphone array are appropriately combined, the array is able to act as a spatial filter, emphasizing signals emanating from a desired direction and suppressing others. This selective spatial sensitivity is summarized in a beampattern, which is a plot---usually in decibels---of the sensitivity of the array as a function of angle of incidence---usually represented as a direction cosine. Here is the beampattern of a minimum variance distortionless response (MVDR) beamformer in the presence of a source of coherent interference.

As is apparent from the plot, the beamformer maintains unity gain (0 dB) in the look direction (i.e., direction of the desired source), which is straight ahead in this case, while placing a deep null in the direction of the coherent interference.

Human speech---whether considered in the time or the frequency domain---has very non-Gaussian statistical characteristics; while Gaussian random variables are characterized by frequent but small deviations from their means, super-Gaussian random variables like speech are characterized by large but infrequent deviations from their means. These super-Gaussian characteristics approach the Gaussian, however, whenever clean speech is corrupted either with noise, reverberation or the speech of another speaker; this follows directly from the central limit theorem. Kurtosis and negentropy are statistical measures of deviation from Gaussianity.

By designing beamforming algorithms that restore the statistical characteristics of the original clean speech, the detrimental effects of noise and reverberation can be removed. This is the principle of operation of the maximum negentropy and maximum kurtosis beamformers, both of which were originally proposed by my students and me.

Here is a table of word error rate results for a simple DSR task where a young child, aged 4--6, repeats a simple phrase spoken by an adult experimenter. The best beamforming result is obtained with our maximum kurtosis beamforming algorithm. Click on the loud speaker to hear the audio; for the audio samples the beamformer is pointed at the child, so her voice is much clearer than that of the adult experimenter. In generating the DSR results, the beamformer was pointed either at the experimenter or child, depending on which was desired.
%Word Error Rate
Beamforming AlgorithmExperimenterChildrenAudio
One Channel of Array3.414.2
Delay-and-Sum2.27.6
Superdirective2.16.5
Maximum Kurtosis0.65.3
Close-Talking Microphone1.94.2
The error rate is always much higher for the children than for the adult experimenter due to the children's relatively poor pronunciation and high-pitched voices. Beamforming techniques can also be used for speech separation when two or more speakers are simultaneously active.

An interesting new field of research is the application of spherical microphone arrays to soundfield analysis. Here is a 32-channel spherical array made by mh acoustic.
Eigenmike spherical array
Spherical arrays have several attractive properties compared to conventional microphone arrays, including an invariance to the direction in which the beam is pointed. Typically, the complete sound field is decomposed into a set of spherical harmonics. The first four radially symmetric spherical harmonics look like this.
Spherical harmonics
My colleagues and I are currently comparing the effectiveness of conventional versus spherical microphone arrays. At the moment, the latter are still relatively expensive; hence we would like to know if this expense is justified in terms of DSR performance. I'll post our initial results as soon as we have them.

Bio-Sketch

I have been involved in automatic speech recognition (ASR) research since January of 1993, when I joined the speech department at BBN in Cambridge, MA. At BBN I worked for Herb Gish on the Wordspotting and later the Switchboard Projects, both of which were sponsored by the National Security Agency. I was an important contributor to the development of the systems that posted the lowest overall word error rates on the 1995 and 1996 Switchboard evaluations. In the Fall of 1997, I entered graduate school at the Johns Hopkins University (JHU) Center for Language and Speech Processing (CLSP); my advisor was Prof. Fred Jelinek. In May of 2000, I was awarded a Ph.D. for my work in speaker adaptation.

From January 2000 until December 2006, I worked at the University of Karlsruhe (UKA) in Karlsruhe, Germany as a researcher and lecturer. While at UKA, I became fluent in German solely by reading Harry Potter novels in translation and was eventually able to hold the lecture Man-Machine Speech Communication entirely in the language. I also developed and taught the lecture Microphone Arrays: Gateway to Hands-Free Speech Recognition. At Karlsruhe, I supervised research in all speech and audio technologies for the EU project CHIL, Computers in the Human Interaction Loop; this included audio-visual speaker tracking, beamforming, hidden Markov Model (HMM) parameter estimation, and speech recognition. I led UKA's participation in the audio segments of the CHIL technology evaluations, which in latter years included coordinating the evaluation task definitions with the US National Institute of Standards and Technology (NIST) as well as top academic and industrial ASR research sites.

In February 2007 I began work at Saarland University (SU) in Saarbruecken, Germany, where I continued to conduct research and lecture about topics related to speech recognition. While at SU, I developed and taught the new course offerings Distant Speech Recognition and Weighted Finite-State Transducers in Speech and Natural Language Processing. At SU, I was principal author for the successful EU project proposal SCALE, Speech Communication with Adaptive Learning, which eventually supported 12 Ph.D. candidates and two postdocs at six academic sites throughout Europe. My team posted the best results in the PASCAL Speech Separation Challenge II. Along with a colleague from UKA, I also completed a book entitled Distant Speech Recognition, which was published by Wiley in April 2009. In the course of my career, I have implemented nearly all of the algorithms described in the latter work, including algorithms for speaker adaptation, speaker tracking, HMM training, manipulation and optimization of weighted finite-state transducers (WFSTs), Bayesian filtering, beamforming, and ASR word hypothesis search.

Beginning in February 2010, I spent a year at Disney Research, Pittsburgh (DRP) founding a research effort in distant speech recognition. In March 2010, I and one of my graduate students from SU held a tutorial on distant speech recognition at ICASSP. I have also served as an invited speaker at several international conferences. Since January of 2011, I have been a visiting scientist at the Language Technologies Institute (LTI) at Carnegie Mellon University, where I continue to conduct research in distant speech recognition while collaborating with colleagues both at the LTI and at DRP. I served as publications chair of the 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics.

Publications (by topic)

  1. Books and Tutorials    Links and Slides
    These publications describe recent work on the general topic of distant speech recognition. They include a book I co-authored with Matthias Woelfel as well a book chapter that I am currently preparing with Kenichi Kumatani, both published by Wiley. There are also the slides from a tutorial I held together with Matthias Woelfel and Friedrich Faubel at ICASSP 2010, as well as other tutorials from the first SCALE winter school.

  2. Distant Speech Recognition    Papers
    These papers describe my work together with colleagues in distant speech recognition.

  3. Acoustic Array Processing    Papers
    These papers describe our work in acoustic array processing; particularly represented is the rather prodigious body of work that we have produced using non-Gaussian assumptions and techniques based thereon for beamforming prior to distant speech recognition.

  4. Weighted Finite-State Transducers    Papers
    These papers describe the work of my colleagues and me in applying the formalism of weighted finite-state transducers to speech recognition.

  5. Speaker Adaptation    Papers
    The papers work dating back to my doctoral thesis and earlier in speaker adaptation for automatic speech recognition.

  6. Discriminative HMM Training    Papers
    These papers describe my work on discriminative HMM training, and the combination thereof with speaker adaptation techniques.

  7. Robust Feature Extraction    Papers
    These papers describe the work I published with my former students on robust feature extraction.

  8. Person and Speaker Tracking    Papers
    These publications describe my work on person and speaker tracking, which is a necessary first step prior to acoustic beamforming.


Web Sites