- Structure Disovery In this work we attempt to automatically determine the minimal structures in structured sounds such as human speech or music. Such structures include units such as phonemes for speech and notes and chords for music. Our aim is to identify these units automatically, through analysis of data. The problem is treated as one of entropy minimization. The minimum entropy estimation principle assumes a structured universe, and dictates the estimation of the most predictable model that the observations used to train the model will allow. For the current problem, this amounts to identifying a set of units such that all examples of a sound class can be represented by a minimum-perplexity network over them.
- Linear-in-N Language Modeling The linguistic foundation of all large-vocabulary speech recognition systems is an N-gram language model, which conditions the probability of any word on the preceeding N-1 words. This requires the learning of probability distributions, the number of parameters in which increase exponentially with N requiring various computational tricks such as backing off and smoothing to make it manageable. We are developing an alternative mechanism for representing N-gram models in which the number of parameters increases linearly with N.
- Recognition with multiple examples In standard pattern classification literature, given multiple observations that are known to be from the same class, classification is performed by maximizing the product of the class-conditional probabilities of the individual observations (with appropriate weighting with class priors). However, when the classifier is a speech recognizer and the multiple observations are repetitions of a sentence, this simple solution becomes infeasible: the classes are now word sequences, the set of all classes in an non-enumerable infinity in size, and the class-conditional probabilities of each repetition cannot be computed and multiplied. Standard solutions to this problem are approximations that require independent recognition of the individual recordings followed by voting. We have devised a dynamic programming solution that can deal with this problem through the introduction of a latent alignment variable. A further generalization of this framework would allow speakers to repeat parts of a sentence (a phenomenon that is very common in conversational speech)