SenseBank

The major research activity for this work was to investigate and produce a “semantic coherence” dataset, a corpus English-like utterances that range from grammatical, meaningful sentences, to ungrammatical and nonsensical sentences. The purpose of these dataset is twofold: to provide “negative evidence” for training statistical language models (which can be used in applications like automatic speech recognition and machine language translation), and to facilitate the linguistic study of grammaticality and sensicality.

Statistical language models are part of the foundation of technologies like automatic speech recognition and machine translation. Nearly all such models are trained exclusively from positive examples of language. We believe that these models could be improved by also providing negative examples of language (e.g. non-grammatical and nonsensical sentences). That is the model is trained to recognize well-formed English, but also to be averse to ill-formed or nonsensical language. Some linguistics and psychologists have argued that children use negative feedback when learning language (e.g. Marcus, 1993). Whether or not children learn from negative examples has been open for debate among linguists, but to our knowledge this question has not been well explored in computational language learning.

We hope that this dataset may also be valuable to linguists and psychologists. Linguists often discretize grammaticality, using grammaticality tests, into grammatical, semi-grammatical, and ungrammatical (the latter two designated by * and # in linguistic data). Our findings thus far suggest that these categories are not so well-defined. Rather, there seems to be a continuum of sensicality and grammaticality. We hope that this dataset may shed some like on what are the characteristics and factors that come into play when ordinary people (as opposed to linguists) attempt to judge linguistic plausibility.


Publications

Benjamin Lambert, Rita Singh, and Bhiksha Raj. "Creating a semantic coherence dataset with non-expert annotators." Interspeech, 2010.

Benjamin Lambert, Rita Singh, and Bhiksha Raj. "Sensebank: A Corpus of Sensible and Nonsense English." In submission to ICASSP, 2011.


Data

The pilot study dataset(Interspeech, 2010):(TAR.GZ)

Initial 1k sample data (ICASSP, 2011): (TAR.GZ) (ZIP)