Unsupervised and semi-supervised learning: the Socratic approach

Our approach is to completely separate knowledge and pattern recognition in the regular data space from knowledge about knowledge.

Let **P** be a pattern classifier that has a very low error rate on some kinds of data but has a much higher error rate on other kinds of data.

Changing the design of **P** or given more training data are examples of trying to improve the performance of **P** as a classification problem in the regular data space.

Try to decide whether **P** is reliable or not on a given new data item is a problem in knowledge about knowledge. We could build a separate classifier S trained to make this decision. To train S, we don't tell it the correct answer on a given data item, just whether or not **P** was correct.

Suppose that a given region of space has many different labels somewhat mixed together. Because there are no significant gaps in the data with different labels, the cluster assumption is false. If the probability density functions are nearly constant, then this is just a region with an inevitably high error rate. Suppose, however, that the probability density functions vary substantially. Then, there will be some subregions in which one label dominates and other subregions in which a different. So an optimum classifier might be able to do very well on this data. The cluster assumption being false merely means many learning algorithms will have difficulty finding the optimum classifier. It doesn't mean that the optimum classifier will have a high error rate.

Unfortunately, rapidly varying probability density functions violate the **smoothness assumption**, which is even more widely used in semi-supervised training than the cluster assumption.

On the other hand, we can imagine that the performance of the classifier **P** is consistent across this region (whether it be good, poor or mediocre). It's answer will vary between the subregions, but its average error rate could be consistent. In particular, such a situation could arise if **P** uses a feature that has good discrimination power in this region. For example, **P** might be particularly good at recognizing fricatives but not so good at recognizing vowels.

Now comes the benefit from being able to consider the Socratic space instead of the regular data space. The probability of **P** making an error varies slowly. That is, the smoothness assumption is true in the Socratic space even though it is not true in the regular data space. Therefore a second classifier, a Socratic agent could learn to recognize that **P** does well in the fricative region of space without needing to know which fricative is the correct label for a given data item.

As another example of this kind of thinking, consider the dilemma that you pose: *How would a classifier know that it doesn't know?* Stated that way, it is essentially just the standard question of conventional confidence estimation. Other than the classifiers own estimate of the a posteriori probability, improved confidence estimation comes from using extra features. The dilemma can be reposed: *If the extra features give a better confidence estimate, why can't we use them to get better recognition in the first place?*

Conventional confidence estimation does have concepts and language to talk comfortably about this dilemma. Some of the new features might, in fact, improve the original recognition, they just haven't been implemented or they don't fit the existing training paradigm. Some of the confidence estimation features are fundamentally different, but the conventional point of view makes it confusing to talk about.

The Socratic knowledge viewpoint makes it much easier to understand this situation and makes it easy to think of much better confidence estimation features and learning procedures. Measurements of the behavior of the classifier are in the Socratic space. Pattern recognition in the Socratic space can be used to estimate the average error rate of the original classifier in a region without being able to help correct an error on a given item.

When there are two or more classifiers, there is even more opportunity to take advantage of the point of view from the Socratic space. Delayed-decision testing or statistical validation compares the performance of two classifiers, so it is asking a question in the Socratic space rather than in the original space. That is part of the reason that it can do something that would seem to be impossible from the conventional point of view. It can replace conventional development testing, which always requires set aside labeled training data, by a statistical test on unlabeled data.

Regularization improves recognition performance, but it is not asking a question in Socratic space. It is addressing a problem in the regular data space: over fitting a limited amount of training data.

A wide variety of different methodologies have been used and proposed for unsupervised or semi-supervised learning. They change the definition of what it means to be a *cluster* and therefore change the meaning of the *cluster assumption.* Generally they will work better on data that satisfies the modified assumption and less well on data that does not. However, all these are variants in the regular data space. They represent modifications in the assumptions about the data. None of them try to measure or acquire *knowledge about knowledge.*

## People

James Karl Baker

Rita Singh