Wednesday, July 9, 2008

Learning paradox

Consider the standard PAC learning model. We have a concept class C over an instance space X, as well as a probability distrubution P on X. After observing a labeled sample of size m, and finding an f in C consistent with this sample, we will know that with high probability the generalization error of f is bounded by something like

(*) sqrt(d / m) + confidence term,

where d is the VC-dimension of the concept class C [the confidence term isn't the interesting part here -- it appears in all generalization bounds of this type, and goes to zero as 1/sqrt(m)]. Intuitively, this bound says that if we managed to explain the observations by picking a model from a "simple" class, we can be confident that our predictions will be good. Explanations by more complex models are less informative. In the trivial case, C= 2^X and VC-dim(C)=|X|
-- so achieving zero sample error tells us nothing about generalization error.

However, one might reason as follows. Suppose I receive a labeled sample and find a classifier f in C that achieves zero sample error. Now what's stopping me from re-defining my concept class? I'll define a new concept class C' = {f} that contains a single concept -- f! Its VC-dimension is trivially zero. As far as my prediction is concerned, it doesn't matter whether I learned using C or C' -- in either case, I'll pick the classifier f. But the VC-dimension of C might be large, giving me a poor generalization guarantee (*), while VC-dim(C') = 0, which is as good as things can be!

What's wrong with this reasoning?

3 comments:

Anonymous said...

For extra credit, consider the PAC-Bayesian bound of McAllester (1999), which says that if you start with a prior distribution P over concepts, and find a class U which fits your training data perfectly, the prior-weighted average over future instances will with high probability have an error rate bounded by -log(P(U))/m + [confidence term]. In other words, it looks like the PAC bound, only with log(1/P(U))/m instead of sqrt(d/m). Why not first find your set of consistent concepts and then retrospectively make up a prior which assigns it unit mass?

Anonymous said...

prior-weighted average

I meant the prior-weighted average performance of the class U.

Anonymous said...

The bound on the generalization error is a probabilistic one. Fix a function f in C and over multiple draws of the samples of size m, most of them will be good samples, i.e. a function achieving zero error on them will have a low generalization error.

learning a function from a sample and then changing the concept class/sample will not give you any bound.