Wednesday, July 9, 2008

Learning paradox

Consider the standard PAC learning model. We have a concept class C over an instance space X, as well as a probability distrubution P on X. After observing a labeled sample of size m, and finding an f in C consistent with this sample, we will know that with high probability the generalization error of f is bounded by something like

(*) sqrt(d / m) + confidence term,

where d is the VC-dimension of the concept class C [the confidence term isn't the interesting part here -- it appears in all generalization bounds of this type, and goes to zero as 1/sqrt(m)]. Intuitively, this bound says that if we managed to explain the observations by picking a model from a "simple" class, we can be confident that our predictions will be good. Explanations by more complex models are less informative. In the trivial case, C= 2^X and VC-dim(C)=|X|
-- so achieving zero sample error tells us nothing about generalization error.

However, one might reason as follows. Suppose I receive a labeled sample and find a classifier f in C that achieves zero sample error. Now what's stopping me from re-defining my concept class? I'll define a new concept class C' = {f} that contains a single concept -- f! Its VC-dimension is trivially zero. As far as my prediction is concerned, it doesn't matter whether I learned using C or C' -- in either case, I'll pick the classifier f. But the VC-dimension of C might be large, giving me a poor generalization guarantee (*), while VC-dim(C') = 0, which is as good as things can be!

What's wrong with this reasoning?