Sunday, December 24, 2006

Large deviations vs. measure concentration

Lately I've been seeing two distinct phenomena conflated: Large Deviation Principles (LDP) and Concentration of Measure (COM). Good references on the LDP include Large Deviations by F. Den Hollander and Large Deviations Techniques and Applications by Dembo and Zeitouni. I also heartily recommend Cosma Shalizi's Lecture notes (lectures 30-35). To read up on COM, a good place to start is Gábor Lugosi's "Concentration-of-measure inequalities", or Ledoux's book for a more abstract, in-depth treatment.

Given the abundance of rigorous treatments of LDP and COM referenced above, I'll focus instead on the "intuitive flavor" of the two approaches. First, the similarities. Both LDP and COM allow one to prove Laws of Large Numbers (LLN). For our handwavy purposes, a LLN is any claim asserting the convergence (weak, almost sure, in probability) of a sequence of random variables to their common expected value.

Quoting Dembo and Zeitouni (p. 4):

The large deviation principle [...] characterizes the limiting
behavior, as , of a family of probability measures
on , in terms of a rate function.

Letting be the empirical mean of independent, one-dimensional standard Gaussian random variables, Dembo and Zeitouni give a simple calculation (1.1.3) showing that

which they summarize as

The "typical value" of is [...] of the order ,
but with small probability (of the order of ),
takes relatively large values.

Note that this result, on its own, does not tell you how large of a sample you need (i.e., how big to make ) to ensure, with probability
that .

To get such a finite sample guarantee (as opposed to an asymptotic limit), we need a COM result. As a slight technicality, let us truncate those Gaussians to the interval and renormalize (anyone who suspects foul play may be referred to Samuel Kutin's extension of McDiarmid's inequality). Then a simple application of Hoeffding's bound gives

Note that the asymptotics match those of the LDP, but we can guarantee that for any

with probability at least , the empirical mean will not deviate from the expected value by more than . Note, however, that we do not have a rate of convergence (in probability) of a random variable to its mean; all we have is an upper bound.

I hope these two examples illustrate the different flavor of the two apporaches. COM tends to give sharper upper bounds on deviation probabilities. These have the advantage of being true for all (excepting at most finitely many) values of , and the disadvantage of providing next to no information on the lower bounds. LDP gives the asymptotic rate of convergence (both upper and lower bounds), but does not (at least without some additional work) yield finite-sample bounds.

It is little wonder, then, that COM has found many more applications in computer science (e.g., randomized algorithms) and statistical machine learning, than LDP. The user wants to know how long to run your algorithm to achieve a specified precision at a given confidence level; order-of-magnitude information for "typical" values is usually insufficient.

Having given an informal "definition" of LDP, let me close with an equally informal discussion of COM, this time, quoting myself:

A common way of summarizing the phenomenon is to say that in a high-dimensional space, almost all of the probability is concentrated around any set whose measure is at least . Another way is to say that any "sufficiently continuous" function is tightly concentrated about its mean.

I realize this isn't terribly informative (or at all original), but the linked paper does attempt a crude survey (with an emphasis of passing from indepenent sequences to those with weak dependence). A word of caution for those attempting to visualize "measure concentration": imagining a narrowly peaked, one-dimensional distribution is tempting but wrong. This is a genuinely high-dimensional phenomenon. The only "visual" intuition I can offer is that the (Lesbegue) volume of the -dimensional Euclidean unit ball goes to zero with , while the volume of the cube (in which the ball is tightly inscribed) blows up exponentially. This means that in high dimensions, most a set's content is concentrated near its boundary -- something that's certainly not true in low dimensions. Anyone who's intrigued by this (that's everyone, right?) should read Milman's "Surprising geometric phenomena in high dimensional convexity theory".


Aryeh said...

By an eerie coincidence, I stumbled upon
shortly after making that post. Indeed, when high dimensionality refers to the ambient space of the data, it is a curse for curve-fitting (regression). In my post, "dimensionality" refers to sample size, and is not a curse but a boon.

Anonymous said...

My browser (Chrome) does not show your math formulas. Do I need to install a plug-in or something?