Saturday, December 30, 2006
Kontorovich-Shalizi love-fest
On a tad more serious note, the statement "Leo and I regard each other's politics as unsound" is inaccurate -- at least in one direction. I actually find Cosma's politics to be rather reasonable -- refreshingly so, given some of the hair-raising monstrosities I've encountered in the pristine halls of academia. (No, you'll get nothing concrete out of me here -- I'll have to start a different blog, under a different name, to be able to repeat what I've heard come out of otherwise sane people's mouths.) The difference between politics and faith is that reasonable people can persuade each other through intellectual arguments, and recognize that there are issues where disagreement is perfectly acceptable. (Eating babies? No way. Gay marriage? Let's discuss.) I have no patience for people who confuse politics with religious faith, considering their opponents not simply mistaken, but truly evil. I've encountered my share of such folks (if your office door sports a poster equating Bush with Hitler, chances are, you're one of them). But I'm happy to report that Cosma is a well-read, erudite, easy-going fellow who it's been my pleasure to shoot the breeze with over beer and coffee. We may not agree on everything, but I wouldn't call his politics unsound.
What is math good for?
I consider myself an applied mathematician, and I take great pride and satisfaction in my work. My motivation to work on a problem is directly proportional to how useful people will find the result. Recently, Anthony Brockwell has been applying some of my rather theoretical results to the very practical and obviously important problem of decoding neural signals. Now, I never had this particular application in mind when proving the inequalities, but it was obvious to me that many people will find this stuff quite useful, and that certainly added to my motivation; it was very satisfying to see my work finding a use.
It's a common false dichotomy to distinguish "useless" math from "useful" engineering. I hate to repeat myself, so I refer the readers to the relevant discussion on Olivier's blog. If I may be forgiven the cliche: "There is nothing so practical as a good theory" -- so trite yet so true. I am opposed to snobbery between the different fields of human endeavor. I readily recognize the importance of bread-bakers, bridge-builders and code-crankers. Can we recognize, once and for all, that theoretical math is an important thing to study in and of itself? Need I remind you that the most basic tools in every engineer's chest (calculus, Fourier analysis) were once nothing but mathematical esoterica? Even that genteel queen of mathematics, number theory, has recently found herself rolling up her sleeves and taking her turn in the sordidly applied field of cryptography. Did Higman have any idea in 1952 that his result on partial orders would be used in 2005 to prove that a certain class of languages is linearly separable (and therefore learnable)?
Any way you approach math -- an intricate art, an indispensible tool, some combination of the two -- it unquestionably merits (nay, demands!) intense study.
Friday, December 29, 2006
FLAC project ideas
Below I am going to suggest some project ideas, based on problems that came up in my work. These range from "mathematical curiosities" to solid research problems to truly fundamental questions, probing the very foundations of computer science. I'll be adding to this list as time goes by, but here is a start.
- In 2003, I proved that the so-called n-gram uniquely decodable languages are regular. In Sec. 4.4 I conjectured that an efficient automaton construction is possible. Recently, such a construction was indeed given. Project: Read and understand the two papers. Implement the automaton described by Li and Xie (in matlab, for example). Attempt to answer other questions I raise in Sec. 4 (I'd be particularly happy to resolve 4.1).
- In a recent paper with Mehryar Mohri and Corinna Cortes, we present a novel approach to learning formal languages from labeled samples. Our technique consists of embedding the strings in a high-dimensional space and learning the language as a maximum-margin hyperplane. We prove that a rich family of languages (the piecewise testable ones) are linearly separable -- and therefore efficiently learnable -- under a certain efficiently computable embedding (kernel). One of the gaps we've been trying to close is a lower bound on the separating margin, in terms of the automaton complexity. All we can currently show is that it's strictly positive. Do not be daunted by the machine-learning concepts in this paper; I'll be happy to sit down with you and explain them all. A result of the type "margin is at least 1/poly(#states)" would be a very good thing, and would certainly make a solid, publishable paper.
- A natural extension of the work with Corinna and Mehryar is the question of what other language families can be linearly separated. It turns out there is a universal kernel that separates all the regular languages! This is one of those results that sounds deep (indeed, it eluded us for something like a year) but becomes trivial once you see the construction. The key quantity is the following auxiliary kernel. Fix a finite alphabet Sigma of size m, and let DFA(m,n) be the set of all deterministic finite-state automata on n states over this m-letter alphabet. There is no loss of generality in assuming that every DFA always starts in state 1. Then size(DFA(m,n)) = n^(mn)*2^n -- do you see why? Suppose I give you two strings, x and y over Sigma, and ask: how many A in DFA(m,n) accept both x and y? Call this quantity K_n(x,y). It may not be immediately obvious, but K_n(x,y) is a kernel of fundamental importance, for it can be easily adapted to render all the regular languages linearly separable. Unfortunately, we do not have an efficient algorithm for computing K_n(x,y), and in fact suspect that it's a computationally hard problem (a likely candidate for #P complete). Proving that this is indeed the case would be a big deal; providing an efficient algorithm for computing K_n(x,y) would be an even bigger deal. (We do have an efficient epsilon-approximation to K_n(x,y).) Again, I'm leaving out many details and a writeup/talk slides are forthcoming. I can't overstate the importance of this problem (according to a really famous computer scientist who'll go unnamed, it's even more important than I realize!). I'd be thrilled if a brave soul would attempt it.
Update: Jan. 26 2007:
More project ideas:
- Algebraic automata theory. A couple of students have expressed an interest in the deep and fascinating connection between automata, languages, and semigroups. I am going to mention a few places to start looking; there's plenty of material here for several projects (heck, for many PhD theses). (1) Check out the expository papers by J-E Pin and/or Mateescu & Salomaa (2) Take a look at the book Semirings, automata, languages by Kuich and Salomaa.
- Algorithms on strings. Computations on strings are at the core of natural language processing and computational biology. Take a look at an influential paper on string kernels, check out Dan Gusfield's book, or the excellent book by Sankoff and Kruskal.
The way I envisioned for this to work is that you do some independent reading, come to me with specific questions/ideas, and we together decide on what a reasonable project might be. I have a strong preference for independent research over passive summarization, but exceptions can be made for particularly challenging material.
UPDATE: Feb. 1, 2007: If we spoke about meeting, please email me with the time! Also, try to have a concrete idea by the time we meet...
Tuesday, December 26, 2006
Blogging from the road
I thought I'd explain my solution to getting tex to appear on the web page. I know that many perfectly sound solutions exist -- just google latex2html -- but this old dog would rather stick to the tricks he knows (matlab and latex) rather than learn new ones. I wrote a matlab script, which takes a .tex file, turns commands of type \link{url}{description} into the href business and {\it bla} into bla (and so on). It also extracts every math formula in the file, creates a temporary .tex file containing just that formula and calls the perl script textogif. It outputs the html code with the links to the formula gifs.
Is this the prettiest or most efficient way of doing things? Hell, no. But a quirk in my character makes it much easier for me to invent a solution from scratch rather than use an existing one. It's been a boon and a curse in my research. Here's that matlab script in case anyone is curious; I'd be happy to hear comments, suggestions, etc.
Sunday, December 24, 2006
A minor breakthrough
Now, how do I get rid of that annoying border?
Update: border issue solved (thanks to Jeff & Peter Nelson). I can now blog with the big boys...
Update II: My list of benefactors now includes Adam Schloss -- thanks to everyone for helping me figure out how to do this. The LDP vs COM post has been redone, and is now readable!
Deep mathematics (vs. the other kind)
If the answers to these were obvious to you before taking Real, then it's a real shame you didn't pursue analysis as a specialty. The year I'd graduated with a math degree from Princeton, I still made the shameful blunder of convincing myself that I'd constructed a continuous, monotonic, nowhere differentiable function (fortunately, I recalled the Lebesgue-Radon-Nikodym theorem before I had a chance to really embarrass myself). The latter, btw, is an example of a Truly Deep (tm) theorem from undergrad analysis.
Of course, it's silly to debate the relative "depth" of different mathematical disciplines. I'll grant you that P?=NP occupies a central place because of its philosophical implications (and it's a real pleasure to read your expositions on these). But I tend to avoid mine-is-deeper-than-yours contests as they produce nothing useful and occasional animosity. A "deep" result is any claim that's surprising, nontrivial, and has rich implications. It need not be paticularly difficult (the Gelfand-Mazur theorem is plenty deeep, but has a 1-line proof).
The "two cultures" paper (linked by Elad Verbin) is certainly relevant here (on top of being an educational and enjoyable read). I've always found discrete math more difficult to reason about than continuous. What makes combinatorics so difficult is that "continuity of intuition" fails. Things often go wrong not because you mis-formulated some technical condition but because the "structure" of the problem is so difficult to grab hold of. The hardest things I've proved haven't been terribly technical -- indeed, the techniques I used were elementary. The key part has always been was understanding the elusive structure of the objects I was working with. Alexandre Borovik refers to these two types of thinking by "switches" and "flows" in his book. I'd like to read the cognitive explanation more closely sometime (and of course would be thrilled if someone would summarize it).
PS. Don't mind Conway's off-handed dismissal of The Central Question. When you have a group named after you, you can pretty much say anything at all.
Large deviations vs. measure concentration
Given the abundance of rigorous treatments of LDP and COM referenced above, I'll focus instead on the "intuitive flavor" of the two approaches. First, the similarities. Both LDP and COM allow one to prove Laws of Large Numbers (LLN). For our handwavy purposes, a LLN is any claim asserting the convergence (weak, almost sure, in probability) of a sequence of random variables to their common expected value.
Quoting Dembo and Zeitouni (p. 4):
Letting be the empirical mean of independent, one-dimensional standard Gaussian random variables, Dembo and Zeitouni give a simple calculation (1.1.3) showing thatThe large deviation principle [...] characterizes the limiting
behavior, as , of a family of probability measures
on , in terms of a rate function.
which they summarize as
Note that this result, on its own, does not tell you how large of a sample you need (i.e., how big to make ) to ensure, with probabilityThe "typical value" of is [...] of the order ,
but with small probability (of the order of ),
takes relatively large values.
that .
To get such a finite sample guarantee (as opposed to an asymptotic limit), we need a COM result. As a slight technicality, let us truncate those Gaussians to the interval and renormalize (anyone who suspects foul play may be referred to Samuel Kutin's extension of McDiarmid's inequality). Then a simple application of Hoeffding's bound gives
Note that the asymptotics match those of the LDP, but we can guarantee that for any
with probability at least , the empirical mean will not deviate from the expected value by more than . Note, however, that we do not have a rate of convergence (in probability) of a random variable to its mean; all we have is an upper bound.
I hope these two examples illustrate the different flavor of the two apporaches. COM tends to give sharper upper bounds on deviation probabilities. These have the advantage of being true for all (excepting at most finitely many) values of , and the disadvantage of providing next to no information on the lower bounds. LDP gives the asymptotic rate of convergence (both upper and lower bounds), but does not (at least without some additional work) yield finite-sample bounds.
It is little wonder, then, that COM has found many more applications in computer science (e.g., randomized algorithms) and statistical machine learning, than LDP. The user wants to know how long to run your algorithm to achieve a specified precision at a given confidence level; order-of-magnitude information for "typical" values is usually insufficient.
Having given an informal "definition" of LDP, let me close with an equally informal discussion of COM, this time, quoting myself:
I realize this isn't terribly informative (or at all original), but the linked paper does attempt a crude survey (with an emphasis of passing from indepenent sequences to those with weak dependence). A word of caution for those attempting to visualize "measure concentration": imagining a narrowly peaked, one-dimensional distribution is tempting but wrong. This is a genuinely high-dimensional phenomenon. The only "visual" intuition I can offer is that the (Lesbegue) volume of the -dimensional Euclidean unit ball goes to zero with , while the volume of the cube (in which the ball is tightly inscribed) blows up exponentially. This means that in high dimensions, most a set's content is concentrated near its boundary -- something that's certainly not true in low dimensions. Anyone who's intrigued by this (that's everyone, right?) should read Milman's "Surprising geometric phenomena in high dimensional convexity theory".A common way of summarizing the phenomenon is to say that in a high-dimensional space, almost all of the probability is concentrated around any set whose measure is at least . Another way is to say that any "sufficiently continuous" function is tightly concentrated about its mean.
Friday, December 22, 2006
The inaugural post
An explanation is in order. The name is an amalgamation of two of my main research interests: stochastic processes and automata theory. "Absolutely regular" is a type of strong mixing condition, and I've done a bit of work on that. "Regular" is also the name of a fundamental class of formal languages (in a well-defined sense, it is the simplest nontrivial class of languages with natural closure properties). I have some results regarding these as well.
"Regularity" is one of those abused terms in math (cf. "space" and "kernel") that means everything and nothing. Roughly speaking, regularity is whatever conditions we need to impose to exclude the pathologies that prevent intuitive/desirable statements from being true. Mathematical analysis is full of instances where the naive claim is inaccurate or plain false (integration and differentiation are inverse operations) but becomes true once the right (usually, mild) conditions are imposed. In some sense, analysis is the study of such conditions. Regularity also crops up in more applied areas, such as statistical machine learning. No one can learn arbitrary functions -- there are simply too many of them. However, once a regularity condition is imposed (VC-dimension, Sobolev norm, Tsybakov noise condition, etc) -- learning becomes possible. In some sense, machine learning theory is the study of the regularity conditions that enable learning.
And finally, "absolutely regular" might just conjure up an image of a relaxed, easy-going everyday joe-shmoe blogging away in his pajamas.
I'll close with a mission statement. This is mainly a research blog. I'll post about problems I'm working on, current and past, and invite readers to join in. Recently, my work has been generating a slew of open problems, so if you're looking for a good problem to work on, you've come to the right place.
I'll resist the temptation to post about personal stuff because (1) my friends shouldn't have to read my blog to get an update (2) everyone else really has no business even being curious. I'll try to keep the politics near zero, but I see myself occasionally lapsing into a mini-rant (what blogger worth his salt doesn't? At least I'm honest and upfront about it).
Finally, I'm a novice at this "web-logging" on "the inter-net" so please do bear with me. I don't even know how to make formulas, such as $f:\mathcal{X}^n\to\mathbb{R}$ come out right. I'm glad that русский and עברית seem to work.