Tuesday, February 27, 2007

A universal Turing kernel?

At the end of my talk, someone asked why I focus exclusively on DFAs. After all, my result that any countable concept class over a countable sample space is linearly separable (in this sense) works equally well for push-down automata or even Turing machines. My first (mildly petulant) answer is that DFAs are the only automata I'm comfortable with and I won't be ready to move on to more powerful automata until someone resolves the star-height problem (at which point I'll grudgingly grant humanity the right to graduate on to higher machines).

Someone else was quick to suggested that the analogous kernel for Turing machines, T_n(x,y), would have to count the number of Turing machines on n states that accept both x and y -- and surely this is undecidable. But is that obvious? Might there not be some clever way of computing the kernel without solving the halting problem?

Turns out, there is not; this was shown by Jeremiah Blocki, a student in my class -- who, incidentally, has taken up the challenge of the notorious problem #3.

Monday, February 26, 2007

Proofs by induction

When we teach proofs by induction in high school or early undergrad, they tend to be rather simple. Or rather, the induction tends to be "trivial" in the sense that it's obvious
1. what structure to do the induction over
2. how to decompose the claim for n+1 so as to invoke the inductive hypothesis

Examples of such simple uses of induction include proving the sum-of-squares-(cubes, etc.) formula, and proving that if a language L is context-free then so is its reversal L^R.

We the instructors somewhat lament this state of affairs, as it gives a misleading impression of induction. A student might dismiss induction as trivial and vacuous, or on the contrary, come away believing that induction is a magical, all-powerful tool that does proofs for you, relieving you of the burden of thought.

So we looked hard for examples of clever use of induction. Here's a good problem to start with:
Let f_p(n) be the sum of the first n powers of p:
f_p(n) = 1^p + 2^p + 3^p + ... + n^p.
For any given p (say, p=3), it is straightforward to verify (by trivial induction!) the f_3(n) is a 4th degree polynomial in n. It's a lot trickier to prove that for all p, f_p(n) is a polynomial in n of degree (p+1). Can you do it? No need to invoke Bernoulli numbers here -- we're not asking for the coefficients of that polynomial. There are a couple of elementary, self-contained proofs -- using clever induction! -- can you find one?

Actually, all of my difficult proofs have been by induction. Various concentration of measure results (Talagrand, Marton, yours truly) have induction at their core. It's a shame we can't present these in a typical undergraduate class (too much overhead), since their use of induction is rather non-trivial.

Do people have more examples of nontrivial proofs by induction?

Thursday, February 22, 2007

Quantum disinformation

Check out the discussion over at Scott's blog. It covers a wide range of topics, including: the (typically) shoddy mainstream journalistic coverage of technical material, the powers/limitations of quantum computing, and finally, something near and dear to my heart -- the nature and role of mathematical proofs.

Tuesday, February 20, 2007

Minimal consistent DFA revisited

I'm glad we ended up putting this problem on the midterm (as extra credit). A handful of folks nailed it dead-on, but it's still causing confusion for some -- which means it's a good one to work out! Now that it's been assigned and graded, the readers are invited to say anything and everything they want about it in the comments.

BTW, the problem of finding the minimal consistent DFA is NP-hard; this fact more or less motivates the line of research I'll be presenting at this talk tomorrow (today).

Monday, February 19, 2007

Linear/convex programming in infinite dimensions?

I'm sure I'll find the answer in Rudin's Functional Analysis, but perhaps a helpful reader will educate me and the rest; otherwise, what's the point of blogging?

Say we have a Hilbert space H -- so it's locally convex and the Krein-Milman theorem applies. Let K be a compact convex set in H. Is it still true that linear functionals achieve their maxima on extreme points of K? What about convex functionals? [I should really look up the latter in the last chapter of Borwein and Lewis.]

Seems like these should be true, but I've learned to mistrust my intuition in high (and a forteriori infinite) dimensions. Anyone have an answer handy?

Sunday, February 18, 2007

Arbitrarily large vs. infinite

Infinity is a subtle mathematical notion, which is logically related to -- but not synonymous with! -- the notion of "arbitrarily large". Actually, when dealing with limiting values of magnitudes, the two more or less coincide. When we say that the series
1 + 1/2 + 1/3 + 1/4 + ... = Infinity
what we mean is that no matter now big of an N you pick, I can always find enough terms in that series whose sum will exceed N. In this case, infinite really is shorthand for "increasing without bound" or becoming "arbitrarily large".

When dealing with cardinalities -- as opposed to magnitudes -- all bets are off. That was the crux of the closure under star question. [BTW, I think I've found the bug in Ivan's proof that closure under concatenation implies closure under star. The set U need not be a complete lattice. Otherwise, your argument could be used to show that the function f:Z->Z defined on the integers by f(x)=x+1 has a fixed point.] Set theory is rife with examples where some property P holds for arbitrary finite collections but not infinite ones:
  • in point-set topology, a finite (but not necessarily countable) intersection of open sets is open
  • in analysis, there are finitely (but not countably) additive measures
  • etc

There are techniques, most (all?) of them based on transfinite induction, for proving that P holds for infinite collections given that it holds for arbitrary finite ones. Knaster-Tarski is one; Hausdorff maximality principle and Zorn's lemma are other favorites. All are equivalent to the axiom of choice.

Saturday, February 17, 2007

"Progressive" math

Alexandre Borovik's blog is my usual source of examples of semi-literate math-is-being-used-for-evil hysteria. Staying true to this blog's mission statement, I sincerely endeavor not to excessively dilute the mathematical content with politics. I also make it a general principle to rebuke outrageous claims only if they are championed by a reasonably reputable source; life is too short to engage in reasoned intellectual discourse with every obscenity-shouting hobo in the street.

For this reason, though I had seen the unhinged, lunatic rants quoted here about a week ago, I didn't see fit to address them, basically relegating this to the hobo-in-the-street category. Brief synopsis: math has cryptographic and defense applications, and we all know these are tools of imperialistic oppression to keep down the working class, man. (I am hoping that the quote by Rabelais about the evil use of frontal lobes is meant to parody Ken Burch and Kevin Laddle, but one can't know for sure.)

Thursday, February 15, 2007

Writer appreciation day: John Derbyshire

This past December I was giving a talk at Princeton and was pleasantly surprised to find John Derbyshire's book lying on a coffee table in the math lounge in Fine Hall. I was even more pleased to see that this work had won the Euler Book Prize.

I am a big fan of Derbyshire, and encourage everyone to read his columns every now and again. If you've never heard of him, this interview will serve as a decent crash-course intro. Derbyshire is a writer of a rare breed, possessing all the erudition of a crusty academic and all the reckless, unapologetic calling-them-as-he-sees-them brutally refreshing honesty of a man who has nothing to fear from the PC commissars. He is eminently quotable; here are my personal favorites:

  • Let's face it, in the great 20th-century struggle between the state and the individual, the state has won, game, set, and match.
  • The fact is that political stupidity is a special kind of stupidity, not well correlated with intelligence, or with other varieties of stupidity.
  • Wherever there is a jackboot stomping on a human face there will be a well-heeled Western liberal to explain that the face does, after all, enjoy free health care and 100 percent literacy.
(since these came from comments on a thread, I'm going to ask Mr. Derbyshire to authenticate them).

Sure enough, he has his share of detractors -- what hard-hitting pundit does not? They roll out the usual accusations: racist, sexist, bigot, hater. To all the people losing sleep over John Derbyshire's alleged hatred: relax, he doesn't hate you. Knowing him, he probably doesn't give a damn about you.

Conservation of dimension and randomness

Fix two integers 0 < n < m and consider the following seemingly unrelated questions:

(a) is there a function f:{0,1}^n -> {0,1}^m such that if its input X is uniformly random over the domain then f(X) is uniformly random over the range?

(b) is there a continuous bijection g:R^n -> R^m ?

The answer to both is negative. I'll give hand-wavy proof sketches and invite readers to fill in gaps (or point out logical lapses), or give pointers to clear, self-contained proofs. If (a) were true, we could arbitrarily amplify the randomness of some fixed-length input, obtaining lossless compression of arbitrarily long strings. If (b) were true, such a g could be uniformly approximated by differentiable functions, whose local Jacobian would have to be an invertible linear map from R^n to R^m.

Now, what do the two have in common? The first one violates "conservation of randomness" -- we get more random bits than we started out with, for free. The second one violates "conservation of dimension" -- we locally span a space of a higher dimension than what we started out with.

Update: If you take Cosma's class, you'll see an apparent violation of conservation of randomness. Define the logistic map F:[0,1] -> [0,1] by F(x) = 4x(1-x). Let N:[0,1]->{0,1} be the nearest-integer function. Draw x_0 uniformly at random from [0,1] and consider the sequence x_0, x_1, x_2,..., where x_{i+1} = F(x_i). It's easy to show that the sequence y_i = N(x_i) consists of iid Bernoulli variables. So it seems that with a single random draw of x_0, we got an infinite sequence of independent random variables for free. What's going on here?

Monday, February 12, 2007

More feminist math

This started out as a comment to Alexandre Borovik's post but as I was writing it I decided it merits a post of its own.

I'd craft a strongly worded reaction to this garbage (the author compares sexual abuse to "math abuse"), but I hate to repeat myself.

I can only add that John Kellermeier's suggestion that mathematical education abandon its "right" answers and methods would be laughable if it weren't so frighteningly realistic.

This isn't to say that every problem has a unique answer or way of arriving at it. Some problems even have no known answers! But for pedagogical reasons, we tend to give youngsters well-posed, solvable problems that illustrate some specific correct technique.

Is Kellermeier really so humorless and out of touch as not to realize he's imitating Jack Handey?
Instead of having "answers" on a math test, they should just call them "impressions," and if you got a different "impression," so what, can't we all be brothers?

Except I guess you can't say "brothers" anymore. Is "siblings" sufficiently inclusive and diverse?

Concatenation implies star?

Fix a finite alphabet -- let's take it to be {0,1} for simplicity. A language L is a collection of finite strings over {0,1}, i.e., L is a subset of {0,1}^*.

Let U be a family of languages. We say that U is closed under an operation if applying the operation to members of U always yields members of U. Suppose U is closed under concatenation -- that is, if L_1 and L_2 are in U then so is L_1 L_2 (the set of all strings with prefix in L_1 and suffix in L_2).

A student claimed in a homework problem that if U is closed under concatenation then it is closed under the star operation (L^* is the set of all strings obtained by concatenating n members of L, for n=0,1,2,...).

Is this true? Answers/discussion welcome in comments!

Wednesday, February 7, 2007

Sampling Lipschitz functions

Fix a positive integer n and let F_n be the set of all functions
f:{0,1}^n -> [0,n]
that are 1-Lipschitz with respect to the Hamming metric. As a reminder, for two sequences x and y in {0,1}^n, their Hamming distance d(x,y) simply counts the number of indices in which they differ. The Lipschitz condition just means that
|f(x) - f(y)| <= d(x,y)
for all x and y in {0,1}^n. It's obvious that F_n is a compact, convex polytope in a finite dimensional space (that space being R^(2^n) ). How would you sample a function uniformly at random from F_n?

I have a method that I'm pretty sure is correct, but I'd love to generate some discussion...

Tuesday, February 6, 2007

Proofs & Math Pedagogy

Alexandre Borovik's post reminded me of the good old days when I was a math undergrad and would pay an annual visit to my 6th grade Gifted&Talented math teacher. She let me teach a lesson, which proved popular enough that the following year I got two periods in a row. The class consisted of that rare breed of American 12-year-olds who are genuinely hungry for mathematical knowledge and can appreciate beauty and aesthetics when they see it. These kids, I remind you, have never seen calculus or trigonometry, and don't have a solid grasp of the real number field.

I would start out by writing three expressions on the board:
(a) 1 + 1/2 + 1/4 + 1/8 + ...
(b) 1 + 1/2 + 1/3 + 1/4 + 1/5 + ...
(c) 1 - 1 + 1 - 1 + ...

and ask them what each series sums to. Most had no difficulty getting (a) right, and many could even give a pictorial argument to support their answer. (b) gave them more trouble, and who can blame them? It's far from obvious, at first glance, how the harmonic series behaves; even using a computer and making plots, one would become frustrated by the slow divergence. I let the cat out of the bag by telling them that it does in fact blow up, but they'll have to wait for college and Baby Rudin to become really convinced. [Incidentally, as all intelligent 12-year-olds, they are fascinated and confused by the concept of infinity. I glossed over it at this point, saying that "infinity" in this case is shorthand for "increasing without bound" or becoming "as large as you want if you add up enough terms". We'll come back to it shortly!] But what to do about (c)? One hardly needs a computer simulation or a formal proof to see that something is not kosher about that expression. If (b) can be called a "number" in some extended sense, there is no sense in which (c) makes sense a number. So, lesson number one: just because we can write it down, in numbers and symbols, does not necessarily mean it's well-defined or makes sense.

I would go on to ask if anyone had heard of irrational numbers and could name an example of one. A few hands would go up -- "pi" would be the standard answer. Whether anyone knew why, or could formally define irrational numbers, is another matter. Well, we didn't prove pi's irrationality, but we did define rationals and irrationals, and more importantly, we'd see a rigorous proof of the existence of irrational numbers! For most of the kids, this was the first time seeing a formal proof, and by the looks on the faces, I could see that a paradigm shift was taking place. What a novel concept -- you don't just accept facts handed down from authority, but become convinced of them through a waterproof argument!

That would take about 45 minutes. In the next 45, I'd prove -- again, in full rigor -- Cantor's diagonalization theorem of the uncountability of the reals. There is absolutely nothing in the definition of cardinality or in Cantor's proof that is beyond the grasp of a 12-year-old. They were getting it; I can vouch for that. I have no idea why we make students wait until college to see this.

Sunday, February 4, 2007

Concentration of Boolean functions?

In the course of giving an invited talk at Mehryar Mohri's NYU machine learning class this fall, I gave the example of the parity function f:{-1,1}^n -> {-1,1} to illustrate how the concentration phenomenon can break down without a Lipschitz condition. The natural metric here is normalized Hamming and it's immediate that the parity function is not Lipschitz with respect to this metric (or rather its Lipschitz constant is 2n). On the other hand, under the uniform distribution on the discrete cube {-1,1}^n, we have
P(f=-1)=P(f=1)=1/2, while E[f]=0, which means that f is not concentrated about its mean (or any other constant).

Someone in the audience (unfortunately, I don't remember who) asked about the majority function. Totally in improvisation mode, I blurted out that perhaps the majority function is "almost Lipschitz" and one might be able to apply Kutin's generalization of McDiarmid's inequality. [Regarding the latter, I'm surprised this fine paper isn't getting more attention or seeing good applications. Or is it?]

Fast forward to this spring, when I'm sitting in on Ryan O'Donnell's very enjoyable class. We're seeing all sorts of characterizations of properties of Boolean functions in terms of their Fourier coefficients and related concepts. Given my one-track mind, I keep wondering if one can say something about the concentration of Boolean functions with low energy or degree.

Having thought about it for a bit, I see my hopes were naive. Take the simplest nontrivial function -- a dictator (that is, f:{-1,1}^n -> {-1,1} whose values depend only on a fixed single bit). As shown in homework 1, dictators have a rather simple Fourier decomposition. However, for a dictator, we have P(f=-1)=P(f=1)=1/2 and E[f]=0, so we can forget about concentration.

What about majority? The majority function has a more complex Fourier structure (could this be a future homework problem?). Some quick numerical trials indicate that when f:{-1,1}^n -> {-1,1} is the majority function, we have E[f] tending to 0 and P(f=-1), P(f=1) tending to 1/2. I haven't proved this (another possible exercise?), but if it's true then I hope whoever asked me that question reads this post!

A couple of afterthoughts. (1) A recent inequality of Kim and Vu allows one to prove concentration of polynomials with nonnegative coefficients. Since the Fourier expansion is a multivariate polynomial, one might be able to apply Kim-Vu to certain classes of Boolean functions. Is there a characterization of the Boolean functions having nonnegative Fourier coefficients?

(2) I was at first excited to see the Poincaré inequalities, as they are one method for proving concentration. But one must be careful, as Poincaré and log-Sobolev inequalities are really asserting a property of the probability measure, not any specific function. In particular, the Poincaré inequality on the discrete cube is telling us that the product measure on {-1,1}^n has exponential concentration with respect to the Hamming metric. But we already knew a stronger fact (subgaussian tails) via Chernoff bounds!

Update Feb 6: I feel a bit silly running numerics on the majority function and suggesting the observation that E[f] tends to 0 and P(f=-1), P(f=1) tend to 1/2 as an exercise. It's obvious.
Update II: An obvious point that I forgot to make is that it's a bit vacuous to talk about concentration of {-1,1} valued functions; this only makes sense if f_n:{-1,1}^n -> {-1,1} is approaching a constant. The original concentration question was actually about real-valued functions on the cube, but the simple reasoning above show that small Fourier coefficients do not imply concentration.

Thursday, February 1, 2007

Characterizing total variation

Let be a finite set and consider two probability measures (distributions) on , and . One can define many notions of "distance" between and , but a particularly important one (at least in my work) is the variational distance. By definition,

-- that is, we maximize over all subsets of the difference of the measures assigned by the two distributions. It is a well-known fact, routinely left as an exercize for the reader, that

where is just the norm of the difference , viewed as a vector in .

It is also well-known that
where the infimum is taken over all the distributions on , having marginals and , respectively, and the random variables are and are distributed and .

Let us define the (unnormalized) measure as the pointwise minimum of and . A somewhat lesser-known fact is that

(I have not seen this mentioned anywhere, but can't imagine that I'm the first one observing this). It easily follows that if and are minorized by some probability measure -- i.e., there is an for which holds pointwise, we have

Finally, here is a relation that has a chance of being novel. Consider two finite sets , with probability measures on and on . We will write for the product measure on (and similarly for ). Then

To the best of my knowledge, this "tensorization" result was first proved here, but I'd be very grateful if anyone would bring an earlier reference to my attention.

None of these are difficult to prove, but if pressed, how would you do it? I have a simple technique, based on a linear programming principle, that yields all of these pretty much effortlessly (see Lemma 2.6 in the linked paper for the idea). Any other simple proofs out there?

Friday meetings

If we're supposed to meet on Fri, Feb 2, and you don't see your name below, please email me!

  • 1:00 Salahuddin C.
  • 1:30 Vinay C.
  • 2:30 Jeremiah B.
  • 3:00 David W.
  • 3:30 Jonah S.

Reminder: my suggestions for projects can be found here. Try to have a clear idea of what you want to do before the meeting.