Consider the binary hypercube B_d = {0,1}^d naturally embedded in R^d. I take some hyperplane H and slice B_d into two pieces, labeling the corners on one side of H positive and on the other side negative. Then I forget about H and find the maximum-margin hyperplane W that separates the positive corners from the negative ones. Define rho(d) to be the worst-case (i.e., smallest) margin obtainable in this way for dimension d. Formally,
rho(d) =
min_{all hyperplanes H}
max_{all hyperplanes W that induce the same dichotomy as H}
(the margin attained by W).
How does rho(d) behave? I recall some numerical simulations a couple of months ago suggesting that it can be exponentially small, on the order of 2^{-d}. Anybody out there aware of any known results? This would be really good to know!
Subscribe to:
Post Comments (Atom)
21 comments:
This isn't a solution, but I think it's a little of a simplification: if you are mainly concerned about the asymptotics wrt d, you need only consider H and W that pass through the cube center. For, if unrestricted H and W have margin rho, form a (d+1)-dimensional hyperplane H' that passes through the H and the cube center; by symmetry the corresponding optimal W' also passes through the center, and can have no better margin than W.
If you instead take a hypercube {+/- 1/sqrt(d)}^d, all the points would be on a unit sphere centered at the origin. If you think of these points as unit vectors, and dualize by replacing each of the 2^d unit vectors by a perpendicular hyperplane and replacing W by a perpendicular unit vector, the arrangement of hyperplanes cuts the sphere into cells and the margin of W becomes the inradius of the cell containing the dual vector to W.
I would like to argue that the number of cells is so large that one of them must be small. If the dual hyperplane arrangement were simple (at most d-1 hyperplanes have a common intersection) then the number of cells would be something like 2^{d^2}, so one of them would cover a surface area 2^{-d^2} that of the whole sphere and have inradius (2^{-d^2}^{1/d}) ~ 2^{-d}.
But it's not simple. E.g. in six dimensions, there is a plane through the center of the hypercube and 20 vertices (the ones with three positive and three negative coordinates) that dualizes to a point in the intersection of 10 hyperplanes.
So I'm not sure how to push this argument. Maybe show that there are a lot of simple intersections? E.g. if you pick a random (d-1)-tuple of hypercube vertices, and find the intersection of the dual hyperplanes, is that intersection disjoint from all the other dual hyperplanes with constant probability? That would be enough to make the argument work, I think.
Very interesting approach, David. It certainly wouldn't have occurred to me, and I'm glad it caught a geometer's attention (I'm also glad there isn't a trivial 3-line solution -- which would've made me feel rather stupid).
I'll try to digest your duality approach... Meanwhile, a simple brute-force way to upper-bound the margin would be to construct some linearly separable labeling of the vertices whose (bad) margin is easy to compute/bound. Finding such adversarial labelings is easy in low dimensions, so maybe some sort of pattern would emerge. If it turns out that the upper-bound is exponentially small, this would nix any argument along the lines of "this concept class is linearly separable in a low-dimensional cube, therefore it must have a decent margin".
By the way, I found a couple references to the problem of counting the number of linearly separable subsets of a hypercube (essentially the number of cells in the dual arrangement):
Aichholzer and Aurenhammer, Classifying Hyperplanes in Hypercubes, 1996
and
Budinich, On linear separability of random subsets of hypercube vertices, 1991
Neither really answers the question, which leads me to believe that good bounds on the number of linear separations may not be known.
I *think* I follow the duality argument, but could you elaborate a bit? You rescale and translate the hypercube so its corners all lie on a unit sphere about the origin. Each corner is a vector and determines (its perpendicular) hyperplane, going thru the origin.
I am less clear about what happens with W, how the surface of the sphere gets cut into cells, and what relations hold between cell size and "primal" margin...
I should stress again that I care less about the asymptotics of rho(d) than about (tight) upper bounds. So you agree that exponential decay is reasonable as a working conjecture?
The rescaling and translation is not especially important. The more important points are that you view the points as lying on a sphere, and that you you replace each polar pair of points (forming a line through the center of the sphere) by an equatorial great circle (or higher dimensional equivalent) formed by cutting the sphere on a hyperplane through the center that is perpendicular to that line.
Here's another way of thinking about it that may help. You want to find a hemisphere that has some set of hypercube vertices inside it, and the complementary set outside it. A hemisphere is the set of points on a sphere within a 90 degree angle of its pole. Each great circle you're forming for an opposite pair of vertices (u,v) splits the sphere in two. If you place a pole in one of those two parts, the hemisphere surrounding that pole will contain u, while if you place the pole in the other of the two parts, the hemisphere containing that pole will contain v.
So the arrangement of all of these great circles formed from the hypercube vertices partitions the sphere into cells. If you place a pole at any point within one of these cells, the hemisphere surrounding that pole will contain the same vertices as you would get if you placed the pole anywhere else within the same cell.
The angular margin of a hyperplane is then the distance from the dual pole to the boundary of its cell. Because if you rotate the hyperplane through the center of the sphere so that the pole crosses that boundary, you would get a different partition as the vertex dual to the boundary crosses the rotating hyperplane.
I'm not sure, did that help at all or was it just more confusing?
The A+A '96 paper has something similar in it, I think, so maybe they explain it more clearly.
Some more progress. Still not a proof but getting closer. So from my previous comments we need merely to show that there are many cells in the arrangement. Here's an argument that at least makes it seem likely that this is true.
First, the number of cells in the arrangement is at least equal to half the number of arrangement vertices. For, in the southern hemisphere of the sphere, each vertex is the southernmost point for at least one cell, and each cell has one southernmost point. (The same argument would apply to northernmost points in the northern hemisphere, but then we'd have to worry about counting equatorial cells twice. As for the southernmost points of cells in the northern hemisphere, the problem is that a cell may have multiple locally-southernmost points).
So anyway, instead of counting cells, let's count vertices. As in my first comment, I'd like to count simple vertices, formed by the intersection of only d-1 hyperplanes. I'd like to show that a random (d-1)-tuple of hyperplanes has a reasonable chance of forming a simple vertex (where reasonable means far less than constant, but still large enough to make the overall approach work).
d hyperplanes intersect whenever the corresponding d pairs of opposite hypercube vertices lie on a common hyperplane. If we view the hypercube as {0,1}^d, pick for each pair of opposite hypercube vertices the one that has a 1 in its last coordinate, and use the coordinates of each vertex as a row in a matrix, then a d-tuple of vertices corresponds to a 0-1 matrix in which the last column is all ones, and the points lie on a common hyperplane iff this matrix has zero determinant. So now suppose we have chosen randomly the first d-1 rows of a matrix of this type. We want there to be no other vertex that is coplanar with the chosen d-1 vertices. If the coordinates of the last row are x_i, then the determinant of the dxd matrix is sum(x_i D_i), where D_i is a determinant of a smaller (d-1)x(d-1) matrix.
Now here's the non-rigourous part: with probability close to 1/2, any D_i is positive. If they are all positive, then no not-all-zero assignment of 0's and 1's to the x_i's can make the overall dxd determinant zero. So it seems likely that with probability ~ 1/2^d, a random (d-1)-tuple of vertices forms a simple arrangement vertex. Thus the number of simple vertices would be something like 2^{(d-1)(d-2)} and the rest of the argument about some cell having so small volume that its inradius must be small would go through.
Looks like you're going to kill this one off before I get a chance to fully understand your approach. Tempting though it is to work on this now, I must finish writing up the PhD thesis for the defense in May (on a different topic entirely).
It's kind of disappointing that the worst-case margin seems to be decaying exponentially. If it were only polynomial, that would significantly facilitate some learning generalization bounds...
Oh, and even if you do end up dispensing with this question via your approach, I'm still going to take some time to try and find explicit constructions of adversarial, low-margin achieving labelings of vertices.
Almost there:
In my last comment I had transformed the problem into the following: take a random n x n matrix M of 0's and 1's, and add a column of all ones to it to form a rectangular matrix M'. Form the sequence of values D_i = (-1)^i times the determinant of the submatrix formed by dropping the i'th column of M'. If all D_i's are nonzero and have the same sign, then the rows of M give a set of n vertices of the n-hypercube such that the hyperplane through them avoids all other vertices. If this happens with high enough probability, then there are many such cutting planes, and if there are many planes then we can prove that some linearly separable subset of the hypercube vertices has exponentially low margin.
Today's insight: one can control the signs of M' by replacing any column with its complement. The complement of c is 1-c (where 1 is the all-ones column, also in the matrix) from which it follows that this replacement flips the signs of all the D_i's except for the one in which column c is dropped. Therefore, the probability of getting all D_i's nonzero and the same sign is 2^-n times the probability that they're all nonzero, exactly what we want it to be.
The remaining subproblem is to bound the probability that all D_i are nonzero.
Two steps back: the sign-changing approach doesn't work, because the change to the D_i that corresponds to dropping the all-ones column is not just a sign change. And in general the approach of trying to form M' such that it is impossible to add a 0-1 row (with a 1 in the last column) to get a singular matrix is doomed to failure: it always works to repeat one of the existing rows.
That's a shame, as the M' formulation was finally something I could readily understand and begin to work with!.. I was about to try to apply Tao-Vu to it:
http://front.math.ucdavis.edu/math.CO/0411095
Back to geometry, I guess...
Have you though about using an algebraic equation to define an arbitrary hyperplan H [like a1*x1+a2*x2=C for d=2]? If you can formally represent its "margin" [the term I don't understand] through the coefficiens ai then the first derivative of this must be equal to 0 for max/min.
The margin of a hyperplane H is the shortest distance from any corner of the hypercube to H. I don't know that there's a closed-form expression for this quantity in terms of the vector and offset defining H. The matter is further complicated by the fact that we want the maximum-margin hyperplane for a given linearly separable vertex partition.
Well, let's see for d=2:
Any dot on the hyperplane a1x1+a2x2=C has the following coordinates: (x1, (C-a1x1)/a2)
The dot's distance, from the hypercube corner (k1,k2) would be
sqrt((x1-k1)^2+((C-a1x1)/a2-k2)^2).[*]
Taking the first derivative:
2(x1-k1)+2(C-a1x1)/a2-k2)(-a1)=
2(1+a1^2/a2)x1+2(C/a2-k1-k2)=0
or
x1=((k1+k2)a2-C)/(1+a1^2)
If we substitute this value back into [*] we'll get a function to optimize by k1,k2 [should be easy as ki=0!1] and also by a1,a2 - that's what you are after!
In a common case you'd have to first solve the d equations resulted from taking first derivative from [*] by each x1, x2, .. xd
Only then you can consider the additional condition of separability
When you optimize [*] you need to take into account the constraint that your optimal hyperplane W must partition the corners into the same two classes as the original one H.
The original hyperplane h1x1+h2x2=H [I am still assuming d=2, this time with no loss of commonality] slices the hypercube into two pieces - one, where h1x1+h2x2 > H and another, where h1x1+h2x2 < H. Let's call the corners "positive" if h1x1+h2x2-H>0 and "negative" otherwise. Your class W must preserve the corners "sign", so:
sign(a1x1+a2x2-C)=sign(h1x1+h2x2-H)
or, more conviniently:
(a1x1+a2x2-C)*(h1x1+h2x2-H)>0 [**]
The above must be true only for the corners, so we have 2^d [one per corner] linear restriction in ai [**] in addition to the goal function [*].
I vaguely recall that there were ways to bring such restrictions directly into the goal function.
Sure, computing the maximum-margin hyperplane (the technique also goes by SVM -- support vector machine) is a quadratic program, readily solvable by a number of computational packages. But I'm pretty sure there's not a closed-form solution.
I haven't forgotten about this problem but progress has been slow.
I did find more about the number of linearly separable subsets of a hypercube (also the number of subsets separable by a central linear cut in one dimension higher): http://www.research.att.com/~njas/sequences/A000609
I should look at some of those references to see whether more is known about the asymptotics of these numbers.
These are growing rather quickly...
Do you have any intution about simple slices that give bad margins? In two dimensions, you clearly want to cut a corner off the square. In three dimensions, it's a trivial matter to enumerate them all (must prepare lecture for tomorrow and not get distracted :)
This keeps going back to my pig-headed approach, but I was wondering about your geometer's intuition...
It was too long for a comment here, so I posted some visual intuition over on my own blog.
Many thanks for the very helpful illustrations. How do you draw those figures?
Three illustrations are new: the cube with decorated sides, and the two stella octangula ones. I drew them in Adobe Illustrator, using the grid to align them (and its transparency features to make them partially see-through) but otherwise freehand. The other two illustrations were cribbed from old papers of mine and touched up in Illustrator. The spherical drawing was originally done in Cinderella, while if I remember correctly the exploded hypercube was done in Mathematica. Or rather, I did an unexploded hypercube wireframe in Mathematica and then exploded it and added the transparency features in Illustrator.
Post a Comment