## Friday, December 29, 2006

### FLAC project ideas

This spring I'm TAing Formal Languages, Automata and Computation (FLAC), with Lenore Blum as instructor and Katrina Ligett as fellow TA. Since we like to get the undergraduates off to an early start in independent research, we are suggesting (well, requiring) that the students do a course project.

Below I am going to suggest some project ideas, based on problems that came up in my work. These range from "mathematical curiosities" to solid research problems to truly fundamental questions, probing the very foundations of computer science. I'll be adding to this list as time goes by, but here is a start.

1. In 2003, I proved that the so-called n-gram uniquely decodable languages are regular. In Sec. 4.4 I conjectured that an efficient automaton construction is possible. Recently, such a construction was indeed given. Project: Read and understand the two papers. Implement the automaton described by Li and Xie (in matlab, for example). Attempt to answer other questions I raise in Sec. 4 (I'd be particularly happy to resolve 4.1).
2. In a recent paper with Mehryar Mohri and Corinna Cortes, we present a novel approach to learning formal languages from labeled samples. Our technique consists of embedding the strings in a high-dimensional space and learning the language as a maximum-margin hyperplane. We prove that a rich family of languages (the piecewise testable ones) are linearly separable -- and therefore efficiently learnable -- under a certain efficiently computable embedding (kernel). One of the gaps we've been trying to close is a lower bound on the separating margin, in terms of the automaton complexity. All we can currently show is that it's strictly positive. Do not be daunted by the machine-learning concepts in this paper; I'll be happy to sit down with you and explain them all. A result of the type "margin is at least 1/poly(#states)" would be a very good thing, and would certainly make a solid, publishable paper.
3. A natural extension of the work with Corinna and Mehryar is the question of what other language families can be linearly separated. It turns out there is a universal kernel that separates all the regular languages! This is one of those results that sounds deep (indeed, it eluded us for something like a year) but becomes trivial once you see the construction. The key quantity is the following auxiliary kernel. Fix a finite alphabet Sigma of size m, and let DFA(m,n) be the set of all deterministic finite-state automata on n states over this m-letter alphabet. There is no loss of generality in assuming that every DFA always starts in state 1. Then size(DFA(m,n)) = n^(mn)*2^n -- do you see why? Suppose I give you two strings, x and y over Sigma, and ask: how many A in DFA(m,n) accept both x and y? Call this quantity K_n(x,y). It may not be immediately obvious, but K_n(x,y) is a kernel of fundamental importance, for it can be easily adapted to render all the regular languages linearly separable. Unfortunately, we do not have an efficient algorithm for computing K_n(x,y), and in fact suspect that it's a computationally hard problem (a likely candidate for #P complete). Proving that this is indeed the case would be a big deal; providing an efficient algorithm for computing K_n(x,y) would be an even bigger deal. (We do have an efficient epsilon-approximation to K_n(x,y).) Again, I'm leaving out many details and a writeup/talk slides are forthcoming. I can't overstate the importance of this problem (according to a really famous computer scientist who'll go unnamed, it's even more important than I realize!). I'd be thrilled if a brave soul would attempt it.

Update: Jan. 26 2007:

More project ideas:

• Algebraic automata theory. A couple of students have expressed an interest in the deep and fascinating connection between automata, languages, and semigroups. I am going to mention a few places to start looking; there's plenty of material here for several projects (heck, for many PhD theses). (1) Check out the expository papers by J-E Pin and/or Mateescu & Salomaa (2) Take a look at the book Semirings, automata, languages by Kuich and Salomaa.
• Algorithms on strings. Computations on strings are at the core of natural language processing and computational biology. Take a look at an influential paper on string kernels, check out Dan Gusfield's book, or the excellent book by Sankoff and Kruskal.

The way I envisioned for this to work is that you do some independent reading, come to me with specific questions/ideas, and we together decide on what a reasonable project might be. I have a strong preference for independent research over passive summarization, but exceptions can be made for particularly challenging material.

UPDATE: Feb. 1, 2007: If we spoke about meeting, please email me with the time! Also, try to have a concrete idea by the time we meet...