Monday, January 1, 2007

On Dictionaries and Language

This is prompted by the Weekly Standard piece on Samuel Johnson. The author, Jack Lynch, credits Johnson (among other things) for taking the trouble to define such "obvious words" as take and get -- which lexicographers before him had neglected. The segue into philosophy of language is inevitable: do words have "primitive", "irreducible" definitions? Or are dictionary entries necessarily circular, since we can only define words in terms of other words? Since much ink has already been spilled on these matters (cf. Wittgenstein, Chomsky and David Lewis), let us frame the question more operationally. Can a computer program, in principle, reason about the world (on par with humans) given only text as input and output (for learning)? Or must it necessarily have sensory perception in order to become sentient?

My training at Bell Labs, where I was most heavily influenced by Daniel Lee, leads me to conjecture that a large enough corpus of text, with sufficiently clever statistical processing, is enough to train a machine that would pass the Turing test. Supporting evidence: congenitally blind people are able to reason about colors perfectly well, without having any qualitative experience of the phenomenon. The linguist J.R. Firth would seem to agree with the possibility of "stand-alone" semantics: "You shall know a word by the company it keeps".

Taking five minutes to ponder the mysteries of language reminds me why I left Natural Language Processing, after a few brief forays. The problem is just too hard, and our current tools too primitive. I realized that until we develop more powerful mathematical tools, NLP research will be plagued by the sad fact that ad-hoc heuristic hacks tend to outperform elegant, clean, principled models. Of course, I am quite out of it as far as recent developments -- and would be happy if someone would set me straight. I know that elegant, principled models exist for document classification and text translation. I also know that if the state of the art is to be judged by Google's automatic translator, then there is, ahem, much room for improvement.

I prefer to be a producer of formal theorems and a consumer of NLP products (and judging by my list of rejected NLP paper submissions, my preference is in line with that of the community). Of course, no one can stop me from dabbling in language as a hobby, which I regularly do. Want the etymology of an obscure (but known!) Indo-European or Semitic root? Want a tip picking out a good dictionary? You've come to the right place. Regarding the latter: I can size up a dictionary in a matter of minutes, and my intuition has yet to mislead me. Always look up slang and, yes, vulgarities -- any lexicographer who pretends that certain words don't exist isn't worthy of the title. (Russian joke: "Мама, что такое жопа?" -- "Такого слова нет, сынок." -- "Странно -- жопа есть, а слова нет?..") Our final conclusion is that Russians have a joke for most occasions.


"Q" the Enchanter said...

"Supporting evidence: congenitally blind people are able to reason about colors perfectly well, without having any qualitative experience of the phenomenon."

That's interesting, but I'm wondering how far this sort of reasoning really goes. I can, e.g., "reason" about strings of propositions invoking p & q, for instance, but I'd argue that that's not sufficient for a coherence theory of semantics with regard to p & q. Well, maybe I'd argue that.

Anyway, got a citation by any chance?

Aryeh said...

Unfortunately, I don't have a citation at my fingertips -- but I believe that "blind people" example came from a book of Pinker's, possibly The Language Instinct.

Regarding your p,q example -- that's my point exactly! It's very difficult to define semantics in an internally consistent and non-circular way. Here's an interesting variation on a Turing test: you're chatting to an educated adult via a text terminal, and your job is to determine whether s/he is blind or not. Can blind people give convincing "sighted" responses? I have no idea, but I would guess that in principle, yes. Any queries you ask via a text terminal are, in principle, answerable from a sufficiently large text corpus. This would seem to support my original strong-AI-learnable-from-text claim.

Anonymous said...

(Russian joke: "Мама, что такое жопа?" -- "Такого слова нет, сынок." -- "Странно -- жопа есть, а слова нет?..")

for us undereducated types, can you paraphrase your Russian?

"Mom, what does a thingy do?"

"Thingies, don't do that, they do something else"

"sure they're a thingy, but they aren't some other thingy?.."

Ha ha ha?

Aryeh said...

Very well: "Mommy, what's an ass?" "There's no such word, dear." "That's odd -- I got one, but there's no word for it?.."

That's the best I can do, and I realize it didn't come out all that well. Russian humor has the property of often not translating well.

The only point I was trying to make is that just as a physician has no notion of "naughty" body parts, a linguist has no notion of "bad" words. Words can be used properly or misused; some character/phoneme strings are "unattested" (have not been observed in the vernacular) or ill-formed (violate the phono/morpho-tactics of the language). But any widely attested word -- no matter how vulgar or obscene -- is a part of the language, merits documenting, and even has proper usage.

Osame Kinouchi said...

Hi Leo,
I am a physicist working on dictionaries as complex networks.
Have you a good reference for definitions of metrics or distance between synonimous words? Have you a preferred distance definition?

Aryeh said...


one way to induce semantic metrics is to convert words to "semantic vectors", similar to what Schone and Jurafsky do here; I'd try browsing the other publications on Dan Jurafsky's page. Of course, dictionary entries give you natural and clean co-occurrence data, so you should be able to extract highly accurate semantics...

I do have a reservation regarding the use of LSA in inducing semantics; the singular value decompostion has next to no theoretical justification in this framework (see my paper). Latent Dirichlet Allocation is a much more elegant and principled semantic dimensionality reduction technique.

Finally, there is a corpus-linguistics mailing list; try writing them at to subscribe -- I'm sure they'll have lots of suggestions for you.

Alexandre Borovik said...

I have to admit that every time Samuel Johnson is mentioned I cannot help but recall the Ink and Incapability episode of Blackadder and the clever trick of inventing new words just to tease the composer of a dictionary. Another modification of the Turing test problem: put yourself in Johnson's shoes. How do you determine that the word was invented and did not exist before?

Anonymous said...

"that a large enough corpus of text, with sufficiently clever statistical processing, is enough to train a machine that would pass the Turing test."

Well, people pass the Turing Test. We have no way to directly access the internal mental operations of other people, and so they may be faking it when they interact with us -- in other words, not really understanding what we are talking about, but undertaking syntactical language manipulations sufficient to persuade us, their interlocuters, that we all share a common semantics. That being the case, I don't see why machines could not do this also.

As the multi-agent systems community knows well, a sufficiently clever software program can always simulate any desired internal state, in order to fool an external observer.

Aryeh said...

I agree entirely that there is no inherent scientific obstacle to having a machine pass the "strong" Turing test (as opposed to the "weak" kind, already passed by ELIZA). The "humans are machines" and "humans are intelligent" is as compelling an existence argument as any.

Of course, there's the nativist/empiricist debate (how much of our knowledge is hard-wired vs. experientially acquired). But this debate is largely irrelevant to my learnable-strong-AI claim. It should be possible to build a "universal" learner from some minimal set of assumptions about the "universe".