Thursday, February 03, 2005

Computers Learn from Google Search


Google's search for meaning

Computers can learn the meaning of words simply by plugging into Google. The finding could bring forward the day that true artificial intelligence is developed.

Trying to get a computer to work out what words mean - distinguish between "rider" and "horse" say, and work out how they relate to each other - is a long-standing problem in artificial intelligence research.

One of the difficulties has been working out how to represent knowledge in ways that allow computers to use it. But suddenly that is not a problem any more, thanks to the massive body of text that is available, ready indexed, on search engines like Google (which has more than 8 billion pages indexed).

The meaning of a word can usually be gleaned from the words used around it. Take the word "rider". Its meaning can be deduced from the fact that it is often found close to words like "horse" and "saddle". Rival attempts to deduce meaning by relating hundreds of thousands of words to each other require the creation of vast, elaborate databases that are taking an enormous amount of work to construct.

The "Google distance"
But Paul Vitanyi and Rudi Cilibrasi of the National Institute for Mathematics and Computer Science in Amsterdam, the Netherlands, realised that a Google search can be used to measure how closely two words relate to each other. For instance, imagine a computer needs to understand what a hat is.

To do this, it needs to build a word tree - a database of how words relate to each other. It might start with any two words to see how they relate to each other. For example, if it googles "hat" and "head" together it gets nearly 9 million hits, compared to, say, fewer than half a million hits for "hat" and "banana". Clearly "hat" and "head" are more closely related than "hat" and "banana".

To gauge just how closely, Vitanyi and Cilibrasi have developed a statistical indicator based on these hit counts that gives a measure of a logical distance separating a pair of words. They call this the normalised Google distance, or NGD. The lower the NGD, the more closely the words are related.

Automatic meaning extraction
By repeating this process for lots of pairs of words, it is possible to build a map of their distances, indicating how closely related the meanings of the words are. From this a computer can infer meaning, says Vitanyi. "This is automatic meaning extraction. It could well be the way to make a computer understand things and act semi-intelligently," he says.

The technique has managed to distinguish between colours, numbers, different religions and Dutch painters based on the number of hits they return, the researchers report in an online preprint.

The pair's results do not surprise Michael Witbrock of the Cyc project in Austin, Texas, a 20-year effort to create an encyclopaedic knowledge base for use by a future artificial intelligence. Cyc represents a vast quantity of fundamental human knowledge, including word meanings, facts and rules of thumb.

Witbrock believes the web will ultimately make it possible for computers to acquire a very detailed knowledge base. Indeed, Cyc has already started to draw upon the web for its knowledge. "The web might make all the difference in whether we make an artificial intelligence or not," he says.

1 comment:

Anonymous said...

While you spend hours doing virus scans and spyware searches, the real backdoors are quietly installed on your personal porn-surfing apparatuses by these nice folks:

Dept of Homeland Security -
in cooperation with and facilitated by:


and a host of little brothers...

Yes, Big Brother has a well-connected extended family.

more to come