Artificial Intelligence - Language Resources
LangBase is a large spreadsheet that is similar to a thesaurus. However, it is formatted by
word type. For example: Nouns, Verbs, Adjectives, and Adverbs are in LANGBSE1. The second file, LANGBSE2,
contains Phrases, Names and Common Words. This kind of data is very useful to people who are researching
or interested in Artificial Intelligence, especially Natural Language Processing (NLP). These are people who,
for example, like to take a sentence as input from a user, and have the computer say something intelligent
in response. Since all the data is listed with headwords, it is very easy to search for a word, and
return the subject.
The following workbooks are provided as-is, free of charge. They consist of hundreds of hours of formatting and
collecting data, so all I ask is if you find it useful or interesting, please
e-mail me. The files are compressed using WinZip 6.3, and are
in Microsoft Excel 5.0/97 format. You may want to combine the two spreadsheets into one larger file (I had
to split them up to upload them).
Language Base 1
Language Base 2
You can write Excel macros, or copy the data into a Microsoft Access database. Alternatively, you can save
the data as comma-delimited text, and load it into a C program or Basic program's variable space.
Application Example 1
The following example of this data can produce some interesting results. Have the user
type in a sentence, such as "The quick brown fox jumped over the lazy dog." Then, for each word in the sentence
that does NOT appear in the list of Common Words (like "the"), replace the word with a word that appears
in the column of headwords (usually column B) in the database. In my example, you get the result:
"Transientness animal neglect inactivity animal". With some formatting, you might be able to get:
"Transient animals neglect inactive animals", which is a thought-provoking way to think about the
sentence entered by the user.
Application Example 2
This is a more advanced application, but produces much nicer results. The basic idea is to calculate a "closeness"
between words, where words with a greater distance are less related in subject. A word that is on the same
row in the database has, say, a distance of 10. So "fox" and "dog" have a distance of 10, because they are both
on the row with headword "ANIMAL". However, "elephant" and "horse" are much more related, because they
share two headword rows - "ANIMAL" and "CARRIER". So they might be assigned a weighted closeness of 100.
Therefore, if the user enters any sentence talking about horses and elephants, the computer would probably seem
justified responding about "carriers" or "beasts of burden".
There are many ways to fine-tune this weighting. For example, words that appear together on a short
row are probably more related than ones that share a long row. Also, some of the headwords are
related. Perhaps two words never appear on the same row (this happens a lot) but their headwords
are related. You could build a list which stores the headwords, and all the headwords that are
related to them. This would allow you to get a subject "closeness" weighting for any word in
the English language!
Panther - [email protected]
Last Update: September 30, 1999