Artificial Intelligence - Language Resources


LangBase is a large spreadsheet that is similar to a thesaurus. However, it is formatted by word type. For example: Nouns, Verbs, Adjectives, and Adverbs are in LANGBSE1. The second file, LANGBSE2, contains Phrases, Names and Common Words. This kind of data is very useful to people who are researching or interested in Artificial Intelligence, especially Natural Language Processing (NLP). These are people who, for example, like to take a sentence as input from a user, and have the computer say something intelligent in response. Since all the data is listed with headwords, it is very easy to search for a word, and return the subject.

The following workbooks are provided as-is, free of charge. They consist of hundreds of hours of formatting and collecting data, so all I ask is if you find it useful or interesting, please e-mail me. The files are compressed using WinZip 6.3, and are in Microsoft Excel 5.0/97 format. You may want to combine the two spreadsheets into one larger file (I had to split them up to upload them).

Language Base 1
Language Base 2

You can write Excel macros, or copy the data into a Microsoft Access database. Alternatively, you can save the data as comma-delimited text, and load it into a C program or Basic program's variable space.

Application Example 1

The following example of this data can produce some interesting results. Have the user type in a sentence, such as "The quick brown fox jumped over the lazy dog." Then, for each word in the sentence that does NOT appear in the list of Common Words (like "the"), replace the word with a word that appears in the column of headwords (usually column B) in the database. In my example, you get the result: "Transientness animal neglect inactivity animal". With some formatting, you might be able to get: "Transient animals neglect inactive animals", which is a thought-provoking way to think about the sentence entered by the user.

Application Example 2

This is a more advanced application, but produces much nicer results. The basic idea is to calculate a "closeness" between words, where words with a greater distance are less related in subject. A word that is on the same row in the database has, say, a distance of 10. So "fox" and "dog" have a distance of 10, because they are both on the row with headword "ANIMAL". However, "elephant" and "horse" are much more related, because they share two headword rows - "ANIMAL" and "CARRIER". So they might be assigned a weighted closeness of 100. Therefore, if the user enters any sentence talking about horses and elephants, the computer would probably seem justified responding about "carriers" or "beasts of burden".

There are many ways to fine-tune this weighting. For example, words that appear together on a short row are probably more related than ones that share a long row. Also, some of the headwords are related. Perhaps two words never appear on the same row (this happens a lot) but their headwords are related. You could build a list which stores the headwords, and all the headwords that are related to them. This would allow you to get a subject "closeness" weighting for any word in the English language!


Panther - [email protected]
Last Update: September 30, 1999