FilterGus 0.2

(Documentation version 0.2.0)

Welcome to FilterGus 0.2, a simple, nice and free approach to Information Filtering!

Attention: the 0.2 version has totally innovated the previous one and is the first non-
experimental release of this Project! Take a look at a brief, user-oriented description of it !

FilterGus is a simple Java Program that filters up any kind of textual documents, provided with Profiles of the user's actual interests. This project is at its very beginning, and by now it's just something more than a keyword-parser.

What it is

Theoretically..

It is an Information Filtering program, which employs somehow innovative approaches to this task, the program being written as a Java applet, and by the way it should be the very first full-working applet on this task, pubblicly released over the Internet.
I am currently working on my Thesis on the development of an innovative Information Filtering system, and this little applet was motivated by the curiosity to see if some ideas I had were a practical way to cover some aspects of the IF task as personal filtering, for instance. Anyway this program and its algorithms have little or nothing to share with my Thesis Project, and they are developed only as a personal interest. I will try to keep away all the technical jargon, at least in this first docs release..
The key idea was to explore the use of RL grammars (and their fast analizers) to the problem of IF, particularly the personal IF ..What is better than a little program embedded in a browser for this tasks? ..The bet is using this framework (ad-hoc revisited anyway) for coping with natural languages morphological variations (don't think to English!) extending it for augment the effectiveness of the filtering process.. The scanning algorithm thought for this kind of application was called fancifully the Matching Tree Algorithm so we can boast that FilterGus uses a MTA technique. It has been especially designed by me for this task with a target of maximum low-level working mechanism. By now it has been implemented in a half-way level for short.
This was just a bit of theory, quite far from the actual implemented reality..

Practically..

This simple program will start up when you load (locally) the starting HTML page. Giving it an URL you can filter it out and receive a score based on a profile you have previously loaded.
It's useful to people that needs to search through a great number of documents. It can save a lot of time! For example, if you need to do a big search on a lot of documents, maybe from a Search Engine, you could consider the effort to write a profile, i.e. a file that describes to FilterGus what you're looking for. In this actual release a profile it's just a list of words, with some attribute as the score if matched, or the suffixes allowed. But this tiny program lets you do a lot more: with the right profiles it can filters out every kind of textual document, not only HTML, even if, for this format (and other hypertexts formats) it performs special searching abilities. See next if you're interested to know how it works and how to write your profiles.

How it Works

Profile

The Profile is made up with XML1.0 files with the .gus suffix. So you can edit them with your text editor as long as the right syntax is maintaned. These files define a kind of language that can teach to 'Gus many things about the documents he's going to parse for you. You can tell him particularly important areas of a document, for example the title, and to emphasize the words found there using the dw="" attribute. Also you can express the importance of a word giving it a score, using the cw="" attribute. (The following description is not necessary reading because FilterGus can make all the work for you without any knowledge of the behind the scenes)
So, an element in a .gus file looks like this: <E WORD="word" DW="x" CW="y" >

DW= specifies the weight in the document for that word
CW= is the content weight for that word, by now you have the following choices:

CW="-" adds with negative weight: used when you don't want that word!
CW="0" don't add anything, useful for stoplist words, for example. The default
CW="1" a good word for this profile
CW="2" a very good word

The final score is normalised, so that it can be negative and not bigger than 100. It's made up with all the partial sums of matching words, when their cw is different from zero.
Then, you can use many word variants in one expression, to obtain the same result a stemmer can obtain, more or less.(This point was interesting for testing if this different approach could allow both good Precison and Recall at the same time while dealing with morphological variations; but the real test shouldn't be performed only on the English language that is too coarse in this aspect).
Profiles in this early version are extracted from a document when you push the "feedback" button. It is just a sketched functionality insofar, reporting all the non-matching words, with stamdarda score and without caring about suffixes. By now it's your task to edit it manually if you want a very high quality.
Talking about sophisticated profile handling, remember that profiles can be layered, so you can treat them as Java classes, for example, reusing them as more as possible. If you're going to make a profile about some particular hardware component, for example, try to write down two or three profile, one for Computers, another for Hardware and the last one for your actual need. Of course, you'll better look here for those general-purpose profiles, and if you write your own ones, please send them here! A central Profile database would allow 'Gus to enter in the world of Social (or collaborative) Filtering, in the clumsiest way, being him an Octopus..
You noticed that FilterGus loads always some default profiles; they're written in the startup.gus file --where you'll find all the System's variables-- they're a standard stoplist (go take a look) and a short HTML profile, where you'll see an example of dw attributes set to discriminate between structured text.
Also suffixes can be set, up to nine in this version, so you can express the way a word can match; a special suffix is the "+" one that handles regular English plurals (casualty-casualties, etc.) and the "*" that works like in regular expressions. This aspect is to be refined because the goal is to keep it unspecific about languages and relying all on loaded profiles. For example, you want to match computer , computers , computing , computation , computations , ..

all in one; with this simple mechanism you can, without sacrifice to weird stemmed words as comput- that could come from the word "compulsory" as well.. Without any performance loss.

Known Problems

the applet 0.2 release will come soon, use the 0.1 instead.
efficiency has been kept for future releases. (This is a very common habit isn't it? )
so far manipulating "suspended" threads ("zz..") causes deadlock; don't hit stop or filter buttons while links are in (suspended) mode. I preferred to solve this with clean design later than keep going with patches (a rare habit this one..)
"delikatessen" as icons and other refinements , but also user-friendly settings facilities, etc. will come later on as well. So by now get amused with strange character sequences beside each link to express its current state.

Further issues

    Firstly, a strategic decision, whether or not let this project evolve to a fully developed IF system, with plenty of sophisticated features but also a big size, or keep it as simple as possible, according to the Applet viewpoint. The answer lies both on technical (the embedded Java environment in next generation Browsers) and on user demands issues.
    Then, multilanguage support is a very important issue and needs further work to study general mechanisms in order to allow any language to be handled by Gus. That could be a too ambitious target, but for most languages (western ones, cyrillic) it should be fairly possible. Note that the particular architecture of FilterGus allows him to filter documents in two or more languages at the same time!
    Another important issue is the parsing algorithm, the real engine of this piece of software. The home-made approach to the task, the so-called Matching Tree Algorithm it has been revealed expensive to develop (from scratch) and still to be enhanced a lot. So questions arise to pass altogether to a more proved & efficient LR- parsing tool; this would be a big change from the first idea, where the matching algorithm (especially designed for this task) was a major part of the whole architecture.
    Also multidocument format and a real complete set of actions to cover all the possible aspects of IF (.. to IR ) are parts of the FilterGus architecture, all of them already developed but furher in the scheduling agenda being complementary issues. Designing them was a great fun, but coding's details are not so interesting to me.

Scheduling

There won't be any features addings till January '99, when I'll get out of that neverending University course.
In the meanwhile only maintanance releases. Efficiency is bounded to the parsing algorithm, the core of the system but also the most work-demanding part.

Finally, it'll be fine to hear from interested people, also about ways to develop open mechanisms for different languages.

Thank You for Your attention, hoping that Gus will be useful to You!

- general info - installation - description & examples - change log - my homepage -

FilterGus Copyright By Mauro Marinilli (August, 12 1998)