FilterGus Copyright By Mauro Marinilli (August,
12 1998)
FilterGus
0.2
(Documentation version 0.2.0)
Welcome to FilterGus 0.2, a simple, nice and free approach
to Information Filtering!
: the 0.2 version has totally
innovated the previous one and is the first non-
experimental release of this Project! Take a look at a brief, user-oriented
description of it !
FilterGus is a simple Java Program that filters up any kind of textual
documents, provided with Profiles of the user's actual interests. This
project is at its very beginning, and by now it's just something more than
a keyword-parser.
What it is
Theoretically..
It is an Information Filtering program, which employs somehow innovative
approaches to this task, the program being written as a Java applet, and
by the way it should be the very first full-working applet on this task,
pubblicly released over the Internet.
I am currently working on my Thesis on the development of an innovative
Information Filtering system, and this little applet was motivated by the
curiosity to see if some ideas I had were a practical way to cover some
aspects of the IF task as personal filtering, for instance. Anyway this
program and its algorithms have little or nothing to share with my Thesis
Project, and they are developed only as a personal interest. I will try
to keep away all the technical jargon, at least in this first docs release..
The key idea was to explore the use of RL grammars (and their fast
analizers) to the problem of IF, particularly the personal IF
..What is better than a little program embedded in a browser for this
tasks? ..The bet is using this framework (ad-hoc revisited anyway)
for coping with natural languages morphological variations (don't think
to English!) extending it for augment the effectiveness of the filtering
process.. The scanning algorithm thought for this kind of application was
called fancifully the Matching Tree Algorithm so we can boast that
FilterGus uses a MTA technique. It has been especially designed by me for
this task with a target of maximum low-level working mechanism. By now
it has been implemented in a half-way level for short.
This was just a bit of theory, quite far from the actual implemented
reality..
Practically..
This simple program will start up when you load (locally) the starting
HTML page.
Giving it an URL you can filter it out and receive a score based
on a profile you have previously loaded.
It's useful to people that needs to search through a great number of
documents. It can save a lot of time! For example, if you need to do a
big search on a lot of documents, maybe from a Search Engine, you could
consider the effort to write a profile, i.e. a file that describes
to FilterGus what you're looking for. In this actual release a profile
it's just a list of words, with some attribute as the score if matched,
or the suffixes allowed. But this tiny program lets you do a lot more:
with the right profiles it can filters out every kind of textual document,
not only HTML, even if, for this format (and other hypertexts formats)
it performs special searching abilities. See next if you're interested
to know how it works and how to write your profiles.
How it Works
Profile
The Profile is made up with XML1.0 files with the .gus suffix. So
you can edit them with your text editor as long as the right syntax is
maintaned. These files define a kind of language that can teach to 'Gus
many things about the documents he's going to parse for you. You can tell
him particularly important areas of a document, for example the title,
and to emphasize the words found there using the dw="" attribute.
Also you can express the importance of a word giving it a score, using
the cw="" attribute. (The following description is not necessary
reading because FilterGus can make all the work for you without any knowledge
of the behind the scenes)
So, an element in a .gus file looks like this:
<E WORD="word" DW="x" CW="y" >
-
DW= specifies the weight in the document for that word
-
CW= is the content weight for that word, by now you have the following
choices:
-
CW="-" adds with negative weight: used when you don't want that
word!
-
CW="0" don't add anything, useful for stoplist words, for example.
The default
-
CW="1" a good word for this profile
-
CW="2" a very good word
The final score is normalised, so that it can be negative and not bigger
than 100. It's made up with all the partial sums of matching words, when
their cw is different from zero.
Then, you can use many word variants in one expression, to obtain the
same result a stemmer can obtain, more or less.(This point was interesting
for testing if this different approach could allow both good Precison and
Recall at the same time while dealing with morphological variations; but
the real test shouldn't be performed only on the English language that
is too coarse in this aspect).
Profiles in this early version are extracted from a document when you
push the "feedback" button. It is just a sketched functionality
insofar, reporting all the non-matching words, with stamdarda score
and without caring about suffixes. By now it's your task to edit it manually
if you want a very high quality.
Talking about sophisticated profile handling, remember that profiles
can be layered, so you can treat them as Java classes, for example, reusing
them as more as possible. If you're going to make a profile about some
particular hardware component, for example, try to write down two or three
profile, one for Computers, another for Hardware and the last one for your
actual need. Of course, you'll better look here for those general-purpose
profiles, and if you write your own ones, please send them here! A central
Profile database would allow 'Gus to enter in the world of Social (or collaborative)
Filtering, in the clumsiest way, being him an Octopus..
You noticed that FilterGus loads always some default profiles; they're
written in the startup.gus file --where you'll find all the System's
variables-- they're a standard stoplist (go take a look) and a short HTML
profile, where you'll see an example of dw attributes set
to discriminate between structured text.
Also suffixes can be set, up to nine in this version, so you can express
the way a word can match; a special suffix is the "+" one that handles
regular English plurals (casualty-casualties, etc.) and the "*" that works
like in regular expressions. This aspect is to be refined because the goal
is to keep it unspecific about languages and relying all on loaded
profiles. For example, you want to match
computer , computers , computing , computation , computations ,
..
all in one; with this simple mechanism you can, without sacrifice to
weird stemmed words as comput- that could come from the word "compulsory"
as well.. Without any performance loss.
Known Problems
-
the applet 0.2 release will come soon, use the 0.1 instead.
-
efficiency has been kept for future releases. (This is a very common habit
isn't it? )
-
so far manipulating "suspended" threads ("zz..") causes deadlock;
don't hit stop or filter buttons while links are in (suspended)
mode. I preferred to solve this with clean design later than keep going
with patches (a rare habit this one..)
-
"delikatessen" as icons and other refinements , but also user-friendly
settings facilities, etc. will come later on as well. So by now get amused
with strange character sequences beside each link to express its current
state.
Further issues
Firstly, a strategic decision, whether or not let this
project evolve to a fully developed IF system, with plenty of sophisticated
features but also a big size, or keep it as simple as possible, according
to the Applet viewpoint. The answer lies both on technical (the embedded
Java environment in next generation Browsers) and on user demands issues.
Then, multilanguage support is a very important
issue and needs further work to study general mechanisms in order to allow
any language to be handled by Gus. That could be a too ambitious target,
but for most languages (western ones, cyrillic) it should be fairly possible.
Note that the particular architecture of FilterGus allows him to filter
documents in two or more languages at the same time!
Another important issue is the parsing algorithm,
the real engine of this piece of software. The home-made approach to the
task, the so-called Matching Tree Algorithm it has been revealed
expensive to develop (from scratch) and still to be enhanced a lot. So
questions arise to pass altogether to a more proved & efficient LR-
parsing tool; this would be a big change from the first idea, where the
matching algorithm (especially designed for this task) was a major part
of the whole architecture.
Also multidocument format and a real complete set
of actions to cover all the possible aspects of IF (.. to IR ) are parts
of the FilterGus architecture, all of them already developed but furher
in the scheduling agenda being complementary issues. Designing them was
a great fun, but coding's details are not so interesting to me.
Scheduling
There won't be any features addings till January '99, when I'll get out
of that neverending University course.
In the meanwhile only maintanance releases. Efficiency is bounded to
the parsing algorithm, the core of the system but also the most work-demanding
part.
Finally, it'll be fine to hear from interested people, also about ways
to develop open mechanisms for different languages.
Thank You for Your attention,
hoping that Gus will be useful to You!
Copyright (c) Mauro Marinilli August 1998