Welcome to the home page for the LIS429 Indexing
projet.
This project aims to take a number of html files that contain genus and species information on plants of North America and index them and then make the index searchable via the Web. This data is part of the Flora of North America Project
The first phase of this project was to write and indexing program for the collection of HTML files. This phase has three stages:
- 1. Text processing of the FNA files.
The work involves breaking each file into multiple files and then
stripping
the html out of each of these files. Each species file should end up
formatted to highlight key terms inside html bold tags. Key terms
include the genus-species name, and descriptor terms such as leaves.
Anton and Alex's
converter.pl This program handles the FNA data file manipulation and separation.
Our output files are named 10002.html-10090.html created from the original Txxxxxx.html files. Both sets of files can be found on oak at:
/usr2/people/lis429/public_html/aa
- 2. Conversion of the processed text file to an index.
This phase involves two perl programs. The program
fnaprocess.pl converts the html files that stage 1l produces (i.e. 10000.html series) to a form that can be readily indexed. The file fnaindex.pl, is an Indexing
Program. It incorporates the indexing steps and the gdbm database file creation as written by Geoff (see stage 3).
This work involves taking the bold tag marked up species files from step
one and producing indexes for each of the descriptor terms. Eric
produced a list of common descriptor terms and this list will be used
as a master. Each descriptor in the master will result in one
index.
As of 4/11/98 the method for calling the file fnaindex.pl is:
From the command-line, type:
fnaindex.pl -i /usr2/people/lis429/public_html/fnaprocess/ -o
/usr2/people/lis429/public_html/fnadatabases/ -b bold.wrd -s stop.wrd
3. Incorporating the index of step 2 into a gdbm database.
Geoff's files
These include work from the whole semester. Particular to this project
is the folder on GDBM, the hashing/database package used to hold the index as produced in step two. A perl version and a c version are available:
gdbm.pl, &
gdbmbld.c
The latest and greatest program for creating gdbm databases is at:
component.pl
Component.pl is a program that will interface with Eric's fnaindex.pl. It will
create a separate database for each of the controlled vocabulary (i.e. bold word
s) such as leaves and stems. Note that component.pl (version as of 4/4/98) has
been concatenated with fnaindex.pl
Also
flatfiler.pl, a small program that will print out the contents of a gdbm database file.
- Location of files is as follows:
/usr2/people/lis429/public_html/ #Project root - Contains the perl processing files and CGI files
/usr2/people/lis429/public_html/source #Original data files (i.e. TXXXXX.html)
/usr2/people/lis429/public_html/species #HTML files for final display. Contain one species per file (e.g. 10001.HTML)
/usr2/people/lis429/public_html/fnaprocess #Intermediary files for use in indexing
/usr2/people/lis429/public_html/fnadatabases #The gdbm database files that were created by fnaindex.pl
The second phase of this project was to write a CGI program to allow searching of the previously created indexes. The CGI program for this activity is available at fnasearch.cgi.
Other Issues
There is a modified version of mgquery at:
/usr2/local/mg-1.2/src/text/mgquery
A in class demo was run with the command:
/usr2/local/mg-1.2/src/text/mgquery -i junk2. gdbm calflora
Followed with the query:
>quercus @junk2.gdbm:oval
This query instructs gm to search the calflora db for the word 'quercus'
and the junk2.gdbm database/hash for the word 'oval'
There is also a Web Page
hosted by the makers of
mg. This page includes a manual and descriptions of the workings of mg.