Note: The links from this page should all point to files of code. The CGI and Perl scripts linked here should not run.
Purpose of the System

The group designed the system outlined below to provide some form of searchability to the enormous quantities of HTML produced by the Flora of North America project. After considering several possible retrieval models, the class elected to pursue a retrieval form that closely resembled relational data access.

In the FNA data, most species descriptions consisted of one or more standard heading labels (presented in boldface) and associated data values, followed by any of several semi-standard heading labels (also presented in boldface), followed in turn by a free-text description. These species descriptions were collected into meaningful groupings and presented in a single large HTML file. An example of such a file can be seen here.

The group elected to consider these standard and semi-standard heading labels as data fields, with the associated data values as the value of that field for that record. Retrieval would allow for searching in all fields or in a single specified field.

Indexing the Collection

The process of indexing the collection in order to search for a species based on the contents of a given field required several distinct operations. Each of these operations was handled by a separate component of the final system.

  1. converter.pl: This program separates the HTML source files into individual files for each species. It also strips extraneous HTML from the documents.
  2. fnaprocess.pl: This program processes the species files down still farther, identifying the fieldname each word in the file should be associated with.
  3. fnaindex.pl: This program reads the output of fnaprocess.pl and stores the information in one of several databases.

The end result of this indexing process is a small collection of GDBM database files. A single database exists for each of the standard and semi-standard heading labels. The keys into these databases are the words found in the files, while the values in the database are lists of filenames.

My contributions to this part of the project were primarily in the database code. The databasing routines found in fnaindex.pl are a modification of database routines initially created as standalone applications in gdbm.pl and gdbm.c. These routines were later adapted again for use in the retrieval portion of the system. I also contributed some minor interface code to fnaindex.pl, such as the usage information message and the command-line option handling.

Retrieving the Data

The second portion of the project involved using the indexes we built in the first part of the project to provide retrieval from a search interface. We elected to create the search interface as a CGI script, preferring that over an X-Windows, Windows, Macintosh, or even Java application. The rationale behind this choice was simply that HTML and CGI would be accessible to a larger number of people than any of the alternatives.

Retrieval based on a single field presented little difficulty. The GDBM storage code was revised to provide access to the databases instead. The appropriate GDBM file for the field in question would be opened, the script would check the database for any entries matching the search term, then return a list of document names which contained the requested term in the requested field.

Searching all terms simultaneously proved to be a more substantial challenge. The group elected to provide two methods for doing this. The first method simply searched all available GDBM files in sequence and collated the results. The second option used the MG indexing and retrieval engine to provide full-document searching.

This decision led to a four-component design. These components were:

In this part of the project, I was responsible for mg_server.pl and mgquery.pl almost in entirety, I was responsible for the GDBM-handling routines in fnasearch.cgi, and I was responsible for the code used to invoke mgquery.pl from fnasearch.cgi.