Flora of North America Project

Note: The links from this page should all point to files of code. The CGI and Perl scripts linked here should not run.

Purpose of the System

The group designed the system outlined below to provide some form of searchability to the enormous quantities of HTML produced by the Flora of North America project. After considering several possible retrieval models, the class elected to pursue a retrieval form that closely resembled relational data access.

In the FNA data, most species descriptions consisted of one or more standard heading labels (presented in boldface) and associated data values, followed by any of several semi-standard heading labels (also presented in boldface), followed in turn by a free-text description. These species descriptions were collected into meaningful groupings and presented in a single large HTML file. An example of such a file can be seen here.

The group elected to consider these standard and semi-standard heading labels as data fields, with the associated data values as the value of that field for that record. Retrieval would allow for searching in all fields or in a single specified field.

Indexing the Collection

The process of indexing the collection in order to search for a species based on the contents of a given field required several distinct operations. Each of these operations was handled by a separate component of the final system.

converter.pl: This program separates the HTML source files into individual files for each species. It also strips extraneous HTML from the documents.
fnaprocess.pl: This program processes the species files down still farther, identifying the fieldname each word in the file should be associated with.
fnaindex.pl: This program reads the output of fnaprocess.pl and stores the information in one of several databases.

The end result of this indexing process is a small collection of GDBM database files. A single database exists for each of the standard and semi-standard heading labels. The keys into these databases are the words found in the files, while the values in the database are lists of filenames.

My contributions to this part of the project were primarily in the database code. The databasing routines found in fnaindex.pl are a modification of database routines initially created as standalone applications in gdbm.pl and gdbm.c. These routines were later adapted again for use in the retrieval portion of the system. I also contributed some minor interface code to fnaindex.pl, such as the usage information message and the command-line option handling.

Retrieving the Data

The second portion of the project involved using the indexes we built in the first part of the project to provide retrieval from a search interface. We elected to create the search interface as a CGI script, preferring that over an X-Windows, Windows, Macintosh, or even Java application. The rationale behind this choice was simply that HTML and CGI would be accessible to a larger number of people than any of the alternatives.

Retrieval based on a single field presented little difficulty. The GDBM storage code was revised to provide access to the databases instead. The appropriate GDBM file for the field in question would be opened, the script would check the database for any entries matching the search term, then return a list of document names which contained the requested term in the requested field.

Searching all terms simultaneously proved to be a more substantial challenge. The group elected to provide two methods for doing this. The first method simply searched all available GDBM files in sequence and collated the results. The second option used the MG indexing and retrieval engine to provide full-document searching.

This decision led to a four-component design. These components were:

fnasearch.cgi: This program contains both the search form, the code for managing the search or searches, the code for handling the boolean set logic, and the code for presenting the results. This version of fnasearch.cgi actually runs so that you can see the system in action. The MG query support may not be presently active, since it depends on the mg_server.pl boostrap being started.
mgquery.pl: This program connects to a Unix domain socket and sends a request to the process on the other end of that socket.
mg_server.pl: This program acts as a bootstrap to the actual MG system. It opens a Unix domain socket and processs queries, one at a time. When each query completes, this program returns the results to the requesting program, then processes the next request.
mg: This is a "third-party" program. MG is a complete indexing and retrieval system for free-text searching, and provides substantially better response times than searching all our GDBM databases sequentially.

In this part of the project, I was responsible for mg_server.pl and mgquery.pl almost in entirety, I was responsible for the GDBM-handling routines in fnasearch.cgi, and I was responsible for the code used to invoke mgquery.pl from fnasearch.cgi.