Intent to Develop Federated Metadata Search Capabilities Between PICES Member Country Data Centers

Introduction

The North Pacific Ecosystem Metadatabase (NPEM), a project co-sponsored by PICES, wishes to extend the metadata searched by its users to include metadata from PICES member country Ocean Data Centers. To implement this feature, called a “federated search”, staff of the NPEM will work closely with international partners to construct and coordinate the required data translation dictionaries as well as to install the necessary technical infrastructure to allow remote computers to communicate.

The PICES Technical Committee on Data Exchange (TCODE) has offered to help NPEM find partners.  At the Twelfth Annual PICES meeting in Seoul (October 2004), NPEM established informal communications with the director of the Korea Oceanographic Data Center.  In time, NPEM desires to expand communications to all PICES member country Ocean Data Centers.

The purpose of this document is to inform national TCODE representatives of this opportunity and to provide potential partners with basic knowledge about federating.  The technology that NPEM suggests for federated searches is called Z39.50.  It is proven in wide and varied applications.  It is simple to acquire, install and configure.

What is Z39.50?

Z39.50 is a protocol that specifies data structures and interchange rules.  The protocol permits a client computer to search databases on a server computer and retrieve records that the search identifies. Implementation of the Z39.50 protocol requires installation of freely available, open-source software on client and server computers.  Users who log on to the client computer then have transparent access to its data and data on any server computers with which it exchanges information.

What does it do?

Z39.50 enables communication between databases on computer systems. This communication could be between Ocean Data Centers (ODCs).  A PC user in China could access the Japan ODC through the WWW, submit an on-line search for ‘chum salmon’, and specify that the search examine not only the holdings of the JODC, but also any other ODCs that share the protocol.  Search results are returned to the user in China irrespective of the location of the information discovered.

Real examples of this type of distributed search include NOAAServer (http://www.esdim.noaa.gov/NOAAServer/), and Alaska's Cooperatively Implemented Information Management System (CIIMMS; http://info.dec.state.ak.us/ciimms/).  Both the NOAAServer and CIIMMS make use of Z39.50 in order to create a single, virtual resource. In both cases, the data themselves remain distributed and under the control of those with the knowledge and expertise to most effectively maintain, update, and add to the resources already present. By using Z39.50, these obvious benefits of distributed data management are combined with the equally valuable benefits of unified data access, which allows the user to submit a single search across multiple resources, regardless of their physical proximity to one another or to the user. The technology is effectively hidden from the user; despite what the informational screens on the web sites might say, as far as users are concerned, they're simply searching one great big database.

How does it do it?

The protocol specifies Facilities, Services, Attributes, Syntaxes and Profiles.  Simplifying hugely, initiation might be a greeting from the client computer ("Hello, do you speak English?") and a related response from the server ("Hello. Yes, I do. Let's talk"). Without this positive two-way dialogue, the session cannot proceed.

A search request is then transmitted from the client ("OK — can I have everything you have on ‘chum salmon'?"), and is responded to by the server ("I've got 25 records matching your request, and here are the first five. As you didn't specify anything else, I've sent them to you in MARC format, so I hope that's OK.").

Finally, the client asks for the data they want ("25, eh? Can I have the first ten, please? Oh, and I don't really like MARC, can you send me unstructured text?"), resulting in the transmission of the records themselves from the server.

An example

See how a federated search works by exercising the CIIMMS website that has implemented the Z39.50 protocol.  You will search several different databases for the occurrence of the keyword “salmon”.  The spatial domain of your search is all of Alaska.

Start the procedure by connecting to the CIIMMS website (http://info.dec.state.ak.us/ciimms/).  In the left frame, click on “Search” to go to the search page, or click “Advanced Search” on the main page.  In the “Search for” box, type salmon.  Next to the “Search for” box is the “In:” box. Pull down its menu and select “Subject/Keywords”. Select the geographic limits of your search by pulling down the menu in the box on the right side of the page under the compass whose default content reads “Select Co-ords using Pre-defined Areas”.  Select “Alaska Statewide”.  So far, you have built a search that will look for the keyword salmon occurring in records from all over Alaska.  Now it is time to declare what databases you will search.  This is the federated feature.  At the bottom of the page is a table containing two columns.  The column on the left is labeled “Databases searched”, the column on the right is labeled “Database select buttons”.  There are nine databases from which to search: two CIIMMS databases, two public libraries, three geospatial data clearinghouses, and two web-page collections.  Boxes may be checked manually or via buttons provided in the right column for easy selection. For this search, you will select the “CIIMMS Databases” button in the right column.  Finally, click the “Search” button near the middle of the page to initiate the search.

During the search, a status page provides a table listing databases searched, status of the search (successful or failed), and result count.  When the search is completed, the database name can be clicked to view any records matching the search criteria.  Information and format will vary with database type.  For instance, the geospatial data clearinghouse databases and the in-house databases will provide metadata in the FDGC data format, while the library databases will provide information pertaining to publication in the usual library reference format.

In your search for salmon in the CIIMMS databases, you probably found no matching records.  Now repeat the search, but this time, select the “Clear/Select All” button to enable searching against all the databases.  The results of this search should show over 3000 matches in the ARLIS and Web harvest databases.

Regardless of the type of information and the format given, the CIIMMS website demonstrates the benefits that the Z39.50 protocol provides.  There is no need for data or database relocation or infusion.  Compatibility is not an issue.  The organization that is responsible for the database will continue to manage the database and the data format the way it was designed.  The only cost would come from the resources needed to implement the Z39.50 protocol.  Once that is in place, there should be very little cost involved.

Conclusion

Z39.50 has been used extensively for long enough to demonstrate its robustness.  As new technologies such as XML and RDF begin to fulfill aspects of the information discovery and retrieval process, work is underway to capitalize upon them, and to tie such technologies more closely to Z39.50. It appears for the moment that Z39.50 is the one effective means of enabling simultaneous queries upon distributed heterogeneous databases, and this remains something that the broader user community wants to be able to do.

NPEM hopes that you will consider the obvious benefits of joining a North Pacific marine data federation.  Please pass this information to your appropriate national Ocean Date Center official, and let NPEM know who that official is.