FWD 2 HerbalEgram

HerbalEGram: Volume 6, Number 3, March 2009

The Digital Herbarium: Texas Research Centers Receive Large Grant to Make Botanical Specimen Label Data More Useful


Researchers working at the Herbarium of the Botanical Research Institute of Texas (BRIT) have a major bottleneck in their system that is common in natural history museums: they have to physically locate and retrieve specimens for study, and often must ship them halfway around the world to other scientists. BRIT would prefer to provide their data in a searchable database via the Internet, but most of this information hasn’t ever been transferred to a computer. Why? Because it takes a long time for someone to sit down and enter all the data into a database and no form of automated optical character recognition (OCR) technology can transform scanned images of text into a computer language without major errors, especially when some of the text looks like John Hancock’s signature.

Herbarium sheets (dried, pressed scientific plant specimens) have labels that contain mandatory information about a plant such as the species names, collection date, collector, location, and other important ecological data so there is really no room for error when it comes to any attempt to transcribe this information. Though more recent specimen labels generated with computers result in very complete, organized, and accessible information when digitized with a computer, some specimens are almost 250 years old. One BRIT specimen label from 1858 consists of faded calligraphy, scribbles and curls which can just barely be deciphered with the human eye—certainly not something computer software could translate.

There is a real need to preserve and catalog the information on these labels. In some instances, the labels are attached to specimens that are endangered or extinct: “Older plant specimens may represent a final record of existence from habitats that are no longer intact, and may be the most valuable to researchers of global climate change, since dates of a plant’s flowering or fruiting events are recorded on the specimen,” said Amanda Neill, Director of the BRIT Herbarium in a BRIT press release.1 “Data from these labels can also provide the most information about changes in the earth’s vegetation during the last 250 years, including the movement of invasive species and the loss of endangered species over time.”

Another reason to digitize these labels involves the increased use by a wider audience. BRIT desires to put all the images and information digitized from their specimens into an online searchable database accessible by the public to promote use and appreciation of the resources provided by herbaria, and wider interest in the plants all around us. Though the herbarium’s current users include everyone from professional botanists to college students and 4-H Clubs, the wider (and instant) availability presented by a “virtual” herbarium can add millions of users.  Digitizing labels could also encourage less handling of the original specimen, according to Jason Best, information technology (IT) manager for BRIT (e-mail, February 15, 2009).

“A high-quality digital facsimile is just as effective for research as the actual specimen in many cases, and less physical handling means less damage to these fragile and scientifically valuable items,” said Best.1

The specimen data bottleneck is a serious issue to all herbaria, and BRIT has so much data to share—it’s approximately the 10th largest herbarium in the United States, containing more than a million specimens. A time and cost effective solution had to be found. Using computers alone to transcribe the indispensable information contained on the labels leads to gibberish about half of the time. According to BRIT, a random survey of their herbarium specimens found that only 41% of labels could be transcribed using OCR technology without errors, while 59% were filled with errors when digitalized with computers alone.1 It was recognized that there was a need for synergy between people and computers to transcribe this data which resulted in the partnership of BRIT, an independent, nonprofit Fort Worth research institution, and the Texas Center for Digital Knowledge (TxCDK), a digital information research center established in 2001 by the University of North Texas (UNT) in Denton.

“Digital information is truly the currency of the 21st Century University (as well as 21st Century Herbaria), and TxCDK emerged from recognizing that this form of information needs to be harnessed in support of knowledge creation,” said William Moen, TxCDK director (e-mail, February 15, 2009).

Together Moen, Neill, and Best created a proposal outlining their ideas and submitted it to the Institute of Museum and Library Services (IMLS) National Leadership Grant Program. (The IMLS supports the nation’s 122,000 libraries and 17,500 museums on a national level.) The proposal, entitled High-Throughput Workflow for Computer-Assisted Human Parsing of Biological Specimen Label Data, outlined their plan to do the following: find and test appropriate processes to transfer specimen label data, identify what steps people and computers can take to correct the digitally-transferred data, develop processes to increase the effectiveness of both approaches, and determine the quality of the metadata created by the human and machine cooperation. The project was awarded a 2-year grant of $738,075 from IMLS and got under way in December 2008.

BRIT and TxCDK’s research project, now called Apiary as a gesture to the collaborative efforts many bees contribute to a hive, will yield a new workflow model for effective and efficient label data transformation, correction, and enhancement that can be replicated, adapted, and transferred to other herbaria and natural history collections.  The project includes support for 4 UNT graduate research assistants who will work with the principal investigators and programmers on the problem. The progress and results of the Apiary project will be made available on its website: http://www.apiaryproject.org/.

IMLS is known for funding projects that have a strong likelihood of developing models, solutions, and approaches adoptable by others. BRIT and TxCDK will share the solutions they develop: “When we complete the research project, we will make our final, optimized system available for others to use or improve upon,” said Best.

More information about BRIT and TxCDK is available at their websites: www.brit.org and www.txcdk.org.

—Kelly Saxton Lindner


References

1. Botanical Research Institute of Texas and University of North Texas receive $738,075 National Leadership Grant to investigate digitization of herbarium collections [press release]. Fort Worth, TX: Botanical Research Institute of Texas. December 3, 2008.