Kurator Package

June 22nd, 2012

Kurator: the Kepler data curation package

With digitalization and data standard development of scientific data in various areas, such as biodiversity, ecology and life science etc., large amount of scientific data become accessible electronically. Therefore how to curate and integrate data from various sources with various quality problems through automatic or human-involved curation comes to be a critical issue.

The Kepler/Curation(Abbrev. Kurator) package aims to help collection managers, researchers etc., to build curation workflows as executable pipelines. The package consists of a number of actors and sample workflows. Diverse services and tools can be conveniently integrated into a workflow through the actors, helping data curation in various dimensions: e.g. there are visualization services (Google Maps) which help spot quality problems in input datasets; domain-specific services (e.g.,  GeoLocate, IPNI, GNI) can be used to identify and correct, for example, geo-reference errors, scientific name errors, etc.; common curation operations like duplicate identification and consensus; and other useful utility services like authentication/authorization services (e.g. OAuth), data sharing services (e.g. Google Spreadsheet) and communication services (e.g. E-Mail and SMS Text Messaging) are provided as well. The data dependency information of the curation workflow developed by using this package is fully recorded. After execution, users can traverse and query the resulting data provenance graph in the Kepler Provenance Browser so that the data source can be tracked and the credibility of the assembled curated data can be assessed. A demo curation workflow developed from actors in Kurator package to do data-quality control for a input dataset is demonstrated in Figure 1.

Kepler Curation Workflow

Fig1. Kepler Curation Workflow

Currently the Kurator package is mainly developed according to the use cases from biodiversity research area. With in-depth understanding of data quality control issue and keeping in touch with data curation use cases from more areas, the Kurator package will be continually enriched with more and more actors and sample workflows.

The Kurator package is implemented based on Comad and Kepler/Google (Abbrev. Koogle) suites of Kepler workflow system. Please refer to these two suites and also Kepler workflow and Ptolemy simulation system for more information.

The Kurator package can be checked out from the Kepler repository by following the instructions on Kepler website. In this way, you could find the “kuration” directory containing all the source code, example workflows and user-manual under the directory where you check out the Kepler. Besides, the first version of Kuration package has been released. And the user who has installed Kepler can get the Kuration-1.0.0 package through Kepler module manager.  In this way, you can find the kuration-1.0.0 directory containing all the sample workflows and user-manual under the $User_Home/KeplerData/kepler.modules directory. The demo workflows are also copied to $User_Home/KeplerData/workflows/module/kuration-1.0.0.

Videos of Demo Workflows Assembled from Actors in Kepler-G Pack