COMAD system framework

June 23rd, 2012

Mission

Develop and maintain the Collection-Oriented Modeling and Design (COMAD) framework for building scientific workflows that makes it easy to efficiently manage and operate on collections of data. By flattening the structural data into a stream and supporting the data location and binding through XPath-like expression, the COMAD framework provides more advanced workflow management functions.

Introduction

Usually the scientific data is in hierarchical structure with collection or nested collection of data. For example, the meteorological data collected from multiple stations is organized in multiple levels of collections based on the geographical locations, including station, country or state etc. Different kinds of analysis is made by summarizing or comparing different levels of the data to study the way in which multiple environmental factors, including climate variability, affect major ecosystems.

In the workflow of such scientific system, besides the necessary data analysis logic, a lot of shim units must be included to assemble/disassemble the data collections, convert the data into expected format for each analysis and keep the association between the input and generated data. Such none functional while necessary data processing not only makes it hard to model the workflow, but also make it hard to understand the workflow, especially to the scientists, since the main analysis pipeline is totally hidden inside the large amount of none functional processing units. Moreover, such workflow is also not easy to be maintained or evolve.

COMAD (Collection-Oriented Modeling and Design) is proposed to support design, modeling and execution of the scientific application with the collection-oriented data. In COMAD, such data is flattened from a tree-structure into a stream without any loss of information. The stream consists of multiple tokens. Besides the token representing the concrete data value, there’re special delimiter tokens to denote the start and end of the collection. The nested collection is allowd. In this way, the original tree structure of the data is still kept. Each processing unit in the workflow is called actor. The data stream flows through the whole workflow from one actor to another. As the workers besides the assembly line, the actors in the COMAD workflow pick up the data they’re interested from the stream, process it and put the output back to the stream. A COMAD path expression with syntax similar to XPath is used to declare where to pick up and output the data in the stream.

Features

COMAD brings the following benefits to the workflow modeling and execution.

Clear view to the original scientific data analysis process:

The actor in COMAD doesn’t need to deal with the assembly/disassembly of the data collection anymore since all the collections are already disassembled when the data is flattened into the COMAD token stream. Once it’s declared where to pick up and output the data in the stream, the system will automatically grab the data from the stream, feed the data into the actor after necessary type check and conversion, and finally write the data output by the actor into the target location of the data stream. Such easy-handling of the structured data and automatic data massaging behavior removes a bunch of shim units from the workflow and make the workflow very neat. Usually, the COMAD workflow is linear. It presents a very clear view to the original scientific data analysis process. It’s easy to be modeled and understood, especially by the scientists.

Easy actor development:

The COMAD actor developer can focus on the data processing logic while leave all the other work, like input data grabbing, output data writing, buffering, validation and conversion etc.,  to the system. All the data processing logic is implemented in one method which will be invoked by the system with the prepared input data. The result of the data processing will be returned and then how to write it back to the stream will be handled by the system. Meanwhile, simplicity is also very good for the reliability of the actor development.

High reusability and adaptability of the actor:

COAMD actor is very easy to be reused to assemble different workflow. During the actor development phase, the signature is defined to declare what kind of type is expected for each input and output data port. And then during the workflow assembly phase, the concrete data for each port is specified through the data binding by using the COMAD path expression. Signature makes it easy to know how to use the actor correctly. And the data processing logic inside the actor only deal with the structure-independent data. By changing the data binding to adapt to the data stream with different structure, the actor is easy to be reused to assemble different workflow. The high reusability and adaptability of the COMAD actor meet the evolve requirement of the scientific workflow with exploiting feature. For scientific application, change is normal.

Improved performance due to the streaming mode:

In the COMAD workflow, data is re-structured into a stream flowing through each actor. Therefore the workflow performance is improved since multiple actors could work on different part of the data stream at the same time.

Great support for provenance:

To trace back how each data comes from, the provenance information, including each insertion and deletion operation, is recorded in the data stream. (In COMAD, once the data item is created, it’s not allowed to be modified. So there’s only deletion and insertion instead of modification operation. ) By using the tool of “Provenance Browser”, the provenance information could be easily browsed and queried.

Potential powerful workflow management ability:

By analyzing the input data structure and each actor’s COMAD path expression about how to bind the input and output data with the data stream, it’s possible to know how the data stream is changed by each actor, how the actor depends on each other, when the input data is ready and then the actor is fired. Such information is very valuable to the workflow system management. It could be used to test the correctness of the workflow configuration before the workflow execution. For example, it’s easy to find out whether there’s an actor never be fired due to the wrong data binding expression. It’s also possible to change the configuration of the workflow during the workflow executing phase for some important management purpose, like fault tolerance or load balance etc.

Example

Comet workflow analyzes meteorological data coming from comet project. The main analysis steps are:

  • Collect hourly humidity data in ten days from the weather stations
  • Aggregate the hourly data based on a group of time window and calculate basic statistic data, like min, max etc, for the data aggregated in each window.
  • Calculate statistics of growing degree day for data in each window
  • Draw time-based trend graph for analysis and comparison

Basically the workflow is composed by six actors:

  • CollectionComposer: convert external input data from one station into COMAD data stream.
  • WindowsGenerator: generate a group of time windows to aggregate data according to specific parameters, including number of generated windows, the start time of the first window, the interval of each window, and the slide time between adjacent windows.
  • Chunker: aggregate the data based on the time windows and make basic statistics for the group of data in each window.
  • GddCalculator: compute gdd (growing degree day) according to the statistic data of min and max for each window
  • RPlotter: draw the average and gdd time-based trend graph for each window by using RExpression.
  • TraceWriter: write the whole data collection and all related provenance information into a file for the future provenance analysis.

The comet workflow built from COMAD framework and the actor configuration is demonstrated in figure 1. In each actor, signature parameter declares what’s expected from the input and output. Read scope and data binding parameters decide how to fetch input data from data stream and how to write the output into data stream. For GddCalculator actor, it reads min, max and TbaseValue as input and outputs gdd. The type of all the value is DoubleToken. And Both the input and output go to the “window” collection. Compared with the linear structure of the COMAD workflow which clearly demonstrates the data analysis process, the comet workflow built from normal Kepler with PN director has more complex structure as showed in figure 2.

Figure 1. Comet workflow built from COMAD framework

 

Figure 2. Comet workflow built with PN director

 

Project artifacts