Record Fuse Use-Case

June 23rd, 2012

Use Case. Given a set of records (e.g., bibtex, specimen records, etc.), possibly harvested from different sites, is to be “cleaned”, using both automatic methods and expert curation. Specifically, duplicates should be detected automatically, whenever possible. For these duplicates, the user would like to see a “fused” record that combines all the information obtained from the different “duplicates”. All or some of the automatically fused records shall be reviewed by suitable experts who can accept or override (edit) the automatic record fusion. The resulting records (automatically or semi-automatically fused) are (i) to be archived as the new, “cleaned” version of the collections, and (ii) the updates found (either automatically or semi-automatically) are to be propagated back to the sources.

Possible Implementation (Overview). The following dataflow pipeline can be used to implement the use case:

-> [DuDe]
-> [Fuse]
-> [Assign]
-> [Dispatch]
-> [PushUnique,PullCurated]
-> [Archive]
-> [PropagateUpdates]

1. The input data collection, i.e., a set of records is collected and streamed out.
[output type =  record* , i.e., a stream of records)

2. A Duplicate-Detection (DuDe) step creates a new keys for each “equivalence class” of records, i.e., which have been found to be “duplicates” (referring to the same real-world entity) of one another.

3. In the Fuse step, the duplicates (identified by their DuDe-key) are fused, i.e., the record fields are unioned. For single-valued attributes with multiple values, a flag is raised. The new fused record leaves a provenance (dependency) trail behind, showing how the fused record came about. After this step, all records are tagged as either unique or duplicate.

4. For each fused record, together with the duplicates it was derived from, a “work item” is assigned to an expert curator (e.g. by joining the record type with the expertise fields of curators).

5. Work items are dispatched to curators by sending them email with a url to their work item. A curator, upon clicking on the url in the email, can see a form, prefilled with the fused record, and can decide to accept the proposed fusion, or edit it. For each email sent, a handle (future, promise) is passed through immediately.

6. The PushUnique/PullCurated (PUPC) actor, pushes all unique records through immediately. Those that were tagged upstream as curated/fused enter a “WaitFor” pool, from which these curated records are pulled once the curator has filled his/her form.

7. The archival step receives unique and curated records and stores them in an archive.

8. The updates made by the system and/or curators are propagated back to the sources, increasing not just the quality of the workflow result stream, but also the input streams themselves.

Early Prototype. We have implemented an early prototype that encompasses roughly (1), (2), (3), (7). The following figure shows the workflow prototype, data collection and dependency history.
1. The CollectionComposer actor imports original input data collection (here a fixed number, seven, of BibTex records).
2. The BibTexRecChunker actor identifies the duplicates. It creates a partition of the collection, i.e., inserts groups (subcollections) of “equivalent” records (in the sense that they have been identified as referring to the same real-world object). In the prototype, we simply assume that records have a key, indicating that the belong to the same class.  After BibTexRecChunker actor, you see that four subcollections have been created: two of them are singleton, i.e., already contain unique records; two other subcollections have two and three members, respectively. The latter indicates that those 2 (or 3) records are to be fused subsequently.
3. The BibTexRecFuse actor creates a fused record from the duplicates, and inserts the fused record in the result. The records being fused are tagged as deleted at the same time. The data dependency clearly shows that BibTexRecFuse actor works on two duplicate sets and each newly created record is fused from and therefore depends on the corresponding duplicate records.  Note how the final structure has two fused records (right-most) and two records that already were unique, while deleting the five non-unique records (which had been fused 2->1 and 3->1, respectively).
4. The CollectionDisplay actor displays the workflow trace, including the complete data structure of the result.
Note: There are a number of variations for modeling the collection structure, invocations, etc.
For example, we currently capture dependencies at the level of keys, not individual slots (but this would be easy).
Similarly, just like the last stage has two invocations of BibTexRecFuse, we could model BibTexRecChunker as being invoked once for each partition (subcollection), so here four times.