QUALITY, RETRIEVAL AND ANALYSIS OF PROVENANCE IN LARGE-SCALE DATA
Loading...
Can’t use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.
Date
2014-02
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
[Bloomington, Ind.] : Indiana University
Permanent Link
Abstract
Provenance is metadata that describes the lineage of a data product. Lineage is invaluable in advancing the reuse and reproducibility of scientific results in e-Science. Through the availability of provenance, future researchers can make valid assessments of data quality or consider the trustworthiness of the data. The shift towards 'Big Data' has presented challenges in provenance driven by data volume and variety, and the need for making data more valuable and veracious. This dissertation examines provenance quality, capture, and representation particularly for highly voluminous provenance that occurs with growing frequency in large-scale science.
This work has at its core a framework and methodology that identify three dimensions of provenance quality: correctness, completeness, and relevance. Based on the proposed quality dimensions, the framework supports provenance quality analysis at the node/edge, graph, and multi-graph levels, which includes analysis of annotations, timestamps and the structure of provenance traces.
A supporting contribution is the design and generation of a pseudo-realistic provenance workload that consists of 48,000 provenance traces, forming a provenance database 10 Gigabytes in size. This workload is composed of provenance from 6 varied realistic workflows and includes a failure model that introduces several types of failures into provenance data including workflow
executions that experienced failures and workflow executions that experienced faults in message passing communication between application and provenance system, the latter resulting in dropped provenance.
Provenance in High Performance Computing is directly addressed with the design of a cache storage solution that supports multi-level provenance capture with minimum collection overhead. A distributed NoSQL database stores the collected provenance. Evaluation is carried out through experiments performed on two production systems at the National Energy Research Scientific Computing Center.
The final contribution is in the experimental evaluation of two storage approaches for provenance, graph and relational databases, and the impact on retrieval for provenance specific realistic queries. Results carried out at scale and using real-world provenance traces show that graph databases are better suited for the retrieval of large provenance graphs by ID and relational databases provide a better option for provenance graphs that are of great depth in evaluated scenarios.
Description
Thesis (Ph.D.) - Indiana University, Computer Sciences, 2014
Keywords
Large-scale Provenance, Provenance Analysis, Provenance Quality, Provenance Query
Citation
Journal
DOI
Link(s) to data and video for this item
Relation
Rights
Type
Doctoral Dissertation