Big provenance stream processing for data-intensive computations
Loading...
Can’t use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.
Date
2018-11
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
[Bloomington, Ind.] : Indiana University
Permanent Link
Abstract
Industry, academia, and research alike are grappling with the opportunities that Big Data brings in the ability to analyze data from numerous sources for insight, decision making, and predictive forecasts. The analysis workflows for dealing with such volumes of data are said to be large scale data-intensive computations (DICs). Data-intensive computation frameworks, also known as Big Data processing frameworks, carry out both online and offline processing. Big Data analysis workflows frequently consist of multiple steps: data cleaning, joining data from different sources and applying processing algorithms. Critically today the steps of a given workflow may be performed with different processing frameworks simultaneously, complicating the lifecycle of the data products that go through the workflow. This is particularly the case in emerging Big Data management solutions like Data Lakes in which data from multiple sources are stored in a shared storage solution and analyzed for different purposes at different points of time. In such an environment, accessibility and traceability of data products are known to be hard to achieve. Data provenance, or data lineage, leads to a good solution for this problem as it provides the derivation history of a data product and helps in monitoring, debugging and reproducing computations. Our initial research produced a provenance-based reference architecture and a prototype implementation to achieve better traceability and management. Experiments show that the size of fine-grained provenance collected from data-intensive computations can be several times larger than the original data itself, creating a Big Data problem referred to in the literature “Big Provenance”. Storing and managing Big Provenance for later analysis is not be feasible for some data-intensive applications due to high resource consumption. In addition to that, not all provenance is equally valuable and can be summarized without loss of critical information. In this thesis, I apply stream processing techniques to analyze streams of provenance captured from data-intensive computations. The specific contributions are several. First, a provenance model which includes formal definitions for provenance stream, forward provenance and backward provenance in the context of data-intensive computations. Second, a stateful, one-pass, parallel stream processing algorithm to summarize a full provenance stream on-the-fly by preserving backward provenance and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. Multiple provenance stream partitioning strategies: horizontal, vertical, and random for provenance emerging from data-intensive computations are also presented. A provenance stream processing architecture is developed to apply the proposed parallel streaming algorithm on a stream of provenance arriving through a distributed log store. The solution is evaluated using Apache Kafka log store, Apache Flink stream processing system, and the Komadu provenance capture service. Provenance identity, archival and reproducibility use a persistent ID (PID)-based approach.
Description
Thesis (Ph.D.) - Indiana University, School of Informatics, Computing and Engineering, 2018
Keywords
Big Data, Big Provenance, Stream Processing
Citation
Journal
DOI
Link(s) to data and video for this item
Relation
Rights
Type
Doctoral Dissertation