Big provenance stream processing for data-intensive computations

Loading...
Thumbnail Image
Can’t use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.

Date

2018-11

Journal Title

Journal ISSN

Volume Title

Publisher

[Bloomington, Ind.] : Indiana University

Abstract

Industry, academia, and research alike are grappling with the opportunities that Big Data brings in the ability to analyze data from numerous sources for insight, decision making, and predictive forecasts. The analysis workflows for dealing with such volumes of data are said to be large scale data-intensive computations (DICs). Data-intensive computation frameworks, also known as Big Data processing frameworks, carry out both online and offline processing. Big Data analysis workflows frequently consist of multiple steps: data cleaning, joining data from different sources and applying processing algorithms. Critically today the steps of a given workflow may be performed with different processing frameworks simultaneously, complicating the lifecycle of the data products that go through the workflow. This is particularly the case in emerging Big Data management solutions like Data Lakes in which data from multiple sources are stored in a shared storage solution and analyzed for different purposes at different points of time. In such an environment, accessibility and traceability of data products are known to be hard to achieve. Data provenance, or data lineage, leads to a good solution for this problem as it provides the derivation history of a data product and helps in monitoring, debugging and reproducing computations. Our initial research produced a provenance-based reference architecture and a prototype implementation to achieve better traceability and management. Experiments show that the size of fine-grained provenance collected from data-intensive computations can be several times larger than the original data itself, creating a Big Data problem referred to in the literature “Big Provenance”. Storing and managing Big Provenance for later analysis is not be feasible for some data-intensive applications due to high resource consumption. In addition to that, not all provenance is equally valuable and can be summarized without loss of critical information. In this thesis, I apply stream processing techniques to analyze streams of provenance captured from data-intensive computations. The specific contributions are several. First, a provenance model which includes formal definitions for provenance stream, forward provenance and backward provenance in the context of data-intensive computations. Second, a stateful, one-pass, parallel stream processing algorithm to summarize a full provenance stream on-the-fly by preserving backward provenance and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. Multiple provenance stream partitioning strategies: horizontal, vertical, and random for provenance emerging from data-intensive computations are also presented. A provenance stream processing architecture is developed to apply the proposed parallel streaming algorithm on a stream of provenance arriving through a distributed log store. The solution is evaluated using Apache Kafka log store, Apache Flink stream processing system, and the Komadu provenance capture service. Provenance identity, archival and reproducibility use a persistent ID (PID)-based approach.

Description

Thesis (Ph.D.) - Indiana University, School of Informatics, Computing and Engineering, 2018

Keywords

Big Data, Big Provenance, Stream Processing

Citation

Journal

DOI

Link(s) to data and video for this item

Relation

Rights

Type

Doctoral Dissertation