Show simple item record

dc.contributor.advisor Plale, Beth en Luo, Yuan en 2015-11-24T18:02:27Z en 2015-11-24T18:02:27Z en 2015-08 en
dc.identifier.uri en
dc.description Thesis (Ph.D.) - Indiana University, Informatics and Computing, 2015 en
dc.description.abstract Scenarios exist in the era of Big Data where computational analysis needs to utilize widely distributed and remote compute clusters, especially when the data sources are sensitive or extremely large, and thus unable to move. A large dataset in Malaysia could be ecologically sensitive, for instance, and unable to be moved outside the country boundaries. Controlling an analysis experiment in this virtual cluster setting can be difficult on multiple levels: with setup and control, with managing behavior of the virtual cluster, and with interoperability issues across the compute clusters. Further, datasets can be distributed among clusters, or even across data centers, so that it becomes critical to utilize data locality information to optimize the performance of data-intensive jobs. Finally, datasets are increasingly sensitive and tied to certain administrative boundaries, though once the data has been processed, the aggregated or statistical result can be shared across the boundaries. This dissertation addresses management and control of a widely distributed virtual cluster having sensitive or otherwise immovable data sets through a controller. The Virtual Cluster Controller (VCC) gives control back to the researcher. It creates virtual clusters across multiple cloud platforms. In recognition of sensitive data, it can establish a single network overlay over widely distributed clusters. We define a novel class of data, notably immovable data that we call "pinned data", where the data is treated as a first-class citizen instead of being moved to where needed. We draw from our earlier work with a hierarchical data processing model, Hierarchical MapReduce (HMR), to process geographically distributed data, some of which are pinned data. The applications implemented in HMR use extended MapReduce model where computations are expressed as three functions: Map, Reduce, and GlobalReduce. Further, by facilitating information sharing among resources, applications, and data, the overall performance is improved. Experimental results show that the overhead of VCC is minimum. The HMR outperforms traditional MapReduce model while processing a particular class of applications. The evaluations also show that information sharing between resources and application through the VCC shortens the hierarchical data processing time, as well satisfying the constraints on the pinned data. en
dc.language.iso en_US en
dc.publisher [Bloomington, Ind.] : Indiana University en
dc.subject Cloud computing en
dc.subject Data processing en
dc.subject Distributed and immovable data en
dc.subject Hierarchical mapreduce en
dc.subject Pinned data en
dc.subject Virtual cluster management en
dc.title Virtual Cluster Management for Analysis of Geographically Distributed and Immovable Data en
dc.type Doctoral Dissertation en
dc.altmetrics.display false en

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IUScholarWorks

Advanced Search


My Account