HathiTrust Research Center Data Capsule v1.0: An Overview of Functionality

Loading...
Thumbnail Image
Can’t use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.

Date

2014-09-10

Journal Title

Journal ISSN

Volume Title

Publisher

Indiana University Digital Collections Services

Abstract

The first mode of access by the community of digital humanities and informatics researchers and educators to the copyrighted content of the HathiTrust digital repository will be to extracted statistical and aggregated information about the copyrighted texts. But can the HathiTrust Research Center support scientific research that allows a researcher to carry out their own analysis and extract their own information? This question is the focus of a 3-year, $606,000 grant from the Alfred P. Sloan Foundation (Plale, Prakash 2011-2014), which has resulted in a novel experimental framework that permits analytical investigation of a corpus but prohibits data from leaving the capsule. The HTRC Data Capsule is both a system architecture and set of policies that enable computational investigation over the protected content of the HT digital repository that is carried out and controlled directly by a researcher. It leverages the foundational security principles of the Data Capsules of A. Prakash of University of Michigan, which allows privileged access to sensitive data while also restricting the channels through which that data can be released. Ongoing work extends the HTRC Data Capsule to give researchers more compute power at their fingertips. The new thrust, HT-DC Cloud, extends existing security guarantees and features to allow researchers to carry out compute-heavy tasks, like LDA topic modeling, on large-scale compute resources. HTRC Data Capsule works by giving a researcher their own virtual machine that runs within the HTRC domain. The researcher can configure the VM as they would their own desktop with their own tools. After they are done, the VM switches into a "secure" mode, where network and other data channels are restricted in exchange for access to the data being protected. Results are emailed to the user. In this talk we discuss the motivations for the HTRC Data Capsule, its successes and challenges. HTRC Data Capsule runs at Indiana University. See more at http://d2i.indiana.edu/non-consumptive-research

Description

Keywords

Text analysis, Databases

Citation

Journal

DOI

Link(s) to data and video for this item

Click on the link below in the "External Files" section to play this video.

Rights

Type

Presentation