Evaluation of Data Storage in HathiTrust Research Center Using Cassandra
Loading...
Can’t use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.
Date
2014-07-02
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Permanent Link
Abstract
As digital data sources grow in number and size, they pose an opportunity for computational investigation by means of text mining, NLP, and other text analysis techniques. The HathiTrust Re-search Center (HTRC) was recently established to provision for automated analytical techniques on the over 11 million digitized volumes (books) of the HathiTrust digital repository. The HTRC data store that hosts and provisions access to HathiTrust volumes needs to be efficient, fault-tolerant and large-scale. In this paper, we propose three schema designs of Cassandra NoSQL store to represent HathiTrust corpus and perform extensive performance evaluation using simulated workloads. The experimental results demonstrate that encapsulating the whole volume within a single row with regular columns delivers the best overall performance.
Description
Keywords
Cassandra, schema design, performance evaluation
Citation
Journal
DOI
Link(s) to data and video for this item
Relation
Rights
Type
Technical Report