HathiTrust Research Center: Challenges and Opportunities in Big Text Data
Loading...
Can’t use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.
Date
2014-03-05
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Indiana University Digital Collections Services
Permanent Link
Abstract
HathiTrust Research Center (HTRC) is the public research arm of the HathiTrust digital library where millions of volumes, such as books, journals, and government documents, are digitized and preserved. By Nov 2013, the HathiTrust collection has 10.8M total volumes of which 3.5M are in the public domain [1] and the rest are in-copyrighted content.
The public domain volumes of the HathiTrust collection by themselves are more than 2TB in storage. Each volume comes with a MARC metadata record for the original physical copy and a METS metadata file for provenance of digital object. Therefore the large-scale text raises challenges on the computational access to the collection, subsets of the collection, and the metadata. The large volume also poses a challenge on text mining, which is, how HTRC provides algorithms to exploit knowledge in the collections and accommodate various mining need.
In this workshop, we will introduce the HTRC infrastructure, portal and work set builder interface, and programmatic data retrieve API (Data API), the challenges and opportunities in HTRC big text data, and finish with a short demo to the HTRC tools.
More about HTRC
The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge. See http://www.hathitrust.org/htrc for details.
[1] http://www.hathitrust.org/statistics_visualizations
Description
Keywords
Text mining, HathiTrust
Citation
Journal
DOI
Link(s) to data and video for this item
Click on the link below in the "External Files" section to play this video.
Rights
Type
Presentation