The HathiTrust Research Center (HTRC): Mining the 17 Million Volumes of the HathiTrust Digital Library
No Thumbnail Available
Can’t use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.
Date
2020-11-04
Journal Title
Journal ISSN
Volume Title
Publisher
Indiana University Digital Collections Services
Permanent Link
Abstract
The HathiTrust Digital Library (HTDL) was founded in 2008 with just over 2 million volumes in the collection. Today there are over 17 million volumes ranging from 6th-century psalters to 21st-century academic texts. The diverse contents of the HTDL include government documents, academic journal articles, and monographs from all the disciplines one would find represented in a typical academic research library. While the majority of materials are in English, there are many volumes in German, French, Spanish, Italian, Arabic, Chinese, Russian, and Latin. Researchers may perform text analysis on the contents of HTDL by utilizing the many text analysis tools and data sets provided by the HathiTrust Research Center (HTRC).
The HathiTrust Research Center (HTRC), based at IU Bloomington, develops infrastructure, tools, and services to support Text Data Mining of the HTDL corpus. These include off-the-shelf web-based text analysis tools, a secure data capsule computing environment for analysis of rights-restricted content, and the HTRC Extracted Features Data Set, which provides volume-level and page-level word counts and other metadata for the entire corpus.
This presentation will discuss the current contents of the HTDL collection and its benefits as a data source and provide examples of existing research facilitated by HTDL collections and HTRC resources. In addition, this presentation will give an overview of the various HTRC text analysis tools and the different options for analyzing public domain and copyrighted material.
Description
Keywords
Text data mining, Computational linguistics, Digital libraries
Citation
Journal
DOI
Link(s) to data and video for this item
Click the link below to play this video
Rights
Type
Presentation