The HathiTrust Research Center (HTRC): Mining the 17 Million Volumes of the HathiTrust Digital Library

No Thumbnail Available
Can’t use the file because of accessibility barriers? Contact us with the title of the item, permanent link, and specifics of your accommodation need.

Date

2020-11-04

Journal Title

Journal ISSN

Volume Title

Publisher

Indiana University Digital Collections Services

Abstract

The HathiTrust Digital Library (HTDL) was founded in 2008 with just over 2 million volumes in the collection. Today there are over 17 million volumes ranging from 6th-century psalters to 21st-century academic texts. The diverse contents of the HTDL include government documents, academic journal articles, and monographs from all the disciplines one would find represented in a typical academic research library. While the majority of materials are in English, there are many volumes in German, French, Spanish, Italian, Arabic, Chinese, Russian, and Latin. Researchers may perform text analysis on the contents of HTDL by utilizing the many text analysis tools and data sets provided by the HathiTrust Research Center (HTRC). The HathiTrust Research Center (HTRC), based at IU Bloomington, develops infrastructure, tools, and services to support Text Data Mining of the HTDL corpus. These include off-the-shelf web-based text analysis tools, a secure data capsule computing environment for analysis of rights-restricted content, and the HTRC Extracted Features Data Set, which provides volume-level and page-level word counts and other metadata for the entire corpus. This presentation will discuss the current contents of the HTDL collection and its benefits as a data source and provide examples of existing research facilitated by HTDL collections and HTRC resources. In addition, this presentation will give an overview of the various HTRC text analysis tools and the different options for analyzing public domain and copyrighted material.

Description

Keywords

Text data mining, Computational linguistics, Digital libraries

Citation

Journal

DOI

Link(s) to data and video for this item

Click the link below to play this video

Rights

Type

Presentation