Data to Insight Center

Permanent link for this collection

The Data To Insight Center is a collaboration between the School of Informatics, the Indiana University Libraries, and University Information Technology Services (UITS) at Indiana University.

The center engages in interdisciplinary research and education in the preservation of scientific data, digital humanities, large-scale data management, data analytics, and visualization. The Center's current projects engage researchers in the humanities, geography, sustainability science, atmospheric science, informatics, computer science and digital libraries. Because of the Data to Insight Center's close working relationship with UITS, the Center is well positioned to engage in projects that can be strengthened by IU's substantial investment in cyberinfrastructure compute and storage resources, and can in turn further strengthen these investments. The Center engages in outreach and education in service to the university and its students, the community, the State of Indiana, and the nation.

The Data To Insight Center is led by Beth Plale, a Professor in the School of Informatics and Computing at Indiana University Bloomington. Professor Plale has a deep and long engagement in interdisciplinary research particularly with the environmental and atmospheric sciences and has substantive experience in developing stable and useable scientific cyberinfrastructure. The IU Libraries representation in the Center is through Associate Director, Robert McDonald Associate Dean for Library Technologies at IU. UITS is represented through Associate Directors Eric Wernert Senior Manager and Scientist for Visualization Technologies and Futures for IU Research Technologies, and Matt Link, Director of Systems for IU Research Technologies. Associate Director Polly Baker brings strong ties to IUPUI and its New Media program.


Recent Submissions

Now showing 1 - 20 of 21
  • Item
    Achieving low barriers to entry in the FAIR Digital Objects (FDO) data space: a Use Case in Biodiversity Extended Specimen Networks
    Plale, Beth
    For a network of FAIR digital objects (a “data space”) to be fully realized at a global scale, its architecture must possess low barriers to entry to newcomer data providers. Barriers to entry is a measure of the up-front resource demands (costs) required to enter into a line of business or participate in a multi-organizational endeavor. The biodiversity community’s notion of Extended Specimen is a good match as a FAIR Digital Objects (FDO) data space. Extended Specimen is the interconnecting of physical specimen with all manner of derived and/or related data reflecting new sources of data and information related to collected specimens. We look at two possible manifestations of FAIR digital object data space for the global biodiversity community: the DiSSCo project in Europe and an early evaluation being undertaken in the US. Application of the lense of barriers to entry in this context strongly suggests that the FAIR Digital Object data space adopt a policy of flexibility with respect to the requirements it imposes for newcomers.
  • Item
    A Role for the Research Data Alliance (RDA) in Adoption of RDA Products: a Whitepaper
    Plale, Beth
    In the early life of the international Research Data Alliance (RDA), mid-2014, a consortium of volunteers, initial consensus products that promote data sharing are beginning to emerge. The RDA community is grappling with adoption, specifically what is RDA’s role in advancing the adoption of the products emerging from its working groups? This whitepaper posits that RDA has an active role to play in promoting the adoption of its products (“RDA Recommendations”). This role includes reaching potential adopters in the early stages of the technology adoption process. This whitepaper provides a contextual framework for adoption, products, and adopters. It then examines current RDA activities (circa 2014) and highlights potential gaps.
  • Item
    Persistent IDs: Application to Workflow and Sensor Applications
    (2018-05-09) Luo, Yu; Ratharanjan, Kunalan; Zhou, Quan; Plale, Beth
    A poster for presenting the investigations on PRAGMA Airbox Sensor Data and PRAGMA Rice Genomics project. In poster sections, we demonstrate the PID assignment strategy for data for streaming data under Airbox use case, and viability of provenance as part of the PID KI record under Rice Genomics use case.
  • Item
    SEADTrain Data Analysis
    (2017-07-23) Plale, Beth; Kouper, Inna
    Hands on tutorial on using Azure VMs to give data science students hands-on experience. Students analyze PM 2.4 data in real time. Partnership of the Pacific Rim Applications and Grid Middleware Assembly (PRAGMA) with team in Taiwan deploying Airbox sensor network. Hands on tutorial presented at ESIP Summer 2017 meeting in Bloomington, IN
  • Item
    A Hybrid Approach to Population Construction For Agricultural Agent-Based Simulation
    (2016) Chen, Peng; Evans, Tom; Frisby, Michael; Izquierdo, Eduardo; Plale, Beth
    An Agent Based Model (ABM) is a powerful tool for its ability to represent heterogeneous agents which through their interactions can reveal emergent phenomena. For this to occur though, the set of agents in an ABM has to accurately model a real world population to reflect its heterogeneity. But when studying human behavior in less well developed settings, the availability of the real population data can be limited, making it impossible to create agents directly from the real population. In this paper, we propose a hybrid method to deal with this data scarcity: we first use the available real population data as the baseline to preserve the true heterogeneity, and fill in the missing characteristics based on survey and remote sensing datasets; then for the remaining undetermined agent characteristics, we use the Microbial Genetic Algorithm to search for a set of values that can optimize the replicative validity of the model to match data observed from real world. We apply our method to the creation of a synthetic population of household agents for the simulation of agricultural decision making processes in rural Zambia. The result shows that the synthetic population created from the farmer register can correctly reflect the marginal distributions and the randomness of survey data; and can minimize the difference between the distribution of simulated yield and that of the observed yield in Post Harvest Survey (PHS).
  • Item
    Analysis of Memory Constrained Live Provenance
    Peng, Chen; Tom, Evans; Beth, Plale
    We conjecture that meaningful analysis of large-scale provenance can be preserved by analyzing provenance data in limited memory while the data is still in motion; that the provenance needs not be fully resident before analysis can occur. As a proof of concept, this paper defi nes a stream model for reasoning about provenance data in motion for Big Data provenance. We propose a novel streaming algorithm for the backward provenance query, and apply it to the live provenance captured from agent-based simulations. The performance test demonstrates high throughput, low latency and good scalability, in a distributed stream processing framework built on Apache Kafka and Spark Streaming.
  • Item
    Grand Challenge of Indiana Water: Estimate of Compute and Data Storage Needs
    (none) Plale, Beth
    This study is undertaken to assess the computational and storage needs for a large-scale research activity to study water in the State of Indiana. It draws its data and compute numbers from the Vortex II Forecast Data study of 2010 carried out by the Data To Insight Center at Indiana University. Detail of the study can be found in each of the archived data products (which contains results of a single weather forecast plus 42 visualizations created for each forecast.) See for example archived data product.
  • Item
    TextRWeb: Large-Scale Text Analytics with R on the Web
    (2014-07-13) Ruan, Guangchen; Zhang, Hui; Wernert, Eric; Plale, Beth
    As digital data sources grow in number and size, they pose an opportunity for computational investigation by means of text mining, NLP, and other text analysis techniques. R is a popular and powerful text analytics tool; however, it needs to run in parallel and re- quires special handling to protect copyrighted content against full access (consumption). The HathiTrust Research Center (HTRC) currently has 11 million volumes (books) where 7 million volumes are copyrighted. In this paper we propose HTRC TextRWeb, an interactive R software environment which employs complexity hiding interfaces and automatic code generation to allow large-scale text analytics in a non-consumptive means. For our principal test case of copyrighted data in HathiTrust Digital Library, TextRWeb permits us to code, edit, and submit text analytics methods empowered by a family of interactive web user interfaces. All these methods combine to reveal a new interactive paradigm for large-scale text analytics on the web.
  • Item
    Software in Science: a Report of Outcomes of the 2014 National Science Foundation Software Infrastructure for Sustained Innovation (SI2) Meeting
    (2015-03-31) Plale, Beth; Jones, Matt; Thain, Douglas
    The second annual NSF Software Infrastructure for Sustained Innovation (SI2) PI meeting took place in Arlington, VA February 24-25, 2014. It was hosted by Beth Plale, Indiana University; Douglas Thain, University of Notre Dame; and Matt Jones, National Center for Ecological Analysis and Synthesis. This report captures the challenges and outcomes emerging from the meeting over the four topic areas discussed i) Attribution and Citation, ii) Reproducibility, Reusability, and Preservation, iii) Project/Software Sustainability, and iv) Career Paths. The report is an academic synthesis with credit to all the participants and to the notetakers who took prodigious notes and synthesized the results upon which the conclusions of this report are derived.
  • Item
    The Data Capsule for Non-Consumptive Research: Final Report
    Plale, Beth; Prakash, Atul; McDonald, Robert
    Digital texts with access and use protections form a unique and fast growing collection of materials. Growing equally quickly is the development of text and data mining algorithms that process large text-based collections for purposes of exploring the content computationally. There is a strong need for research to establish the foundations for secure computational and data technologies that can ensure a non-consumptive environment for use-protected texts such as the copyrighted works in the HathiTrust Digital Library. Developing a secure computation and data environment for non-consumptive research for the HathiTrust Research Center is funded through a grant from the Alfred P. Sloan Foundation. In this research, researchers at HTRC and the University of Michigan are developing a “data capsule framework” that is founded on a principle of “trust but verify”. The project has resulted in a novel experimental framework that permits analytical investigation of a corpus but prohibits data from leaving the capsule. The HTRC Data Capsule is both a system architecture and set of policies that enable computational investigation over the protected content of the HT digital repository that is carried out and controlled directly by a researcher.
  • Item
    HTRC Data API Performance Study
    Sun, Yiming; Plale, Beth; Zeng, Jiaan
    HathiTrust Research Center (HTRC) allows users to access more than 3 million volumes through a service called Data API. Data API plays an important role in HTRC infrastructure. It hides internal complexity from user, protects against malicious or inadvertent damages to data and separates underlying storage solution with interface so that underlying storage may be replaced with better solutions without affecting client code. We carried out extensive evaluations on the HTRC Data API performance over the Spring 2013. Specifically, we evaluated the rate at which data can be retrieved from the Cassandra cluster under different conditions, impact of different compression levels, and HTTP/HTTPS data transfer. The evaluation presents performance aspects of different software pieces in Data API as well as guides us to have optimal settings for Data API.
  • Item
    Big Data and HPC: Exploring Role of Research Data Alliance (RDA), a Report On Supercomputing 2013 Birds of a Feather
    (2013-11-13) Plale, Beth
    The ubiquity of today's data is not just transforming what is, it is transforming what will be laying the groundwork to drive new innovation. Today, research questions are addressed by complex models, by large data analysis tasks, and by sophisticated data visualization techniques, all requiring data. To address the growing global need for data infrastructure, the Research Data Alliance (RDA) was launched in FY13 as an international community-driven organization. We propose to bring together members of RDA with the HPC community to create a shared conversation around the utility of RDA for data-driven challenges in HPC.
  • Item
    Evaluation of Data Storage in HathiTrust Research Center Using Cassandra
    (2014-07-02) Ruan, Guangchen; Plale, Beth
    As digital data sources grow in number and size, they pose an opportunity for computational investigation by means of text mining, NLP, and other text analysis techniques. The HathiTrust Re-search Center (HTRC) was recently established to provision for automated analytical techniques on the over 11 million digitized volumes (books) of the HathiTrust digital repository. The HTRC data store that hosts and provisions access to HathiTrust volumes needs to be efficient, fault-tolerant and large-scale. In this paper, we propose three schema designs of Cassandra NoSQL store to represent HathiTrust corpus and perform extensive performance evaluation using simulated workloads. The experimental results demonstrate that encapsulating the whole volume within a single row with regular columns delivers the best overall performance.
  • Item
    Dependency Provenance in Agent Based Modeling
    (2013-08) Chen, Peng; Plale, Beth; Evans, Tom
    Researchers who use agent-based models (ABM) to model social patterns often focus on the model's aggregate phenomena. However, aggregation of individuals complicates the understanding of agent interactions and the uniqueness of individuals. We develop a method for tracing and capturing the provenance of individuals and their interactions in the NetLogo ABM, and from this create a "dependency provenance slice", which combines a data slice and a program slice to yield insights into the cause-effect relations among system behaviors. To cope with the large volume of fine-grained provenance traces, we propose use-inspired filters to reduce the amount of provenance, and a provenance slicing technique called "non-preprocessing provenance slicing" that directly queries over provenance traces without recovering all provenance entities and dependencies beforehand. We evaluate performance and utility using a well known ecological NetLogo model called "wolf-sheep-predation".
  • Item
    Repository of NSF-funded Publications and Related Datasets: “Back of Envelope” Cost Estimate for 15 years
    (2013) Plale, Beth; Kouper, Inna; Seiffert, Kurt; Konkiel, Stacy R
    In this back of envelope study we calculate the 15-year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately 0.90 cents per GB over a 15-year span. Variable costs are estimated at a sliding scale of 150-100 dollars per new dataset for up-front curation, or 4.87-3.22 dollars per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at 167,000,000 dollars over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at 5.56 dollars. This $167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation.
  • Item
    SEAD: Preserving Data for Environmental Sciences in Areas of Climate, Land-Use, and Environmental Management
    (2013-01-14) Hedstrom, Margaret; Plale, Beth; McDonald, Robert H.; Chandrasekar, Kavitha; Kouper, Inna; Konkiel, Stacy; Kumar, Praveen; Myers, James
  • Item
    SEAD Virtual Archive: Building a Federation of Institutional Repositories for Long-Term Data Preservation in Sustainability Science
    (2013-01-16) Plale, Beth; McDonald, Robert H.; Chandrasekar, Kavitha; Kouper, Inna; Konkiel, Stacy; Hedstrom, Margaret L.; Myers, Jim; Kumar, Praveen
    Major research universities are grappling with their response to the deluge of scientific data emerging through research by their faculty. Many are looking to their libraries and the institutional repository as a solution. Scientific data introduces substantial challenges that the document-based institutional repository may not be suited to deal with. The Sustainable Environment - Actionable Data (SEAD) Virtual Archive specifically addresses the challenges of “long tail” scientific data. In this paper, we propose requirements, policy and architecture to support not only the preservation of scientific data today using institutional repositories, but also its rich access and use into the future.
  • Item
    Provenance Analysis: Towards Quality Provenance
    (2012) Cheah, You-Wei; Plale, Beth
    Data provenance, a key piece of metadata that describes the lifecycle of a data product, is crucial in aiding scientists to better understand and facilitate reproducibility and reuse of scientific results. Provenance collection systems often capture provenance on the fly and the protocol between application and provenance tool may not be reliable. As a result, data provenance can become ambiguous or simply inaccurate. In this paper, we identify likely quality issues in data provenance. We also establish crucial quality dimensions that are especially critical for the evaluation of provenance quality. We analyze synthetic and real-world provenance based on these quality dimensions and summarize our contributions to provenance quality.
  • Item
    Visualization of Network Data Provenance
    (2012-09) Chen, Peng; Plale, Beth; Cheah, You-Wei; Ghoshal, Devarshi; Jensen, Scott; Luo, Yuan
    Visualization facilitates the understanding of scientific data both through exploration and explanation of the visualized data. Provenance also contributes to the understanding of data by containing the contributing factors behind a result. The visualization of provenance, although supported in existing workflow management systems, generally focuses on small (medium) sized provenance data, lacking techniques to deal with big data with high complexity. This paper discusses visualization techniques developed for exploration and explanation of provenance, including layout algorithm, visual style, graph abstraction techniques, and graph matching algorithm, to deal with the high complexity. We demonstrate through application to two extensively analyzed case studies that involved provenance capture and use over three year projects, the first involving provenance of a satellite imagery ingest processing pipeline and the other of provenance in a large-scale computer network testbed.
  • Item
    Temporal Representation for Scientific Data Provenance
    (2012-09) Chen, Peng; Plale, Beth; Aktas, Mehmet S.
    Provenance of digital scientific data is an important piece of the metadata of a data object. It can however grow voluminous quickly because the granularity level of capture can be high. It can also be quite feature rich. We propose a representation of the provenance data based on logical time that reduces the feature space. Creating time and frequency domain representations of the provenance, we apply clustering, classification and association rule mining to the abstract representations to determine the usefulness of the temporal representation. We evaluate the temporal representation using an existing 10 GB database of provenance captured from a range of scientific workflows.