Browsing by Author "Plale, Beth"
Now showing 1 - 20 of 229
- Results Per Page
- Sort Options
Item 2011 annual report on training, education, and outreach activities of the Indiana University Pervasive Technology Institute and affiliated organizations(2012) Miller, Therese; Plale, Beth; Stewart, Craig A.This report summarizes training, education, and outreach activities for calendar 2011 of PTI and affiliated organizations, including the School of Informatics and Computing, Office of the Vice President for Information Technology, and Maurer School of Law. Reported activities include those led by PTI Research Centers (Center for Applied Cybersecurity Research, Center for Research in Extreme Scale Technologies, Data to Insight Center, Digital Science Center) and Service and Cyberinfrastructure Centers (Research Technologies Division of University Information Technology Services, National Center for Genome Assembly Support)Item 2012 annual report on training, education, and outreach activities of the Indiana University Pervasive Technology Institute and affiliated organizations(2014-05-08) Miller, Therese; Ping, Robert J.; Plale, Beth; Stewart, Craig A.This report summarizes training, education, and outreach activities for calendar 2012 of PTI and affiliated organizations, including the School of Informatics and Computing, Office of the Vice President for Information Technology, and Maurer School of Law. Reported activities include those led by PTI Research Centers (Center for Applied Cybersecurity Research, Center for Research in Extreme Scale Technologies, Data to Insight Center, Digital Science Center) and Service and Cyberinfrastructure Centers (Research Technologies Division of University Information Technology Services, National Center for Genome Assembly Support)Item 2013 annual report on training, education, and outreach activities of the Indiana University Pervasive Technology Institute and affiliated organizations(2014-05-08) Ping, Robert J.; Miller, Therese; Plale, Beth; Stewart, Craig A.This report summarizes training, education, and outreach activities for calendar 2013 of PTI and affiliated organizations, including the School of Informatics and Computing, Office of the Vice President for Information Technology, and Maurer School of Law. Reported activities include those led by PTI Research Centers (Center for Applied Cybersecurity Research, Center for Research in Extreme Scale Technologies, Data to Insight Center, Digital Science Center) and Service and Cyberinfrastructure Centers (Research Technologies Division of University Information Technology Services, National Center for Genome Assembly Support)Item Achieving low barriers to entry in the FAIR Digital Objects (FDO) data space: a Use Case in Biodiversity Extended Specimen NetworksPlale, BethFor a network of FAIR digital objects (a “data space”) to be fully realized at a global scale, its architecture must possess low barriers to entry to newcomer data providers. Barriers to entry is a measure of the up-front resource demands (costs) required to enter into a line of business or participate in a multi-organizational endeavor. The biodiversity community’s notion of Extended Specimen is a good match as a FAIR Digital Objects (FDO) data space. Extended Specimen is the interconnecting of physical specimen with all manner of derived and/or related data reflecting new sources of data and information related to collected specimens. We look at two possible manifestations of FAIR digital object data space for the global biodiversity community: the DiSSCo project in Europe and an early evaluation being undertaken in the US. Application of the lense of barriers to entry in this context strongly suggests that the FAIR Digital Object data space adopt a policy of flexibility with respect to the requirements it imposes for newcomers.Item Agenda to the Midwest Research Computing and Data Consortium Annual Meeting held 4/30/2024-5/1/2024(2024-04-15) Snapp-Childs, Winona; Plale, Beth; Tomko, Karen; Hampton, Scott; Palen, Brock; Combs, Jane; Shechter, Todd; Ferguson, Jim; Liming, Lee; Smith, Preston; Djohari, HadrianItem Archiving a social-ecological database: challenges, solutions and lessons learned(Indiana University Digital Collections Services, 2014-02-12) Plale, Beth; Kouper, InnaSocial-ecological research studies complex human-natural environments and the uses and sharing of ecological resources. Elinor Ostrom, a Nobel prize laureate from IU, pioneered the idea that social-ecological data can be collected and stored in a centralized database, which will capture complex relationships between various components of data and facilitate their collective collaborative use. While useful in its active stage, databases present a challenge for archival and preservation, especially if they are stored in a proprietary format and where changes are often applied retrospectively to both new and existing data. In this talk we will present an approach to archiving a social-ecological research database, the International Forestry Resources and Institutions (IFRI) database, and discuss challenges that we encountered as well as lessons learned. The talk aims at stimulating a discussion about preservation of complex data objects and possible solutions that can be generalized beyond one case.Item Big Data Analytics in Static and Streaming Provenance([Bloomington, Ind.] : Indiana University, 2016-04) Chen, Peng; Plale, BethWith recent technological and computational advances, scientists increasingly integrate sensors and model simulations to understand spatial, temporal, social, and ecological relationships at unprecedented scale. Data provenance traces relationships of entities over time, thus providing a unique view on over-time behavior under study. However, provenance can be overwhelming in both volume and complexity; the now forecasting potential of provenance creates additional demands. This dissertation focuses on Big Data analytics of static and streaming provenance. It develops filters and a non-preprocessing slicing technique for in-situ querying of static provenance. It presents a stream processing framework for online processing of provenance data at high receiving rate. While the former is sufficient for answering queries that are given prior to the application start (forward queries), the latter deals with queries whose targets are unknown beforehand (backward queries). Finally, it explores data mining on large collections of provenance and proposes a temporal representation of provenance that can reduce the high dimensionality while effectively supporting mining tasks like clustering, classification and association rules mining; and the temporal representation can be further applied to streaming provenance as well. The proposed techniques are verified through software prototypes applied to Big Data provenance captured from computer network data, weather models, ocean models, remote (satellite) imagery data, and agent-based simulations of agricultural decision making.Item Big Data and HPC: Exploring Role of Research Data Alliance (RDA), a Report On Supercomputing 2013 Birds of a Feather(2013-11-13) Plale, BethThe ubiquity of today's data is not just transforming what is, it is transforming what will be laying the groundwork to drive new innovation. Today, research questions are addressed by complex models, by large data analysis tasks, and by sophisticated data visualization techniques, all requiring data. To address the growing global need for data infrastructure, the Research Data Alliance (RDA) was launched in FY13 as an international community-driven organization. We propose to bring together members of RDA with the HPC community to create a shared conversation around the utility of RDA for data-driven challenges in HPC.Item Big provenance stream processing for data-intensive computations([Bloomington, Ind.] : Indiana University, 2018-11) Suriarachchi, Isuru; Plale, BethIndustry, academia, and research alike are grappling with the opportunities that Big Data brings in the ability to analyze data from numerous sources for insight, decision making, and predictive forecasts. The analysis workflows for dealing with such volumes of data are said to be large scale data-intensive computations (DICs). Data-intensive computation frameworks, also known as Big Data processing frameworks, carry out both online and offline processing. Big Data analysis workflows frequently consist of multiple steps: data cleaning, joining data from different sources and applying processing algorithms. Critically today the steps of a given workflow may be performed with different processing frameworks simultaneously, complicating the lifecycle of the data products that go through the workflow. This is particularly the case in emerging Big Data management solutions like Data Lakes in which data from multiple sources are stored in a shared storage solution and analyzed for different purposes at different points of time. In such an environment, accessibility and traceability of data products are known to be hard to achieve. Data provenance, or data lineage, leads to a good solution for this problem as it provides the derivation history of a data product and helps in monitoring, debugging and reproducing computations. Our initial research produced a provenance-based reference architecture and a prototype implementation to achieve better traceability and management. Experiments show that the size of fine-grained provenance collected from data-intensive computations can be several times larger than the original data itself, creating a Big Data problem referred to in the literature “Big Provenance”. Storing and managing Big Provenance for later analysis is not be feasible for some data-intensive applications due to high resource consumption. In addition to that, not all provenance is equally valuable and can be summarized without loss of critical information. In this thesis, I apply stream processing techniques to analyze streams of provenance captured from data-intensive computations. The specific contributions are several. First, a provenance model which includes formal definitions for provenance stream, forward provenance and backward provenance in the context of data-intensive computations. Second, a stateful, one-pass, parallel stream processing algorithm to summarize a full provenance stream on-the-fly by preserving backward provenance and forward provenance. The algorithm is resilient to provenance events arriving out-of-order. Multiple provenance stream partitioning strategies: horizontal, vertical, and random for provenance emerging from data-intensive computations are also presented. A provenance stream processing architecture is developed to apply the proposed parallel streaming algorithm on a stream of provenance arriving through a distributed log store. The solution is evaluated using Apache Kafka log store, Apache Flink stream processing system, and the Komadu provenance capture service. Provenance identity, archival and reproducibility use a persistent ID (PID)-based approach.Item The Data Capsule for Non-Consumptive Research: Final ReportPlale, Beth; Prakash, Atul; McDonald, RobertDigital texts with access and use protections form a unique and fast growing collection of materials. Growing equally quickly is the development of text and data mining algorithms that process large text-based collections for purposes of exploring the content computationally. There is a strong need for research to establish the foundations for secure computational and data technologies that can ensure a non-consumptive environment for use-protected texts such as the copyrighted works in the HathiTrust Digital Library. Developing a secure computation and data environment for non-consumptive research for the HathiTrust Research Center is funded through a grant from the Alfred P. Sloan Foundation. In this research, researchers at HTRC and the University of Michigan are developing a “data capsule framework” that is founded on a principle of “trust but verify”. The project has resulted in a novel experimental framework that permits analytical investigation of a corpus but prohibits data from leaving the capsule. The HTRC Data Capsule is both a system architecture and set of policies that enable computational investigation over the protected content of the HT digital repository that is carried out and controlled directly by a researcher.Item Datasets Published by the IU Pervasive Technology Institute 1999-2019(2020-08-26) Stewart, Craig A.; Plale, Beth; Fischer, JeremyItem Dependency Provenance in Agent Based Modeling(2013-08) Chen, Peng; Plale, Beth; Evans, TomResearchers who use agent-based models (ABM) to model social patterns often focus on the model's aggregate phenomena. However, aggregation of individuals complicates the understanding of agent interactions and the uniqueness of individuals. We develop a method for tracing and capturing the provenance of individuals and their interactions in the NetLogo ABM, and from this create a "dependency provenance slice", which combines a data slice and a program slice to yield insights into the cause-effect relations among system behaviors. To cope with the large volume of fine-grained provenance traces, we propose use-inspired filters to reduce the amount of provenance, and a provenance slicing technique called "non-preprocessing provenance slicing" that directly queries over provenance traces without recovering all provenance entities and dependencies beforehand. We evaluate performance and utility using a well known ecological NetLogo model called "wolf-sheep-predation".Item Evaluation of Data Storage in HathiTrust Research Center Using Cassandra(2014-07-02) Ruan, Guangchen; Plale, BethAs digital data sources grow in number and size, they pose an opportunity for computational investigation by means of text mining, NLP, and other text analysis techniques. The HathiTrust Re-search Center (HTRC) was recently established to provision for automated analytical techniques on the over 11 million digitized volumes (books) of the HathiTrust digital repository. The HTRC data store that hosts and provisions access to HathiTrust volumes needs to be efficient, fault-tolerant and large-scale. In this paper, we propose three schema designs of Cassandra NoSQL store to represent HathiTrust corpus and perform extensive performance evaluation using simulated workloads. The experimental results demonstrate that encapsulating the whole volume within a single row with regular columns delivers the best overall performance.Item Grand Challenge of Indiana Water: Estimate of Compute and Data Storage Needs(none) Plale, BethThis study is undertaken to assess the computational and storage needs for a large-scale research activity to study water in the State of Indiana. It draws its data and compute numbers from the Vortex II Forecast Data study of 2010 carried out by the Data To Insight Center at Indiana University. Detail of the study can be found in each of the archived data products (which contains results of a single weather forecast plus 42 visualizations created for each forecast.) See https://scholarworks.iu.edu/dspace/handle/2022/15153 for example archived data product.Item HathiTrust Research Center Data Capsule v1.0: An Overview of Functionality(Indiana University Digital Collections Services, 2014-09-10) Plale, Beth; Zeng, Jiaan; McDonald, Robert; Chen, MiaoThe first mode of access by the community of digital humanities and informatics researchers and educators to the copyrighted content of the HathiTrust digital repository will be to extracted statistical and aggregated information about the copyrighted texts. But can the HathiTrust Research Center support scientific research that allows a researcher to carry out their own analysis and extract their own information? This question is the focus of a 3-year, $606,000 grant from the Alfred P. Sloan Foundation (Plale, Prakash 2011-2014), which has resulted in a novel experimental framework that permits analytical investigation of a corpus but prohibits data from leaving the capsule. The HTRC Data Capsule is both a system architecture and set of policies that enable computational investigation over the protected content of the HT digital repository that is carried out and controlled directly by a researcher. It leverages the foundational security principles of the Data Capsules of A. Prakash of University of Michigan, which allows privileged access to sensitive data while also restricting the channels through which that data can be released. Ongoing work extends the HTRC Data Capsule to give researchers more compute power at their fingertips. The new thrust, HT-DC Cloud, extends existing security guarantees and features to allow researchers to carry out compute-heavy tasks, like LDA topic modeling, on large-scale compute resources. HTRC Data Capsule works by giving a researcher their own virtual machine that runs within the HTRC domain. The researcher can configure the VM as they would their own desktop with their own tools. After they are done, the VM switches into a "secure" mode, where network and other data channels are restricted in exchange for access to the data being protected. Results are emailed to the user. In this talk we discuss the motivations for the HTRC Data Capsule, its successes and challenges. HTRC Data Capsule runs at Indiana University. See more at http://d2i.indiana.edu/non-consumptive-researchItem HathiTrust Research Center: Challenges and Opportunities in Big Text Data(Indiana University Digital Collections Services, 2014-03-05) Chen, Miao; Plale, BethHathiTrust Research Center (HTRC) is the public research arm of the HathiTrust digital library where millions of volumes, such as books, journals, and government documents, are digitized and preserved. By Nov 2013, the HathiTrust collection has 10.8M total volumes of which 3.5M are in the public domain [1] and the rest are in-copyrighted content. The public domain volumes of the HathiTrust collection by themselves are more than 2TB in storage. Each volume comes with a MARC metadata record for the original physical copy and a METS metadata file for provenance of digital object. Therefore the large-scale text raises challenges on the computational access to the collection, subsets of the collection, and the metadata. The large volume also poses a challenge on text mining, which is, how HTRC provides algorithms to exploit knowledge in the collections and accommodate various mining need. In this workshop, we will introduce the HTRC infrastructure, portal and work set builder interface, and programmatic data retrieve API (Data API), the challenges and opportunities in HTRC big text data, and finish with a short demo to the HTRC tools. More about HTRC The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge. See http://www.hathitrust.org/htrc for details. [1] http://www.hathitrust.org/statistics_visualizationsItem HTRC Data API Performance StudySun, Yiming; Plale, Beth; Zeng, JiaanHathiTrust Research Center (HTRC) allows users to access more than 3 million volumes through a service called Data API. Data API plays an important role in HTRC infrastructure. It hides internal complexity from user, protects against malicious or inadvertent damages to data and separates underlying storage solution with interface so that underlying storage may be replaced with better solutions without affecting client code. We carried out extensive evaluations on the HTRC Data API performance over the Spring 2013. Specifically, we evaluated the rate at which data can be retrieved from the Cassandra cluster under different conditions, impact of different compression levels, and HTTP/HTTPS data transfer. The evaluation presents performance aspects of different software pieces in Data API as well as guides us to have optimal settings for Data API.Item A Hybrid Approach to Population Construction For Agricultural Agent-Based Simulation(2016) Chen, Peng; Evans, Tom; Frisby, Michael; Izquierdo, Eduardo; Plale, BethAn Agent Based Model (ABM) is a powerful tool for its ability to represent heterogeneous agents which through their interactions can reveal emergent phenomena. For this to occur though, the set of agents in an ABM has to accurately model a real world population to reflect its heterogeneity. But when studying human behavior in less well developed settings, the availability of the real population data can be limited, making it impossible to create agents directly from the real population. In this paper, we propose a hybrid method to deal with this data scarcity: we first use the available real population data as the baseline to preserve the true heterogeneity, and fill in the missing characteristics based on survey and remote sensing datasets; then for the remaining undetermined agent characteristics, we use the Microbial Genetic Algorithm to search for a set of values that can optimize the replicative validity of the model to match data observed from real world. We apply our method to the creation of a synthetic population of household agents for the simulation of agricultural decision making processes in rural Zambia. The result shows that the synthetic population created from the farmer register can correctly reflect the marginal distributions and the randomness of survey data; and can minimize the difference between the distribution of simulated yield and that of the observed yield in Post Harvest Survey (PHS).Item Indiana University Digitization Master Plan(2014-11-20) Lewis, David; Plale, BethIn his State of the University address on October 1, 2013, Indiana University President Michael McRobbie emphasized that universities have a critical role to play in the preservation of knowledge. In keeping with this goal, President McRobbie announced a charter for an Indiana University Digitization Master Plan (DMP). The DMP is to look beyond time-based media and formulate a university-wide roadmap to digitize and store in some form all of our existing collections judged by experts and scholars to be of lasting importance to research and scholarship, and to ensure the preservation of all new research and scholarship at IU that is born digital.Item Indiana University Pervasive Technology Institute(2017-09-01) Stewart, Craig A.; Welch, Von; Plale, Beth; Fox, Geoffrey; Pierce, Marlon; Sterling, Thomas