mirage

# Repository of NSF-funded Publications and Related Datasets: “Back of Envelope” Cost Estimate for 15 years

## DSpace/Manakin Repository

 dc.contributor.author Plale, Beth dc.contributor.author Kouper, Inna dc.contributor.author Seiffert, Kurt dc.contributor.author Konkiel, Stacy R dc.date.accessioned 2013-05-29T14:28:12Z dc.date.available 2013-05-29T14:28:12Z dc.date.issued 2013 dc.identifier.uri http://hdl.handle.net/2022/16599 dc.description.abstract In this back of envelope study we calculate the 15 year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately $.0.90/GB over 15-year span. Variable costs are estimated at a sliding scale of$150 - $100 per new dataset for up-front curation, or$4.87 – $3.22 per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at$167,000,000 over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at $5.56. This$167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation. en_US dc.language.iso en_US en_US dc.subject Research Subject Categories en_US dc.subject data repository en_US dc.subject National Science Foundation (NSF) en_US dc.subject cost model en_US dc.subject digital preservation en_US dc.subject data curation en_US dc.title Repository of NSF-funded Publications and Related Datasets: “Back of Envelope” Cost Estimate for 15 years en_US dc.type Technical Report en_US dc.type Working Paper en_US dc.altmetrics.display TRUE dc.altmetrics.display true en_US
﻿