Show simple item record

dc.contributor.author Plale, Beth
dc.contributor.author Kouper, Inna
dc.contributor.author Seiffert, Kurt
dc.contributor.author Konkiel, Stacy R
dc.date.accessioned 2013-05-29T14:28:12Z
dc.date.available 2013-05-29T14:28:12Z
dc.date.issued 2013
dc.identifier.uri http://hdl.handle.net/2022/16599
dc.description.abstract In this back of envelope study we calculate the 15 year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately $.0.90/GB over 15-year span. Variable costs are estimated at a sliding scale of $150 - $100 per new dataset for up-front curation, or $4.87 – $3.22 per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at $167,000,000 over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at $5.56. This $167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation. en_US
dc.language.iso en_US en_US
dc.subject Research Subject Categories en_US
dc.subject data repository en_US
dc.subject National Science Foundation (NSF) en_US
dc.subject cost model en_US
dc.subject digital preservation en_US
dc.subject data curation en_US
dc.title Repository of NSF-funded Publications and Related Datasets: “Back of Envelope” Cost Estimate for 15 years en_US
dc.type Technical Report en_US
dc.type Working Paper en_US
dc.altmetrics.display TRUE
dc.altmetrics.display true en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IUScholarWorks


Advanced Search

Browse

My Account

Statistics