Show simple item record Plale, Beth en Kouper, Inna en Seiffert, Kurt en Konkiel, Stacy R en 2013-05-29T14:28:12Z en 2013-05-29T14:28:12Z en 2013 en
dc.identifier.uri en
dc.description.abstract In this back of envelope study we calculate the 15 year fixed and variable costs of setting up and running a data repository (or database) to store and serve the publications and datasets derived from research funded by the National Science Foundation (NSF). Costs are computed on a yearly basis using a fixed estimate of the number of papers that are published each year that list NSF as their funding agency. We assume each paper has one dataset and estimate the size of that dataset based on experience. By our estimates, the number of papers generated each year is 64,340. The average dataset size over all seven directorates of NSF is 32 gigabytes (GB). A total amount of data added to the repository is two petabytes (PB) per year, or 30 PB over 15 years. The architecture of the data/paper repository is based on a hierarchical storage model that uses a combination of fast disk for rapid access and tape for high reliability and cost efficient long-term storage. Data are ingested through workflows that are used in university institutional repositories, which add metadata and ensure data integrity. Average fixed costs is approximately $.0.90/GB over 15-year span. Variable costs are estimated at a sliding scale of $150 - $100 per new dataset for up-front curation, or $4.87 – $3.22 per GB. Variable costs reflect a 3% annual decrease in curation costs as efficiency and automated metadata and provenance capture are anticipated to help reduce what are now largely manual curation efforts. The total projected cost of the data and paper repository is estimated at $167,000,000 over 15 years of operation, curating close to one million of datasets and one million papers. After 15 years and 30 PB of data accumulated and curated, we estimate the cost per gigabyte at $5.56. This $167 million cost is a direct cost in that it does not include federally allowable indirect costs return (ICR). After 15 years, it is reasonable to assume that some datasets will be compressed and rarely accessed. Others may be deemed no longer valuable, e.g., because they are replaced by more accurate results. Therefore, at some point the data growth in the repository will need to be adjusted by use of strategic preservation. en
dc.language.iso en_US en
dc.subject Research Subject Categories en
dc.subject data repository en
dc.subject National Science Foundation (NSF) en
dc.subject cost model en
dc.subject digital preservation en
dc.subject data curation en
dc.title Repository of NSF-funded Publications and Related Datasets: “Back of Envelope” Cost Estimate for 15 years en
dc.type Technical Report en
dc.type Working Paper en
dc.altmetrics.display TRUE en
dc.altmetrics.display true en

Files in this item

This item appears in the following Collection(s)

Show simple item record

Search IUScholarWorks

Advanced Search


My Account