Presentations

Permanent link for this collectionhttps://hdl.handle.net/2022/14534

Browse

Recent Submissions

Now showing 1 - 20 of 87
  • Item
    Harvesting Field Station Data: Automating Raspberry Pi Sensors to Collaborative Websites, and Update
    (2019-09-09) Sanders, Sheri; Foran, Eliza; Guido, Emmanuel; Anderson, Jazzly; Slayton, Thomas; Doak, Thomas
    Field stations increasingly leverage remote sensors for large scale environmental data collection.  Here we demonstrate a proof-of-concept workflow from data collection from remote sensors to presentation of summary results on a remote, and therefore fast and stable, cloud server.  Environmental data is collected via raspberry pi's in several locations and data is streamed to the server on XSEDE's Jetstream, housed at Indiana University, through low-bandwith messaging.  The Jetstream server does all the heavy lifting, exporting the data into a database, running automatically updating summary scripts to produce graphs, and hosting a Drupal-based website to present the data to collaborators or the public.  While we use compact data in our demo, larger databases can be backed up on XSEDE's Wrangler, also housed at Indiana University.   The end product is automatic aggregation and back up of sensor data onto a stable website that does not require a in-house server or large bandwidth on-site.  This set-up is packaged into a ready-to-use and publically-available Jetstream image, meaning researchers could use their own sensors and R code for custom graphs with very little individual set up.  Alternatively, the set-up can be used to house and display larger scale databases from other data types, such as audio recordings or photography.  Future work will be in developing the ability to "pick up" data via drone fly-over and aggregation of citizen science data from multiple sites.
  • Item
    Genome and transcriptome analysis of fish tapeworm Nippotaenia percotti through scientific collaboration between research labs and national cyberinfrastructure.
    (2020-01-13) Papudeshi, Bhavya; Chafin, Tyler K; Sanders, Sheri A; Ganote, Carrie; Reshetnikov, Andrey N; Sokolov, Sergey G; Doak, Thomas; Pummill, Jeff F; Douglas, Marlis R.; Douglas, Michael E.
    Tapeworms (Cestoda) are endoparasites infecting all vertebrates. Most (>50%) of their described diversity is within the clade Cyclophyllidea, a common, chronic source of anthropogenic infection. We report the first sequenced genome for the tapeworm Nippotaenia percotti (Nippotaeniidea - the putative sister group to Cyclophyllidea), which is host-specific to a fish in the Amur River. The genome was derived for comparative purposes, to explore evolutionary change in functional gene loci of immunological import. Pooled individuals were sequenced on Illumina (HiSeq 2000) and PacBio (RSII), with additional RNAseq on the HiSeq 2500. Hybrid assemblies were completed in SPAdes with long-read scaffolding in LINKS. The assembly was further improved using Redundans and Pilon, generating 3,410 contigs at an N50 of 209,561bp. Transcriptomes were assembled using a combined de novo approach (CDTA) with multiple assemblers and k-mers. Assembled transcripts were combined using EvidentialGene, producing 28,226 assembled transcripts at an N50 of 2,290bp, then annotated using Trinotate. The assembled genome was annotated using MAKER, identifying 30,671 genes, using our assembled transcriptome and genomes of closely-related cestodes. Gene evolution was examined using 15 cestode genomes from the WormBase Parasite database, with the MCL algorithm identifying 16,099 orthologous genes clusters. Gene loss/gain was assessed by contrasting gene clusters with the cestode phylogenetic tree constructed with core genes identified by BUSCO, using IQ-Tree. Nippotaenia percotti’s genome provides a baseline for future investigations into candidate-gene families potentially involved with anthropogenic infection and would also sponsor improvements in tapeworm treatment and control.
  • Item
    National Center for Genome Analysis Support (NCGAS): Genomics and other Science in the NSF-Funded Jetstream Cloud
    (2020-01-13) Doak, Thomas; Sanders, Sheri; Ganote, Carrie; Papudeshi, Bhavya; Fischer, Jeremy; Hancock, David Y.
    The National Center for Genome Analysis Support (NCGAS) is an NSF-funded (NSF-1445604) center that helps all NSF-funded researchers doing genomics research. Genomics includes transcriptomics, metagenomics, genome annotation, etc. Our support includes providing access to large memory computing, maintaining curated sets of genomics applications, providing one-on-one consultation, and creating educational opportunities. A resource that we have come to rely on for providing these services is the NSF-funded Jetstream Cloud—maintained by Indiana University (led by the Indiana University Pervasive Technology Institute (PTI) and the University of Texas at Austin's Texas Advanced Computing Center (TACC). Additionally, we leverage Globus data transfer tools. Globus at the University of Chicago is responsible for integrating Jetstream with the NSF-funded Extreme Science and Engineering Discovery Environment (XSEDE), and for integrating Globus data movement and management tools, as well as Globus-based secure user authentication. With a focus on ease of use and broad accessibility, Jetstream is designed for those who have not previously used high performance computing and software resources—for researchers who need more than desktop-strength computing but less than full-scale High Performance Computing (HPC). Jetstream features a web-based user interface based on the popular Atmosphere cloud computing environment—developed by CyVerse—extended to support science and engineering research generally. The system is particularly geared toward 21st-century workforce development at small colleges and universities – especially historically black colleges and universities, minority serving institutions, tribal colleges, and higher education institutions in EPSCoR States. Jetstream provides a library of virtual machines designed to do discipline-specific scientific analysis, but researchers can also develop their own VMs, with their own software sets, or sets specialized to a particular task. These VMs can be both saved and shared with collaborators. Currently there are 19 genomics VMs, including RStudio instances with bioconductor, ready-made genome browsers with JBrowse/Tripal, and metagenomic tools like QIIME2 and Anvi’o. biology and molecular biology researchers are the largest users of Jetstream. NCGAS has found VMs extremely useful in education and workshops: we develop class-specific VMs, with all the applications needed, then clone, so that each student has their own VM to work on (making courses easy to scale). In addition to on-demand VMs, persistent science gateways can be established using template VMs NCGAS has built. These can be used to provide services to collaborators or to the world. Users can easily create Galaxy servers on Jetstream: each server comes preconfigured with hundreds of tools and commonly used reference datasets—once running, researchers can use it or customize it. Many NCGAS users establish genome browsers—specific to their organism—that are shared with small sets of collaborating researchers—but can be shared to the world. Jetstream is accessed via an allocation process at XSEDE—a startup allocation is typically approved within a day.
  • Item
    Mining Microbial Genomes from Datasets on the Sequence Read Archive
    (2020-01-15) Papudeshi, Bhavya; Leffler, Haley; Ganapanei, Sruthi; Sanders, Sheri; Ganote, Carrie; Doak, Thomas
    The declining costs of genome sequencing and growing amounts of genetic data has allowed the field of genomics to become more integrated with computational analysis. The use of high performance clusters (HPC) is necessary to compute the large amounts of data in genomic projects, however, many biologists lack background experience in working with HPC systems, which limits their ability to best address their research questions. The National Center of Genome Analysis Support (NCGAS) is an NSF-funded center that focuses on filling this need, by providing training as workshops, bioinformatics support on projects, and access to compute resources. As a byproduct of helping research projects, we develop open source workflows and make them available to the community. Here we present a developed workflow that will assist researchers in mining the Sequence Read Archive (SRA), to identify environments/datasets potentially containing genomes of interest, and identify their closely related genomes. As a proof of concept, we used two genomes to test the developed workflow, selected to ensure the flexibility of the workflow to generate results in formats amiable to further downstream analysis, based on the research question. The developed pipeline is made available through GitHub (https://github.com/NCGAS/CEWiT-REU-Identifying-datasets-in-SRA-using-Jetstream), and available as a pre-installed workflow on the XSEDE Jetstream cloud computing infrastructure. 
  • Item
    Navigating the Sequence Read Archive to identify crAssphage, an ubiquitous inhabitant of the human microbiome
    (Jim Holland Summer Science Research Program Poster Session, 2019-07-14) Cai, Jasmine X.; Weathers, Jania G.; Leffler, Haley; Ganapaneni, Sruthi; Papudeshi, Bhavya; Sanders, Sheri; Doak, Thomas G.
    The declining costs of genome sequencing and growing amounts of genetic data is evolving the field of genomics to become more integrated with computational analysis. The use of high performance clusters(HPC) are necessary to compute the large amounts of data in genomic projects. However, many biologists lack the background experience in working with HPC systems, which limits their ability to best address their research questions. National Center of Genome Analysis Support (NCGAS) is an NSF funded center that focuses on filling this crevice, through helping the research through providing training as workshops, bioinformatics support on projects, and access to compute resources. As a byproduct of helping on research projects, we develop open source workflows and make them available to the community. Here we present a developed workflow that will assist researchers in mining the sequence read archive (SRA), to identify other environments/datasets potentially contain a genome of interest, and identify their closely related genomes. As a proof of concept, we used two genomes to test the developed workflow. We selected these two different genomes to ensure the flexibility of the workflow to generate results in formats to aid further downstream analysis based on the research question.The developed pipeline will be made available through an NSF cloud computing platform, Jetstream with documentation to the research community.
  • Item
    A workflow to identify genomes in the Sequence Read Archive for phylogenomic analysis
    (American Society for Microbiology 2019, 2019-06-23) Leffler, Haley; Ganapaneni, Sruthi; Papudeshi, Bhavya; Ganote, Carrie; Sanders, Sheri; Doak, Thomas G.
    The declining costs of genome sequencing and growing amounts of genetic data is evolving the field of genomics to become more integrated with computational analysis. The use of high performance clusters(HPC) are necessary to compute the large amounts of data in genomic projects. However, many biologists lack the background experience in working with HPC systems, which limits their ability to best address their research questions. National Center of Genome Analysis Support (NCGAS) is an NSF funded center that focuses on filling this crevice, through helping the research through providing training as workshops, bioinformatics support on projects, and access to compute resources. As a byproduct of helping on research projects, we develop open source workflows and make them available to the community. Here we present a developed workflow that will assist researchers in mining the sequence read archive (SRA), to identify other environments/datasets potentially contain a genome of interest, and identify their closely related genomes. As a proof of concept, we used two genomes to test the developed workflow. We selected these two different genomes to ensure the flexibility of the workflow to generate results in formats to aid further downstream analysis based on the research question.The developed pipeline will be made available through an NSF cloud computing platform, Jetstream with documentation to the research community.
  • Item
    Harvesting Field Station Data: Automating Data Flow from Raspberry Pi Sensors to Collaborative Websites
    (Annual Meeting of the Organization of Biological Field Stations, 2018-09-22) Sanders, Sheri; Guido, Emmanuel; Anderson, Jazzly; Slayton, Thomas; Doak, Thomas G.
    Field stations increasingly leverage remote sensors for large scale environmental data collection. Here we demonstrate a proof-of-concept workflow from data collection from remote sensors to presentation of summary results on a remote - and therefore fast and stable - cloud server. Environmental data is collected via raspberry pis in several locations and the data is streamed to the server on XSEDE's Jetstream, housed in part at Indiana University, through low-bandwith messaging. The Jetstream cloud server does all the heavy lifting, exporting the data into a database, running automatically updating summary scripts to produce graphs, and hosting a Drupal-based website to present the data to collaborators or the public. While we use compact data in our demo, larger databases can be backed up on XSEDE's Wrangler, a large scale storage server also housed in part at Indiana University. The end product is automatic aggregation and back up of sensor data onto a stable website that does not require a in-house server or large bandwidth on-site. This workflow is packaged into a ready-to-use and publically-available Jetstream image, meaning researchers could use their own sensors and R code for custom graphs with very little set up. Alternatively, the image can be used to house and display larger scale databases from other data types, such as audio recordings or photography. Future work will be in developing the ability to "pick up" data via drone fly-over and aggregation of citizen science data from multiple sites.
  • Item
    Developing a workflow for bioacoustic recording devices and frog call analysis within Jetstream
    (Center of Excellence for Women & Technology, 2019-04-12) Foran, Eliza; Anderson, Jazzly; Slayton, Thomas; Guido, Emmanuel; Doak, Thomas; Sanders, Sheri
  • Item
    Mining the Sequence Read Archive to identify crAssphage, a ubiquitous inhabitant of the human microbiome, in dog and pig samples
    (Center of Excellence for Women & Technology, 2019-04-12) Leffler, Haley; Ganapaneni, Sruthi; Papudeshi, Bhavya; Sanders, Sheri; Doak, Thomas
  • Item
    Coupling metagenomics with high-performance computing to mine the Sequence Read Archive (SRA) to analyze Pseudomonas phage PAK-P1
    (Center of Excellence for Women & Technology, 2019-04-12) Ganapaneni, Sruthi; Leffler, Haley; Papudeshi, Bhavya; Sanders, Sheri; Doak, Thomas
  • Item
    Population Genetics of Tree Swallows, in Collaboration with NCGAS
    (Plant and Animal Genome XXVII, 2019-01-14) Sanders, Sheri; Papudeshi, Bhavya; Ganote, Carrie; Doak, Tom; Mansfield, Charles; Tseng, Chi Yen; Custer, Thomas; Custer, Christine; Matson, Cole
    The National Center for Genome Analysis Support (NCGAS) provides training and computational resources in an effort to train biologists to approach historically-difficult, non-model problems with large biological data sets. For example, our collaborators at Baylor University work with Tree Swallow (Tachycineta bicolor), using RNAseq data in population genetics and toxicology. Working with the NCGAS, they assembled a de novo transcriptome assembly for the Tree Swallow, for which there is no genome. Variant calling using the transcriptome identified 66,169 single nucleotide polymorphisms (SNPs) across 144 samples. They were then able to identify phylogeographic structuring across the Great Lakes Region, including accurate grouping populations distributed across smaller geographic scales (e.g. along the Maumee River). SNPs were also used to assess population heterozygosity and genetic diversity. This project required large scale data handling, large memory machines to assembly the transcriptome, and advanced Linux skills to manage the data and analyses. NCGAS provided the computation resources and training on the Linux environment and data management. Further assistance was provided in consultation and problem solving - leading to a high level of independence and competency of the graduate student researcher.
  • Item
    The Genome of Fish Tapeworm Nippotaenia percotti as a Potential Bookmark for Gene Loci that Facilitates Anthropogenic Infection.
    (Plant and Animal Genome XXVII, 2019-01-14) Sanders, Sheri; Papudeshi, Bhavya; Ganote, Carrie; Doak, Tom; Chafin, Tyler; Reshetnikov, Andrey; Sokolov, Sergey; Pummil, Jeff; Douglas, Marlis; Douglas, Michael
    Tapeworms (Cestoda) are endoparasites infecting all vertebrates. Most (>50%) of their described diversity is within the clade Cyclophyllidea a common, chronic source of anthropogenic infection. We report the first sequenced genome for the tapeworm Nippotaenia percotti (Nippotaeniidea - the putative sister group to Cyclophyllidea), which is host-specific to a fish in the Amur River. The genome was derived for comparative purposes, to explore evolutionary change in functional gene loci of immunological import. Pooled individuals were sequenced on Illumina (HiSeq 2000) and PacBio (RSII), with additional RNAseq on the HiSeq 2500. Hybrid assemblies were completed in SPAdes with long-read scaffolding in LINKS. The assembly was further improved using Redundans and Pilon, generating 3,410 contigs at an N50 of 209,561bp. Transcriptomes were assembled using a combined de novo approach (CDTA) with multiple assemblers and k-mers. Assembled transcripts were combined using EvidentialGene, producing 28,226 assembled transcripts at an N50 of 2,290bp, then annotated using Trinotate. The assembled genome was annotated using MAKER, identifying 30,671 genes, using our assembled transcriptome and genomes of closely-related cestodes. Gene evolution was examined using 15 cestode genomes from the WormBase Parasite database, with the MCL algorithm identifying 16,099 orthologous genes clusters. Gene loss/gain was assessed by contrasting gene clusters with the cestode phylogenetic tree constructed with core genes identified by BUSCO, using IQ-Tree. Nippotaenia percotti’s genome provides a baseline for future investigations into candidate-gene families potentially involved with anthropogenic infection and would also sponsor improvements in tapeworm treatment and control.
  • Item
    Compute resources available to the research community for microbiome analysis
    (Plant and Animal Genome XXVII, 2019-01-16) Papudeshi, Bhavya; Sanders, Sheri; Ganote, Carrie; Doak, Tom
    The National Center for Genome Analysis Support (NCGAS) is an NSF-funded center tasked with assisting biologists in getting access to computational resources they need in order to analyze genomic data. To support microbiome analysis, NCGAS provides preconfigured virtual machines (VM) to identify taxa in 16S amplicon sequencing, and to identify both taxa and functions from whole genome metagenomes. Additionally, a pipeline to reconstruct genomes from metagenomes, to examine the role of specific microbes in a community, is available as a preconfigured VM hosting Anvi’o (https://ncgas.org/Blog_Posts/Running%20Anvio%20on%20Jetstream.php). Jetstream, a cloud computing resource, is both easy to use and flattens the learning curve for using the Linux operating system and for installing bioinformatics software. Jetstream provides an environment for both prototyping and publishing tailored workflows. Through an NCGAS allocation, a researcher can get access to Jetstream, and to other national compute clusters with more memory and for parallel processing. These compute resources have Globus connect subscriptions which assists in transferring terabytes of data quickly. In this workshop, we will demonstrate how to get an NCGAS allocation, set up a Jetstream account, spin up a preconfigured virtual machine for microbiome analysis (https://ncgas.org/Blog_Posts/Getting%20Started%20on%20Jetstream.php), and transfer data between compute clusters using Globus (https://ncgas.org/Blog_Posts/Getting%20Started%20with%20Globus.php).
  • Item
    NCGAS Makes Robust Transcriptome Assembly Even Easier with Added Features to an Accessible de novo Transcriptome Assembly Workflow
    (Plant and Animal Genome XXVII, 2019-01-12) Sanders, Sheri; Papudeshi, Bhavya; Ganote, Carrie; Doak, Tom
    The National Center for Genome Analysis Support (NCGAS) assists research groups with de novo transcriptome assembly. Following best practice for combined de novo transcriptome assemblies can put a technical burden on genomic researchers who may not be fully computationally trained on efficient use of HPC clusters or the variety of available software packages. NCGAS has created a workflow template to move RNAseq data through 19 parallelized assemblies using four software packages (Trinity, SOAP-denovo, transABySS, and Velvet Oases) and multiple kmers. The transcripts are then combined and filtered using EviGenes to output putative transcripts and alternative forms in a replicable manner. The process is semi-automated but flexible enough to allow researchers to adjust parameters if they desire. This workflow provides a low bar for entry into robust transcriptome assembly that follows best practices, while also providing a replicable means of filtering large numbers of transcripts into a draft version of a transcriptome. We will highlight the main work flow in this demo but will concentrate on the additional features added to the workflow in the last year, including annotation via Trinotate, differential expression handling, and the automated creation of table of assembly metrics via BUSCO and Quast for each sub-assembly. As this workflow has now been adopted by several groups, we will also discuss available training and current implementations of the tool.
  • Item
    Consequences of whole genome duplication in Paramecium sps.
    (2013-07) McGrath, C.L.; Gout, J.F.; Johri, P.; Doak, T.G.; Lynch, M.
  • Item
    Whole-genome duplication, the Paramecium aurelia radiation, and the evolution of gene expression.
    (2014-04-12) Doak, T.G.; Gout, J.F.; Bright, L.; Johri, P.; Sung, W.; Lynch, M.; McGrath, C.L.
  • Item
    Consequences of whole genome duplication in Paramecium sps.
    (2014-01) McGrath, C.L.; Gout, J.F.; Johri, P.; Doak, T.G.; Lynch, M.
  • Item
    Consequences of whole genome duplication in Paramecium sps.
    (2013-05) McGrath, C.L.; Gout, J.F.; Johri, P.; Doak, T.G.; Lynch, M.
  • Item
    RNA-Seq Demo on Galaxy
    (2018-10-25) Ganote, Carrie; Papudeshi, Bhavya; Sanders, Sheri; Doak, Tom
  • Item
    Introduction to Metagenomics
    (2018-10-23) Papudeshi, Bhavya; Sanders, Sheri; Ganote, Carrie; Doak, Tom