Articles

Permanent link for this collectionhttps://hdl.handle.net/2022/19602

Browse

Recent Submissions

Now showing 1 - 20 of 21
  • Item
    Copredication in homotopy type theory
    (2017) Bahramian, Hamidreza; Nematollahi, Narges; Sabry, Amr
    This paper applies homotopy type theory to formal semantics of natural languages and proposes a new model for the linguistic phenomenon of copredication. Copredication refers to sentences where two predicates which assume different requirements for their arguments are asserted for one single entity, e.g., “the lunch was delicious but took forever”. This paper is particularly concerned with copredication sentences with quantification, i.e., cases where the two predicates impose distinct criteria of quantification and individuation, e.g., “Fred picked up and mastered three books.” In our solution developed in homotopy type theory and using the rule of existential closure following Heim analysis of indefinites, common nouns are modeled as identifications of their aspects using HoTT identity types, e.g., the common noun book is modeled as identifications of its physical and informational aspects. The previous treatments of copredication in systems of semantics which are based on simple type theory and dependent type theories make the correct predictions but at the expense of ad hoc extensions (e.g., partial functions, dot types and coercive subtyping). The model proposed here, also predicts the correct results but using a conceptually simpler foundation and no ad hoc extensions. The proofs in the proposal have been formalized using Agda.
  • Item
    Fast rule-based bioactivity prediction using associative classification mining
    (Chemistry Central Ltd., 2012) Yu, P.; Wild, D.J.
    Relating chemical features to bioactivities is critical in molecular design and is used extensively in the lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) methods, and produce highly interpretable models.
  • Item
    Diverse CRISPRs evolving in human microbiomes
    (Public Library of Science, 2012) Rho, M.; Wu, Y.-W.; Tang, H.; Doak, T.G.; Ye, Y.
    CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) loci, together with cas (CRISPR-associated) genes, form the CRISPR/Cas adaptive immune system, a primary defense strategy that eubacteria and archaea mobilize against foreign nucleic acids, including phages and conjugative plasmids. Short spacer sequences separated by the repeats are derived from foreign DNA and direct interference to future infections. The availability of hundreds of shotgun metagenomic datasets from the Human Microbiome Project (HMP) enables us to explore the distribution and diversity of known CRISPRs in human-associated microbial communities and to discover new CRISPRs. We propose a targeted assembly strategy to reconstruct CRISPR arrays, which whole-metagenome assemblies fail to identify. For each known CRISPR type (identified from reference genomes), we use its direct repeat consensus sequence to recruit reads from each HMP dataset and then assemble the recruited reads into CRISPR loci; the unique spacer sequences can then be extracted for analysis. We also identified novel CRISPRs or new CRISPR variants in contigs from whole-metagenome assemblies and used targeted assembly to more comprehensively identify these CRISPRs across samples. We observed that the distributions of CRISPRs (including 64 known and 86 novel ones) are largely body-site specific. We provide detailed analysis of several CRISPR loci, including novel CRISPRs. For example, known streptococcal CRISPRs were identified in most oral microbiomes, totaling ~8,000 unique spacers: samples resampled from the same individual and oral site shared the most spacers; different oral sites from the same individual shared significantly fewer, while different individuals had almost no common spacers, indicating the impact of subtle niche differences on the evolution of CRISPR defenses. We further demonstrate potential applications of CRISPRs to the tracing of rare species and the virus exposure of individuals. This work indicates the importance of effective identification and characterization of CRISPR loci to the study of the dynamic ecology of microbiomes.
  • Item
    Oral spirochetes implicated in dental diseases are widespread in normal human subjects and carry extremely diverse integron gene cassettes
    (American Society for Microbiology, 2012) Wu, Y.-W.; Rho, M.; Doak, T.G.; Ye, Y.
    The NIH Human Microbiome Project (HMP) has produced several hundred metagenomic data sets, allowing studies of the many functional elements in human-associated microbial communities. Here, we survey the distribution of oral spirochetes implicated in dental diseases in normal human individuals, using recombination sites associated with the chromosomal integron in Treponema genomes, taking advantage of the multiple copies of the integron recombination sites (repeats) in the genomes, and using a targeted assembly approach thatwe have developed. We find that integron-containing Treponema species are present in 80%of the normal human subjects included in the HMP. Further, we are able to de novo assemble the integron gene cassettes using our constrained assembly approach, which employs a unique application of the de Bruijn graph assembly information;most of these cassette genes were not assembled in whole-metagenome assemblies and could not be identified by mapping sequencing reads onto the known reference Treponema genomes due to the dynamic nature of integron gene cassettes. Our studysignificantly enriches the gene pool known to be carried by Treponema chromosomal integrons, totaling 826 (598 97% nonredundant) enes. We characterize the functions of these gene cassettes: many of these genes have unknown functions. The integron gene cassette arrays found in the human microbiome are extraordinarily dynamic, with different microbial communities sharing only a small number of common genes.
  • Item
    The "most wanted" taxa from the human microbiome for whole genome sequencing
    (Public Library of Science, 2012) Fodor, A.A; DeSantis, T.Z; Wylie, K.M; Badger, J.H; Ye, Y.; Hepburn, T.; Hu, P.; Sodergren, E.; Liolios, K.; Huot-Creasy, H.; Birren, B.W.; Earl, A.M.
    The goal of the Human Microbiome Project (HMP) is to generate a comprehensive catalog of human-associated microorganisms including reference genomes representing the most common species. Toward this goal, the HMP has characterized the microbial communities at 18 body habitats in a cohort of over 200 healthy volunteers using 16S rRNA gene (16S) sequencing and has generated nearly 1,000 reference genomes from human-associated microorganisms. To determine how well current reference genome collections capture the diversity observed among the healthy microbiome and to guide isolation and future sequencing of microbiome members, we compared the HMP's 16S data sets to several reference 16S collections to create a 'most wanted' list of taxa for sequencing. Our analysis revealed that the diversity of commonly occurring taxa within the HMP cohort microbiome is relatively modest, few novel taxa are represented by these OTUs and many common taxa among HMP volunteers recur across different populations of healthy humans. Taken together, these results suggest that it should be possible to perform whole-genome sequencing on a large fraction of the human microbiome, including the 'most wanted', and that these sequences should serve to support microbiome studies across multiple cohorts. Also, in stark contrast to other taxa, the 'most wanted' organisms are poorly represented among culture collections suggesting that novel culture- and single-cell-based methods will be required to isolate these organisms for sequencing.
  • Item
    Fast rule-based bioactivity prediction using associative classification mining
    (Chemistry Central, 2012) Yu, P.; Wild, D.J.
    Relating chemical features to bioactivities is critical in molecular design and is used extensively in the lead discovery and optimization process. A variety of techniques from statistics, data mining and machine learning have been applied to this process. In this study, we utilize a collection of methods, called associative classification mining (ACM), which are popular in the data mining community, but so far have not been applied widely in cheminformatics. More specifically, classification based on predictive association rules (CPAR), classification based on multiple association rules (CMAR) and classification based on association rules (CBA) are employed on three datasets using various descriptor sets. Experimental evaluations on anti-tuberculosis (antiTB), mutagenicity and hERG (the human Ether-a-go-go-Related Gene) blocker datasets show that these three methods are computationally scalable and appropriate for high speed mining. Additionally, they provide comparable accuracy and efficiency to the commonly used Bayesian and support vector machines (SVM) methods, and produce highly interpretable models. © 2012 Yu and Wild; licensee Chemistry Central Ltd.
  • Item
    How the Scientific Community Reacts to Newly Submitted Preprints: Article Downloads, Twitter Mentions, and Citations
    (Public Library of Science, 2012) Shuai, X.; Pepe, A.; Bollen, J.
    We analyze the online response to the preprint publication of a cohort of 4,606 scientific articles submitted to the preprint database arXiv.org between October 2010 and May 2011. We study three forms of responses to these preprints: downloads on the arXiv.org site, mentions on the social media site Twitter, and early citations in the scholarly record. We perform two analyses. First, we analyze the delay and time span of article downloads and Twitter mentions following submission, to understand the temporal configuration of these reactions and whether one precedes or follows the other. Second, we run regression and correlation tests to investigate the relationship between Twitter mentions, arXiv downloads, and article citations. We find that Twitter mentions and arXiv downloads of scholarly articles follow two distinct temporal patterns of activity, with Twitter mentions having shorter delays and narrower time spans than arXiv downloads. We also find that the volume of Twitter mentions is statistically correlated with arXiv downloads and early citations just months after the publication of a preprint, with a possible bias that favors highly mentioned articles
  • Item
    In-silico predictive mutagenicity model generation using supervised learning approaches
    (Chemistry Central, 2012) Seal, A.; Passi, A.; Jaleel, U.C.A.; Consortium, O.S.D.D.; Wild, D.J.
    Experimental screening of chemical compounds for biological activity is a time consuming and expensive practice. In silico predictive models permit inexpensive, rapid "virtual screening" to prioritize selection of compounds for experimental testing. Both experimental and in silico screening can be used to test compounds for desirable or undesirable properties. Prior work on prediction of mutagenicity has primarily involved identification of toxicophores rather than whole-molecule predictive models. In this work, we examined a range of in silico predictive classification models for prediction of mutagenic properties of compounds, including methods such as J48 and SMO which have not previously been widely applied in cheminformatics. Results: The Bursi mutagenicity data set containing 4337 compounds (Set 1) and a Benchmark data set of 6512 compounds (Set 2) were taken as input data seta in this work. A third data set (Set 3) was prepared by joining up the previous two sets. Classification algorithms including Naïve Bayes, Random Forest, J48 and SMO with 10 fold cross-validation and default parameters were used for model generation on these data sets. Models built using the combined performed better than those developed from the Benchmark data set. Significantly, Random Forest outperformed other classifiers for all the data sets, especially for Set 3 with 89.27% accuracy, 89% precision and ROC of 95.3%. To validate the developed models two external data sets, AID1189 and AID1194, with mutagenicity data were tested showing 62% accuracy with 67% precision and 65% ROC area and 91% accuracy, 91% precision with 96.3% ROC area respectively. A Random Forest model was used the approved drugs from DrugBank and metabolites from the Zinc Database with True Positives rate almost 85% showing the robustness of the model. Conclusion: We have created a new mutagenicity benchmark data set with around 8,000 compounds. Our work shows that highly accurate predictive mutagenicity models can be built using machine learning methods based on chemical descriptors and trained using this set, and these models provide a complement to toxicophores based methods. Further, our work supports other recent literature in showing that Random Forest models generally outperform other comparable machine learning methods for this kind of application.
  • Item
    A core human microbiome as viewed through 16S rRNA sequence clusters
    (Public Library of Science, 2012) Huse, S.M.; Ye, Y.; Zhou, Y.; Fodor, A.A.
    We explore the microbiota of 18 body sites in over 200 individuals using sequences amplified V1-V3 and the V3-V5 small subunit ribosomal RNA (16S) hypervariable regions as part of the NIH Common Fund Human Microbiome Project. The body sites with the greatest number of core OTUs, defined as OTUs shared amongst 95% or more of the individuals, were the oral sites (saliva, tongue, cheek, gums, and throat) followed by the nose, stool, and skin, while the vaginal sites had the fewest number of OTUs shared across subjects. We found that commonalities between samples based on taxonomy could sometimes belie variability at the sub-genus OTU level. This was particularly apparent in the mouth where a given genus can be present in many different oral sites, but the sub-genus OTUs show very distinct site selection, and in the vaginal sites, which are consistently dominated by the Lactobacillus genus but have distinctly different sub-genus V1-V3 OTU populations across subjects. Different body sites show approximately a ten-fold difference in estimated microbial richness, with stool samples having the highest estimated richness, followed by the mouth, throat and gums, then by the skin, nasal and vaginal sites. Richness as measured by the V1-V3 primers was consistently higher than richness measured by V3-V5. We also show that when such a large cohort is analyzed at the genus level, most subjects fit the stool "enterotype" profile, but other subjects are intermediate, blurring the distinction between the enterotypes. When analyzed at the finer-scale, OTU level, there was little or no segregation into stool enterotypes, but in the vagina distinct biotypes were apparent. Finally, we note that even OTUs present in nearly every subject, or that dominate in some samples, showed orders of magnitude variation in relative abundance emphasizing the highly variable nature across individuals.
  • Item
    A de Bruijn graph approach to the quantification of closely-related genomes in a microbial community
    (Mary Ann Liebert, Inc. publishers, 2012) Wang, M.; Ye, Y.; Tang, H.
    The wide applications of next-generation sequencing (NGS) technologies in metagenomics have raised many computational challenges. One of the essential problems in metagenomics is to estimate the taxonomic composition of a microbial community, which can be approached by mapping shotgun reads acquired from the community to previously characterized microbial genomes followed by quantity profiling of these species based on the number of mapped reads. This procedure, however, is not as trivial as it appears at first glance. A shotgun metagenomic dataset often contains DNA sequences from many closely-related microbial species (e.g., within the same genus) or strains (e.g., within the same species), thus it is often difficult to determine which species/strain a specific read is sampled from when it can be mapped to a common region shared by multiple genomes at high similarity. Furthermore, high genomic variations are observed among individual genomes within the same species, which are difficult to be differentiated from the inter-species variations during reads mapping. To address these issues, a commonly used approach is to quantify taxonomic distribution only at the genus level, based on the reads mapped to all species belonging to the same genus; alternatively, reads are mapped to a set of representative genomes, each selected to represent a different genus. Here, we introduce a novel approach to the quantity estimation of closely-related species within the same genus by mapping the reads to their genomes represented by a de Bruijn graph, in which the common genomic regions among them are collapsed. Using simulated and real metagenomic datasets, we show the de Bruijn graph approach has several advantages over existing methods, including (1) it avoids redundant mapping of shotgun reads to multiple copies of the common regions in different genomes, and (2) it leads to more accurate quantification for the closely-related species (and even for strains within the same species).
  • Item
    Scholarometer: A Social Framework for Analyzing Impact across Disciplines
    (Public Library of Science, 2012) Kaur, J.; Hoang, D.T.; Sun, X.; Patil, S.; Menczer, F.
    The use of quantitative metrics to gauge the impact of scholarly publications, authors, and disciplines is predicated on the availability of reliable usage and annotation data. Citation and download counts are widely available from digital libraries. However, current annotation systems rely on proprietary labels, refer to journals but not articles or authors, and are manually curated. To address these limitations, we propose a social framework based on crowdsourced annotations of scholars, designed to keep up with the rapidly evolving disciplinary and interdisciplinary landscape. We describe a system called Scholarometer, which provides a service to scholars by computing citation-based impact measures. This creates an incentive for users to provide disciplinary annotations of authors, which in turn can be used to compute disciplinary metrics. We first present the system architecture and several heuristics to deal with noisy bibliographic and annotation data. We report on data sharing and interactive visualization services enabled by Scholarometer. Usage statistics, illustrating the data collected and shared through the framework, suggest that the proposed crowdsourcing approach can be successful. Secondly, we illustrate how the disciplinary bibliometric indicators elicited by Scholarometer allow us to implement for the first time a universal impact measure proposed in the literature. Our evaluation suggests that this metric provides an effective means for comparing scholarly impact across disciplinary boundaries. © 2012 Kaur et al.
  • Item
    Competition among memes in a world with limited attention
    (Nature Publishing Group, 2012) Weng, L.; Flammini, A.; Vespignani, A.; Menczer, F.
    The wide adoption of social media has increased the competition among ideas for our finite attention. We employ a parsimonious agent-based model to study whether such a competition may affect the popularity of different memes, the diversity of information we are exposed to, and the fading of our collective interests for specific topics. Agents share messages on a social network but can only pay attention to a portion of the information they receive. In the emerging dynamics of information diffusion, a few memes go viral while most do not. The predictions of our model are consistent with empirical data from Twitter, a popular microblogging platform. Surprisingly, we can explain the massive heterogeneity in the popularity and persistence of memes as deriving from a combination of the competition for our limited attention and the structure of the social network, without the need to assume different intrinsic values among ideas.
  • Item
    Assessing Drug Target Association Using Semantic Linked Data
    (PLoS Computational Biology, 2012-07) Chen, Bin; Ding, Ying; Wild, David J.
    The rapidly increasing amount of public data in chemistry and biology provides new opportunities for large-scale data mining for drug discovery. Systematic integration of these heterogeneous sets and provision of algorithms to data mine the integrated sets would permit investigation of complex mechanisms of action of drugs. In this work we integrated and annotated data from public datasets relating to drugs, chemical compounds, protein targets, diseases, side effects and pathways, building a semantic linked network consisting of over 290,000 nodes and 720,000 edges. We developed a statistical model to assess the association of drug target pairs based on their relation with other linked objects. Validation experiments demonstrate the model can correctly identify known direct drug target pairs with high precision. Indirect drug target pairs (for example drugs which change gene expression level) are also identified but not as strongly as direct pairs. We further calculated the association scores for 157 drugs from 10 disease areas against 1683 human targets, and measured their similarity using a 157|1683 score matrix. The similarity network indicates that drugs from the same disease area tend to cluster together in ways that are not captured by structural similarity, with several potential new drug pairings being identified. This work thus provides a novel, validated alternative to existing drug target prediction algorithms. The web service is freely available at: http://chem2bio2rdf.org/slap.
  • Item
    Mining relational paths in integrated biomedical data
    (Public Library of Science, 2011-12-06) Wild, David J.; Desai, Pankaj; Qiu, Judy; Moorthy, Ganesh; Chen, Bin; Shin, Jae Hong; Sun, Yuyin; Wang, Huijun; Ding, Ying; Tang, Jie; He, Bing
    Much life science and biology research requires an understanding of complex relationships between biological entities (genes, compounds, pathways, diseases, and so on). There is a wealth of data on such relationships in publicly available datasets and publications, but these sources are overlapped and distributed so that finding pertinent relational data is increasingly difficult. Whilst most public datasets have associated tools for searching, there is a lack of searching methods that can cross data sources and that in particular search not only based on the biological entities themselves but also on the relationships between them. In this paper, we demonstrate how graph-theoretic algorithms for mining relational paths can be used together with a previous integrative data resource we developed called Chem2Bio2RDF to extract new biological insights about the relationships between such entities. In particular, we use these methods to investigate the genetic basis of side-effects of thiazolinedione drugs, and in particular make a hypothesis for the recently discovered cardiac side-effects of Rosiglitazone (Avandia) and a prediction for Pioglitazone which is backed up by recent clinical studies.
  • Item
    Finding Complex Biological Relationships in Recent PubMed Articles Using Bio-LDA
    (PLoS, 2011-03-23) Wild, David J.; Qiu, Judy; He, Bing; Dong, Xiao; Tang, Jie; Ding, Ying; Wang, Huijun
    The overwhelming amount of available scholarly literature in the life sciences poses significant challenges to scientists wishing to keep up with important developments related to their research, but also provides a useful resource for the discovery of recent information concerning genes, diseases, compounds and the interactions between them. In this paper, we describe an algorithm called Bio-LDA that uses extracted biological terminology to automatically identify latent topics, and provides a variety of measures to uncover putative relations among topics and bio-terms. Relationships identified using those approaches are combined with existing data in life science datasets to provide additional insight. Three case studies demonstrate the utility of the Bio-LDA model, including association predication, association search and connectivity map generation. This combined approach offers new opportunities for knowledge discovery in many areas of biology including target identification, lead hopping and drug repurposing.
  • Item
    Improving integrative searching of systems chemical biology data using semantic annotation
    (Chemistry Central Ltd., 2012-03-08) Wild, David J.; Ding, Ying; Chen, Bin
    Background: Systems chemical biology and chemogenomics are considered critical, integrative disciplines in modern biomedical research, but require data mining of large, integrated, heterogeneous datasets from chemistry and biology. We previously developed an RDF-based resource called Chem2Bio2RDF that enabled querying of such data using the SPARQL query language. Whilst this work has proved useful in its own right as one of the first major resources in these disciplines, its utility could be greatly improved by the application of an ontology for annotation of the nodes and edges in the RDF graph, enabling a much richer range of semantic queries to be issued. Results: We developed a generalized chemogenomics and systems chemical biology OWL ontology called Chem2Bio2OWL that describes the semantics of chemical compounds, drugs, protein targets, pathways, genes, diseases and side-effects, and the relationships between them. The ontology also includes data provenance. We used it to annotate our Chem2Bio2RDF dataset, making it a rich semantic resource. Through a series of scientific case studies we demonstrate how this (i) simplifies the process of building SPARQL queries, (ii) enables useful new kinds of queries on the data and (iii) makes possible intelligent reasoning and semantic graph mining in chemogenomics and systems chemical biology. Availability: Chem2Bio2OWL is available at http://chem2bio2rdf.org/owl. The document is available at http:// chem2bio2owl.wikispaces.com.
  • Item
    Semantic inference using chemogenomics data for drug discovery
    (BioMed Central Ltd., 2011-06-23) Wild, David J.; Lajiness, Michael S.; Ding, Ying; Challa, Sashikiran; Sun, Yuyin; Zhu, Qian
    Background: Semantic Web Technology (SWT) makes it possible to integrate and search the large volume of life science datasets in the public domain, as demonstrated by well-known linked data projects such as LODD, Bio2RDF, and Chem2Bio2RDF. Integration of these sets creates large networks of information. We have previously described a tool called WENDI for aggregating information pertaining to new chemical compounds, effectively creating evidence paths relating the compounds to genes, diseases and so on. In this paper we examine the utility of automatically inferring new compound-disease associations (and thus new links in the network) based on semantically marked-up versions of these evidence paths, rule-sets and inference engines. Results: Through the implementation of a semantic inference algorithm, rule set, Semantic Web methods (RDF, OWL and SPARQL) and new interfaces, we have created a new tool called Chemogenomic Explorer that uses networks of ontologically annotated RDF statements along with deductive reasoning tools to infer new associations between the query structure and genes and diseases from WENDI results. The tool then permits interactive clustering and filtering of these evidence paths. Conclusions: We present a new aggregate approach to inferring links between chemical compounds and diseases using semantic inference. This approach allows multiple evidence paths between compounds and diseases to be identified using a rule-set and semantically annotated data, and for these evidence paths to be clustered to show overall evidence linking the compound to a disease. We believe this is a powerful approach, because it allows compound-disease relationships to be ranked by the amount of evidence supporting them.
  • Item
    Topological Cluster Analysis Reveals the Systemic Organization of the Caenorhabditis elegans Connectome
    (PLoS, 2011-05) Jeong, Jaesung; Lee, Junho; Ahn, Yong-Yeol; Choi, Myung-Kyu; Sohn, Yunkyu
    The modular organization of networks of individual neurons interwoven through synapses has not been fully explored due to the incredible complexity of the connectivity architecture. Here we use the modularity-based community detection method for directed, weighted networks to examine hierarchically organized modules in the complete wiring diagram (connectome) of Caenorhabditis elegans (C. elegans) and to investigate their topological properties. Incorporating bilateral symmetry of the network as an important cue for proper cluster assignment, we identified anatomical clusters in the C. elegans connectome, including a body-spanning cluster, which correspond to experimentally identified functional circuits. Moreover, the hierarchical organization of the five clusters explains the systemic cooperation (e.g., mechanosensation, chemosensation, and navigation) that occurs among the structurally segregated biological circuits to produce higher-order complex behaviors.
  • Item
    LTR retroelements in the genome of Daphnia pulex
    (BioMed Central, 2010-07-09) Tang, Haixu; Lynch, Michael; Kim, Sun; Gao, Xiang; Schaack, Sarah; Rho, Mina
    Background: Long terminal repeat (LTR) retroelements represent a successful group of transposable elements (TEs)that have played an important role in shaping the structure of many eukaryotic genomes. Here, we present a genomewide analysis of LTR retroelements in Daphnia pulex, a cyclical parthenogen and the first crustacean for which the whole genomic sequence is available. In addition, we analyze transcriptional data and perform transposon display assays of lab-reared lineages and natural isolates to identify potential influences on TE mobility and differences in LTR retroelements loads among individuals reproducing with and without sex. Results: We conducted a comprehensive de novo search for LTR retroelements and identified 333 intact LTR retroelements representing 142 families in the D. pulex genome. While nearly half of the identified LTR retroelements belong to the gypsy group, we also found copia (95), BEL/Pao (66) and DIRS (19) retroelements. Phylogenetic analysis of reverse transcriptase sequences showed that LTR retroelements in the D. pulex genome form many lineages distinct from known families, suggesting that the majority are novel. Our investigation of transcriptional activity of LTR retroelements using tiling array data obtained from three different experimental conditions found that 71 LTR retroelements are actively transcribed. Transposon display assays of mutation-accumulation lines showed evidence for putative somatic insertions for two DIRS retroelement families. Losses of presumably heterozygous insertions were observed in lineages in which selfing occurred, but never in asexuals, highlighting the potential impact of reproductive mode on TE abundance and distribution over time. The same two families were also assayed across natural isolates (both cyclical parthenogens and obligate asexuals) and there were more retroelements in populations capable of reproducing sexually for one of the two families assayed. Conclusions: Given the importance of LTR retroelements activity in the evolution of other genomes, this comprehensive survey provides insight into the potential impact of LTR retroelements on the genome of D. pulex, a cyclically parthenogenetic microcrustacean that has served as an ecological model for over a century.
  • Item
    Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data
    (BioMed Central, 2010) Ding, Ying; Zhu, Qian; Wang, Huijun; Jiao, Dazhi; Dong, Xiao; Chen, Bin; Wild, David J.
    Background: Recently there has been an explosion of new data sources about genes, proteins, genetic variations, chemical compounds, diseases and drugs. Integration of these data sources and the identification of patterns that go across them is of critical interest. Initiatives such as Bio2RDF and LODD have tackled the problem of linking biological data and drug data respectively using RDF. Thus far, the inclusion of chemogenomic and systems chemical biology information that crosses the domains of chemistry and biology has been very limited Results: We have created a single repository called Chem2Bio2RDF by aggregating data from multiple chemogenomics repositories that is cross-linked into Bio2RDF and LODD. We have also created a linked-path generation tool to facilitate SPARQL query generation, and have created extended SPARQL functions to address specific chemical/biological search needs. We demonstrate the utility of Chem2Bio2RDF in investigating polypharmacology, identification of potential multiple pathway inhibitors, and the association of pathways with adverse drug reactions. Conclusions: We have created a new semantic systems chemical biology resource, and have demonstrated its potential usefulness in specific examples of polypharmacology, multiple pathway inhibition and adverse drug reaction - pathway mapping. We have also demonstrated the usefulness of extending SPARQL with cheminformatics and bioinformatics functionality.