Dataset Information

Digitising legacy zoological taxonomic literature: Processes, products and using the output.

ABSTRACT: By digitising legacy taxonomic literature using XML mark-up the contents become accessible to other taxonomic and nomenclatural information systems. Appropriate schemas need to be interoperable with other sectorial schemas, atomise to appropriate content elements and carry appropriate metadata to, for example, enable algorithmic assessment of availability of a name under the Code. Legacy (and new) literature delivered in this fashion will become part of a global taxonomic resource from which users can extract tailored content to meet their particular needs, be they nomenclatural, taxonomic, faunistic or other. To date, most digitisation of taxonomic literature has led to a more or less simple digital copy of a paper original - the output of the many efforts has effectively been an electronic copy of a traditional library. While this has increased accessibility of publications through internet access, the means by which many scientific papers are indexed and located is much the same as with traditional libraries. OCR and born-digital papers allow use of web search engines to locate instances of taxon names and other terms, but OCR efficiency in recognising taxonomic names is still relatively poor, people's ability to use search engines effectively is mixed, and many papers cannot be searched directly. Instead of building digital analogues of traditional publications, we should consider what properties we require of future taxonomic information access. Ideally the content of each new digital publication should be accessible in the context of all previous published data, and the user able to retrieve nomenclatural, taxonomic and other data / information in the form required without having to scan all of the original papers and extract target content manually. This opens the door to dynamic linking of new content with extant systems: automatic population and updating of taxonomic catalogues, ZooBank and faunal lists, all descriptions of a taxon and its children instantly accessible with a single search, comparison of classifications used in different publications, and so on. A means to do this is through marking up content into XML, and the more atomised the mark-up the greater the possibilities for data retrieval and integration. Mark-up requires XML that accommodates the required content elements and is interoperable with other XML schemas, and there are now several written to do this, particularly TaxPub, taxonX and taXMLit, the last of these being the most atomised. We now need to automate this process as far as possible. Manual and automatic data and information retrieval is demonstrated by projects such as INOTAXA and Plazi. As we move to creating and using taxonomic products through the power of the internet, we need to ensure the output, while satisfying in its production the requirements of the Code, is fit for purpose in the future.

SUBMITTER: Lyal CH

PROVIDER: S-EPMC4741221 | biostudies-other | 2016

REPOSITORIES: biostudies-other

ACCESS DATA

Similar Datasets

Project description:Specimen data in taxonomic literature are among the highest quality primary biodiversity data. Innovative cybertaxonomic journals are using workflows that maintain data structure and disseminate electronic content to aggregators and other users; such structure is lost in traditional taxonomic publishing. Legacy taxonomic literature is a vast repository of knowledge about biodiversity. Currently, access to that resource is cumbersome, especially for non-specialist data consumers. Markup is a mechanism that makes this content more accessible, and is especially suited to machine analysis. Fine-grained XML (Extensible Markup Language) markup was applied to all (37) open-access articles published in the journal Zootaxa containing treatments on spiders (Order: Araneae). The markup approach was optimized to extract primary specimen data from legacy publications. These data were combined with data from articles containing treatments on spiders published in Biodiversity Data Journal where XML structure is part of the routine publication process. A series of charts was developed to visualize the content of specimen data in XML-tagged taxonomic treatments, either singly or in aggregate. The data can be filtered by several fields (including journal, taxon, institutional collection, collecting country, collector, author, article and treatment) to query particular aspects of the data. We demonstrate here that XML markup using GoldenGATE can address the challenge presented by unstructured legacy data, can extract structured primary biodiversity data which can be aggregated with and jointly queried with data from other Darwin Core-compatible sources, and show how visualization of these data can communicate key information contained in biodiversity literature. We complement recent studies on aspects of biodiversity knowledge using XML structured data to explore 1) the time lag between species discovry and description, and 2) the prevelence of rarity in species descriptions.

Project description:Taxonomic literature contains information about virtually ever known species on Earth. In many cases, all that is known about a taxon is contained in this kind of literature, particularly for the most diverse and understudied groups. Taxonomic publications in the aggregate have documented a vast amount of specimen data. Among other things, these data constitute evidence of the existence of a particular taxon within a spatial and temporal context. When knowledge about a particular taxonomic group is rudimentary, investigators motivated to contribute new knowledge can use legacy records to guide them in their search for new specimens in the field. However, these legacy data are in the form of unstructured text, making it difficult to extract and analyze without a human interpreter. Here, we used a combination of semi-automatic tools to extract and categorize specimen data from taxonomic literature of one family of ground spiders (Liocranidae). We tested the application of these data on fieldwork optimization, using the relative abundance of adult specimens reported in literature as a proxy to find the best times and places for collecting the species (Teutamus politus) and its relatives (Teutamus group, TG) within Southeast Asia. Based on these analyses we decided to collect in three provinces in Thailand during the months of June and August. With our approach, we were able to collect more specimens of T. politus (188 specimens, 95 adults) than all the previous records in literature combined (102 specimens). Our approach was also effective for sampling other representatives of the TG, yielding at least one representative of every TG genus previously reported for Thailand. In total, our samples contributed 231 specimens (134 adults) to the 351 specimens previously reported in the literature for this country. Our results exemplify one application of mined literature data that allows investigators to more efficiently allocate effort and resources for the study of neglected, endangered, or interesting taxa and geographic areas. Furthermore, the integrative workflow demonstrated here shares specimen data with global online resources like Plazi and GBIF, meaning that others can freely reuse these data and contribute to them in the future. The contributions of the present study represent an increase of more than 35% on the taxonomic coverage of the TG in GBIF based on the number of species. Also, our extracted data represents 72% of the occurrences now available through GBIF for the TG and more than 85% of occurrences of T. politus. Taxonomic literature is a key source of undigitized biodiversity data for taxonomic groups that are underrepresented in the current biodiversity data sphere. Mobilizing these data is key to understanding and protecting some of the less well-known domains of biodiversity.

Project description:Understanding the influences of dispersal limitation and environmental filtering on the structure of ecological communities is a major challenge in ecology. Insight may be gained by combining phylogenetic, functional and taxonomic data to characterize spatial turnover in community structure (β-diversity). We develop a framework that allows rigorous inference of the strengths of dispersal limitation and environmental filtering by combining these three types of β-diversity. Our framework provides model-generated expectations for patterns of taxonomic, phylogenetic and functional β-diversity across biologically relevant combinations of dispersal limitation and environmental filtering. After developing the framework we compared the model-generated expectations to the commonly used "intuitive" expectation that the variance explained by the environment or by space will, respectively, increase monotonically with the strength of environmental filtering or dispersal limitation. The model-generated expectations strongly departed from these intuitive expectations: the variance explained by the environment or by space was often a unimodal function of the strength of environmental filtering or dispersal limitation, respectively. Therefore, although it is commonly done in the literature, one cannot assume that the strength of an underlying process is a monotonic function of explained variance. To infer the strength of underlying processes, one must instead compare explained variances to model-generated expectations. Our framework provides these expectations. We show that by combining the three types of β-diversity with model-generated expectations our framework is able to provide rigorous inferences of the relative and absolute strengths of dispersal limitation and environmental filtering. Phylogenetic, functional and taxonomic β-diversity can therefore be used simultaneously to infer processes by comparing their empirical patterns to the expectations generated by frameworks similar to the one developed here.

Project description:Current science evaluation still relies on citation performance, despite criticisms of purely bibliometric research assessments. Biological taxonomy suffers from a drain of knowledge and manpower, with poor citation performance commonly held as one reason for this impediment. But is there really such a citation impediment in taxonomy? We compared the citation numbers of 306 taxonomic and 2291 non-taxonomic research articles (2009-2012) on mosses, orchids, ciliates, ants, and snakes, using Web of Science (WoS) and correcting for journal visibility. For three of the five taxa, significant differences were absent in citation numbers between taxonomic and non-taxonomic papers. This was also true for all taxa combined, although taxonomic papers received more citations than non-taxonomic ones. Our results show that, contrary to common belief, taxonomic contributions do not generally reduce a journal's citation performance and might even increase it. The scope of many journals rarely featuring taxonomy would allow editors to encourage a larger number of taxonomic submissions. Moreover, between 1993 and 2012, taxonomic publications accumulated faster than those from all biological fields. However, less than half of the taxonomic studies were published in journals in WoS. Thus, editors of highly visible journals inviting taxonomic contributions could benefit from taxonomy's strong momentum. The taxonomic output could increase even more than at its current growth rate if: (i) taxonomists currently publishing on other topics returned to taxonomy and (ii) non-taxonomists identifying the need for taxonomic acts started publishing these, possibly in collaboration with taxonomists. Finally, considering the high number of taxonomic papers attracted by the journal Zootaxa, we expect that the taxonomic community would indeed use increased chances of publishing in WoS indexed journals. We conclude that taxonomy's standing in the present citation-focused scientific landscape could easily improve-if the community becomes aware that there is no citation impediment in taxonomy.

Project description:The statistics of drug development output and declining yield of approved medicines has been the subject of many recent reviews. However, assessing research productivity that feeds development is more difficult. Here we utilise an extensive database of structure-activity relationships extracted from papers and patents. We have used this database to analyse published compounds cumulatively linked to nearly 4000 protein target identifiers from multiple species over the last 20 years. The compound output increases up to 2005 followed by a decline that parallels a fall in pharmaceutical patenting. Counts of protein targets have plateaued but not fallen. We extended these results by exploring compounds and targets for one large pharmaceutical company. In addition, we examined collective time course data for six individual protease targets, including average molecular weight of the compounds. We also tracked the PubMed profile of these targets to detect signals related to changes in compound output. Our results show that research compound output had decreased 35% by 2012. The major causative factor is likely to be a contraction in the global research base due to mergers and acquisitions across the pharmaceutical industry. However, this does not rule out an increasing stringency of compound quality filtration and/or patenting cost control. The number of proteins mapped to compounds on a yearly basis shows less decline, indicating the cumulative published target capacity of global research is being sustained in the region of 300 proteins for large companies. The tracking of six individual targets shows uniquely detailed patterns not discernible from cumulative snapshots. These are interpretable in terms of events related to validation and de-risking of targets that produce detectable follow-on surges in patenting. Further analysis of the type we present here can provide unique insights into the process of drug discovery based on the data it actually generates.

Dataset Information

Digitising legacy zoological taxonomic literature: Processes, products and using the output.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets