Dataset Information

A dataset to facilitate automated workflow analysis.

ABSTRACT: Data sets that provide a ground truth to quantify the efficacy of automated algorithms are rare due to the time consuming and expensive, although highly valuable, task of manually annotating observations. These datasets exist for niche problems in developed fields such as Natural Language Processing (NLP) and Business Process Mining (BPM), however it is difficult to find a suitable dataset for use cases that span across multiple fields, such as the one described in this study. The lack of established ground truth maps between cyberspace and the human-interpretable, persona-driven tasks that occur therein, is one of the principal barriers preventing reliable, automated situation awareness of dynamically evolving events and the consequences of loss due to cybersecurity breaches. Automated workflow analysis-the machine-learning assisted identification of templates of repeated tasks-is the likely missing link between semantic descriptions of mission goals and observable events in cyberspace. We summarize our efforts to establish a ground truth for an email dataset pertaining to the operation of an open source software project. The ground truth defines semantic labels for each email and the arrangement of emails within a sequence that describe actions observed in the dataset. Identified sequences are then used to define template workflows that describe the possible tasks undertaken for a project and their business process model. We present the overall purpose of the dataset, the methodology for establishing a ground truth, and lessons learned from the effort. Finally, we report on the proposed use of the dataset for the workflow discovery problem, and its effect on system accuracy.

SUBMITTER: Allard T

PROVIDER: S-EPMC6366754 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A dataset to facilitate automated workflow analysis.

Allard Tony T Alvino Paul P Shing Leslie L Wollaber Allan A Yuen Joseph J

PloS one 20190207 2

Data sets that provide a ground truth to quantify the efficacy of automated algorithms are rare due to the time consuming and expensive, although highly valuable, task of manually annotating observations. These datasets exist for niche problems in developed fields such as Natural Language Processing (NLP) and Business Process Mining (BPM), however it is difficult to find a suitable dataset for use cases that span across multiple fields, such as the one described in this study. The lack of establ ...[more]

PMID: 30730921

Similar Datasets

Project description:BACKGROUND:In the past decade, transcriptome data have become an important component of many phylogenetic studies. They are a cost-effective source of protein-coding gene sequences, and have helped projects grow from a few genes to hundreds or thousands of genes. Phylogenetic studies now regularly include genes from newly sequenced transcriptomes, as well as publicly available transcriptomes and genomes. Implementing such a phylogenomic study, however, is computationally intensive, requires the coordinated use of many complex software tools, and includes multiple steps for which no published tools exist. Phylogenomic studies have therefore been manual or semiautomated. In addition to taking considerable user time, this makes phylogenomic analyses difficult to reproduce, compare, and extend. In addition, methodological improvements made in the context of one study often cannot be easily applied and evaluated in the context of other studies. RESULTS:We present Agalma, an automated tool that constructs matrices for phylogenomic analyses. The user provides raw Illumina transcriptome data, and Agalma produces annotated assemblies, aligned gene sequence matrices, a preliminary phylogeny, and detailed diagnostics that allow the investigator to make extensive assessments of intermediate analysis steps and the final results. Sequences from other sources, such as externally assembled genomes and transcriptomes, can also be incorporated in the analyses. Agalma is built on the BioLite bioinformatics framework, which tracks provenance, profiles processor and memory use, records diagnostics, manages metadata, installs dependencies, logs version numbers and calls to external programs, and enables rich HTML reports for all stages of the analysis. Agalma includes a small test data set and a built-in test analysis of these data. In addition to describing Agalma, we here present a sample analysis of a larger seven-taxon data set. Agalma is available for download at https://bitbucket.org/caseywdunn/agalma. CONCLUSIONS:Agalma allows complex phylogenomic analyses to be implemented and described unambiguously as a series of high-level commands. This will enable phylogenomic studies to be readily reproduced, modified, and extended. Agalma also facilitates methods development by providing a complete modular workflow, bundled with test data, that will allow further optimization of each step in the context of a full phylogenomic analysis.

Project description:IntroductionWe define a designated data analytics workflow for the evaluation of stability experiments, which takes all data situations into account. This complements the evaluation described by the CLSI EP25 [1] guideline by including a targeted exception handling algorithm and thus allows one to automatically evaluate stability data based on linear regression analysis.DescriptionThe evaluation of stability experiments based on regression analysis requires the calculation of the confidence interval of the regression line. The stability time is estimated at the intersection of the confidence interval with the acceptance criterion. This approach results in solving a quadratic equation, with factors that depend on the estimated intercept, slope, the measurement variability and the chosen timepoints. When defining an automated data analytics workflow for this problem, the different cases for the solutions of the quadratic equation must be considered. For some data situations there might be no solution at all, other data situations result in a negative and a positive solution and finally there might be even two positive solutions. All these cases have to be considered for the choice of the right solution to become the estimated stability time. The CLSI EP25 [1] guideline on stability evaluation of in vitro diagnostic reagents addresses this problem only superficially and might even lead to incorrect results for some specific data scenarios.ResultsWe evaluate all possible data scenarios and provide examples for each. Based on the gained theoretical insights, we define a designated data analytics workflow and visualize it with a flowchart. By following this flowchart one can implement an automated analysis workflow, targeting all data scenarios with the appropriate exception handling.DiscussionWe deduce that the description for obtaining stability times according to CLSI EP25 is not fully adequate, as it addresses only best-case scenarios. However, for automated data analytics workflows all possible data situations have to be considered. With the here presented workflow one can program automated data analytics pipelines, which ensure that the right stability time is obtained, in case it exists. In addition all exceptions, where no stability times are present, are addressed in the right way and it provides hints as to the failure reason.

Project description:BackgroundWhole genome duplication (WGD) events are common in the evolutionary history of many living organisms. For decades, researchers have been trying to understand the genetic and epigenetic impact of WGD and its underlying molecular mechanisms. Particular attention was given to allopolyploid study systems, species resulting from an hybridization event accompanied by WGD. Investigating the mechanisms behind the survival of a newly formed allopolyploid highlighted the key role of DNA methylation. With the improvement of high-throughput methods, such as whole genome bisulfite sequencing (WGBS), an opportunity opened to further understand the role of DNA methylation at a larger scale and higher resolution. However, only a few studies have applied WGBS to allopolyploids, which might be due to lack of genomic resources combined with a burdensome data analysis process. To overcome these problems, we developed the Automated Reproducible Polyploid EpiGenetic GuIdance workflOw (ARPEGGIO): the first workflow for the analysis of epigenetic data in polyploids. This workflow analyzes WGBS data from allopolyploid species via the genome assemblies of the allopolyploid's parent species. ARPEGGIO utilizes an updated read classification algorithm (EAGLE-RC), to tackle the challenge of sequence similarity amongst parental genomes. ARPEGGIO offers automation, but more importantly, a complete set of analyses including spot checks starting from raw WGBS data: quality checks, trimming, alignment, methylation extraction, statistical analyses and downstream analyses. A full run of ARPEGGIO outputs a list of genes showing differential methylation. ARPEGGIO was made simple to set up, run and interpret, and its implementation ensures reproducibility by including both package management and containerization.ResultsWe evaluated ARPEGGIO in two ways. First, we tested EAGLE-RC's performance with publicly available datasets given a ground truth, and we show that EAGLE-RC decreases the error rate by 3 to 4 times compared to standard approaches. Second, using the same initial dataset, we show agreement between ARPEGGIO's output and published results. Compared to other similar workflows, ARPEGGIO is the only one supporting polyploid data.ConclusionsThe goal of ARPEGGIO is to promote, support and improve polyploid research with a reproducible and automated set of analyses in a convenient implementation. ARPEGGIO is available at https://github.com/supermaxiste/ARPEGGIO .

Dataset Information

A dataset to facilitate automated workflow analysis.

Publications

A dataset to facilitate automated workflow analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets