Project description:Imputation is a powerful in silico approach to fill in those missing values in the big datasets. This process requires a reference panel, which is a collection of big data from which the missing information can be extracted and imputed. Haplotype imputation requires ethnicity-matched references; a mismatched reference panel will significantly reduce the quality of imputation. However, currently existing big datasets cover only a small number of ethnicities, there is a lack of ethnicity-matched references for many ethnic populations in the world, which has hampered the data imputation of haplotypes and its downstream applications. To solve this issue, several approaches have been proposed and explored, including the mixed reference panel, the internal reference panel and genotype-converted reference panel. This review article provides the information and comparison between these approaches. Increasing evidence showed that not just one or two genetic elements dictate the gene activity and functions; instead, cis-interactions of multiple elements dictate gene activity. Cis-interactions require the interacting elements to be on the same chromosome molecule, therefore, haplotype analysis is essential for the investigation of cis-interactions among multiple genetic variants at different loci, and appears to be especially important for studying the common diseases. It will be valuable in a wide spectrum of applications from academic research, to clinical diagnosis, prevention, treatment, and pharmaceutical industry.
Project description:With the increasing availability of large amounts of data in the livestock domain, we face the challenge to store, combine and analyse these data efficiently. With this study, we explored the use of a data lake for storing and analysing data to improve scalability and interoperability. Data originated from a 2-day animal experiment in which the gait score of approximately 200 turkeys was determined through visual inspection by an expert. Additionally, inertial measurement units (IMUs), a 3D-video camera and a force plate (FP) were installed to explore the effectiveness of these sensors in automating the visual gait scoring. We deployed a data lake using the IMU and FP data of a single day of that animal experiment. This encompasses data from 84 turkeys for which we preprocessed by performing an 'extract, transform and load' (ETL-) procedure. To test scalability of the ETL-procedure, we simulated increasing volumes of the available data from this animal experiment and computed the 'wall time' (elapsed real time) for converting FP data into comma-separated files and storing these files. With a simulated data set of 30 000 turkeys, the wall time reduced from 1 h to less than 15 min, when 12 cores were used compared to 1 core. This demonstrated the ETL-procedure to be scalable. Subsequently, a machine learning (ML) pipeline was developed to test the potential of a data lake to automatically distinguish between two classses, that is, very bad gait scores v. other scores. In conclusion, we have set up a dedicated customized data lake, loaded data and developed a prediction model via the creation of an ML pipeline. A data lake appears to be a useful tool to face the challenge of storing, combining and analysing increasing volumes of data of varying nature in an effective manner.
Project description:Antibodies have become the Swiss Army tool for molecular biology and nanotechnology. Their outstanding ability to specifically recognise molecular antigens allows their use in many different applications from medicine to the industry. Moreover, the improvement of conventional structural biology techniques (e.g., X-ray, NMR) as well as the emergence of new ones (e.g., Cryo-EM), have permitted in the last years a notable increase of resolved antibody-antigen structures. This offers a unique opportunity to perform an exhaustive structural analysis of antibody-antigen interfaces by employing the large amount of data available nowadays. To leverage this factor, different geometric as well as chemical descriptors were evaluated to perform a comprehensive characterization.
Project description:Felsenstein's application of the bootstrap method to evolutionary trees is one of the most cited scientific papers of all time. The bootstrap method, which is based on resampling and replications, is used extensively to assess the robustness of phylogenetic inferences. However, increasing numbers of sequences are now available for a wide variety of species, and phylogenies based on hundreds or thousands of taxa are becoming routine. With phylogenies of this size Felsenstein's bootstrap tends to yield very low supports, especially on deep branches. Here we propose a new version of the phylogenetic bootstrap in which the presence of inferred branches in replications is measured using a gradual 'transfer' distance rather than the binary presence or absence index used in Felsenstein's original version. The resulting supports are higher and do not induce falsely supported branches. The application of our method to large mammal, HIV and simulated datasets reveals their phylogenetic signals, whereas Felsenstein's bootstrap fails to do so.
Project description:Often times terms such as Big Data, increasing digital footprints in the Internet accompanied with advancing analytical techniques, represent a major opportunity to improve public health surveillance and delivery of interventions. However, early adaption of Big Data in other fields revealed ethical challenges that could undermine privacy and autonomy of individuals and cause stigmatization. This chapter aims to identify the benefits and risks associated with the public health application of Big Data through ethical lenses. In doing so, it highlights the need for ethical discussion and framework towards an effective utilization of technologies. We then discuss key strategies to mitigate potentially harmful aspects of Big Data to facilitate its safe and effective implementation.
Project description:Engineering efforts are currently attempting to build devices capable of collecting neural activity from one million neurons in the brain. Part of this effort focuses on developing dense multiple-electrode arrays, which require post-processing via 'spike sorting' to extract neural spike trains from the raw signal. Gathering information at this scale will facilitate fascinating science, but these dreams are only realizable if the spike sorting procedure and data pipeline are computationally scalable, at or superior to hand processing, and scientifically reproducible. These challenges are all being amplified as the data scale continues to increase. In this review, recent efforts to attack these challenges are discussed, which have primarily focused on increasing accuracy and reliability while being computationally scalable. These goals are addressed by adding additional stages to the data processing pipeline and using divide-and-conquer algorithmic approaches. These recent developments should prove useful to most research groups regardless of data scale, not just for cutting-edge devices.
Project description:BackgroundLatin America harbors some of the most biodiverse countries in the world, including Colombia. Despite the increasing use of cutting-edge technologies in genomics and bioinformatics in several biological science fields around the world, the region has fallen behind in the inclusion of these approaches in biodiversity studies. In this study, we used data mining methods to search in four main public databases of genetic sequences such as: NCBI Nucleotide and BioProject, Pathosystems Resource Integration Center, and Barcode of Life Data Systems databases. We aimed to determine how much of the Colombian biodiversity is contained in genetic data stored in these public databases and how much of this information has been generated by national institutions. Additionally, we compared this data for Colombia with other countries of high biodiversity in Latin America, such as Brazil, Argentina, Costa Rica, Mexico, and Peru.ResultsIn Nucleotide, we found that 66.84% of total records for Colombia have been published at the national level, and this data represents less than 5% of the total number of species reported for the country. In BioProject, 70.46% of records were generated by national institutions and the great majority of them is represented by microorganisms. In BOLD Systems, 26% of records have been submitted by national institutions, representing 258 species for Colombia. This number of species reported for Colombia span approximately 0.46% of the total biodiversity reported for the country (56,343 species). Finally, in PATRIC database, 13.25% of the reported sequences were contributed by national institutions. Colombia has a better biodiversity representation in public databases in comparison to other Latin American countries, like Costa Rica and Peru. Mexico and Argentina have the highest representation of species at the national level, despite Brazil and Colombia, which actually hold the first and second places in biodiversity worldwide.ConclusionsOur findings show gaps in the representation of the Colombian biodiversity at the molecular and genetic levels in widely consulted public databases. National funding for high-throughput molecular research, NGS technologies costs, and access to genetic resources are limiting factors. This fact should be taken as an opportunity to foster the development of collaborative projects between research groups in the Latin American region to study the vast biodiversity of these countries using 'omics' technologies.
Project description:Clinical studies, especially randomized, controlled trials, are essential for generating evidence for clinical practice. However, generalizability is a long-standing concern when applying trial results to real-world patients. Generalizability assessment is thus important, nevertheless, not consistently practiced. We performed a systematic review to understand the practice of generalizability assessment. We identified 187 relevant articles and systematically organized these studies in a taxonomy with three dimensions: (i) data availability (i.e., before or after trial (a priori vs. a posteriori generalizability)); (ii) result outputs (i.e., score vs. nonscore); and (iii) populations of interest. We further reported disease areas, underrepresented subgroups, and types of data used to profile target populations. We observed an increasing trend of generalizability assessments, but < 30% of studies reported positive generalizability results. As a priori generalizability can be assessed using only study design information (primarily eligibility criteria), it gives investigators a golden opportunity to adjust the study design before the trial starts. Nevertheless, < 40% of the studies in our review assessed a priori generalizability. With the wide adoption of electronic health records systems, rich real-world patient databases are increasingly available for generalizability assessment; however, informatics tools are lacking to support the adoption of generalizability assessment practice.
Project description:Big Data is the buzzword of the modern century. With the invasion of pervasive computing, we live in a data centric environment, where we always leave a track of data related to our day to day activities. Be it a visit to a shopping mall or hospital or surfing Internet, we create voluminous data related to credit card transactions, user details, location information, and so on. These trails of data simply define an individual and form the backbone for user-profiling. With the mobile phones and their easy access to online social networks on the go, sensor data such as geo-taggings and events and sentiments around them contribute to the already overwhelming data containers. With reductions in the cost of storage and computational devices and with increasing proliferation of Cloud, we never felt any constraints in storing or processing such data. Eventually we end up having several exabytes of data and analysing them for their usefulness has introduced new frontiers of research. Effective distillation of these data is the need of the hour to improve the veracity of the Big Data. This research targets the utilization of the Fuzzy Bayesian process model to improve the quality of information in Big Data.
Project description:The Taiwan Biobank (TWB) is a biomedical research database of biopsy data from 200 000 participants. Access to this database has been granted to research communities taking part in the development of precision medicines; however, this has raised issues surrounding TWB's access to electronic medical records (EMRs). The Personal Data Protection Act of Taiwan restricts access to EMRs for purposes not covered by patients' original consent. This commentary explores possible legal solutions to help ensure that the access TWB has to EMR abides with legal obligations, and with governance frameworks associated with ethical, legal, and social implications. We suggest utilizing "hash function" algorithms to create nonretrospective, anonymized data for the purpose of cross-transmission and/or linkage with EMR.