Project description:RNA genomes from coronavirus have a length as long as 32 kilobases, and the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) that caused the outbreak of coronavirus disease 2019 (COVID-19) pandemic has long sequences which made the analysis difficult. Over 20,000 sequences have been submitted to GISAID, and the number is growing fast each day which increased the difficulties in data analysis; however, genome sequence analysis is critical in understanding the COVID-19 and preventing the spread of the disease. In this study, a principal component analysis (PCA) was applied to the aligned large size genome sequences and the numerical numbers were converted from the letters using a published method designed for protein sequence cluster analysis. The study initialized with a shortlist sequence testing, and the PCA score plot showed high tolerance with low-quality data, and the major virus sequences from humans were separated from the pangolin and bat samples. Our study also successfully built a model for a large number of sequences with more than 20,000 sequences which indicate the potential mutation directions for the COVID-19 which can be served as a pretreatment method for detailed studies such as decision tree-based methods. In summary, our study provided a fast tool to analyze the high-volume genome sequences such as the COVID-19 and successfully applied to more than 20,000 sequences which may provide mutation direction information for COVID-19 studies.
Project description:The spatial Principal Component Analysis (sPCA, Jombart (Heredity 101:92-103, 2008) is designed to investigate non-random spatial distributions of genetic variation. Unfortunately, the associated tests used for assessing the existence of spatial patterns (global and local test; (Heredity 101:92-103, 2008) lack statistical power and may fail to reveal existing spatial patterns. Here, we present a non-parametric test for the significance of specific patterns recovered by sPCA.We compared the performance of this new test to the original global and local tests using datasets simulated under classical population genetic models. Results show that our test outperforms the original global and local tests, exhibiting improved statistical power while retaining similar, and reliable type I errors. Moreover, by allowing to test various sets of axes, it can be used to guide the selection of retained sPCA components.As such, our test represents a valuable complement to the original analysis, and should prove useful for the investigation of spatial genetic patterns.
Project description:The COVID-19 is one of the worst pandemics in modern history. We applied principal component analysis (PCA) to the daily time series of the COVID-19 death cases and confirmed cases for the top 25 countries from April of 2020 to February of 2021. We calculated the eigenvalues and eigenvectors of the cross-correlation matrix of the changes in daily accumulated data over monthly time windows. The largest eigenvalue describes the overall evolution dynamics of the COVID-19 and indicates that evolution was faster in April of 2020 than in any other period. By using the first two PC coefficients, we can identify the group dynamics of the COVID-19 evolution. We observed groups under critical states in the loading plot and found that American and European countries are represented by strong clusters in the loading plot. The first PC plays an important role and the correlations (C1) between the normalized logarithmic changes in deaths or confirmed cases and the first PCs may be used as indicators of different phases of the COVID-19. By varying C1 over time, we identified different phases of the COVID-19 in the analyzed countries over the target time period.
Project description:Background: COVID-19 has been quickly spreading, making it a serious public health threat. It is important to identify phenotypes to predict the severity of disease and design an individualized treatment. Methods: We collected data from 213 COVID-19 patients in Wuhan Pulmonary Hospital from January 1 to March 30, 2020. Principal component analysis (PCA) and cluster analysis were used to classify patients. Results: We identified three distinct subgroups of COVID-19. Cluster 1 was the largest group (52.6%) and characterized by oldest age, lowest cellular immune function, and albumin levels. 38.5% of subjects were grouped into Cluster 2. Most of the lab results in Cluster 2 fell between those of Clusters 1 and 3. Cluster 3 was the smallest cluster (8.9%), characterized by youngest age and highest cellular immune function. The incidence of respiratory failure, acute respiratory distress syndrome (ARDS), heart failure, and usage of non-invasive mechanical ventilation in Cluster 1 was significantly higher than others (P < 0.05). Cluster 1 had the highest death rate of 30.4% (P = 0.005). Although there were significant differences in age between Clusters 2 and 3 (P < 0.001), we found that there was no difference in demand for medical resources. Conclusions: We identified three distinct clusters of the COVID-19 patients. The results show that age alone could not be used to assess a patient's condition. Specifically, management of albumin, and immune function are important in reducing the severity of disease.
Project description:BackgroundAccurate inference of genetic ancestry is of fundamental interest to many biomedical, forensic, and anthropological research areas. Genetic ancestry memberships may relate to genetic disease risks. In a genome association study, failing to account for differences in genetic ancestry between cases and controls may also lead to false-positive results. Although a number of strategies for inferring and taking into account the confounding effects of genetic ancestry are available, applying them to large studies (tens thousands samples) is challenging. The goal of this study is to develop an approach for inferring genetic ancestry of samples with unknown ancestry among closely related populations and to provide accurate estimates of ancestry for application to large-scale studies.MethodsIn this study we developed a novel distance-based approach, Ancestry Inference using Principal component analysis and Spatial analysis (AIPS) that incorporates an Inverse Distance Weighted (IDW) interpolation method from spatial analysis to assign individuals to population memberships.ResultsWe demonstrate the benefits of AIPS in analyzing population substructure, specifically related to the four most commonly used tools EIGENSTRAT, STRUCTURE, fastSTRUCTURE, and ADMIXTURE using genotype data from various intra-European panels and European-Americans. While the aforementioned commonly used tools performed poorly in inferring ancestry from a large number of subpopulations, AIPS accurately distinguished variations between and within subpopulations.ConclusionsOur results show that AIPS can be applied to large-scale data sets to discriminate the modest variability among intra-continental populations as well as for characterizing inter-continental variation. The method we developed will protect against spurious associations when mapping the genetic basis of a disease. Our approach is more accurate and computationally efficient method for inferring genetic ancestry in the large-scale genetic studies.
Project description:Although principal component analysis is frequently used in multivariate/ analysis, it has disadvantages when applied to experimental or diagnostic data. First, the identified principal components have poor generality; since the size and directions of the components are dependent on the particular data set, the components are valid only within the set. Second, the method is sensitive to experimental noise and bias between sample groups, since it cannot reflect the design of experiments; rather, it estimates the same weight and independence of all the samples in the matrix. Third, the resulting components are often difficult to interpret. To address these issues, several options were introduced to the methodology. The resulting components were scaled to unify their size unit. Also, the principal axes were identified using training data sets and shared among experiments. This training data reflects the design of experiments, and its preparation allows noise to be reduced and group bias to be removed. The effects of these options were observed in microarray experiments, and showed an improvement in the separation of groups and robustness to noise. Additionally, unknown samples were appropriately classified using pre-arranged axes, and principal axes well reflected the characteristics of groups in the experiments. This SuperSeries is composed of the SubSeries listed below.
Project description:MotivationPopulation stratification (PS) is a major confounder in genome-wide association studies (GWAS) and can lead to false-positive associations. To adjust for PS, principal component analysis (PCA)-based ancestry prediction has been widely used. Simple projection (SP) based on principal component loadings and the recently developed data augmentation, decomposition and Procrustes (ADP) transformation, such as LASER and TRACE, are popular methods for predicting PC scores. However, the predicted PC scores from SP can be biased toward NULL. On the other hand, ADP has a high computation cost because it requires running PCA separately for each study sample on the augmented dataset.ResultsWe develop and propose two alternative approaches: bias-adjusted projection (AP) and online ADP (OADP). Using random matrix theory, AP asymptotically estimates and adjusts for the bias of SP. OADP uses a computationally efficient online singular value decomposition algorithm, which can greatly reduce the computation cost of ADP. We carried out extensive simulation studies to show that these alternative approaches are unbiased and the computation speed can be 16-16 000 times faster than ADP. We applied our approaches to the UK Biobank data of 488 366 study samples with 2492 samples from the 1000 Genomes data as the reference. AP and OADP required 0.82 and 21 CPU hours, respectively, while the projected computation time of ADP was 1628 CPU hours. Furthermore, when inferring sub-European ancestry, SP clearly showed bias, unlike the proposed approaches.Availability and implementationThe OADP and AP methods, as well as SP and ADP, have been implemented in the open-source Python software FRAPOSA, available at github.com/daviddaiweizhang/fraposa.Contactleeshawn@umich.edu.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:We consider spatially dependent functional data collected under a geostatistics setting, where locations are sampled from a spatial point process. The functional response is the sum of a spatially dependent functional effect and a spatially independent functional nugget effect. Observations on each function are made on discrete time points and contaminated with measurement errors. Under the assumption of spatial stationarity and isotropy, we propose a tensor product spline estimator for the spatio-temporal covariance function. When a coregionalization covariance structure is further assumed, we propose a new functional principal component analysis method that borrows information from neighboring functions. The proposed method also generates nonparametric estimators for the spatial covariance functions, which can be used for functional kriging. Under a unified framework for sparse and dense functional data, infill and increasing domain asymptotic paradigms, we develop the asymptotic convergence rates for the proposed estimators. Advantages of the proposed approach are demonstrated through simulation studies and two real data applications representing sparse and dense functional data, respectively.
Project description:Quality-related traits are some of the most important traits in rice, and screening and breeding rice lines with excellent quality are common ways for breeders to improve the quality of rice. In this study, we used 151 recombinant inbred lines (RILs) obtained by crossing the northern cultivated japonica rice variety ShenNong265 (SN265) with the southern indica rice variety LuHui99 (LH99) and simplified 18 common rice quality-related traits into 8 independent principal components (PCs) by principal component analysis (PCA). These PCs included peak and hot paste viscosity, chalky grain percentage and chalkiness degree, brown and milled rice recovery, width length rate, cooked taste score, head rice recovery, milled rice width, and cooked comprehensive score factors. Based on the weight ratio of each PC score, the RILs were classified into five types from excellent to poor, and five excellent lines were identified. Compared with SN265, these 5 lines showed better performance regarding the chalky grain percentage and chalkiness degree factor. Moreover, we performed QTL localization on the RIL population and identified 94 QTLs for quality-related traits that formed 6 QTL clusters. In future research, by combining these QTL mapping results, we will be using backcrossing to aggregate excellent traits and achieve quality improvement of SN265.
Project description:Nearly 30 years ago, Cavalli-Sforza et al. pioneered the use of principal component analysis (PCA) in population genetics and used PCA to produce maps summarizing human genetic variation across continental regions. They interpreted gradient and wave patterns in these maps as signatures of specific migration events. These interpretations have been controversial, but influential, and the use of PCA has become widespread in analysis of population genetics data. However, the behavior of PCA for genetic data showing continuous spatial variation, such as might exist within human continental groups, has been less well characterized. Here, we find that gradients and waves observed in Cavalli-Sforza et al.'s maps resemble sinusoidal mathematical artifacts that arise generally when PCA is applied to spatial data, implying that the patterns do not necessarily reflect specific migration events. Our findings aid interpretation of PCA results and suggest how PCA can help correct for continuous population structure in association studies.