Unknown

Dataset Information

0

Principal Component Analysis applied directly to Sequence Matrix.


ABSTRACT: Sequence data is now widely used to observe relationships among organisms. However, understanding structure of the qualitative data is challenging. Conventionally, the relationships are analysed using a dendrogram that estimates a tree shape. This approach has difficulty in verifying the appropriateness of the tree shape; rather, horizontal gene transfers and mating can make the shape of the relationship as networks. As a connection-free approach, principal component analysis (PCA) is used to summarize the distance matrix, which records distances between each combination of samples. However, this approach is limited regarding the treatment of information of sequence motifs; distances caused by different motifs are mixed up. This hides clues to figure out how the samples are different. As any bases may change independently, a sequence is multivariate data essentially. Hence, differences among samples and bases that contribute to the difference should be observed coincidentally. To archive this, the sequence matrix is transferred to boolean vector and directly analysed by using PCA. The effects are confirmed in diversity of Asiatic lion and human as well as environmental DNA. Resolution of samples and robustness of calculation is improved. Relationship of a direction of difference and causative nucleotides has become obvious at a glance.

SUBMITTER: Konishi T 

PROVIDER: S-EPMC6917774 | biostudies-literature | 2019 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

Principal Component Analysis applied directly to Sequence Matrix.

Konishi Tomokazu T   Matsukuma Shiori S   Fuji Hayami H   Nakamura Daiki D   Satou Nozomi N   Okano Kunihiro K  

Scientific reports 20191217 1


Sequence data is now widely used to observe relationships among organisms. However, understanding structure of the qualitative data is challenging. Conventionally, the relationships are analysed using a dendrogram that estimates a tree shape. This approach has difficulty in verifying the appropriateness of the tree shape; rather, horizontal gene transfers and mating can make the shape of the relationship as networks. As a connection-free approach, principal component analysis (PCA) is used to su  ...[more]

Similar Datasets

| S-EPMC7999099 | biostudies-literature
| S-EPMC7804214 | biostudies-literature
2011-08-15 | GSE31375 | GEO
| S-EPMC4928327 | biostudies-literature
| S-EPMC9579216 | biostudies-literature
| S-EPMC4383722 | biostudies-literature
| S-EPMC4721272 | biostudies-literature
| S-EPMC3131008 | biostudies-literature
| S-EPMC2835171 | biostudies-literature
| S-EPMC4274615 | biostudies-literature