Phyloclustering
Overview
Method
Download
Document
Jargon
Example
Application
Extension
|
Section: Overview
Phylogenetic clustering (phyloclustering)
is an evolutionary Continuous Time Markov Chain (CTMC) model-based
approach that identifies population
structure from molecular data without assuming linkage equilibrium.
The goal is to use a statistical approach to find the population
structure from tones of sequences which can be
SNPs, DNAs, codons, ... etc, to cluster individuals into
subpopulations, and to identify molecular sequences representative
of those subpopulations.
It is an approximate solution to the
NP-complete problem of estimating phylogenetic trees.
It also benefits varied research fields,
-
Virology -- identifying key sequences for disease diagnostics and
vaccine design,
-
Ecology -- detecting structure and gene flow in endangered population
or invasive species, and
-
Human Genetics -- searching for genes associated with complex diseases
including potential environmental interactions.
Details and references can be found in
Method and Document.
Purpose
The major goals of phyloclustering are:
- to distinguish ancestors where sequences evolve from,
- to determine population structure based on classifications,
- to avoid possible sequencing or alignment discrepancy, and
- to aggregate trustworthy sequence information.
In phyloclusterng,
the similarity of sequences in a group is characterized by
mutation processes rather than nucleotide frequency.
A naive example is illustrated in the table below to
illustrate phyloclustering.
-
The first column contains the id for six sequences shown
in the second column. The third column shows the potential
ancestors for two groups.
The fourth column indicates the classifications.
-
The sequences in the first group have a higher chance mutating
from the first ancestor than mutating from the second ancestor.
-
The two row blocks show the difference of two possible populations
behind the data.
-
The first site of the fourth sequence is
T
which may be a sequencing error, but can be "rounded" as
the first ancestor.
-
To get a phylogenetic tree based on the two ancestors is easier then
based on the six sequences. The final tree may reveal the structural
phylogeny of population.
|
| Id
| Sequence
| Ancestor
| Group
|
|
|
1
2
3
4
|
A
C
G
T
A
C
C
A
T
C
C
A
A
G
T
C
C
G
A
T
G
C
A
A
G
T
C
C
G
A
T
G
C
T
A
G
T
C
C
G
A
T
G
C
|
A
A
G
T
C
C
G
A
T
G
C
| 1
|
|
|
5
6
|
C
C
G
G
A
A
C
T
A
C
G
C
C
G
G
A
A
C
T
G
C
A
|
C
C
G
G
A
A
C
T
A
C
A
| 2
|
|