Hierarchical Clustering Program
for Orthologous Protein Domain Classification
DomClust
is an effective tool for orthologous grouping in multiple
genomes, which is a crucial first step in large-scale comparative
genomics. The method takes as input all-against-all similarity
data and classifies genes based on the traditional hierarchical
clustering algorithm UPGMA. In the course of clustering, the method
detects domain fusion or fission events, and splits clusters into
domains if required. The subsequent procedure splits the resulting
trees such that intra-species paralogous genes are divided into
different groups so as to create plausible orthologous groups. As a
result, the procedure can split genes into the domains minimally
required for ortholog grouping.
DomClust outputs a set of hierarchical clustering trees,
but these trees may overlap with each other. The overlapping
trees, which are represented in the above logo,
actually result from the domain fusion/fission event,
and are the salient feature of the DomClust program.
When comparing several clustering algorithms combined with
the conventional bidirectional best-hit (BBH) criterion,
DomClust generally showed better agreement with the COG classification.
By comparing the clustering results generated from datasets of
different releases, we also found that DomClust showed relatively
good stability in comparison to the BBH-based methods.
DomClust has been used for classifying hundreds of mocrobial
genomes in
MBGD (Microbial genome database for
comparative analysis), which itself provides currently the most
user-friendly interface for DomClust.
Download program
Download data
README | The readme file for the dataset
|
cog02.tgz
| The COG02 dataset used in the DomClust paper
(including all-all similarities, 65MB).
|
cog03.tgz
| The COG03 dataset used in the DomClust paper
(including all-all similarities, 190MB).
|
You can also download homology data from MBGD
Reference
Uchiyama, I.
Hierarchical clustering algorithm for comprehensive
orthologous-domain classification in multiple genomes.
Nucleic Acids Res. 34, 647-658 (2006).
[ Download PDF ]
Please send questions and comments to: Ikuo Uchiyama
(uchiyama@nibb.ac.jp)