Help - Phylogenetic reconstruction pipeline

PhylomeDB User's Manual Index

 

Phylogenetic reconstruction pipeline

 The phylome pipeline mimics the steps anyone would take in order to reconstruct a phylogenetic tree: homology search, multiple sequence alignment and tree reconstruction. Small variations can be found across phylomes depending on the year they were reconstructed or the particular needs of each phylome. A detailed description of the methodology used in each phylome can be found in the main phylome page (i.e.: http://phylomedb.org/phylome_1). Here we provide a general overview of the steps taken in phylome reconstruction, which are valid for all the phylomes constructed to date.

 

 

Data collection:

Each phylome is reconstructed with a set number of proteomes. These proteomes have been downloaded from different databases or can belong to sequencing projects. The details of the source of each proteome can be found in the main phylome page. Take into account that one single species can be represented by a different proteome version in two different phylomes.

Phylomes are reconstructed starting from one of the proteomes, which is called the seed proteome. This will be the proteome that will be fully represented in the phylome, while the other proteomes will only appear in trees when they have homologous sequences to the seed. In some instances meta-phylomes are reconstructed. Those are groups of phylomes that use the same set of proteomes and start using different seed proteomes.

 

Homology search:

For each protein encoded in the seed proteome a Smith-Waterman search is performed against the database that comprises the selected proteomes to retrieve a set of proteins with a significant similarity. Results are filtered based on their e-value and the percentage of overlap between the query sequence and the hit. A limit in the number of accepted hits is also used. 

 

Multiple Sequence Alignments (MSA):

In the older phylome versions MSA were performed with MUSCLE and then regions with a high amount of gaps were removed using trimAl. In the last years this step has been modified to adapt a more robust pipeline. Now sets of homologous protein sequences are aligned using three different alignment programs chosen among: MUSCLEMAFFTDIALIGN-TX and KALIGN. Alignments are performed in forward and reverse direction and the six resulting alignments are combined using M-COFFEE. The resulting alignment is then trimmed using trimAl using a consistency cutoff of 0.1667 and a gap score cutoff of 0.1.

 

Phylogenetic reconstruction:

Maximum likelihood (ML) trees are reconstructed for each seed protein that has at least two homologs. One of the important steps in ML reconstruction is to choose the correct evolutionary model. In earlier phylomes we opted to reconstruct trees with different models (usually JTT, WAG, Blossum62 and VT) and then chose the one with the best likelihood according to the Akaike Information Criterion (AIC). Currently the best model is derived from a collection of Neighbor Joining trees that are reconstructed using scoredist distances as implemented in BioNJ. The likelihood of this topology is computed, allowing branch-length optimisation, using seven different models (JTT, LG, WAG, Blosum62, MtREV, VT and Dayhoff), as implemented in PhyML. The evolutionary models best fitting the data are determined by comparing the likelihood of the used models according to the Akaike Information Criterion (AIC). Maximum likelihood trees are then derived using the selected models. In all cases ML trees are reconstructed using a discrete gamma-distribution model with four rate categories plus invariant positions, the gamma parameter and the fraction of invariant positions were estimated from the data.

 

Phylome details:

While all the phylomes follow a similar pipeline, each one has unique features such as the proteomes involved in the reconstruction or small changes in the pipeline. All the information pertaining to how a phylome was reconstructed can be found in the main phylome page. To access it, you can go to the "All phylomes" link on the top of the page. This will lead you to the complete list of phylomes that are currently available. Clicking on the phylome name will take you to the main phylome page. 

The phylome page is currently composed of three sections:

a) Pipeline description. Here are all the details on how the phylome was reconstructed including which programs were used and their version, which relevant parameters were used and how many evolutionary models were used to reconstruct the trees. 

b) Species tree. One or several images of the species tree enclosing all the proteomes used in the phylome are provided in the phylome page. This information will give a visual evolutionary context for the phylome.

c) List of proteomes. Information about the proteomes used in this phylome can be found here. The information found in the table includes: taxonomy ID of the species, species name, which is linked to its page in NCBI taxonomy, the proteome version, the source and the date in which it was included. The proteome version is formed by two fields. The first is the species assigned code, which is normally composed of 5 characters and should match the UniProt mnemonic assignation. In cases in which UniProt has no mnemonic code assignated, the taxID is used. In extreme cases where a species does not have a taxID, a temporal taxID is given. The second parameter refers to the version of the proteome used in this phylome. So different human phylomes can use different human proteomes.