Graph Splitting

INTRODUCTION OF GRAPH SPLITTING (GS) METHOD

A protein superfamily contains distantly related proteins that have acquired diverse biological functions through long evolutionary history. Phylogenetic analysis of early evolution of such proteins is a key challenge in evolutionary biology, where existing phylogenetic methods show poor performance when protein sequences are too diverged to construct an informative multiple sequence alignment.

Graph Splitting (GS) method reliably and efficiently reconstructs a protein superfamily-scale phylogenetic tree that contains remote homologs (upper panel). In brief, given a set of protein sequences, GS method first calculates symmetrical relative sequence similarities by all-to-all pairwise alignments using BLAST and obtains a weighted and undirected sequence similarity graph (SSG), whose nodes and edges represent the proteins and detected similarities, respectively. Then, the node set in SSG is recursively split in a top-down manner by the spectral clustering algorithm until a tree (phylogram) is obtained.

Not only reconstructing trees accurately, any practical phylogenetic method needs to provide reliability measures on estimated branches, such as bootstrap values and posterior probabilities in the existing methods. These two measures cannot be easily obtained for a method that does not rely on MSA, because their calculation requires informative MSA whose columns are independent each other. To overcome this obstacle, we developed an edge perturbation (EP) method for statistically evaluating the branch reliability (lower panel). EP method repeatedly makes random perturbations to them. Subsequently, GS method is run using the perturbed SSGs and the proportion that each phylogenetic branch is recovered is calculated.

More detailed information can be found in our paper; Matsui M and Iwasaki W, Graph Splitting: A Graph-Based Approach for Superfamily-scale Phylogenetic Tree Reconstruction, Systematic Biology, Published online, 2019.

Schematic diagrams of Graph Splitting (GS) and Edge Perturbation (EP) methods.

GS method builds a sequence similarity graph (SSG) in which nodes and edges represent proteins and sequence similarities, respectively. The SSG is then recursively split by the spectral clustering, and its clustering hierarchy gives a phylogenetic tree (cladogram). EP method evaluates reliability of each phylogenetic branch by three steps. First, sequence similarity scores of an SSG are randomly perturbed to produce a large number of SSGs. Second, GS method is run on each tree to reconstruct a phylogenetic tree. Third, a recovery rate of each branch in the original tree is obtained.