Why do sequence alignment




















In the center panel, a score is placed for the number of matches between sequences for each block of 3 amino acids. For 3 consecutive matches, the score is 3 For 2 out of 3 the score is 2 For 1 out of three the score is 1. While there are more 1's on the second matrix, by ignoring all values below 2 right panel , the random matches are ignored.

One can readily see a diagonal stretching across the matrix indicating sequence similarity throughout the length of both sequences. This works well for DNA sequences, but you must use higher stringencies; longer lengths help too:.

Protein sequence of porcine submaxillary mucin compared to itself looking for exact matches of 39 residues. You can see diagonal of unique sequence at termini. In between are Repeats are often detectable in comparing related sequences to each other. Dot matrix analysis makes them obvious. Virtually all the background matches have been eliminated.

Even better results can be obtained using weighted scoring matrices. Comparison of related proteins revealed that substitution of chemically similar amino acid residues was fairly common in many positions of the protein sequences. Substitution matrices quantitate these differences in substitution frequencies for use in evaluation of alignments.

Organized sequences based on presumed phylogeny of sequences then identified changes from one amino acid to another by mutation. Table values are log odds scores, so odds of adding successive characters to alignment is sum of log probability product of probabilities of each independent event.

Since PAM is extrapolated from closely related sequences, may not be best way to look at more distant relatives. Allows alignment of short regions of sequence from very divergent proteins. These values can be weighted to account for higher rates of transition mutations than transversions. Gap costs: Introducing gaps can aid in aligning sequences, but if you have unlimited gaps, you can align any 2 sequences. Also, gaps in a protein structure mean insertion or deletion of structural elements.

Can be very deleterious to protein structure compared to amino acid substitution. In this procedure, the main caveat lies in the assumption that the seed sequence accuracy reflects well the global data set. This assumption is, however, only correct if the sequences with known structures are evenly distributed within the considered data set.

Structure-based benchmarking does not necessarily depend on a reference alignment, and alternative methods have also been designed that rely on structural superposition rather than structural superposition-induced alignments. These developments were mostly the consequence of work by Lackner [ ], who reported on situations where the structure-based superposition is ambiguous enough to support equally well several alternative sequence alignments.

When this occurs, the reference alignment becomes the arbitrary prioritization of one reference over another, thus biasing the benchmark process. Most reference benchmarks deal with this problem by specifying core regions in which the reference alignment is expected to be less ambiguous, but this procedure remains dependent on the way in which core regions are defined. A more general alternative exists that involves comparing intra-molecular distances between pairs of aligned residue pairs.

This measure, named iRMSD [ ], makes it possible to quantify the structural fit implied by an alignment without having to rely on a reference. Structural benchmarks have also been developed for RNA alignment evaluation Figure 2. Three such benchmarks exist. It makes it possible to evaluate the accuracy of a multiple aligner on RNA sequences by considering the modeling capacity of the evaluated aligner with respect to some reference secondary structure.

This dependence on sequence on which the secondary structure estimation is based slightly limits its scope, as it implies common dependencies between the reference compilation and the evaluation procedure. BraliDart [ 76 ], a newer data set, that is only based on structural information and contains sets of homologous RNA families with known experimental structures, has been recently reported.

This data set is limited by the relative scarceness of experimental RNA 3D structures. Another specificity of BraliDart is its non-reliance on a reference structural alignment but rather on the structural fit implied by the sequence alignment using a distance RMSD measure, as defined by the iRMSD method. They have not been assembled for benchmarking purposes, but rather as a consequence of the importance of accurate ribosomal RNA rRNA alignments when estimating the tree of life.

These alignments have been done manually while taking into account highly conserved rRNA secondary structures that play critical roles in the ribosome functional capacities. At the time we write this review, no reference data set has yet been published to validate the MSAs of long non-coding RNAs, a recently described population of transcripts.

Although empirical data benchmarks are the most commonly used strategies to evaluate alignment methods, they remain limited by their dependence on structural data and the lack of such data for the evaluation of certain kinds of alignments—such as non-transcribed DNA. Furthermore, it remains to be established to which extent structure-based alignments can be expected to be evolutionarily correct. This question is especially critical considering that phylogenetic modeling is one of the main applications of MSA modeling.

A major issue of the most popular aligner methods is their systematic reliance, and possible tuning on structurally correct sequence alignments. These methods are, however, often used to carry out phylogenic reconstruction.

This inconsistency has long been pointed out by the evolutionary community, which routinely relies on simulated data sets rather than empirical ones [ ]. Simulated data sets rely on models mimicking evolution to generate sequences whose diversity is expected to represent a true evolutionary process. The main strength of this approach is to provide a perfectly traceable model, in which the relationship between nucleotides or amino acids is explicitly known.

Their most obvious drawback is to rely on evolutionary models assumed to be correct, while the true extent to which they represent biologically realistic scenarios remains unknown. In any case, these approaches are useful when estimating the impact of extreme conditions on modeling capacity, for instance accelerated evolution, long-branch attraction and similar effects that can confound standard analysis.

It is worth noting that whenever simulated and structure-based reference data sets have been used to validate similar algorithms for alignment accuracy, the rankings were found to differ significantly between these two groups of benchmarks, a clear indication that different alignment characteristics are being evaluated [ 4 , ].

All phylogeny-aware aligners are currently evaluated using these simulated data sets. When doing so, the evaluation is often done on tree modeling capacity rather than on the MSA itself.

Such algorithms include [ , — ]. Resolving the apparent discrepancies between structure-based and simulated reference data sets will probably require a better understanding of the complex relation between alignment accuracy and trustworthy phylogenetic reconstruction. Moving one step in this direction, Dessimoz and Gil recently introduced tree-based tests of alignment accuracy, which not only use large and representative samples of real biological data, but also enable the evaluation of the effect of gap placement on phylogenetic inference [ ].

In an unrelated work [ 35 ], Chang and coauthors proposed the use of empirical data sets obtained by enriching collections of orthologous genes in families likely to support the Tree of Life. When using such data sets, the discrepancy between phylogenetic and structural correctness appears to be less marked. MSA quality indexes and their features.

Features with zero are not used by the specific quality index. With increasingly available structural data, the systematic use of 3D information for the monitoring of MSA accuracy is slowly becoming a realistic prospect.

The first such methods [ 41 , ] were designed using the structural accuracy measured on all possible pairs of sequences with a known 3D structure as a proxy for global accuracy. Recent efforts were therefore focused toward the use of single structures to estimate MSA accuracy.

The CAO contact substitution matrix [ ] is one of the earliest work in this direction. The principle is to embed a sequence with a known structure in the MSA.

Unfortunately, the estimation of this matrix is limited by the lack of available data. This problem was addressed by the STRIKE algorithm [ ], in which the contact substitution matrix is replaced with a contact potential metrics that considers the score of all potential contacts, as obtained from structural data.

When using this matrix to evaluate an MSA, column contacts—as implied by at least one embedded structure—are evaluated by summing the contact score found in the contact log-odd matrix. This approach was shown to be significantly superior to CAO as a mean to discriminate between alternative alignments. Sequence conservation is one of the most straightforward ways of estimating MSA accuracy. A large number of tools have been developed for this purpose that roughly fall in two main categories: structural i.

The evolutionary indexes aim at identifying within an MSA all positions likely to hamper phylogenetic reconstruction. These indexes are usually focused on the removal of diverse columns or indel-enriched regions.

The most commonly used tools are Gblocks [ , ] and trimAl [ ], a re-implementation of Gblocks using an automated parameterization procedure to adjust the filtering level. While these tools are extremely popular and form part of many large-scale phylogenetic pipelines, the actual value of column filtering remains a point of discussion. Two recent reports suggest that filtering could decrease MSA phylogenetic modeling potential [ 28 , 35 ].

Similar tools have been developed to estimate the structural correctness of protein MSAs. The simplest ones like AL2CO [ ] merely measure conservation according to various physicochemical criterions.

Columns and residues eventually get assigned an index value that can be used when doing modeling. The most widely used MSA packages rely on a combination between the progressive algorithm and more or less sophisticated dynamic programming implementations, allowing pairwise alignments of sequences or profiles.

These dependencies make these algorithms inherently unstable. Over the past few years, the development of methods able to quantify this instability to estimate local reliability has become a fast growing trend.

The idea of using robustness as an indicator of biological accuracy is not new and had already been used as early as [ ] in a procedure that involved removing in turn every pair of amino acid in a pair of sequences before realigning them, so as to assess local alignment stability. Later on, the T-Coffee objective function [ ] was used to show the predictive power of consistency. In general, any procedure that may be used to perturbate an alignment lends itself to the definition of a robustness index.

Such indexes can then be evaluated for their correlation with structural or phylogenetic modeling potential. The Head or Tail HoT procedure [ ] is a good example of a simple method sequences are simply inverted , yielding useful information at the cost of a moderate computational overhead.

Other similar procedure albeit more costly have been described. PSAR is one of them [ 97 ]. It is a method that involves generating several alternative MSAs while removing each sequence in turn. The main issue with these two approaches is their relatively high computational cost.

These methods are, however, much more informative than their sequence conservation alternatives. This review is an attempt to put in context and cover the developments that have taken place in the field of MSAs over the past decade or so. The unprecedented pace of development makes it difficult to be truly exhaustive. We have nonetheless tried to provide the reader with an overview of the main aspects, and how they connect to one another. As shown in Figure 1 , the progressive alignment framework aligning the sequences following a tree-order is the main algorithmic heuristic that has been adopted by almost all existing alignment methods.

It is also worth noting that the current inflation in the number of available methods merely reflects the growing pace of data accumulation. MSA modeling is one of the most powerful ways to make sense of biological sequences. MSAMs, by their approximate nature, are doomed to follow a red-queen evolutionary strategy and will need to keep evolving, faster and faster, to keep up with the processing of standard biological data. This review provides an overview on the development of Multiple Sequence Alignment MSA methods and their main applications.

MSA method is one of the most powerful and widely used modeling methods in biology, and a series of algorithmic solutions has been proposed over the years for the alignment of evolutionarily related sequences, while taking into account evolutionary events such as mutations, insertions, deletions and rearrangement under certain conditions.

The main challenges for multiple sequence aligners will be to keep up with growing data set sizes and effectively deal with nucleic acid alignments. This work was supported by the Spanish Ministry of Economy and Competitiveness grant no. Currently, she is working at the Centre for Genomic Regulation, in Barcelona, Spain, conducting her doctoral studies in the field of Comparative Bioinformatics, with Dr Cedric Notredame as her supervisor.

Her main research is about designing and deploying tools and methods that will facilitate the analysis of Big Biomedical Data, allow for biological discoveries and promote personalized medicine. His research activities focuses on developing and evaluating bioinformatics tools for sequencing data and comparative genomics.

Ionas Erb has a PhD in mathematics and a background in statistical physics. His work in the Center for Genomic Regulation CRG in Barcelona, Spain, focuses on multivariate statistical methods and their applications to the analysis of biological sequences, gene expression and behavioral data. The top papers.

Nature ; : — 3. Google Scholar. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res ; 22 : — A comprehensive benchmark study of multiple sequence alignment methods: current challenges and future perspectives. PLoS One ; 6 : e Kemena C Notredame C. Upcoming challenges for multiple sequence alignment methods in the high-throughput era. Bioinformatics ; 25 : — Edgar RC Batzoglou S.

Multiple sequence alignment. Curr Opin Struct Biol ; 16 : — Notredame C Higgins DG. SAGA: sequence alignment by genetic algorithm. Nucleic Acids Res ; 24 : — Hogeweg P Hesper B.

The alignment of sets of sequences and the construction of phylogenetic trees: an integrated method. J Mol Evol ; 20 : — A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol ; 48 : — T-Coffee: a novel method for fast and accurate multiple sequence alignment. J Mol Biol ; : — ProbCons: probabilistic consistency-based multiple sequence alignment.

Genome Res ; 15 : — Evaluation of iterative alignment algorithms for multiple alignment. Bioinformatics ; 21 : — Edgar RC. Nucleic Acids Res ; 32 : — 7. Nucleic Acids Res ; 30 : — Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Mol Syst Biol ; 7 : Saitou N Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol ; 4 : — Murtagh F. Complexities of hierarchic clustering algorithms: state of the art. Comput Stat Q ; 1 : — Multiple alignment by aligning alignments. Bioinformatics ; 23 : i — Bioinformatics ; 14 : — 4. Kececioglu JD. The maximum weight trace problem in multiple sequence alignment.

Lect Notes Comput Sci ; : — A polyhedral approach to sequence alignment problems. Discret Appl Math ; : — Sequence progressive alignment, a framework for practical large-scale probabilistic consistency alignment.

Probabilistic models of proteins and nucleic acids. Biol Seq Anal ; 14 : — MSAProbs: multiple sequence alignment based on pair hidden Markov models and partition function posterior probabilities.

Bioinformatics ; 26 : — Cloud-Coffee: implementation of a parallel consistency-based multiple alignment algorithm in the T-Coffee package and its benchmarking on the Amazon Elastic-Cloud. Bioinformatics ; 26 : — 4. Segment-based multiple sequence alignment. Bioinformatics ; 24 : i — Epistasis as the primary factor in molecular evolution.

Nature ; : — 8. J Comput Biol ; 22 : — Rapid and accurate large-scale coestimation of sequence alignments and phylogenetic trees. Science ; : — 4. BMC Bioinformatics ; 5 : Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms Mol Biol ; 5 : An algorithm for progressive multiple alignment of sequences with insertions.

Morrison DA. Why would phylogeneticists ignore computerized sequence alignment? Syst Biol ; 58 : — 8. Blackburne BP Whelan S. Class of multiple sequence alignment algorithm affects genomic analysis.

Mol Biol Evol ; 30 : — Markova-Raina P Petrov D. High sensitivity to aligner and high rate of false positives in the estimates of positive selection in the 12 Drosophila genomes. Genome Res ; 21 : — TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Mol Biol Evol ; 31 : — Rost B. Twilight zone of protein sequence alignments. Protein Eng ; 12 : 85 — Accurate multiple sequence alignment of transmembrane proteins with PSI-Coffee.

Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. Nucleic Acids Res ; 36 : — Expresso: automatic incorporation of structural information in multiple sequence alignments using 3D-Coffee. Nucleic Acids Res ; 34 : W — 8. Quantifying the relationship between sequence and three-dimensional structure conservation in RNA. BMC Bioinformatics ; 11 : Sankoff D.

Simultaneous solution of the RNA folding, alignment and protosequence problems. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics ; 7 : Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. Mathews DH. Predicting a set of minimal free energy RNA secondary structures common to two sequences. Holmes I. Accelerated probabilistic inference of RNA structure evolution.

BMC Bioinformatics ; 6 : Finding the most significant common sequence and structure motifs in a set of RNA sequences. Nucleic Acids Res ; 25 : — A fast structural multiple alignment method for long RNA sequences.

BMC Bioinformatics ; 9 : Bioinformatics ; 22 : — 9. McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers ; 29 : — Murlet: a practical multiple alignment tool for structural RNA sequences. Bioinformatics ; 23 : — Siebert S Backofen R. Bioinformatics ; 21 : — 9.

Alignment of RNA base pairing probability matrices. Bioinformatics ; 20 : — 7. Inferring noncoding RNA families and classes by means of genome-scale structure-based clustering. PLoS Comput Biol ; 3 : e Multiple structural alignment and clustering of RNA sequences. RNA ; 18 : — Nucleic Acids Res ; 40 : W49 — A max-margin model for efficient simultaneous alignment and folding of RNA sequences.

Bioinformatics ; 24 : i68 — Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign.

BMC Bioinformatics ; 8 : Bioinformatics ; 31 : — Thus it is valuable to test whether such conditions also exists on protein level by comparing MSA and PSA methods in a systematically way. SP increases with the number of correctly aligned sequences and is used to determine the extent of MSAs succeed in an alignment.

CS is a binary score that shows the ability of MSA methods to align all the input sequences correctly. However, SP and CS only consider the correctly aligned residues. To overcome this limitation, an alternative approach is Position Shift Error PSE score, which is used to measure the average magnitude of error.

This score could ensure misalignments that caused a small shift between two sequences are penalized less than large shifts. Other metrics such as fD and fM have been developed to distinguish the regions that were homologous from the unrelated regions. These metrics may reflect the ability of MSA methods through a computational perspective; however, the underlying assumption is that all the input protein sequences are globally align-able, which means that only substitutions, small insertions, and deletions are considered to be the mutational events separating those protein sequences.

However, most protein benchmark datasets are grouped into different sub-datasets which contain several protein families. Each benchmark dataset contain several protein families which could be considered as classes and the proteins in them can be considered as samples with known class labels. Cluster validity criteria which are quantitative measures are suitable here [ 71 ] to evaluate the fitness between results generated from MSA or PSA methods and the correct results real protein family divisions.

A higher cluster validity value means the corresponding alignment method shows better performance. Dunn is time consuming and very sensitive to noise since the score is closely related to the maximum and minimum distances between samples. In this paper we propose a new benchmark framework for protein sequence alignment methods based on cluster validity. In contrast to former studies, we calculated the cluster validity scores based on sequence distances directly instead of clustering results, which avoids the influence brought by different clustering methods, and makes the comparison fairer for both MSA and PSA methods.

Results showed that PSA methods have higher cluster validity score than MSA methods on most of the benchmark datasets. Figure 1 depicts the pipeline of the benchmark procedure carried out in this paper, which comprises four main steps as follows: 1 Data Generation; 2 Alignment Analyses; 3 Evaluation Calculation; 4 Significance Analyses. Firstly, sequences with different class labels were combined to generate each benchmark dataset.

Then, the alignment analyses were performed using 6 MSA and 1 PSA methods and the aligned sequence matrices with gaps inserted are generated as outputs. Based on this, evaluation calculation was performed by cluster validity calculation using SW and RS scores, based on distances calculation results. Finally, Significance analyses were performed on biological and statistical levels to determine whether the performance differences between algorithms produces essential discriminations on application scope.

Framework of this benchmark study This benchmark study is performed following four main steps including data generation, alignments, evaluation calculation, and significance analyses. One major difficulty for comparing alignment methods against biological backgrounds is that de-novo sequence binning relies heavily on the choice of clustering methods, which is independent of the alignment itself but greatly impacts the outcome.

To avert this influence, we adopt a clustering-free approach on the evaluation step. Instead of creating clusters and matching them with real taxonomy, we directly evaluate how the taxa are separated by the alignment results. An optimal alignment will be expected to maximally separate sequences of different family, while on the other hand group sequences of the same family together.

In this manner the alignment quality of different algorithms are evaluated. We downloaded eight datasets from BAliBase v3. Two steps of analyses were performed on these datasets to generate eight groups of benchmark datasets as follows: 1 For each dataset, we combined protein sequences of different protein families into one file.

All the protein families and sequences were included in the analyses except for the sequences belonged to more than one protein families since they would cause the confusion of the following steps.

The original family labels of the sequences are considered as the ground truth of the clustering results. This procedure was repeated 10 times for each dataset.

Thus eight groups of benchmark datasets were generated and each group contained 11 datasets including one benchmark datasets downloaded from BAliBase and 10 re-sampled datasets. The details of these eight benchmark datasets groups were listed as follows and Table 1.

Sequences having large internal insertions or extensions were excluded. The average number of sequences and the average sequence length The numbers of sequences in the dataset was with average sequence length It was designed to demonstrate the ability of the programs to correctly align equidistant divergent families into a single alignment.

For References 1—3, the percent identity was calculated over the homologous region only, and no sequences contain large internal insertions. The numbers of sequences in the two datasets were and with average sequence length Reference 9 including RV and RV contained full-length sequences with linear motif alignment. The average number of sequences was The details of these methods were listed as follows:.

The first stage calculated the similarity of each pair of input sequences using k-mer counting or by constructing a global alignment of the pair to get a triangular distance matrix constructed a tree based on it. After this, a progressive alignment was built. The second stage attempted to improve the tree constructed in the first step and built a new progressive alignment according to this tree. The third stage performed iterative refinement using a variant of tree-dependent restricted partitioning.

At the completion of each stage, a multiple alignment was available and the algorithm can be terminated. Strategies in type1 ran the fastest in speed and strategies in type3 was the most accurate. Compared with previous versions, Clustal Omega offered a significant increase in scalability, allowing virtually any number of protein sequences to be aligned quickly with similar accuracy of other MSA methods.

KAlign was a global, progressive alignment method which employed an approximate string-matching algorithm to calculate sequence distances and incorporated local matches into the global alignment. It was designed to deal with large-scale sequences with quickly speed and accuracy. To evaluate the performance of different methods we analyzed in this study, we performed evaluation calculation using three procedures: similarity calculation, distance calculation, and cluster validation calculation.

The input of this step was the aligned sequence matrices generated by each alignment method and the output was a cluster validity value. The detailed way of the three procedures were as follows:. Based on the similarity ID score, the distance between two protein sequences was calculated as follows:.

Since cluster validity index were designed to evaluate the fitness degree MSA or PSA aligned results and the real protein family divisions, the index should not be too sensitive to noise such as Dunn and Dunn like indices and should not add burden to the calculation such as importing the representative point for each cluster as many index required.

Silhouette was used to find the partitioning that best fitted the underlying data and was not easily affected by noise data. If one sequence alignment method got well-clustered results, the value will near 1. Higher silhouette value meant intra-distances distances among the same class were much smaller than inter-distances distances among different classes which proved the partitioning to be a good one. The Silhouette Width SW score for a partition was calculated as. RS score was used to measure the dissimilarity of clusters.

The values of RS ranged from 0 to 1. A higher RS value meant better clustering. It was calculated as:. Of which SS t referred to the total sum of squares of the whole dataset, SS w referred to the sum of squares within cluster.

It should be noted that, although PSA methods are likely to produce smaller distance for a sequence pair compared with MSA methods, the above criterion is essentially fair for both type of methods.

The reason is that both the SW score and the RS score are not measured by the sole sequence distances, but by the contrasts between intra-cluster and inter-cluster distances. Although PSA achieves smaller pair-wise distances, this applies to both within-cluster and between-cluster comparisons. For each benchmark group, the cluster validity results of different alignment methods calculated on the 10 re-sampled datasets were compared using t test.

A higher p-value meant the performance of the two alignment methods was of no difference while a smaller p-value meant there were significant differences between the two alignment methods. Six hundred sixteen alignment analyses were performed in total. Esprit got the highest SW scores on all the benchmark datasets See Fig. Statistical analyses showed that all the differences between Esprit and other MSA methods were significant with small p values.

The detailed results of each benchmark group were as follows:. Cluster validation results based on SW score. Benchmark dataset reference1 was composed of two sub-datasets named RV11 and RV12 in this study which were used to reflect the abilities of alignment methods on dealing with short length sequences.

The average SW scores on the re-sampled benchmark datasets showed similar results: Esprit got the highest SW scores compared with other alignment methods in RV11 with 0. Both RV 20 and RV30 datasets contained over a thousand of sequences which made the alignment procedures time-consuming. Esprit was the best alignment method in the two datasets with SW scores 0. The average SW scores on the re-sampled benchmark datasets showed similar results: Esprit got the highest SW scores compared with other alignment methods in RV20 with 0.

Esprit got the highest SW score 0. The difference is sequences in RV cover linear motif alignment. The highest SW score is achieved by Esprit with 0.

Results based on RS scores showed similar results with those calculated using SW score. Esprit got the highest RS scores in these four re-sampled benchmark datasets. This ensured that Esprit performed the best compared with other methods no matter calculated using SW or RS scores. The results based on RS scores showed that the performances of all methods were similar on RV11 which was the same with former researches that the resulting alignments were poor no matter which alignment method was used when dealing with diverse set of sequences.

There was no statistical difference between MUSCLE default and Esprit on the re-sampled datasets of RV11 group, thus both of them could be considered as best performance alignment methods in this dataset group.

Considering the p value for the re-sampled datasets of this benchmark group between the two methods was not significant with p value 0.

These indicated that PSA methods may have better performance when dealing with family containing highly similar sequences and could align equidistant divergent families into a single alignment compared with MSA methods. However, this result was not significant on statistical analyses since the p value was 0. However, the big p value 0. Similar as the results of RV40 and RV50, the big p value 0. Cluster validation results based on RS score.

Considering the above results, cluster validity calculation using SW and RS scores on the datasets indicated that PSA methods perform better than MSA methods under most biological conditions.



0コメント

  • 1000 / 1000