Wednesday, October 29, 2014

Uncertainty in multiple sequence alignments

It is well known that reticulations in phylogenetic networks can reflect variation in data sets from many sources, not only gene flow during evolutionary history. These other sources are presumably unwanted in the analysis when they are due to estimation errors. Such errors include incorrect data, inappropriate sampling, and model mis-specification.

For molecular data, one of the more obvious sources of model mis-specification is an incorrect multiple sequence alignment. This reflects wrong assessments of primary homology among the characters, so that the wrong residues are aligned in the columns. This particular issue seems not to have been addressed in the network literature in any systematic way.

However, it is obviously rather important. After all, who needs a phylogenetic network that reflects mis-alignment rather than evolutionary history? One approach to this issue would be to have some sort of measurement of our confidence in the alignment columns, which could be taken into account when the network is constructed.

One practical problem with this approach is that there has been a veritable cottage industry developing such measurements, which would need to be assessed for their suitability. So, I thought that I might list some of them here, along with a brief description of what they measure. The list is comprehensive but not necessarily exhaustive — it consists of ones for which there was at some stage a computer program (there are others that have never been named). Most of the methods are designed specifically for amino-acid sequences, so that not all of them can be used for nucleotides.

There are basically two types of measurement: (1) quantitative scoring schemes, which provide a reliability score for each aligned position, and (2) selection schemes, which select a subset of the aligned positions as being reliably aligned. So, I have divided the list roughly into these two groups.


Dopazo J (1997) A new index to find regions showing an unexpected variability or conservation in sequence alignments. Computer Applications in the Biosciences 13: 313-317.
— evolutionary index is based on conservativeness of amino acid differences as predicted from nucleotide differences

Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (1997) The CLUSTAL-X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research 25: 4876-4882.
— quality is based on conservativeness of amino acid differences

Notredame C, Holm L, Higgins DG (1998) COFFEE: an objective function for multiple sequence alignments. Bioinformatics 14: 407-422.
— score represents consistency among global and local alignments

Pei J, Grishin NV (2001) AL2CO: calculation of positional conservation in a protein sequence alignment. Bioinformatics 17: 700-712.
— conservation is based on weighted entropy

Redelings BD, Suchard MA (2005) Joint Bayesian estimation of alignment and phylogeny. Systematic Biology 54: 401-418.
— approximate probability that the letter is homologous to the ancestral residue in its column

Lassmann T, Sonnhammer EL (2005) Automatic assessment of alignment quality. Nucleic Acids Research 33: 7120-7128.
— consistency based on overlap of alignments from several programs

HoT score
Landan G, Graur D (2007) Heads or tails: a simple reliability check for multiple sequence alignments. Molecular Biology and Evolution 24: 1380-1383.
— measures uncertainty due to co-optimal alignments

Bradley RK, Roberts A, Smoot M, Juvekar S, Do J, Dewey C, Holmes I, Pachter L (2009) Fast Statistical Alignment. PLoS Computational Biology 5: e1000392.
— several scores based on HMM consistency, certainty, expected accuracy, expected sensitivity, expected specificity

Penn O, Privman E, Landan G, Graur D, Pupko T (2010) An alignment confidence score capturing robustness to guide tree uncertainty. Molecular Biology and Evolution 27: 1759-1767.
— robustness to guide tree uncertainty

Kim J, Ma J (2011) PSAR: measuring multiple sequence alignment reliability by probabilistic sampling. Nucleic Acids Research 39: 6359-6368.
— agreement with probabilistic sampling of suboptimal alignments

Wu M, Chatterji S, Eisen JA (2012) Accounting for alignment uncertainty in phylogenomics. PLoS One 7: e30288.
— pair Hidden Markov Model to model the sequence evolution and uses the model to calculate the posterior probabilities that residues of a column are correctly aligned

Chang J-M, Di Tommaso P, Notredame C (2014) TCS: a new multiple sequence alignment reliability measure to estimate alignment accuracy and improve phylogenetic tree reconstruction. Molecular Biology and Evolution 31: 1625-1637.
— transitive consistency score is an extended version of the Coffee scoring scheme


Martin MJ, Gonzâlez-Candelas F, Sobrino F, Dopazo J (1995) A method for determining the position and size of optimal sequence regions for phylogenetic analysis. Journal of Molecular Evolution 41: 1128-1138.
— locates the smallest blocks with similar pairwise genetic distances to the whole alignment

Castresana J (2000) Selection of conserved blocks from multiple alignments for their use in phylogenetic analysis. Molecular Biology and Evolution 17: 540-552.
— selected blocks are based on conservation of identity

Löytynoja A, Milinkovitch MC (2001) SOAP, cleaning multiple alignments from unstable blocks. Bioinformatics 17: 573-574.
— stability is measured with respect to variation in the Clustal gap-opening and gap-extension penalties

Thompson JD, Plewniak F, Ripp R, Thierry J-C, Poch O (2001) Towards a reliable objective function for multiple sequence alignments. Journal of Molecular Biology 314: 937-951.
— normalized mean distance is based on pairwise distances

Shift score
Cline M, Hughey R, Karplus K (2002) Predicting reliable regions in protein sequence alignments. Bioinformatics 18: 306-314.
— uses information from near-optimal alignments

Lawrence CJ, Zmasek CM, Dawe RK, Malmberg RL (2004) LumberJack: a heuristic tool for sequence alignment exploration and phylogenetic inference. Bioinformatics 20: 1977–1979.
— identifies blocks that have their phylogenetic tree being most similar to that of the whole alignment

Dress AW, Flamm C, Fritzsch G, Grünewald S, Kruspe M, Prohaska SJ, Stadler PF. (2008) Noisy: identification of problematic columns in multiple sequence alignments. Algorithms in Molecular Biology 3: 7.
— identification of phylogenetically uninformative homoplastic sites from compatibilities in a circular split system

Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T (2009) trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25: 1972-1973.
— proportion of sequences with a gap, level of amino acid similarity, level of consistency across different (user-provided) alignments

Blouin C, Perry S, Lavell A, Susko E, Roger AJ. (2009) Reproducing the manual annotation of multiple sequence alignments using a SVM classifier. Bioinformatics 25: 3093-3098.
— support vector machine reproduces manual annotations from other alignments

Criscuolo A, Gribaldo S (2010) BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology 10: 210.
— calculates entropy-like scores weighted by similarity matrices

Kück P, Meusemann K, Dambach J, Thormann B, von Reumont BM, Wägele JW, Misof B (2010) Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Frontiers in Zoology 7: 10.
— consensus profiles identify dominating patterns of nonrandom similarity

Rajan V (2013) A method of alignment masking for refining the phylogenetic signal of multiple sequence alignments. Molecular Biology and Evolution 30: 689-712.
— compatible subsplits define clusters of sites which are then removed based on evolutionary rate

Monday, October 27, 2014

Predecessors of Charles Darwin

Charles Darwin and Alfred Russel Wallace are usually credited with independently developing the idea that natural selection could be the important process by which new species arise, although history has apportioned most of the fame to Darwin alone.

In the first edition of his most famous book Darwin (1859) cited no sources, and credited no-one except Thomas Malthus as a source of ideas. He was criticized for this, and from the third edition onwards he provided a historical essay mentioning a few more names.

The basic issue is that the idea of natural election had been "in the air" for more than half a century, but only with respect to within-species variation. It was Darwin and Wallace who took the leap to consider between-species variation, on the basis that there is no historical boundary defining species — all individuals trace their ancestry back through a whole series of ancestors, including those who existed before the origin of their current species. That is, phylogenies trace back to the origin of life not just to the origin of each species.

So, who were the people who published, however briefly, a comment noting the idea of within-species natural selection? Joachim Dagg, of the Natural History Apostils blog, has recently been writing a series of posts discussing many of those publications that contain a clear description of selection. Here I have provided a convenient overview, in time order, with links to Joachim's blog for those of you who want more information.

Joseph Townsend
  • (1786, republished in 1817) A Dissertation on the Poor Laws, by a Well-wisher to Mankind. London: Ridgways.
— a brief mention of selection in relation to the Poor Laws, not organic evolution, but he seems to have inspired Thomas Mathus (1798) Essay on the Principle of Population, the critical work cited by Darwin and Wallace
Link 1 - Link 2

James Hutton
  • (1794) Investigation of the Principles of Knowledge and of the Progress of Reason, from Sense to Science and Philosophy. Volume 2. Edinburgh: Strahan & Cadell. [section 13, chapter 3]
— advocated the idea of what we now call microevolution, especially in relation to agriculture, and suggested natural selection as the mechanism
Link 1

William Charles Wells
  • (1813) An Account of a White Female, Part of Whose Skin Resembles that of a Negro. [talk]
  • (1818) Two Essays: One Upon Single Vision with Two Eyes; the other on Dew. [plus] An Account of a Female of the White Race of Mankind, Part of Whose Skin Resembles that of a Negro. Edinburgh: Archibald Constable.
— a talk read before the Royal Society of London in 1813, and apparently referenced by Adams, but not put into print until 1818 — discusses selection in relation to human skin color
Link 1 - Link 2

Joseph Adams
  • (1814) A Treatise on the Supposed Hereditary Properties of Diseases. London: J. Callow.
— does not actually use the expression "selection" but briefly describes the process in relation to climate-related human variation, tucked away in the notes
Link 1 - Link 2 - Link 3

Patrick Matthew
  • (1831) On Naval Timber and Arboriculture; with Critical Notes on Authors who have Recently Treated the Subject of Planting. Edinburgh: Adam Black.
— explicitly used the phrase "natural process of selection" in relation to the origin of timber varieties, with a discussion tucked away in an appendix — as noted by Joachim Dagg, Matthew explicitly included the possible origin of new species via selection, thus being a literal predecessor of Darwin and Wallace, although they appear to have been unaware of his work
Link 1 - Link 2 - Link 3

John C. Loudon
  • (1832) [Book review of] Matthew, Patrick: On Naval Timber and Arboriculture; with Critical Notes on Authors who have recently treated the Subject of Planting. The Gardener's Magazine 8: 702-703.
— a book review mentioning Matthew's idea of natural selection (he was the only contemporary commenter to do so) and noted it explicitly as being concerned with "the origin of species and varieties"
Link 1 - Link 2

Edward Blyth
  • (1835) An attempt to classify the "varieties" of animals, with observations on the marked seasonal and other changes which naturally take place in various British species, and which do not constitute varieties. The Magazine of Natural History 8: 40-53.*
  • (1836) Observations on the various seasonal and other external changes which regularly take place in birds, more particularly in those which occur in Britain; with remarks on their great importance in indicating the true affinities of species; and upon the natural system of arrangement. The Magazine of Natural History 9: 393-409.*
  • (1837) On the psychological distinctions between man and all other animals; and the consequent diversity of human influence over the inferior ranks of creation, from any mutual or reciprocal influence exercised among the latter. The Magazine of Natural History, new series, 1: 1-9.*
— discusses the effects of artificial selection, but describes the process in nature as restoring organisms in the wild to their archetype (rather than forming new species)
Link 1

Herbert Spencer
  • (1852) A theory of population, deduced from the general law of animal fertility. Westminster Review 57: 468-501.
— published his article in order to show that the adaptedness or fitness of organisms results from the principle discussed by Malthus — Spencer later coined the expression "survival of the fittest" as a synonym of natural selection (in 1862)
Link 1

* Full title: The Magazine of Natural History and Journal of Zoology, Botany, Mineralogy, Geology, and Meteorology