Monday, February 20, 2017

Producing admixture graphs

I have written before about admixture graphs, which are phylogenetic networks that represent reticulations due to introgression:
To date, these graphs have not really been incorporated into the mainstream network literature. Part of the problem has been the rather disparate nature of the admixture literature itself. A paper has recently appeared as a preprint in Bioinformatics that provides a brief introduction to this situation:
  • Kalle Leppälä, Svend Vendelbo Nielsen, Thomas Mailund (2017) admixturegraph: an R package for admixture graph manipulation and fitting. Bioinformatics

There are currently several quite different programs for producing admixture graphs:
  • qpgraph (Castelo and Roberato 2006)
  • TreeMix (Pickrell and Pritchard 2012)
  • AdmixTools (Patterson et al. 2012)
  • MixMapper (Lipson et al. 2013)
  • admixturegraph (see above)
These programs summarize the genetic data in different ways based on genetic drift (eg. the covariance matrix versus so-called f statistics), and construct the graphs in different ways (eg. sequential heuristic building versus a user specified graph). There are also different ways to evaluate the graphs, including fitting the graph parameters using likelihood, and comparing them, including the bootstrap, jackknife, and MCMC.

None of this is ideal. Another problem has been that the graphs are often constructed by hand, and may be needed as input to the programs. However, the biggest limitation is that there are currently no algorithms for inferring the optimal graph topology. This is, of course, the basic problem that needs to be solved for all network construction. To quote the authors with regard to their own R package:
The set of all possible graphs, even when limited to one or two admixture events, grows super-exponentially in the number of leaves, and it is generally not computationally feasible to explore this set exhaustively. Still, we give graph libraries for searching through all possible topologies with not too many leaves and admixture events.
For larger graphs we provide functions for exploring all possible graphs that can be reached from a given graph by adding one extra admixture event or by adding one additional leaf. However, the best fitting admixture graphs are not necessarily extensions of best fitting smaller graphs, so we recommend that users not only expand the best smaller graph but a selected few best of them.
The world of graph-edge rearrangements (NNI, SPR) does not yet seem to have encountered the world of admixture graphs.

Tuesday, February 14, 2017

The evolution of women's clothing sizes

Several years ago I presented a piece about the Evolutionary history of Mazda motor cars, in which I pointed out that what is known in biology as Cope's Rule of phyletic size increase applies to manufactured objects as well as to biological organisms. This "rule" suggests that the size of the organisms within a species generally increases through evolutionary time. Human beings, for example, are on average larger now than they were a few thousand years ago. Furthermore, through time, new species arise to occupy the niches that have been vacated (because the previous organisms are now too big to fit).

This situation is easy to demonstrate for cars, because all successful car models get bigger through time — the customers indicate that the car is not quite big enough, and the manufacturer responds. Another simple example is women's clothing, which I will discuss here.

Women's clothing changes through time in response to two factors in the modern world: changes in the "desired" image of women (as discussed in the post on Changes in Playboy's women through 60 years), and increasing obesity in western society (see the post on Fast food and diet). Illustrating Cope's Rule in this case is thus easy.

There have been five voluntary "standards" developed over the past century for standardized clothing sizes in the USA, as discussed in Wikipedia. These standards describe, for example, what sized woman should fit into a Size 12 in terms of various of her dimensions. There is nothing mandatory about these standards, and they simply reflect societal recommendations at any given time. So, a Size 12 in 1958 is not the same as a Size 12 in 2008.

These three graphs illustrate the time course of the changes in each of the defined clothing sizes (Size 0 to Size 20), in terms of three female girth measurements.

This is blatantly Cope's Rule in all three cases. All of the sizes get bigger through time, at approximately the same rate. Furthermore, as the dimensions increase through time, new sizes appear to fit the smaller women — Size 8 did not exist in 1931, Size 6 did not exist in 1958, Sizes 2 and 4 did not exits in 1971, and sizes 0 and 00 did not exist in 1995.

To put it another way, a Size 12 woman today is much larger than her Size 12 mother was, who in turn was bigger than the Size 12 grandmother. I believe that this is referred to in the clothing business as "vanity sizing", which it may well be, but it is also a natural example of Cope's Rule of phyletic size increase.

Finally, there is no reason to expect that this phyletic size increase will stop any time soon. Do cars or clothes have an upper limit on their size? Biological organisms do, mainly because of the effect of gravity, and so the phyletic size increase either ceases or the species becomes extinct. Manufactured objects are different.

Data sources
  • DuBarry / Woolworth (1931-1955) - see Wikipedia
  • National Institute of Standards and Appeals (1958) Commercial Standard CS215-58: Body Measurements for the Sizing of Women's Patterns and Apparel Table 4
  • National Institute of Standards and Appeals (1971) Commercial Standard PS42-70: Body Measurements for the Sizing of Women's Patterns and Apparel Table 4
  • ASTM International (1995, revised 2001) Standard D5585 95 (R2001)
  • ASTM International (2011) Active Standard D5585 11e1: Standard Tables of Body Measurements for Adult Female Misses Figure Type, Size Range 00–20

Tuesday, February 7, 2017

Networks, trees and sequence polymorphisms

One of the more obvious bits of evidence that an organismal history may not be entirely tree-like is the presence of sequence polymorphisms. For example, intra-individual site polymorphisms in ITS sequences create considerable conflict in a dataset, if we try to construct a tree-like phylogeny.

This means that people have adopted a range of strategies to try to get a nice neat tree out of their data. This topic is briefly reviewed in this recent paper:
Agnes Scheunert and Günther Heubl (2017) Against all odds: reconstructing the evolutionary history of Scrophularia (Scrophulariaceae) despite high levels of incongruence and reticulate evolution. Organisms Diversity and Evolution in press.
The authors discuss the following strategies, for which they also provide a few literature references.

1. Delete the offending taxa

Pruning the offending taxa is among the most-used tactics. This deletes part of the phylogeny, of course.

2. Delete the polymorphisms

Excluding the polymorphic alignment positions is probably the most common tactic. Similar strategies include the replacement of the polymorphisms with either a missing data code or the most common nucleotide at that position. All of these ideas resolve the polymorphisms in favor of the strongest phylogenetic signal, and thus sweep the conflicting signals under the carpet.

3. Select single gene copies

The polymorphisms become apparent because there are multiple copies of the gene(s) concerned, and therefore selecting a single copy removes the polymorphisms. This can be done by cloning the gene (at the time of data collection), or by statistical haplotype phasing methods (during the data analysis). This also sweeps the conflicting signals under the carpet..

4. Code the polymorphisms

As a preferred alternative, rather than discarding or substituting the sequence variabilities, we could include them as phylogenetically informative characters. This would allow the construction of a phylogenetic network, as well as a tree-like history.

One possibility, suggested by Fuertes Aguilar and Nieto Feliner (2003), concentrates on Additive Polymorphic Sites (APS). A sequence site is an APS when each of the nucleotides involved in the polymorphism can also be found separately at the same site in at least one other accession. Other intra-individual polymorphisms are ignored. This approach has been used to detect hybrids, for example.

An alternative, as used by Scheunert and Günther Heubl to study reticulate evolution in their paper, uses 2ISP (Intra-Individual Site Polymorphisms). All IUPAC codes, including polymorphic sites, are treated as unique characters, by recoding the complete alignment as a standard matrix, which is then analyzed using a multistate analysis option for categorical data. The authors actually use the ad hoc maximum-likelihood implementation from Potts et al. (2014), with additional adaptation of a method for bayesian inference based on Grimm et al. (2007).

You can check out these papers for details.


Fuertes Aguilar J., Nieto Feliner G. (2003) Additive polymorphisms and reticulation in an ITS phylogeny of thrifts (Armeria, Plumbaginaceae). Molecular Phylogenetics and Evolution 28: 430-447.

Grimm G.W., Denk T., Hemleben V. (2007) Coding of intraspecific nucleotide polymorphisms: a tool to resolve reticulate evolutionary relationships in the ITS of beech trees (Fagus L., Fagaceae). Systematics and Biodiversity 5: 291-309.

Potts A.J., Hedderson T.A., Grimm G.W. (2014) Constructing phylogenies in the presence of intra-individual site polymorphisms (2ISPs) with a focus on the nuclear ribosomal cistron. Systematic Biology 63: 1-16.