Monday, October 20, 2014

Beer family trees

Some time ago I wrote a blog post about The bourbon family forest, which contained a collection of trees that, rather than being genealogical trees, instead showed the corporate ownership of American whiskey.

Here is a similar arrangement for "the six companies that make 50% of the world's beer", produced by David Yanofsky at the Quartz blog. As before, the vertical axis is actually a time scale, but the trees are only marginally family trees in the genealogical sense. Note that there is a reticulation between two of the trees for the "Scottish & Newcastle" entry, although this was apparently followed immediately by a subsequent divergence.

Nevertheless, roughly the same sort of information could actually be presented as proper genealogies. Here is an example form Philip Howard's blog, restricted to American beer. Note that the genealogies refer to the joining of branches through time, rather than their splitting. There are two reticulation events, one of which also refers to the "Scottish & Newcastle" entry.

It is also worth noting the use of other types of network by Philip Howard, to look at:

Wednesday, October 15, 2014

Open problems in phylogenetics

Periodically, mathematicians and other computationalists produce lists of what they refer to as "Open Problems" in their particular field. Phylogenetics is no exception. We have had a few on this blog before today (e.g.  An open question about computational complexity; Phylogenetic network Millennium problems).

I thought that I should draw your attention to the fact that last year, Barbara Holland produced a few of her own (2013. The rise of statistical phylogenetics. Australian and New Zealand Journal of Statistics 55: 205-220). These are:

Open problem 1: What is the natural analogue of a confidence interval for a phylogenetic tree?

Open problem 2: What are useful residual diagnostics for phylogenetic models?

Open problem 3: What makes a good phylogenetic model?

Open problem 4: Should DAGs be acceptable objects for inference or should network methods be restricted to exploratory data analysis?

It is obviously the latter problem that is of most interest to us here:
DAGs [directed acyclic graphs] can be constructed by beginning with a good tree and then progressively adding edges until the fit between the model and the data is deemed good enough or there is no sufficient improvement in fit by continuing to add edges. The trouble with using DAGs to define mixture models is that this approach doesn’t actually capture the biological processes of interest within the model. The sorts of things we’d like the data to tell us are what is the relative rate of recombination events or hybridisation events to mutation events or speciation events. The danger with using phylogenetic networks in an "add an extra edge until the fit is good enough" approach is that by giving ourselves the capacity to explain everything we risk explaining nothing. At some point have we stopped doing inference and got back to just summarising our data? 
In phylogenetics we rely on our models for their explanatory power — in the context of network evolution we need to make careful decisions about what biological processes should be included within the model such that inferences about reticulate (non-treelike) processes of evolution can be brought within the realm of stochastic uncertainty rather than being left as a source of inductive uncertainty. This is not a straightforward task, and will require the collaboration of evolutionary biologists and statisticians.
One of the principal issues here is that it is almost impossible to consistently distinguish one reticulation process from another based on the structure of the resulting network. These processes all produce gene flow in the biological world, and they all appear as reticulations in the graphical representation of a network. In practice, phylogenetic analysis may boil down to only two biological processes in the model (vertical gene inheritance and horizontal gene flow), followed by biologists trying to sort out the details with post hoc analyses. Deep coalescence and gene duplication are part of the vertical inheritance, while hybridization, introgression, horizontal gene flow and recombination are part of gene flow. It would be nice to think that this model would simplify network analyses.