Wednesday, July 30, 2014


This post is just to let everyone know that Dan Gusfield's long-awaited book on the interface between phylogenetics and population genetics is now available.

The book is targeted for mathematically inclined readers. It has a few contributions from Charles H. Langley, Yun S. Song and Yufeng Wu. The title is described as "a portmanteau word derived from the single-crossover recombination of the words 'recombination' and 'combinatorics'."

Hardcover 448 pp; ISBN: 9780262027526; $60.00 £30.95
More information is available from The MIT Press.

This new book joins these previous contributions to the genre:

Image from Celine Scornavacca.

Monday, July 28, 2014

The bourbon family forest?

On pages 72-73 of the book Guide to Urban Moonshining: How to Make and Drink Whiskey (written by Colin Spoelman and edited by David Haskell, 2013, published by Harry N. Abrams), there is an illustration of something called the "American whiskeys family tree". This is reproduced in in the article The Bourbon Family Tree for GQ magazine, from where I sourced the copy here.

The author describes it as follows:
This chart shows the major distilleries operating in Kentucky, Tennessee, and Indiana, grouped horizontally by corporate owner, then subdivided by distillery. Each tree shows the type of whiskey made, and the various expressions of each style of whiskey or mash bill, in the case of bourbons. For instance, Basil Hayden's is a longer-aged version of Old Grand-Dad, and both are made at the Jim Beam Distillery.
So, while the vertical axis is indeed a time scale, the trees are only marginally family trees in the genealogical sense. This is much more an attempt to illustrate the corporate ownership of American whiskey, which is made principally from corn (and thus is generically called bourbon, although in Tennessee they seem to rarely use this word). The main distinctions among the brands are (i) whether the non-corn part is made from rye, a little bit more rye, or wheat, and (ii) the length of time it is aged between distillation and sale.

The reticulations among the trees apparently refer to blends. The ghost lineages at the right are described thus:
Willett, formerly only a bottler as Kentucky Bourbon Distillers, has been distilling its own product for about a year; I include the brands that it bottles from other sources for reference.

Wednesday, July 23, 2014

Evolutionary fitness and incest

I have written before about the expected genetic problems associated with inbreeding, including consanguinity and incest (relationships between people who are first cousins or closer). Conventionally, the evolutionary advantage of sexual over non-sexual reproduction is considered to be the creation of genetic diversity through heterozygosity. Inbreeding, by reducing heterozygosity, then seems to negate the advantages of sexual reproduction — it leads to the propagation of deleterious recessive alleles and thus inbreeding depression. So, there is a clear evolutionary dimension to the fact that incest avoidance is nearly universal in humans.

The best known exceptions to this situation are among royalty, including the family "trees" of the ancient Egyptian 18th Dynasty (see Tutankhamun and extreme consanguinity) and the Egyptian Ptolemaic dynasty (see Cleopatra, ambition and family networks), which were hybridization networks rather than conventional trees. The presence of consanguinity and incest among royal families then requires a biological explanation. As noted by van den Berghe & Mesher (1980):
Royal incest is best explained in terms of the general sociobiological paradigm of inclusive fitness ... Royal incest (mostly brother-sister; less commonly father-daughter) represents the logical extreme of hypergyny. Women in stratified societies maximize fitness by marrying up; the higher the status of a woman, the narrower her range of prospective husbands. This leads to a direct association between high status and inbreeding.
The benefits of inclusive fitness refer to the increased number of offspring in future generations that result from increasing the reproductive success of close relatives. This is achieved via choice of mate. In other words, close relatives share genes, and the success of any relative in leaving offspring is a success for all relatives. Therefore, evolutionary fitness is a combination of individual fitness plus the fitness of close relatives. Inbreeding may reduce individual fitness but can increase inclusive fitness, as noted by Puurtinen (2011):
Theoretical work has shown that inclusive fitness benefits can favor close inbreeding even when this results in substantial reduction in offspring fitness. These models have identified the boundary level of inbreeding depression limiting the evolution of inbreeding among first-order relatives, that is, between full siblings, or between parents and offspring.
So, there is a stable level of inbreeding in those populations that practice mate choice for optimal inbreeding. For example, the genetic risks of close inbreeding can be more than accounted for by the production of a highly related heir who has access to a wide choice of mates. Nevertheless:
For a wide range of realistic inbreeding depression strengths, mating with intermediately related individuals maximizes inclusive fitness.
In other words, mating with very close relatives is unlikely to evolve via natural selection because it is not an optimal strategy; and we must thus look to a sociological component to incest (such as retaining wealth within the family), as well as a biological one.

In this context, it is interesting to note exceptions to the usual restriction of incest to the aristocracy. The society of Graeco-Roman Egypt (from c. 300 BCE to 300 CE) provides the best-documented case (eg. see Hopkins 1980; Shaw 1992; Parker 1996; Scheidel 1997; Huebner 2007; Remijsen & Clarysse 2008). [This era starts with the Ptolemaic dynasty, which marks the collapse of Egyptian rule of Egypt.] During this time a significant proportion of all marriages noted in official Roman census declarations were between full brothers and sisters. That is, the Roman-era Egyptians did not limit this type of inbreeding to any small group, but spread it across several social classes (mainly Greek settlers rather than native Egyptians).

As noted by Schiedel (1997):
According to official census returns from Roman Egypt (first to third centuries CE) preserved on papyrus, 23·5% of all documented marriages in the Arsinoites district in the Fayum (n=102) were between brothers and sisters. In the second century CE, the rates were 37% in the city of Arsinoe and 18·9% in the surrounding villages. Documented pedigrees suggest a minimum mean level of inbreeding equivalent to a coefficient of inbreeding of 0·0975 in second century CE Arsinoe. Undocumented sources of inbreeding and an estimate based on the frequency of close-kin unions indicate a mean coefficient of inbreeding of F=0·15-0·20 in Arsinoe and of F=0·10-0·15 in the villages at the end of the second century CE. These values are several times as high as any other documented levels of inbreeding.
For comparison, the inbreeding F values for these family relationships are:
parent-offspring = siblings
uncle-niece = double first cousins
first cousins
first cousins once removed
second cousins

However, inbreeding depression seems not to have been a notable problem during this historical time. As noted by John Hawkes:
There is not a single mention in the evidence that links sibling marriage to negative genetic effects or unhappy marriages.
This does not mean that there were no problems, but merely that any problems were not documented, as noted by Scheidel (1997):
Even in the absence of explicit references to inbreeding depression from Roman Egypt, there is no compelling reason to assume that brother–sister marriage could have remained entirely without negative consequences for the Arsinoites. It is however possible that, due to a low incidence of lethal recessives, such effects were considerably weaker than in some western samples. The census returns do not suggest lower levels of fertility or smaller numbers of children among sibling couples ...
The practice seems to have stopped solely because it was contrary to Roman Law:
Before a.d. 212 the Romans had accepted discrepancies between their own legal practice and prevailing local customs and traditions in the Eastern provinces. Papyri from Roman Egypt, the Talmud, and the Romano-Syrian law book indeed reveal legal procedures which differed significantly from Roman law in matters such as marriage, guardianship, paternal authority, sales, and debts. The Constitutio Antoniana, however, made all free men and women of the Roman Empire into Roman citizens, and so Roman law became applicable to all inhabitants of Egypt. Brother-sister marriages cease to be documented in our Roman census returns from the early third century on. Our last [incest] testimony dates to a.d. 229.


Hopkins K (1980) Brother-sister marriage in Roman Egypt. Comparative Studies in Society and History 22: 303-354.

Huebner SR (2007) "Brother-sister" marriage in Roman Egypt: a curiosity of humankind or a widespread family strategy? Journal of Roman Studies 97: 21-49.

Parker S (1996) Full brother-sister marriage in Roman Egypt: Another look. Cultural Anthropology 11: 362-376.

Puurtinen M (2011) Mate choice for optimal (k)inbreeding. Evolution 65: 1501-1505.

Remijsen S, Clarysse W (2008) Incest or adoption? Brother-sister marriage in Roman Egypt revisited. Journal of Roman Studies 98: 53-61.

Scheidel W (1997) Brother-sister marriage in Roman Egypt. Journal of Biosocial Science 29: 361-371.

Shaw BD (1992) Explaining incest: brother-sister marriage in Graeco-Roman Egypt. Man 27: 267-299.

Monday, July 21, 2014

The evolutionary March of Progress in popular culture

I have commented before on fact that the general public associates an inappropriate "March of Progress" image with the concept of "evolution" (see Haeckel and the March of Progress, and especially Tattoo Monday VIII - the March of Progress). It therefore seems worthwhile to gather a few examples together in the one place. Most of these are abbreviated versions of the image in the book Early Man by Francis C. Howell (1965. Time-Life International, New York). There were originally 14 images (see the version here), but the modern versions have a half or fewer images.

Wednesday, July 16, 2014

Touching the Data, photos

We all worked hard during the workshop. Here is our fearless leader, in deep thought:

While some of the younger participants enjoyed drawing on the walls:

Professor Whitfield has come up with a great new model of evolution: phylogenetic windmills:

There was not only work, but also time to relax and enjoy the beautiful Dutch summer weather:

And not to forget the delicious Dutch food:

But really, most of the time we were busy touching the data, which you can find on this website:

For more photos, see the Touching the Data website.

Friday, July 11, 2014

Touching the Data, report 2

We have now completed the workshop.

Since the first report, we have had three more talks. First, Mukul Bansal outlined the relationship between phylogenetic networks and reconciliation analysis, and the way in which the latter can be used to construct the former. Starting from an estimated species tree, the tree for each locus is optimized for fit to the species tree, which helps locate any areas of extensive gene flow (ie. reticulation). This can be done using a large number of loci and an even larger number of taxa.

Celine Scornavacca provided details of some of the fundamental limitations of network analysis.The most important of these is unidentifiability of network topologies -- there are classes of network topologies that cannot be distinguished based on the information that is currently used, so that we cannot guarantee that a unique optimal network will be found during an analysis. Branch lengths may help with this situation, but cannot guarantee to resolve it.

Jim Whitfield covered the advantages and potential problems of using genomic-scale data for phylogenetic analysis. The basic problem is the increased scope for error in moving to the genome data (genome assembly problems, gene homology issues, alignment difficulties), although the potential advantages are extensive.

Most importantly, we spent two days "touching" some data. The participants broke into smaller groups of continuously varying size, each of which focussed on a particular dataset (as supplied by some of the participants). These data were evaluated in many different ways, to assess the characteristics of the data as well as to evaluate the data-analysis methods. This not only allowed us to identify the current state of the art with respect to phylogenetic networks, but it also allowed computationalists to improve their understanding of biological data and how biologists proceed to analyze it, as well as allowing biologists to obtain immediate feedback with respect to their data-analysis issues.

Production of phylogenetic networks seems to have come a long way in the past few years, although there is still no single "one-stop shopping" software tool to use. Practical issues getting programs to perform on all computer types were identified, along with data-format issues. Nevertheless, all of the participants seemed to find that this was a very valuable exercise, as a means of focussing interactions among themselves.

Finally, we considered both European and U.S. funding for network research, in the latter case assisted by David Mindell (from the N.S.F.). In particular, we identified sources of funding for future workshops (either in the south of France or the north-eastern U.S.A.).

The canal-boat cruise turned out well, in spite of the somewhat uncooperative weather. The football, of course, has turned out to be rather disappointing for the hosts, although they have one more game to play.

Tuesday, July 8, 2014

Touching the Data, report 1

We have now completed two days of the workshop. We have had a relaxed approach to progress, and are thus currently running behind the nominal schedule. Nevertheless, we are progressing splendidly.

We had three talks on the first day and one today. I tried to kick things off by asking a series of what I consider to be unanswered questions from observing practitioners and computationalists in action, although apparently several members of the audience already had their own answers to some of these. The bottom line is that phylogenetic analysis focuses on data patterns while interpretation focuses on processes / mechanisms, and this constitutes a large part of the apparent separation of practitioners and computationalists.

Steven Kelk and Luay Nakhleh introduced the diversity of computational approaches that we already have. These presentations neatly complemented each other, providing a valuable summary of the field as well as an overview of current limitations and future prospects. This topic was taken up later by various members of the audience, as one of the inherent problems for practitioners is how to navigate through the methods to choose a suitable one -- there are methods based on parsimony, likelihood and bayesian analysis, and methods that tackle de novo network construction, gene tree / species tree reconciliation, gene tree scoring, and network presentation.

This topic was followed up today by presentations introducing some of the currently available software. Some of these have progressed significantly in recent years, notably PhyloNet and Dendroscope, and there are some relatively new ones, as well as even newer ones in the pipeline. Based on the literature, these programs are being dramatically under-used compared to their actual usefulness.

This morning Scot Kelchner introduced us to the application of Zen Buddhism to science in general and phylogenetics in particular. This went down much better than he seemed to be expecting -- there were apparently a lot of  "Zen" people in the room. The basic idea is not to get trapped by preconceived expectations, especially arbitrary categorical notions, when interpreting the output of a phylogenetic analysis. You can consult The Nine-Headed Dragon River, by Peter Matthiessen, if you would like further information.

Finally, we got to the topic implied by the workshop's title: Touching the Data. We had a brief run-through of the pre-existing datasets stored with this blog (see the upper right-hand corner), which cover some of the diversity of what practitioners have provided to date in the way of usable datasets with "known" phylogenetic patterns.

By far the most interesting, however, was the presentation of some recent datasets made available by members of the workshop, notably Axel Janke (bear species), Scot Kelcher (bamboo species) and Mattis List (Indo-European languages) (Jim Whitfield will present his datasets tomorrow morning). These datasets generated much interest, as they provide a diversity of different possible applications for phylogenetic networks. The idea from here on in the workshop is to address what can currently be done with these datasets and what we might like to do with them if the tools were available. This will help focus the participants on specific practical issues, which should lead to the progress that we hope to achieve.

It has rained most of the day, which is actually unusual -- intermittent rain is more common in this climate. We are currently waiting for the football to start: Germany versus Brazil. Tomorrow will be the Netherlands versus Argentina. It is risky being in this country this week! The current local betting is for an all-European final,an assessment that involves no cultural bias whatsoever.

Monday, July 7, 2014

Workshop: Touching the Data

This week we have returned to Leiden (in the Netherlands), for another workshop sponsored by the Lorentz Center. The previous workshop, in October 2012, is discussed in this prior blog post: Workshop: The Future of Phylogenetic Networks.

The full title of the new workshop is: Utilizing Genealogical Phylogenetic Networks in Evolutionary Biology: Touching the Data. As before, it has been organized by Steven Kelk, Leo van Iersel, Leen Stoogie and myself. The program and abstracts can be found here. It runs for the whole week 7 July – 11 July 2014.

The workshop differs significantly from the previous workshop in two ways: it is intended to be a much smaller and more focused workshop, and it is intended to be practical rather than theoretical. The basic aim is to get biologists and computational people to sit down in small groups and actually talk about real phylogenetic data, so that each side of the phylogenetics "coin" gets to understand a bit better what is going on on the other side. To this end, we have gathered together some of the experts in the field specifically of evolutionary / genealogical networks (rather than data-display networks), as this is the area that needs the greatest future development. We have also gathered together some real-world datasets involving apparent reticulating evolution, which will be the focus of discussion. These datasets are available here and also here.

The weather is predicted to be changeable during the workshop, which is to be expected in northern Europe even in summer — that is why everyone else has gone to southern Europe.

I am hoping to add some blog posts based on what happens at the workshop, as it proceeds.

Thursday, July 3, 2014

Are genotype or phenotype data more tree-like?

I recently wrote a manuscript comparing the tree-likeness of phylogenetic data in biology and anthropology (see Are phylogenetic patterns the same in anthropology and biology?). While doing so, I also made a comparison of genotype and phenotype data within biology.

The comparison is based on maximum-parsimony analyses of the data, using the (ensemble) Retention Index (RI) as the measure of tree-likeness. If RI = 1 then all of the characters are compatible with the same tree, whereas if RI = 0 then none of them are pairwise compatible. As the graph shows, the genotype data are considerably less tree-like than are the phenotype data (mean RI ≈ 0.5 versus 0.7, respectively).

It would be interesting to know whether other people have observed this pattern. If it is general, then what causes it? Are the phenotype characters being chosen (subconsciously or not) because they show nested grouping patterns (which lend themselves automatically to a tree representation)? Or do the genotype data inherently have more stochastic variation? Does this mean that we should always be using phylogenetic networks for the representation of genotype data?

You can read the manuscript if you want the details of the analyses. Briefly, the initial collections of datasets were taken from Collard et al. (Evolution and Human Behavior 27: 169-184; 2006) — the graphed data are taken from the paper as I never managed to get the original datasets from the authors. I then supplemented this information with phenotype datasets from TreeBase (total of n=31) and miscellaneous genotype datasets from the literature (n=15). All of the datasets refer to vertebrates and insects (with one phenotype dataset from spiders). My parsimony analyses used the parsimony ratchet and PAUP*.