Wednesday, October 28, 2015

Arguments against the use of networks?

The usual argument in favour of using phylogenetic networks is the obvious one that they can account for gene flow during phylogenetic history, as well as vertical inheritance. The usual argument against their use, if there is one, is that vertical inheritance is of primary importance, and thus a tree is "adequate" under many circumstances; or the use of a tree is simply an unquestioned assumption (ie. phylogenetics = trees).

However, Walter Salzburger, Greg B. Ewing and Arndt von Haeseler (2011. The performance of phylogenetic algorithms in estimating haplotype genealogies with migration. Molecular Ecology 20: 1952-1963) have presented a different argument. They point out that a collection of trees can contain more information than can a single network that combines them. This occurs when reticulations represent ambiguity rather than gene flow, as they will in a population or haplotype network (see How do we interpret a rooted haplotype network?).

Their argument is this:
We note that out of a set of different haplotype genealogies, no single genealogy offers a better description of the ‘truth’ than any other one does without considering external data such as the underlying DNA sequences (this is the same when dealing with a set of different MP trees with the same score). The question raised is how are we better off with a group of haplotype genealogies vs. a network that may not be tree-like. The existence of many haplotype genealogies is simply another way of representing ambiguity in the data.
However, the important difference between a network and a set of trees is the lack of independence of Fitch length labellings [ie. the Hadamard distance between nodes]. We illustrate this in Fig. 2. We have the same initial tree with the same tip sequences, but the Fitch branch lengths and internal sequences are different. In the top figure we see that haplotype E connects to D, while haplotype A and B form a cherry also connecting to D. But an alternative is that haplotype E connects to C. This has the effect of changing the topology throughout the tree. So by making some choice in one part of the Fitch tree, it can have topological consequences elsewhere in the tree. In the network case, each ambiguity is represented independently of each other.
It is difficult to represent the same information in a [single] graph compared to a set of trees.

Using this argument, the authors focussed entirely on trees in their simulation study comparing phylogenetic methods: "Here, we are considering the case where the true signal is tree-like and that reticulations represent reconstruction ambiguity." They then confirmed the consequent, by demonstrating that under these circumstances network methods produce false-positive reticulations. Tree-based methods cannot produce reticulations, and so there can be no false positives.

Apart from the impracticality of dealing with potentially large numbers of trees, the main downside of a collection of trees is that we cannot easily compare those trees, which we can instantly do when they are represented by a single network (ie. the trees differ where there are reticulations in the network). Salzburger et al. indirectly refer to this when they note that a "problem is the evaluation of the reliability of connections in haplotype genealogies." They suggest mapping the consistency index for each mutation responsible for each connection in each tree, which seems to be a rather cumbersome alternative to the use of network reticulations to represent unreliability. (NB. A consensus tree is the third way to represent a set of trees, and this seems rarely be used for haplotypes.)

Interestingly, the authors' results showed that the Phylip program DNAPARS consistently did better than the program PAUP* at recovering the simulated trees. The main difference between these two programs is that PAUP* does a better job of finding the set of maximum-parsimony (MP) trees. The results therefore suggest that the authors' trees were usually not MP trees, so that PAUP* was simply wasting its time looking harder for them.

Monday, October 26, 2015

Recent patterns of credit card fraud

The U.S.A. is finally in the process of introducing credit cards containing microchips, in addition to the use of magnetic stripes. These have been widely available elsewhere for a number of years, particularly in the European Union. These chips are used to verify the PIN code entered during transactions, and thus provide an extra level of security against fraudulent use of the cards (in preference to the use of magnetic stripes plus signatures). Unfortunately, the U.S. is introducing a watered-down level of security, in which a signature can be used instead of the PIN code— this achieves very little in the way of extra security (see That big security fix for credit cards won’t stop fraud).

Even in the face of chip.and-PIN security, card fraud cannot be completely stopped, of course. For example, a recent pre-print by Houda Ferradi, Rémi Géraud, David Naccache and Assia Tria (When organized crime applies academic results: a forensic analysis of an in-card listening device) describes a 2011 case where the in-card chip was by-passed by an extra chip, which approved any entered PIN (for a non-technical explanation, see X-ray scans expose an ingenious chip-and-pin card hack). Nevertheless, such cases currently seem to be the exception rather than the rule.

We can investigate recent patterns of credit and debit card fraud using the U.K. as an example. There are regularly updated data in the annual "Fraud The Facts" booklets produced by the UK Cards Association and Financial Fraud Action UK. I have compiled the data for the years 1999-2014 inclusive, so that we can look at the past 16 years using a phylogenetic network. The data include five type of fraud (listed in order of decreasing average frequency):
  • Remote purchase (card not present) = phone, internet and mail-order purchases
  • Counterfeit card
  • Lost or stolen card
  • Card ID theft
  • Card non-receipt = card stolen in the mail

As usual, the network is being used as a form of exploratory data analysis. I first used the manhattan distance to calculate the similarity of the different years, based on the frequencies of the five fraud types. This was followed by a neighbor-net analysis to display the between-year similarities as a phylogenetic network. So, years that are closely connected in the network are similar to each other based on their fraud frequencies, and those that are further apart are progressively more different from each other.

The pattern is basically an increasing incidence of fraud through time from top to bottom in the network, due almost entirely to a rapid increase in Remote-purchase fraud. However, this trend was reversed after 2008, followed by a return from 2011 onwards.

However, the time trends are not the same for each fraud type. The incidence of fraud involving Card-ID theft remained relatively steady through time. On the other hand, the incidence of both Card non-receipt fraud and Lost / stolen card fraud dropped after 2004 and they have stayed low since then. Counterfeit-card fraud dropped after 2008, and has stayed low since then. Finally, Remote-purchase fraud also dropped after 2008, but rose again in 2012 and has continued to increase. The latter has been almost entirely due to e-commerce fraud (rather than phone or mail-order).

The drop in certain types of fraud in 2004 seems to have been due to increasing use of sophisticated fraud-screening detection tools by retailers and banks, such as the integrated chip and PIN technology. These help deal with counterfitting and loss / theft. From 2008, there was growth in the use of the American Express SafeKey, MasterCard SecureCode and Verified by Visa systems, by both online retailers and cardholders. This helps deal with e-commerce security. Finally, a "Be Card Smart Online" campaign was launched in the U.K. at the end of 2008, which provides consumers with straightforward practical tips to help them shop safely on the internet.

The recent drammatic increase in e-commerce fraud is attributed to criminals changing their strategies to target this opportunity. For example, they now need to obtain both numbers cards and PINs, and are applying methods to do so. Hardware modifications can also be used, such as the one mentioned above.

Not unexpectedly, the greatest amount of both overseas fraudulent use of U.K. cards, and fraudulent use of foreign cards in the U.K., involves the U.S.A. This is because Americans have only belatedly started using chipped cards, as noted above.

Good, practical advice about minimizing fraudulent use of your cards is given in the current (2015) booklet, irrespective of which country you live in.

Wednesday, October 21, 2015

Studying gene flow using genomes

Continuing the recent blog theme of researchers analyzing potentially reticulate relationships without explicitly using networks (Are networks actually used to explore reticulate histories? ; Problems with manually constructing networks), there is this just-published paper:
Nater A, Burri R, Kawakami T, Smeds L, Ellegren H (2015) Resolving evolutionary relationships in closely related species with whole-genome sequencing data. Systematic Biology 64: 1000-1017.
The authors note:
Using genetic data to resolve the evolutionary relationships of species is of major interest in evolutionary and systematic biology. However, reconstructing the sequence of speciation events, the so-called species tree, in closely related and potentially hybridizing species is very challenging. Processes such as incomplete lineage sorting and interspecific gene flow result in local gene genealogies that differ in their topology from the species tree, and analyses of few loci with a single sequence per species are likely to produce conflicting or even misleading results ... Although gene tree incongruences caused by ILS are still fully compatible with a strictly bifurcating species tree, gene flow among species requires a more complex representation of evolutionary histories, resembling reticulate networks rather than trees.
Unfortunately, this is the sole mention of the word "network" in the text.

The authors addressed the issues of incomplete lineage sorting and interspecific gene flow using whole-genome sequence data from 198 individuals of four flycatcher species, plus two outgroup genomes. They found that, for most genomic regions, none of the 15 possible rooted gene tree topologies appeared consistently at high frequencies — the most frequent gene tree occurred 17.7% of the time, with the second at 14.3% and the third at 10.5%.

They investigated this gene-tree diversity using four programs that attempt to resolve a species tree in the context of incomplete lineage sorting and the coalescent: MP-EST, SNAPP, Fastsimcoal2, and ABC. The latter two approaches also allow for post-divergence gene flow. All four methods have limited applicability when applied to 200 genomes, and so in each case only a subset of the data was analyzed or a subset of the possible species trees was tested. All four methods produced the same species tree, which was also the same as the most commonly encountered gene tree.

Unfortunately, the authors found almost no evidence of gene flow using these methods, although their detailed gene-tree analyses do suggest its existence. This indicates that there are problems with these methods. Perhaps the main problem is that the authors approached their analyses almost exclusively in the context of a species tree rather than a network. There are other methods that one could try, including the one used by researchers studying introgression in archaic hominoids (as discussed in Are networks actually used to explore reticulate histories?).

In addition, the authors seem to be unclear about their concept of what is a species. For example, they note that "gene flow among lineages in the species tree can confound the true order of speciation events", which seems to preclude use of the biological species concept. Furthermore, they note that "lack of species monophyly is common in this study system", which seems to preclude the phylogenetic species concept. What then constitutes speciation?

Finally, the authors seem to have a common misconception of ancestral character states. Their approach includes this statement: "If both outgroup individuals were monomorphic for the same allele, this allele was considered ancestral." This argument has been repeatedly rejected in the literature. See, for example, Crisp MD, Cook LG. (2005) Do early branching lineages signify ancestral traits? Trends in Ecology and Evolution 20: 122-128.

Monday, October 19, 2015

An unusual typographical error

Writing scientific papers is hard enough, but, just to make it more challenging for us, publishers insist that we are also the ones to deal with error checking, such as spelling, grammar, etc. This means that we have all published papers with typographical errors. I could not resist mentioning this one, which is in the current issue of a journal. (I am not listing it, because I see no reason to single out the authors, when we all have this problem.)

The references list contains this publication:

If you look up the original paper, you will find that this unusually named "author" is allegedly: Points O.F. View (which is actually the title of the page, not the author).

I suspect that the real author of the paper is being slowly eliminated from recorded history. A few months ago I blogged about a successful attempt to replace me as the author of a set of blog posts (One of the "joys" of blogging). It's enough to make you stop trying to be an author.

Wednesday, October 14, 2015

Problems with manually constructing networks

I wrote recently about whether explicit network methods are currently used in practice to construct evolutionary networks (Are networks actually used to explore reticulate histories?), and noted that they usually are not. Here I explore in a bit more detail another example, and point out a couple of limitations of constructing such networks manually.

Earlier this year a paper was published exploring the Anopheles gambiae species complex, this group of mosquitoes being the principal vector of the malaria parasite:
Fontaine MC, et al. (2015) Extensive introgression in a malaria vector species complex revealed by phylogenomics. Science 347: 1258524.
There are about 450 known species of anopheline mosquitoes, which transmit five species of malaria to humans, and many other malaria species to most other vertebrates. The genomes of the six Anopheles species were included as part of a genome study published simultaneously, which also included other Anopheles species:
Neafsey DE, et al. (2015) Highly evolvable malaria vectors: the genomes of 16 Anopheles mosquitoes. Science 347: 1258522.

Both groups of researchers constructed a phylogenetic tree of their organisms, but Fontaine et al. then added reticulations to their tree (thus manually forming an evolutionary network). The reticulations represent putative introgression among members of the An. gambiae species complex, many of which have overlapping distributions within sub-saharan Africa.

Fontaine et al. constructed their network by trying to take into account incomplete lineage sorting (which Neafsey et al. apparently did not — they left the An. gambiae species complex as an unresolved polychotomy). This is all well and good, and it matches the current paradigm in the literature where hybridization / introgression (a process involving horizontal gene flow that creates gene-tree discordance) is studied in association with ILS (a process involving vertical inheritance but which also creates gene-tree discordance). The alternative paradigm is that lateral gene transfer (a process involving horizontal gene flow that creates gene-tree discordance) is studied in association with gene duplication–loss (a process involving vertical inheritance but which also creates gene-tree discordance).

However, this might not be the best strategy in this particular case. In the companion paper by Neafsey et al., they note that for their 16 genomes:
Copy-number variation in homologous gene families also reveals striking evolutionary dynamism. Analysis of 11,636 gene families ... indicates a rate of gene gain / loss higher by a factor of at least 5 than that observed for 12 Drosophila genomes.
Under these circumstances, why ignore the possibility that gene duplication and selective loss has created gene-tree discordance? This possibility is not even mentioned by Fontaine et al. Also not mentioned are other possible sources of gene-tree discordance that are associated with vertical inheritance (eg. balancing selection), but they do at one stage concern themselves with the possibility of unequal rates of evolution among the chromosomes.

Their data-analysis strategy was this:
To infer the correct species branching order in the face of anticipated ILS and introgression, maximum-likelihood (ML) phylogenies were constructed from 50-kilobase (kb) non-overlapping windows across the alignments (referred to here as "gene trees" regardless of their protein-coding content), considering six in-group species rooted alternatively with An. christyi or An. epiroticus (n = 4063 windows).
They found a total of 85 different gene-tree topologies, some of them occurring much more frequently than others. They plotted these onto the four autosomal chromosomes plus the X chromosome, and found that the X chromosome favoured very different gene trees than did the autosomes.

From this analysis, the authors constructed a phylogenetic network (shown in the next figure) based on a species tree (black lines) with reticulations added (green arrows) to indicate introgression. I have added two labels ("Vertical" and "Horizontal") to emphasize the authors' interpretation of the evolutionary flow of genetic information, separated into vertical inheritance and horizontal gene flow (introgression).

The authors interpret the horizontal gene flow as being introgression because:
Autosomal introgression between An. arabiensis and the ancestor of An. gambiae [gam] + An. coluzzii [col] has long been postulated and could explain the strong discordance between the dominant tree topologies of the X and autosomes.
The idea of the introgression being autosomal seems to be based on the idea that the "true species tree" is the one shown by the genes that mediate male and female fertility (ie. the sex chromosomes).

The authors note that, for a "definitive interpretation of these conflicting signals" between the gene trees, they need to have "the correct species branching order". I have raised a number of times in this blog the difficulty of constructing a "species tree" in the face of reticulation. If there is evidence for horizontal gene flow in the data then how do we first extract just the vertical inheritance? The authors attempted to address this question in a section entitled "Tree height reveals the true species branching order in the face of introgression". Their argument is this:
To infer the correct historical branching order, we applied a strategy based on sequence divergence ... Because introgression will reduce sequence divergence between the species exchanging genes, we expect that the correct species branching order revealed by gene trees constructed from non-introgressed sequences will show deeper divergences than those constructed from introgressed sequences. If the hypothesis of autosomal introgression is correct, this implies that the topologies supported by the X chromosome should show significantly higher divergence times ... than topologies supported by the autosomes.
This, indeed, was what they found; and so they concluded that the X chromosome topology represents the species tree, and the autosomes are showing introgression. However, this seems to be a somewhat specious argument. Maybe introgression does lower tree height, but I don't think that we should conclude from this that lowered tree height indicates introgression. We cannot simply invert this argument (ie. A causes to B, and therefore B implies A), because there may be other differences between the autosomes and the X chromosome that also affect relative tree height, such as unequal gene duplication-loss, convergence, unequal evolutionary rates, balancing selection, and so on.

Therefore, we should not be surprised if the authors have got it wrong about whether the X chromosome or the autosomes is showing the "true species tree" (if there is one). That is, the edge labelled "Vertical" in the above network may actually represent the horizontal gene flow, while the edge labelled "Horizontal" may actually represent the vertical inheritance.

Finally, there is a published commentary on the two Anopheles papers:
Clark AG, Messer PW (2015) Conundrum of jumbled mosquito genomes. Science 347: 27-28.
These authors appropriately note that:
Fontaine et al. adhere to a classical view that there is a "true species tree" ... But given that the bulk of the genome has a network of relationships that is different from this true species tree, perhaps we should dispense with the tree and acknowledge that these genomes are best described by a network, and that they undergo rampant reticulate evolution.
This alternate philosophy requires an integrated method for constructing the network, rather than manually constructing a species tree and then adding reticulations. Such a method would construct a network from first principles, and then reveal whether the species phylogeny is tree-like or not, rather than assuming that it is a tree a priori. There are a number of methods being developed for doing this.

Monday, October 12, 2015

Buffon and the origin of the tree and network metaphors

I have written before about Georges-Louis Leclerc, Comte de Buffon (1707-1788). (Actually, he was called Georges-Louis Leclerc from 1707-1725, and Georges-Louis Leclerc De Buffon from 1725–1773, before becoming a count.) His role in the development of the theory of organic evolution was such that he is worth considering again here, especially given his important role in introducing the tree and network metaphors in phylogenetics.


Buffon is usually credited with being in the top triumvirate of influential people in the development of modern biology, along with Aristotle and Darwin. Buffon followed the lead of the physicist Isaac Newton, by trying to explain natural phenomena solely in terms of other observable natural phenomena, rather than resorting to super-natural explanations. (Indeed, Buffon translated one of Newton's books from LAtin to French.)

This was Newton's main contribution to science, his insistence on empirical explanations. He did not invent this idea, but he was the one who effectively created modern science by consistently applying it. Hence the importance of the apple — the explanation for the small-scale phenomenon of a falling apple, which we can see and study experimentally, is the same as for the large-scale orbits of the planets, which we can see but not experiment upon. Consistency of natural explanations, rather than invoking super-natural forces, creates a coherent scientific whole that is amenable to description, explanation and prediction.

Buffon adopted this same scientific approach and applied it to biology. Once again, he did not invent this idea, but he was the one who applied it consistently across all of biology. He did this principally in his Histoire naturelle, générale et particulière, an ambitious work planned to cover all of nature in 50 volumes (it included geology, anthropology and cosmogeny, as well as biology). Begun in 1749, he and a few collaborators completed 36 volumes before his death in 1788, and 8 more were compiled by others shortly afterwards.

In the process of trying to find natural explanations for all empirically observable biological phenomena, Buffon not unexpectedly encountered the idea of mutation of species, as part of his thoughts about an irreversible history of nature. He thus grappled both with species concepts and with temporal change within and between species. He is thus credited as the first modern evolutionist, because he introduced the time element in comparative biology, so that common structure is explained in terms of common ancestry. However, his ideas, published over many decades, were often inconsistent — sometimes he was an evolutionist and sometimes not. This seems to be, at least in part, due to increasing religious pressure — he was an important person in the ancienne regime of France, and not in a position to easily reject the teachings of the Catholic church.

By modern standards, Buffon was wrong on most things (see Buffon's genealogical ideas), as was Aristotle — being first means that you are also the first to get it wrong, to one extent or another. This does not in any way reduce the impressive nature of his work as a pioneer. He was not a cataloguer of information like his great Swedish rival von Linné — he wanted to explain things, not organize them, as he was interested principally in causes. He also moved away from trying to explain biology in terms of physics (eg. the concept of universal essences), and tried to explain it in terms of itself.


Of principal interest for this blog is Buffon's role in the development of metaphors for biological relationships. Given his role as an early adopter of evolutionary ideas, he was also an early adopter of metaphors to depict those ideas about historical relationships.

Buffon argued for temporal continuity rather than eternal types, modification of both natural and domesticated species through time (but only up to a certain point), and an underlying unity of organismal types. The latter idea suggested common ancestry for all animals, but Buffon considered and rejected this hypothesis. Indeed, he also rejected the idea that species descend from each other, thus accepting only within-species evolution. He did, however, have a broad concept of species, based on inter-breeding, so that some of his species correspond to modern taxonomic families.

In a previous blog post (The first phylogenetic network 1755) I noted that Buffon put his thoughts into action when he considered the within-species evolution of dog breeds in volume V his Histoire naturelle. In doing so, he published what is usually considered to be the first avowedly evolutionary diagram. It shows the origin and diversification of dog domestication as known at the time. It includes both temporal and spatial variation among dogs, since Buffon believed that morphological variation was related to different climates, so that climatic differences were the ultimate cause of biological variation.

Although Buffon labeled the diagram as a "Table", in his text he noted that it is [translated] "a table or, if one prefers, a kind of genealogical tree where one may grasp at a glance all the varieties". In modern terms it is actually a hybridization network, since it shows repeatedly that some dog breeds arose as a result of hybridization between other breeds. It is also, of course, a map, since it shows spatial variation, although the geographical content is not strictly respected. The diagram is thus a hybrid of a network and a map.

Note that Buffon used the idea of a tree long before Simon Pallas (1776), who is usually credited with introducing the tree metaphor. However, Buffon was writing solely about within-species relationships, whereas Pallas discussed a much broader scale (specifically, both plants and animals).

Indeed, Buffon's genealogical ideas had first appeared in volume IV of the Histoire naturelle, in 1753 (the same year as Linné's Species Plantarum). In this volume there is a presentation of his ideas on species in "Discours sur la nature des animaux" [Discourse on the nature of animals] and his ideas about animal genealogy in "L'asne" [The ass]. The latter contains this text:
que l'homme et le singe ont eu une origine commune comme le cheval et l'âne; que chaque famille, tant dans les animaux que dans les végétaux, n'a eu qu'une seule souche, et même que tous les animaux sont venus d'un seul animal qui, dans la succession des temps, a produit, en se perfectionnant et en dégénérant, toutes les races des autres animaux. [that man and ape have had a common origin like the horse and the donkey; every family, both in animals and in plants, had only a single stem [stock], and even all the animals came from a single animal which, in the succession of time has produced by perfection and degeneration, all the races of the other animals.]
Buffon was, however, not consistent in his uses of metaphors. This topic is discussed in detail by Giulio Barsanti (1992), and he has provided a convenient chart of Buffon's metaphors — the following version is taken from Ruse and Travis (2009).

Note that Buffon used the traditional chain analogy most often, since this can be used for ancestor–descendant relationships. However, he simultaneously used the tree and map in 1755 (as discussed above), and he effectively replaced the tree with the map after 1780. The map had previously been introduced by von Linné in 1751 ("All plants show affinities on either side, like territories in a geographical map").

It is interesting to see the rapid rise and fall of the family-tree metaphor in the mid 1700s, before its resurgence a century later. The cluster of tree references in 1766 is from "De la dégénération", in volume XIV of Histoire naturelle. "Dégénération" was Buffon's term for evolution.


Barsanti G (1992) Buffon et l'image de la nature: de l'échelle des êtres à la carte géographique et à l'arbre généalogique [Buffon and the image of nature: the scale of being to the map and to the family tree]. In: Gayon J (ed.) Buffon 88: Actes du Colloque International [pour le bicentenaire de la morte de Buffon] (Paris-Montbard-Dijon, 14-22 juin 1988), pp. 255-296. Paris: Librairie Philosophique J. Vrin.

Ruse M, Travis J (2009) Evolution: The First Four Billion Years. Belknap Press, Cambridge MA, p 458.

Wednesday, October 7, 2015

The Wave Theory: the predecessor of network thinking in historical linguistics


It has been mentioned in a couple of previous blogposts that tree-thinking started rather early in historical linguistics (Morrison 07/2013 and Morrison 11/2012).

Although he was not the first to draw language trees, it was August Schleicher (1821-1866) who made tree-thinking quite popular in linguistics with his two papers published in 1853 (1853a and 1853b). Note that there was no notable influence by Darwin here. It is more likely that Schleicher was influenced by stemmatics (manuscript comparison, Hoenigswald 1963: 8); and even today, historical linguistics has certain features that resemble manuscript comparison much more closely than evolutionary biology. It seems that Schleicher's enthusiasm for the drawing of language trees had quite an impact on Ernst Haeckel (1834-1919), since – as Schleicher pointed out himself (Schleicher 1863) – linguistic trees by then were concrete and not abstract like the one Darwin showed in his Origins (Darwin 1859).


Schleicher's tree-thinking, however, did not last very long in the world of historical linguistics. By the beginning of the 1870s Hugo Schuchardt (1842-1927) and Johannes Schmidt (1843-1901) published critical views, claiming that vertical descent was not only what language evolution is about (Schmidt 1872, Schuchardt 1870). Schuchardt was (at least in my opinion) really concrete and observant in his criticisms, especially pointing to the problem of borrowing between very closely related languages, which might deeply confuse the phylogenetic signal:
We connect the branches and twigs of the family tree with countless horizontal lines and it ceases to be a tree. (Schuchardt 1870: 11, my translation)
While Schuchardt's observations were based on his deep knowledge of the Romance languages, Schmidt drew his conclusions from a thorough investigation of shared homologous words in the major branches of Indo-European. What he found here were patterns of words that were in a strong patchy distribution, with many gaps in certain languages and only a few (if at all) patterns that could be found in all languages. One seemingly suprising fact was, for example, that Greek and Sanskrit shared about 39% of homologs (according to Schmidt's count, see Geisler and List 2013), Greek and Latin shared 53%, but Latin and Sanskrit only 8%. Assuming that Greek and Latin had a common ancestor, Schmidt found it very difficult to explain how the similarities between the two languages with Sanskrit could be so different (Schmidt 1872: 24). Furthermore, this pattern of patchy distributions seemed to be repeated in all branches of Indo-European that Schmidt compared in his investigation. Schmidt thus concluded:
No matter how we look at it, as long as we stick to the assumption that today's languages originated from their common proto-language via multiple furcation, we will never be able to explain all facts in a scientifically adequate way. (Schmidt 1872: 17, my translation).
Unfortunately, Schmidt did not stop with this conclusion but proposed another model of language divergence instead of the family tree model:
I want to replace [the tree] by the image of a wave that spreads out from the center in concentric circles becoming weaker and weaker the farther they get away from the center. (Schmidt 1872: 27, my translation)
Ever since then, this new model, the so-called wave theory (Wellentheorie in German) lurks around textbooks in historical linguistics, and confuses especially those who are not primarily trained in historical linguistics. What is the wave theory in the end? How could it replace the tree? While Schmidt did not give a visualization in his book from 1872, he gave one 3 years later (Schmidt 1875: 199):

What we can see from this figure is that we can't see anything: It displays languages in a pie-chart diagram in a quasi-geographic space. No information regarding ancestral states of the languages is given, and no temporal dynamics are shown. I find Schmidt's descriptions of the wave theory hard to understand in their core. He doesn't seem to ignore that evolution has a time dimension, but he seems to deliberately neglect it when drawing his waves.

Other scholars, like Hirt (1905), Bloomfield (1933), Meillet (1908), or Bonfante (1931), propososed similar and alternative ways to visualize Schmidt's wave, as shown in the image below. In contrast to the language trees which – after Schleicher's initial rather "realistic" tree drawings – quickly began to be schematized in historical linguistics, the correct way to draw a wave has remained a mysterium up to today.

Problems with Waves and Trees

When reading Schmidt's book from 1872 and also inspecting his data, certain fallacies in his argumentation become obvious. Firstly, he claims that the low amount of shared homologs between Sanskrit and Latin would be a problem for a family tree theory — however, this is of course no problem, as long as we do not assume that the loss of words follows an evolutionary clock. Furthermore, Schmidt underestimated the epistemological aspect of our knowledge. When comparing the three languages in alternative counts of more recent etymological databases (see Geisler and List 2013 for details), the scores change rapidly, with Latin and Greek sharing 40%, Greek and Sanskrit sharing 39% and Latin and Sanskrit sharing (already) 21%. Although no complete account of Schmidt's data is available in digital form, I think we can assume that the data that forced Schmidt to assume that there is no tree behind the Indo-European languages would not scare off an evolutionary dendrophilist. Whether the tree that the different phylogenetic frameworks would present us from Schmidt's data is a tree corresponding to any reality of Indo-European language formation is another question, but the data may well be quite tree-like, despite what Schmidt saw in it.

A further problem of the wave theory is that people contrast it with the family tree model. This does not seem to be justified, since -- as we can see from the visualizations shown above -- the wave theory ignores the temporal dimension of divergence and convergence. In this sense, it is a pure data display model, similar to a data-display network (Morrison 2011: 5-9) to which some geographical information has been added. As long as the wave theory shows only similarities between taxonomic units based on some kind of underlying data, it is neither a "theory" nor a hypothesis. It is no opponent of the family tree, since it serves a completely different purpose.

What Schuchardt already mentioned, and what Schmidt might have been looking for, was the idea of phylogenetic networks: if we cannot ignore the fact that languages exchange material laterally as well as they inherit it vertically, we "connect the branches and twigs of the family tree with countless horizontal lines and it ceases to be a tree" (Schuchardt 1870: 11).

  • Bloomfield, L. (1933 [1973]). Language. London: Allen & Unwin. 
  • Bonfante, G. (1931). “I dialetti indoeuropei”. Annali del R. Istituto Orientale di Napoli 4, 69–185.
  • Darwin, C. (1859). On the origin of species by means of natural selection, or, the preservation of favoured races in the struggle for life. Electronic resource. Online available under: London: John Murray.
  • Geisler, H. und J.-M. List (2013). “Do languages grow on trees? The tree metaphor in the history of linguistics”. In: Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization. Hrsg. von H. Fangerau, H. Geisler, T. Halling und W. Martin. Stuttgart: Franz Steiner Verlag, 111–124.
  • Hirt, H. (1905). Die Indogermanen. Ihre Verbreitung, ihre Urheimat und ihre Kultur. Bd. 1. Strassburg: Trübner. Internet Archive: dieindogermaneni01hirtuoft.
  • Hoenigswald, H. M. (1963). “On the history of the comparative method”. English. Anthropological Linguistics 5.1, pp. 1–11. URL:
  • Meillet, A. (1922 [1908]). Les dialectes Indo-Européens. Paris: Librairie Ancienne Honoré Champion. Internet Archive: lesdialectesindo00meil.
  • Morrison, D. A. (2011). An introduction to phylogenetic networks. Uppsala: RJR Productions.
  • Schleicher, A. (1853a). “Die ersten Spaltungen des indogermanischen Urvolkes”. Allgemeine Monatsschrift für Wissenschaft und Literatur, 786–787.
  • Schleicher, A. (1853b). “O jazyku litevském, zvlástě na slovanský. Čteno v posezení sekcí filologické král. České Společnosti Nauk dne 6. června 1853”. Časopis Čsekého Museum 27, 320–334. URL:
  • Schleicher, A. (1863). Die Darwinsche Theorie und die Sprachwissenschaft. Offenes Sendschreiben an Herrn Dr. Erns Haeckel. Weimar: Hermann Böhlau. ZVDD: urn:nbn:de:bvb:12-bsb10588615-5.
  • Schmidt, J. (1872). Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Herman Böhlau.
  • Schmidt, J. (1875): Zur Geschichte des Indogermanischen Vokalismus. Weimar: Hermann Böhlau.
  • Schuchardt, H. (1870 [1900]). Über die Klassifikation der romanischen Mundarten. Probe-Vorlesung, gehalten zu Leipzig am 30. April 1870. Graz. URL:

Monday, October 5, 2015

A network of blood types

The relationship between phenotypes and allele frequencies is often introduced in textbooks using the example of human blood type. There are three alleles for the blood-type gene (IA, IB, IO), and these produce four phenotypes (A, B, AB, and O) since IA and IB are co-dominant and IO is recessive.

The proportions of these phenotypes vary among human ethnic groups, and this variation provides one of the simplest demonstrations that human inter-breeding does not occur at random — that is, Hardy-Weinberg equilibrium is not maintained at a global scale. This can be pictured using a phylogenetic network.

The data come from Racial and Ethnic Distribution of ABO Blood Types. As usual, the phylogenetic network is being used as a form of exploratory data analysis. I first used the manhattan distance to calculate the similarity of the ethnic groups, based on the frequencies of the four blood phenotypes. This was followed by a Neighbor-net analysis to display the between-group similarities as a phylogenetic network. So, ethnic groups that are closely connected in the network are similar to each other based on the relative frequencies of their blood types, and those that are further apart are progressively more different from each other.

You will note that very few of the ethnic peoples that are either geographically or historically closely related to each other have similar distributions of blood types. Indeed, only the Irish and the Scots are closely related both in history and in the network. So, at a global scale, breeding occurs almost entirely within ethnic groups and not between them. Widespread modern migration has not yet obscured this pattern.

There is, however, a broad range of phenotypic variation in blood type. For example, the bottom right-hand part of the network shows those ethnic groups that are dominated by the O phenotype, the top-right is dominated by type A, and the bottom-left by type B.

Of particular interest are those groups for which the B allele is not been recorded in the dataset (ie. the B and AB phenotypes are absent), which includes the Australian Aborigines, the Bororo and Peruvian Indians from South America, the Shompen Nicobars from the Indian Ocean, and the Blackfoot and Navajo peoples from North America. The Maoris and Mayans also have a very low frequency. The Bororo, Peruvian Indian and Shompen peoples also seem to lack the A allele; and it is extremely rare in the Mayans. No group lacks the O allele, but it is lowest in the people from the Grand Andaman islands in the Indian Ocean.