Wednesday, October 7, 2015

The Wave Theory: the predecessor of network thinking in historical linguistics


It has been mentioned in a couple of previous blogposts that tree-thinking started rather early in historical linguistics (Morrison 07/2013 and Morrison 11/2012).

Although he was not the first to draw language trees, it was August Schleicher (1821-1866) who made tree-thinking quite popular in linguistics with his two papers published in 1853 (1853a and 1853b). Note that there was no notable influence by Darwin here. It is more likely that Schleicher was influenced by stemmatics (manuscript comparison, Hoenigswald 1963: 8); and even today, historical linguistics has certain features that resemble manuscript comparison much more closely than evolutionary biology. It seems that Schleicher's enthusiasm for the drawing of language trees had quite an impact on Ernst Haeckel (1834-1919), since – as Schleicher pointed out himself (Schleicher 1863) – linguistic trees by then were concrete and not abstract like the one Darwin showed in his Origins (Darwin 1859).


Schleicher's tree-thinking, however, did not last very long in the world of historical linguistics. By the beginning of the 1870s Hugo Schuchardt (1842-1927) and Johannes Schmidt (1843-1901) published critical views, claiming that vertical descent was not only what language evolution is about (Schmidt 1872, Schuchardt 1870). Schuchardt was (at least in my opinion) really concrete and observant in his criticisms, especially pointing to the problem of borrowing between very closely related languages, which might deeply confuse the phylogenetic signal:
We connect the branches and twigs of the family tree with countless horizontal lines and it ceases to be a tree. (Schuchardt 1870: 11, my translation)
While Schuchardt's observations were based on his deep knowledge of the Romance languages, Schmidt drew his conclusions from a thorough investigation of shared homologous words in the major branches of Indo-European. What he found here were patterns of words that were in a strong patchy distribution, with many gaps in certain languages and only a few (if at all) patterns that could be found in all languages. One seemingly suprising fact was, for example, that Greek and Sanskrit shared about 39% of homologs (according to Schmidt's count, see Geisler and List 2013), Greek and Latin shared 53%, but Latin and Sanskrit only 8%. Assuming that Greek and Latin had a common ancestor, Schmidt found it very difficult to explain how the similarities between the two languages with Sanskrit could be so different (Schmidt 1872: 24). Furthermore, this pattern of patchy distributions seemed to be repeated in all branches of Indo-European that Schmidt compared in his investigation. Schmidt thus concluded:
No matter how we look at it, as long as we stick to the assumption that today's languages originated from their common proto-language via multiple furcation, we will never be able to explain all facts in a scientifically adequate way. (Schmidt 1872: 17, my translation).
Unfortunately, Schmidt did not stop with this conclusion but proposed another model of language divergence instead of the family tree model:
I want to replace [the tree] by the image of a wave that spreads out from the center in concentric circles becoming weaker and weaker the farther they get away from the center. (Schmidt 1872: 27, my translation)
Ever since then, this new model, the so-called wave theory (Wellentheorie in German) lurks around textbooks in historical linguistics, and confuses especially those who are not primarily trained in historical linguistics. What is the wave theory in the end? How could it replace the tree? While Schmidt did not give a visualization in his book from 1872, he gave one 3 years later (Schmidt 1875: 199):

What we can see from this figure is that we can't see anything: It displays languages in a pie-chart diagram in a quasi-geographic space. No information regarding ancestral states of the languages is given, and no temporal dynamics are shown. I find Schmidt's descriptions of the wave theory hard to understand in their core. He doesn't seem to ignore that evolution has a time dimension, but he seems to deliberately neglect it when drawing his waves.

Other scholars, like Hirt (1905), Bloomfield (1933), Meillet (1908), or Bonfante (1931), propososed similar and alternative ways to visualize Schmidt's wave, as shown in the image below. In contrast to the language trees which – after Schleicher's initial rather "realistic" tree drawings – quickly began to be schematized in historical linguistics, the correct way to draw a wave has remained a mysterium up to today.

Problems with Waves and Trees

When reading Schmidt's book from 1872 and also inspecting his data, certain fallacies in his argumentation become obvious. Firstly, he claims that the low amount of shared homologs between Sanskrit and Latin would be a problem for a family tree theory — however, this is of course no problem, as long as we do not assume that the loss of words follows an evolutionary clock. Furthermore, Schmidt underestimated the epistemological aspect of our knowledge. When comparing the three languages in alternative counts of more recent etymological databases (see Geisler and List 2013 for details), the scores change rapidly, with Latin and Greek sharing 40%, Greek and Sanskrit sharing 39% and Latin and Sanskrit sharing (already) 21%. Although no complete account of Schmidt's data is available in digital form, I think we can assume that the data that forced Schmidt to assume that there is no tree behind the Indo-European languages would not scare off an evolutionary dendrophilist. Whether the tree that the different phylogenetic frameworks would present us from Schmidt's data is a tree corresponding to any reality of Indo-European language formation is another question, but the data may well be quite tree-like, despite what Schmidt saw in it.

A further problem of the wave theory is that people contrast it with the family tree model. This does not seem to be justified, since -- as we can see from the visualizations shown above -- the wave theory ignores the temporal dimension of divergence and convergence. In this sense, it is a pure data display model, similar to a data-display network (Morrison 2011: 5-9) to which some geographical information has been added. As long as the wave theory shows only similarities between taxonomic units based on some kind of underlying data, it is neither a "theory" nor a hypothesis. It is no opponent of the family tree, since it serves a completely different purpose.

What Schuchardt already mentioned, and what Schmidt might have been looking for, was the idea of phylogenetic networks: if we cannot ignore the fact that languages exchange material laterally as well as they inherit it vertically, we "connect the branches and twigs of the family tree with countless horizontal lines and it ceases to be a tree" (Schuchardt 1870: 11).

  • Bloomfield, L. (1933 [1973]). Language. London: Allen & Unwin. 
  • Bonfante, G. (1931). “I dialetti indoeuropei”. Annali del R. Istituto Orientale di Napoli 4, 69–185.
  • Darwin, C. (1859). On the origin of species by means of natural selection, or, the preservation of favoured races in the struggle for life. Electronic resource. Online available under: London: John Murray.
  • Geisler, H. und J.-M. List (2013). “Do languages grow on trees? The tree metaphor in the history of linguistics”. In: Classification and evolution in biology, linguistics and the history of science. Concepts – methods – visualization. Hrsg. von H. Fangerau, H. Geisler, T. Halling und W. Martin. Stuttgart: Franz Steiner Verlag, 111–124.
  • Hirt, H. (1905). Die Indogermanen. Ihre Verbreitung, ihre Urheimat und ihre Kultur. Bd. 1. Strassburg: Trübner. Internet Archive: dieindogermaneni01hirtuoft.
  • Hoenigswald, H. M. (1963). “On the history of the comparative method”. English. Anthropological Linguistics 5.1, pp. 1–11. URL:
  • Meillet, A. (1922 [1908]). Les dialectes Indo-Européens. Paris: Librairie Ancienne Honoré Champion. Internet Archive: lesdialectesindo00meil.
  • Morrison, D. A. (2011). An introduction to phylogenetic networks. Uppsala: RJR Productions.
  • Schleicher, A. (1853a). “Die ersten Spaltungen des indogermanischen Urvolkes”. Allgemeine Monatsschrift für Wissenschaft und Literatur, 786–787.
  • Schleicher, A. (1853b). “O jazyku litevském, zvlástě na slovanský. Čteno v posezení sekcí filologické král. České Společnosti Nauk dne 6. června 1853”. Časopis Čsekého Museum 27, 320–334. URL:
  • Schleicher, A. (1863). Die Darwinsche Theorie und die Sprachwissenschaft. Offenes Sendschreiben an Herrn Dr. Erns Haeckel. Weimar: Hermann Böhlau. ZVDD: urn:nbn:de:bvb:12-bsb10588615-5.
  • Schmidt, J. (1872). Die Verwantschaftsverhältnisse der indogermanischen Sprachen. Weimar: Herman Böhlau.
  • Schmidt, J. (1875): Zur Geschichte des Indogermanischen Vokalismus. Weimar: Hermann Böhlau.
  • Schuchardt, H. (1870 [1900]). Über die Klassifikation der romanischen Mundarten. Probe-Vorlesung, gehalten zu Leipzig am 30. April 1870. Graz. URL:

Monday, October 5, 2015

A network of blood types

The relationship between phenotypes and allele frequencies is often introduced in textbooks using the example of human blood type. There are three alleles for the blood-type gene (IA, IB, IO), and these produce four phenotypes (A, B, AB, and O) since IA and IB are co-dominant and IO is recessive.

The proportions of these phenotypes vary among human ethnic groups, and this variation provides one of the simplest demonstrations that human inter-breeding does not occur at random — that is, Hardy-Weinberg equilibrium is not maintained at a global scale. This can be pictured using a phylogenetic network.

The data come from Racial and Ethnic Distribution of ABO Blood Types. As usual, the phylogenetic network is being used as a form of exploratory data analysis. I first used the manhattan distance to calculate the similarity of the ethnic groups, based on the frequencies of the four blood phenotypes. This was followed by a Neighbor-net analysis to display the between-group similarities as a phylogenetic network. So, ethnic groups that are closely connected in the network are similar to each other based on the relative frequencies of their blood types, and those that are further apart are progressively more different from each other.

You will note that very few of the ethnic peoples that are either geographically or historically closely related to each other have similar distributions of blood types. Indeed, only the Irish and the Scots are closely related both in history and in the network. So, at a global scale, breeding occurs almost entirely within ethnic groups and not between them. Widespread modern migration has not yet obscured this pattern.

There is, however, a broad range of phenotypic variation in blood type. For example, the bottom right-hand part of the network shows those ethnic groups that are dominated by the O phenotype, the top-right is dominated by type A, and the bottom-left by type B.

Of particular interest are those groups for which the B allele is not been recorded in the dataset (ie. the B and AB phenotypes are absent), which includes the Australian Aborigines, the Bororo and Peruvian Indians from South America, the Shompen Nicobars from the Indian Ocean, and the Blackfoot and Navajo peoples from North America. The Maoris and Mayans also have a very low frequency. The Bororo, Peruvian Indian and Shompen peoples also seem to lack the A allele; and it is extremely rare in the Mayans. No group lacks the O allele, but it is lowest in the people from the Grand Andaman islands in the Indian Ocean.