Tuesday, January 31, 2017

Similarities and language relationship

There is a long-standing debate in linguistics regarding the best proof deep relationships between languages. Scholars often break it down to the question of words vs. rules, or lexicon vs. grammar. However, this is essentially misleading, since it suggests that only one type of evidence could ever be used, whereas most of the time it is the accumulation of multiple pieces of evidence that helps to convince scholars. Even if this debate is misleading, it is interesting, since it reflects a general problem of historical linguistics: the problem of similarities between languages, and how to interpret them.

Unlike (or like?) biology, linguistics has a serious problem with similarities. Languages can be strikingly similar in various ways. They can share similar words, but also similar structures, similar ways of expressing things.

In Chinese, for example, new words can be easily created by compounding existing ones, and the word for 'train' is expressed by combining huǒ 火 'fire' and chē 車 'wagon'. The same can be done in languages like German and English, where the words Feuerwagen and fire wagon will be slightly differently interpreted by the speakers, but the constructions are nevertheless valid candidates for words in both languages. In Russian, on the other hand, it is not possible to just put two nouns together to form a new word, but one needs to say something as огненная машина (ognyonnaya mašína), which literally could be translated as 'firy wagon'.

Neither German nor English are historically closely related to Chinese, but German, English, and Russian go back to the same relatively recent ancestral language. We can see that whether a language allows compounding of two words to form a new one or not, is not really indicative of its history, as is the question of whether a language has an article, or whether it has a case system.

The problem with similarities between languages is that the apparent similarities may have different sources, and not all of them are due to historical development. Similarities can be:
  1. coincidental (simply due to chance),
  2. natural (being grounded in human cognition),
  3. genealogical (due to common inheritance), and
  4. contact-induced (due to lateral transfer).
As an example for the first type of similarity, consider the Modern Greek word θεός [θɛɔs] ‘god’ and the Spanish dios [diɔs] ‘god’. Both words look similar and sound similar, but this is a sheer coincidence. This becomes clear when comparing the oldest ancestor forms of the words that are reflected in written sources, namely Old Latin deivos, and Mycenaean Greek thehós (Meier-Brügger 2002: 57f).

As an example of the second type of similarity, consider the Chinese word māmā 媽媽 'mother' vs. the German Mama 'mother'. Both words are strikingly similar, not because they are related, but because they reflect the process of language acquisition by children, which usually starts with vowels like [a] and the nasal consonant [m] (Jakobson 1960).

An example of genealogical similarity is the German Zahn and the English tooth, both going back to a Proto-Germanic form *tanθ-. Contact-induced similarity (the fourth type) is reflected in the English mountain and the French montagne, since the former was borrowed from the latter.

We can display these similarities in the following decision tree, along with examples from the lexicon of different languages (see List 2014: 56):

Four basic types of similarity in linguistics

In this figure, I have highlighted the last two types of similarity (in a box) in order to indicate that they are historical similarities. They reflect individual language development, and allow us to investigate the evolutionary history of languages. Natural and coincidental similarities, on the other hand, are not indicative of history.

When trying to infer the evolutionary history of languages, it is thus crucial to first rule out the non-historical similarities, and then the contact-induced similarities. The non-historical similarities will only add noise to the historical signal, and the contact-induced similarities need to be separated from the genealogical similarities, in order to find out which languages share a common origin and which languages have merely influenced each other some time during their history.

Unfortunately, it is not trivial to disentangle these similarities. Coincidence, for example, seems to be easy to handle, but it is notoriously difficult to calculate the likelihood of chance similarities. Scholars have tried to model the probability of chance similarities mathematically, but their models are far too simple to provide us with good estimations, as they usually only consider the first consonant of a word in no more than 200 words of each language (Ringe 1992, Baxter and Manaster Ramer 2000, Kessler 2001).

The problem here is that everything that goes beyond word-initial consonants would have to take the probability of word structures into account. However, since languages differ greatly regarding their so-called phonotactic structure (that is, the sound combinations they allow to occur inside a syllable or a word), an account on chance similarities would need to include a probabilistic model of possible and language-specific word structures. So far, I am not aware of anybody who has tried to tackle this problem.

Even more problematic is the second type of similarity. At first sight, it seems that one could capture natural similarities by searching for similarities that recur in very diverse locations of the world. If we compare, for example, which languages have tones, and we find that tones occur almost all over the world, we could argue that the existence of tone languages is not a good indicator of relatedness, since tonal systems can easily develop independently.

The problem with independent development, however, is again tricky, as we need to distinguish different aspects of independence. Independent development could be due to: human cognition (the fact that many languages all over the world denote the bark of a tree with a compound tree-skin is obviously grounded in our perception); or due to language acquisition (like the case of words for 'mother'); but potentially also due to environmental factors, such as the size of the population of speakers (Lupyan et al. 2010), or the location where the languages are spoken (see Everett et al. 2015, but also compare the critical assessment in Hammarström 2016).

Convergence (in linguistics, the term is used to denote similar development due to contact) is a very frequent phenomenon in language evolution, and can happen in all domains of language. Often we simply do not know enough to make a qualified assessment as to whether certain features that are similar among languages are inherited/borrowed or have developed independently.

Interestingly, this was first emphasized by Karl Brugmann (1849-1919), who is often credited as the "father of cladistic thinking" in linguistics. Linguists usually quote his paper from 1884, in order to emphasize the crucial role that Brugmann attributed to shared innovations (synapomorphies in the cladistic terminology) for the purpose of subgrouping. When reading this paper thoroughly, however, it is obvious that Brugmann himself was much less obsessed with the obscure and circular notion of shared innovations (which also holds for cladistics in biology; see De Laet 2005), but with the fact that it is often impossible to actually find them, due to our incapacity to disentangle independent development, inheritance and borrowing.

So far, most linguistic research has concentrated on the problem of distinguishing borrowed from inherited traits, and it is here that the fight over lexicon or grammar as primary evidence for relatedness primarily developed. Since certain aspects of grammar, like case inflection, are rarely transferred from one language to another, while words are easily borrowed, some linguists claim that only grammatical similarities are sufficient evidence of language relationship. This argument is not necessarily productive, since many languages simply lack grammatical structures like inflection, and will therefore not be amenable to any investigation, if we only accept inflectional morphology (grammar) as rigorous proof (for a full discussion, see Dybo and Starostin 2008). Luckily, we do not need to go that far. Aikhenvald (2007: 5) proposes the following borrowability scale:
Aikhenvald's (2007) scale of borrowability

As we can see from this scale, core lexicon (basic vocabulary) ranks second, right behind inflectional morphology. Pragmatically, we can thus say: if we have nothing but the words, it is better to compare words than anything else. Even more important is that, even if we compare what people label "grammar", we compare concrete form-meaning pairs (e.g., concrete plural-endings), and we never compare abstract features (e.g., whether languages have an article). We do so in order to avoid the "homoplasy problem" that causes so many headaches in our research. No biologist would group insects, birds, and bats based on their wings; and no linguist would group Chinese and English due to their lack of complex morphology and their preference for compound words.

Why do I mention all this in this blog post? For three main reasons. First, the problem of similarity is still creating a lot of confusion in the interdisciplinary dialogues involving linguistics and biology. David is right: similarity between linguistic traits is more like similarity in morphological traits in biology (phenotype), but too often, scholars draw the analogy with genes (genotype) (Morrison 2014).

Second, the problem of disentangling different kinds of similarities is not unique to linguistics, but is also present in biology (Gordon and Notar 2015), and comparing the problems that both disciplines face is interesting and may even be inspiring.

Third, the problem of similarities has direct implications for our null hypothesis when considering certain types of data. David asked in a recent blog post: "What is the null hypothesis for a phylogeny?" When dealing with observed similarity patterns across different languages, and recalling that we do not have the luxury to assume monogenesis in language evolution, we might want to know what the null hypothesis for these data should be. I have to admit, however, that I really don't know the answer.

  • Aikhenvald, A. (2007): Grammars in contact. A cross-linguistic perspective. In: Aikhenvald, A. and R. Dixon (eds.): Grammars in Contact. Oxford University Press: Oxford. 1-66.
  • Baxter, W. and A. Manaster Ramer (2000): Beyond lumping and splitting: Probabilistic issues in historical linguistics. In: Renfrew, C., A. McMahon, and L. Trask (eds.): Time depth in historical linguistics. McDonald Institute for Archaeological Research: Cambridge. 167-188.
  • Brugmann, K. (1884): Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
  • De Laet, J. (2005): Parsimony and the problem of inapplicables in sequence data. In: Albert, V. (ed.): Parsimony, phylogeny, and genomics. Oxford University Press: Oxford. 81-116.
  • Dybo, A. and G. Starostin (2008): In defense of the comparative method, or the end of the Vovin controversy. In: Smirnov, I. (ed.): Aspekty komparativistiki.3. RGGU: Moscow. 119-258.
  • Everett, C., D. Blasi, and S. Roberts (2015): Climate, vocal folds, and tonal languages: Connecting the physiological and geographic dots. Proceedings of the National Academy of Sciences 112.5. 1322-1327.
  • Gordon, M. and J. Notar (2015): Can systems biology help to separate evolutionary analogies (convergent homoplasies) from homologies?. Progress in Biophysics and Molecular Biology 117. 19-29.
  • Hammarström, H. (2016): There is no demonstrable effect of desiccation. Journal of Language Evolution 1.1. 65–69.
  • Jakobson, R. (1960): Why ‘Mama’ and ‘Papa’?. In: Perspectives in psychological theory: Essays in honor of Heinz Werner. 124-134.
  • Kessler, B. (2001): The significance of word lists. Statistical tests for investigating historical connections between languages. CSLI Publications: Stanford.
  • List, J.-M. (2014): Sequence comparison in historical linguistics. Düsseldorf University Press: Düsseldorf.
  • Lupyan, G. and R. Dale (2010): Language structure is partly determined by social structure. PLoS ONE 5.1. e8559.
  • Meier-Brügger, M. (2002): Indogermanische Sprachwissenschaft. de Gruyter: Berlin and New York.
  • Morrison, D. (2014): Is the Tree of Life the best metaphor, model, or heuristic for phylogenetics?. Systematic Biology 63.4. 628-638.
  • Ringe, D. (1992): On calculating the factor of chance in language comparison. Transactions of the American Philosophical Society 82.1. 1-110.

Wednesday, January 25, 2017

Irony and duality in phylogenetics

Ruben E. Valas and Philip E Bourne (2010. Save the tree of life or get lost in the woods. Biology Direct 5: 44) have an interesting discussion of the relationship between the Tree of Life and the Web of Life. They argue that:
Function follows more of a tree-like structure than genetic material, even in the presence of horizontal transfer ... We propose a duality where we must consider variation of genetic material in terms of networks and selection of cellular function in terms of trees. Otherwise one gets lost in the woods of neutral evolution.

As an aside, they also note:
We must keep in mind the humor of calling the central metaphor for evolution "the tree of life". The phrase first appears in Genesis 2:9 ... There is irony in using the name of a tree central to the creation story to argue against that very myth.

There is clearly a duality in Darwin's theory of descent with modification: the history of variation is well described by a network and the history of selection is well described by a tree.

Tuesday, January 17, 2017

What is the null hypothesis for a phylogeny?

As noted in the previous blog post (Why do we need Bayesian phylogenetic information content?), phylogeneticists rarely consider whether their data actually contain much phylogenetic information. Nevertheless, the existence of information content in a dataset implies the existence of null hypothesis of "no information", relative to the objective of the data analysis.

In this regard, Alexander Suh (2016), in a paper on the phylogenetics of birds, makes two important general points:
  • Every phylogenetic tree hypothesis should be accompanied by a phylogenetic network for visualization of conflicts.
  • Hard polytomies exist in nature and should be treated as the null hypothesis in the absence of reproducible tree topologies.
It is difficult to argue with the first point, of course. However, the second point is also an interesting one, and deserves some consideration. Suh notes that: "In contrast to ‘soft polytomies’ that result from insufficient data, ‘hard polytomies’ reflect the biological limit of phylogenetic resolution because of near-simultaneous speciation". That is, the distinction is whether polytomies result from simultaneous branching events (hard) or from insufficient sequence information (soft).

The matter of a suitable null hypothesis in phylogenetics has been considered before, for example by Hoelzer and Meinick (1994) and Walsh et al. (1999), who come to essentially the same conclusion as Suh (2016). Clearly, a network cannot be the null hypothesis for a phylogeny, and nor can a resolved tree (even partially resolved); the only logical possibility is a polytomy.

However, it seems to me that the current null hypothesis is effectively a soft polytomy, although no hypothesis is ever explicitly stated by most workers. Nevertheless, any evidence to resolve polytomies seems to be accepted, with evidence taken in descending order of strength in order to resolve any conflicting evidence. This inevitably produces a tree that is at least partly resolved, which is the alternative hypothesis.

On the other hand, resolving a hard polytomy requires unambiguous evidence for each branch in the phylogeny. If there is substantial conflict then it can only be resolved as a reticulation, or it must remain a polytomy. The existence of a reticulation, of course, results in a network, not a tree, so that the alternative hypothesis is a network, which may in practice be very tree-like.

So, in phylogenetics we have: null hypothesis = hard polytomy, alternative = network, rather than null hypothesis = soft polytomy, alternative = partially resolved tree.

As a final point, Suh claims that: "Neoaves comprise, to my knowledge, the first empirical example for a hard polytomy in animals." This is incorrect. There is also a hard polytomy at the root of the Placental Mammals, as discussed in this blog post: Why are there conflicting placental roots?


Hoelzer G.A., Meinick D.J. (1994) Patterns of speciation and limits to phylogenetic resolution. Trends in Ecology & Evolution 9: 104-107.

Suh A. (2016) The phylogenomic forest of bird trees contains a hard polytomy at the root of Neoaves. Zoologica Scripta 45: 50-62.

Walsh H.E., Kidd M.G., Moum T., Friesen, V.L. (1999) Polytomies and the power of phylogenetic inference. Evolution 53: 932-937.

Tuesday, January 10, 2017

Why do we need Bayesian phylogenetic information content?

There are many ways to construct a phylogenetic tree, and after we have done so we are usually expected to indicate something about "branch support", such as bootstrap values or bayesian posterior probabilities. Rarely, however, do people indicate whether there is much tree-like phylogenetic information in their dataset in the first place — it is simply assumed that there must be (fingers crossed, touch wood).

Recently, this latter issue has been addressed for bayesian analysis by:
Paul O. Lewis, Ming-Hui Chen, Lynn Kuo, Louise A. Lewis, Karolina Fučíková, Suman Neupane, Yu-Bo Wang, Daoyuan Shi. (2016) Estimating Bayesian phylogenetic information content. Systematic Biology 65: 1009-1023.
They develop a methodology for "measuring information about tree topology using marginal posterior distributions of tree topologies", and apply it to two small empirical datasets. That is, we can now work out something about "[substitution] saturation and detecting conflict among data partitions that can negatively affect analyses of concatenated data."

However, we have long been able to do this with data-display phylogenetic networks. More to the point, we can do it in a second or two, without ever constructing a tree. More pedantically, if the network construction produces a tree, then we know there is tree-like phylogenetic information in the dataset; if we get a network then there is little such information. Equally importantly, the network might tell us something about the patterns of non-tree-likeness, which a single-number measurement cannot.

Let's take the first empirical dataset, as described by the authors:
The five sequences of rpsll composing the data set BLOODROOT [three taxa from the angiosperm family Papaveraceae and two monocots] ... were chosen because they represent a case in which horizontal transfer of half of the gene results in different true tree topologies for the 5′ (219 nucleotide sites) and 3′ (237 nucleotide sites) subsets, which allows investigation of information content estimation in the presence of true conflicting phylogenetic signal. We analyzed each half of the data separately and measured phylogenetic dissonance, which is expected to be high in this case.
Here is the NeighborNet based on uncorrected distances. The idea that there is something non-tree-like about Sanguinaria seems hard to avoid. Indeed, the network pattern makes recombination an obvious first choice, with part of the sequence matching the Papaveraceae (on the left) and part matching the monocots (on the right). This recombination may be due to HGT.

Now for the second dataset:
The data set ALGAE comprises chloroplast psaB sequences from 33 taxa of green algae (phylum Chlorophyta, class Chlorophyceae, order Sphaeropleales) ... The alignments of just the psaB gene ... were chosen because of their deep divergence, which invites hasty judgements of saturation, especially of third codon position sites. We analyzed second and third codon position sites separately ... to assess which subset has more phylogenetic information.
Here are the two NeighborNets based on uncorrected distances. Once again, it is immediately obvious that the third-codon positions have almost no information at all, even for a network, let alone a tree — the terminal branches do not connect in any coherent way. The second-codon positions do have some information, but it is so contradictory that one could not construct a reliable tree. Saturation of nucleotide substitutions is a likely candidate for this situation; and some correction for this saturation would be needed even to construct a reasonable network from these data.

2nd positions:

3rd positions:

Tuesday, January 3, 2017

Phylogenetics versus historical linguistics

Google Trends looks at recent trends in web searches, and it has been used to study patterns in web activity for many concepts. This is similar to The Ngram Viewer in Google Books (see the post Ngrams and phylogenetics). Google Trends aggregates the number of web searches that have been performed for any given search term (or terms), and it can display the results as a time graph, for any given geographical region. The Trends searches are somewhat restrictive, but they may show us something about the period 2004-2016 (inclusive).

So, I thought that it might be interesting to look at a few expressions of relevance to readers of this blog. The Trends graphs show changes in the relative proportion of searches for the given term (vertically) through time (horizontally). The vertical axis is scaled so that 100 is simply the time with the most popularity as a fraction of the total number of searches (ie. the scale shows the proportion of searches, with the maximum always shown as 100, no matter how many searches there were).

As you can see, the term "phylogenetics" has maintained its popularity over "historical linguistics". However, it has decreased in popularity through time much more than has "historical linguistics". Nevertheless, both decreases are very small compared to that for the term "bioinformatics", as discussed in the blog post on Bioinformaticians look at bioinformatics.

It is not necessarily clear to me why many technical terms have decreased in Google searches through time, although there are several possibilities. First, it could be Google itself. The Trends numbers represent search volume for a keyword relative to the total search volume on Google. So, actual search numbers for the technical terms could be increasing while as a fraction of total search volume of the internet they are decreasing, if total Google search volume is increasing. 

Alternatively, Business Insider has noted that "search is facing a huge challenge ... consumers are increasingly shifting [from desktop] to mobile. On mobile, consumers say they just don't search as much as they used to because they have apps that cater to their specific needs. They might still perform searches within those apps, but they're not doing as many searches on traditional search engines". Furthermore, "people are discovering content through social media. The top eight social networks drove more than 30% of traffic to sites in 2014".

The extra raggedness in search popularity in the first couple of years of the graph probably reflects inadequacies in the Google Trends dataset in the early years (as discussed by Wikipedia). The same is true for the next graph, as well.

The "phylogenetic tree" searches have been more popular than "evolutionary tree", just as was true for the Google Books usage discussed in the post Ngrams and phylogenetics. However, the "phylogenetic tree" searches show a distinctly bimodal pattern every year. This presumably reflects teaching semesters — few people search for technical terms out of term time!

Unfortunately, it is not possible to look at the term "phylogenenetic network", because Google Trends tells me that there is "Not enough search volume to show results". How rude!