Tuesday, March 28, 2017

Why we need alignments in historical linguistics

Alignments have been discussed quite a few times in this blog. They are so extremely common in molecular biology that I doubt that there are any debates about their usefulness, apart from certain attempts to improve the modelling, especially in cases of non-colinear patterns (Kehr et al. 2014), or to speed up computation (Mathura and Adlakha 2016). In linguistics, on the other hand, alignments are rarely used, although initial attempts to arrange homologous words in a matrix go back to the early 20th century, as you can see from this example taken from Dixon and Koerber (1919: 61):

Early alignment from Dixon and Kroeber (1919)

This example is rather difficult to read for those not familiar with the annotation. The authors group homologous words across different indigenous languages from California. The group labels of the languages under investigation are given in abbreviated form at the very left of the matrix, and the actual varieties are listed in the next column. What follows is the actual alignment, along with comments in the last column. Regarding the alignments, the authors note on page 55:
A number of sets of cognates have been taken from their numbered place in this list and put at the end to allow of their being printed in columnar form, with a view to bringing out parallelisms that otherwise might fail to impress without detailed analysis and discussion. (Dixon and Kroeber 1919: 55)
In my opinion, this expresses nicely why alignments should be used more often in linguistics — due to the problem that our "alphabets" (the sound systems of languages) are undergoing constant change (see this earlier post for details regarding this claim), we need to infer both the scoring function between different sounds across different languages, and the alignment at the same time. If we look at the similarities the authors spotted, it should become obvious what I mean.

I am not yet sure how to interpret the data exactly, but if I am not mistaken, the authors claim that each of the column contains homologous material. So, they find a similarity between kaha in the first row (the language is Northern Wintun, according to the key to abbreviations in the book), and tu in the last row (Monterey Costanoan). The last column shows suffixes, which I think the authors exclude from their analysis, but I could not find additional information confirming this in their book.

The comment column illustrates another problem of representation, namely that the authors do not know how to handle cases of metathesis (or transpositions) consistently. The transposition of the parts of words is a process that is quite frequent in language evolution. It is very frequent in compounds consisting of modifier and modified, such as milk coffee in English, where milk modifies the coffee, while French, for example, puts the modifier after the main noun, expressing this as café au lait.

Nowadays, we can handle these cases consistently in linguistics, both in our data annotation and in the alignments, and we can even search for the structures automatically (see List et al. 2016). One hundred years ago, when Dixon and Kroeber worked out their comparison of the languages in California, they were pioneers who tried to increase the transparency of our discipline, and it is clear that their solutions are not completely satisfying from today's perspective.

It is extremely surprising for me that, despite these early attempts to make our homology judgments in linguistics more transparent, the practice of phonetic alignments is still rarely used by historical linguists. Indeed, the majority of them even think that it is a waste of time, or only useful for the purpose of teaching.

I was reminded of this when I looked at a recent proposal by Bengtson (2017, see also this blog for details) for deep genetic connections between Basque and North Caucasian languages. Note that the Basque language is traditionally considered as an isolate, i.e. a language whose nearest relatives we cannot find among the languages in the world. Many linguists have attempted to solve this puzzle by proposing various hypotheses (see Forni 2013 for an example of attempting to link Basque with Indo-European). Bengtson proposes various types of evidence, which I cannot really judge, as I do not know the languages under comparison, but finally, he also shows a list with potential homologs between Basque and North Caucasian varieties, which you find below.

Potential homologs between Basque and North Caucasian languages (Bengtson 2017)

If you are not a trained historical linguistic, and thus do not know what to do with this table, be assured that many historical linguists will feel similarly. As a rough explanation: the concepts are supposed to be very, very stable, being drawn from Sergey Yakhontov's list of 35 ultra-stable concepts, and I think that all words in one row are supposed to be etymologically related — that is, they should be potential homologs across all of the languages. If word forms are preceded by the asterisk symbol (*), this means that they are reconstructed, i.e. not reflected in written sources. But that is all I can tell you for the moment. Where I should start the comparison between the words remains a mystery for me, as I do not know which parts are supposed to be similar. Alignments would help us to see immediately where the author thinks that the historical similarities can be found — that is, we would see, which parts of the words are supposed to be homologous.

At this point in the post, I originally planned to provide you with an alignment of Bengtson's table, in order to illustrate the benefits of alignment in linguistics. Unfortunately, I had to admit to myself that I cannot do this, as I simply do not know where to align the words (apart from some rare trivial cases in the table).

I really hope that this will change in the future. Too often, our hypotheses in linguistics suffer from insufficient transparency with regards to the "proofs" and the evidence. I agree that it is very difficult to come up with good alignments in linguistics, especially if one regards cases of metathesis, unrelated parts, and general uncertainty. However, instead of giving in to the problem, we should follow the pioneering work of Dixon and Kroeber, and try to improve the way we present our data to both our colleagues and a broader public.

Theories such as the link between Basque and the North Caucasian languages are usually highly disputed in historical linguistics, and I do not know of any long range proposal that has gained broad acceptance during the last 50 years. Yet, maybe this is not because the proposals are not valid, but simply because those who are proposing these theories have failed to present their findings in a transparent and testable way.

  • Bengtson, J. (2017) The Euskaro-Caucasian Hypothesis. Current model. PDF.
  • Dixon, R. and A. Kroeber (1919) Linguistic families of California. University of California Press: Berkeley.
  • Forni, G. (2013) Evidence for Basque as an Indo-European language. The Journal of Indo-European Studies 41.1 & 2: 1-142.
  • Kehr, B., K. Trappe, M. Holtgrewe, and K. Reinert (2014) Genome alignment with graph data structures: a comparison. BMC Bioinformatics 15.1: 99.
  • List, J.-M., P. Lopez, and E. Bapteste (2016) Using sequence similarity networks to identify partial cognates in multilingual wordlists. In: Proceedings of the Association of Computational Linguistics 2016 (Volume 2: Short Papers). Association of Computational Linguistics, pp. 599-605.
  • Mathur, R. and N. Adlakha (2016) A graph theoretic model for prediction of reticulation events and phylogenetic networks for DNA sequences. Egyptian Journal of Basic and Applied Sciences 3.3: 263-271.

Tuesday, March 21, 2017

Computer viruses and phylogenetic networks

I have written before about the Phylogenetics of computer viruses. This is an example of the use of phylogenetics as a metaphor for the history of non-biological objects. By analogy, computer viruses and other malware can be seen to be phylogenetically related, because new viruses are usually generated using existing malicious computer code — that is, one virus "begets" another virus due to changes in its intrinsic attributes. In this sense the analogy is helpful, although there is no actual copying of anything resembling a genome — this is phenotype evolution not genotype evolution.

Furthermore, the model of historical change in computer viruses is often the same as that for biological viruses — recombination rather than substitution. That is, like real viruses, new computer viruses are often created by recombining chunks of functional information from pre-existing viruses, rather than by an accumulation of small changes. Coherent subsets of the current computer code are combined to form the new programs.

From this perspective, it is unexpected that the principal phylogenetic model in the study of computer viruses has been a tree rather than a network — a recombinational history requires a network representation, not a tree, and thus malware evolution is not tree-like. As noted by Liu et al. (2016): "Although tree-based models are the mainstream direction, they are not suited to represent the reticulation events which have happened in malware generation."

In my previous (2014) post, I noted only two known papers that used a network rather than a tree to represent malware evolution:
  • Goldberg et al. (1996) analyzed their data using what they call a phyloDAG, which is a directed network that can have multiple roots (it appears to be a type of minimum-spanning network; described in more detail in Phylogenetics of computer viruses);
  • Khoo & Lió (2011) used splits graphs rather than unrooted trees to display their data, although they did not specify the algorithm for producing their networks.
Unfortunately, malware researchers have continued to pursue the idea that a phylogeny is simply a form of classification, and have therefore stuck to the idea of producing a tree-like phylogeny using some form of hierarchical agglomerative clustering algorithm (eg. Bernardi et al. 2016).

More positively, however, some papers have appeared that have instead pursued the idea of using a network model rather than a tree:
  • Liu et al. (2016) provided median-joining networks, which are unrooted splits graphs, to display relationships within each of three different virus groups;
  • Jang et al. (2013) infered a directed acyclic graph using a minimum spanning tree algorithm, with a post-processing step to allow nodes to have multiple parents;
  • Anderson et al. (2014) presented a novel algorithm based on a graphical lasso, which builds the phylogeny as an undirected graph, to which directionality is then added using a post-hoc heuristic;
  • Oyen et al. (2016) "present a novel Bayesian network discovery algorithm for learning a DAG [directed acyclic graph] via statistical inference of conditional dependencies from observed data with an informative prior on the partial ordering of variables. Our approach leverages the information on edge direction that a human can provide and the edge presence inference which data can provide."
It is important to note that only the works producing a directed graphs can represent a phylogeny — the other works produce unrooted graphs that may or may not reflect phylogenetic history. The bayesian work of Oyen et al. (2016) is particularly interesting:
Directionality is inferred by the learning process, but in many cases it is difficult to infer, therefore prior information is included about the edge directions, either from human experts or a simple heuristic. This paper introduces a novel approach to combining human knowledge about the ordering of variables into a statistical learning algorithm for Bayesian structure discovery. The learning algorithm with our prior combines the complementary benefits of using statistical data to infer dependencies while leveraging human knowledge about the direction of dependencies.


Anderson B, Lane T, Hash C (2014) Malware phylogenetics based on the multiview graphical lasso. Lecture Notes in Computer Science 8819: 1-12.

Bernardi ML, Cimitile M, Mercaldo F (2016) Process mining meets malware evolution : a study of the behavior of malicious code. Proceedings of the 2016 Fourth International Symposium on Computing and Networking, pp 616-622. IEEE Computer Society Washington, DC.

Goldberg LA, Goldberg PW, Phillips CA, Sorkin GB (1996) Constructing computer virus phylogenies. Lecture Notes in Computer Science 1075: 253-270. [also Journal of Algorithms (1998) 26: 188-208]

Jang J, Woo M, Brumley D (2013) Towards automatic software lineage inference. Proceedings of the Twenty-Second USENIX Conference on Security, pp 81-96. USENIX Association, Berkeley, CA.

Khoo WM, Lió P (2011) Unity in diversity: phylogenetic-inspired techniques for reverse engineering and detection of malware families. Proceedings of the 2011 First Systems Security Workshop (SysSec'11), pp 3-10. IEEE Computer Society Washington, DC.

Liu J, Wang Y, Wang Y (2016) Inferring phylogenetic networks of malware families from API sequences. Proceedings of the 2016 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, pp 14-17. IEEE Computer Society Washington, DC.

Oyen D, Anderson B, Anderson-Cook C (2016) Bayesian networks with prior knowledge for malware phylogenetics. The Workshops of the Thirtieth AAAI Conference on Artificial Intelligence Artificial Intelligence for Cyber Security: Technical Report WS-16-03, pp 185-192. Association for the Advancement of Artificial Intelligence, Palo Alto, CA.

Tuesday, March 14, 2017

Detecting introgression versus hybridization

There has been considerable interest in recent years in developing methods that will detect hybridization in the presence of incomplete lineage sorting (ILS), which will allow the construction of a realistic hybridization network. Clearly, both ILS and hybridization create conflicting gene trees, which will lead to a very complex data-display network. However, if the ILS signals in the data can be used to construct a small collection of gene-tree groups, in which the gene trees within each group are congruent with a single species tree (under the ILS model), then the incongruence between groups can be used to construct a hybridization network. This network will then be an hypothesis for a realistic evolutionary network.

Recently, a paper has appeared that uses simulations to evaluate several of these methods:
Olga K. Kamneva and Noah A. Rosenberg (2017) Simulation-based evaluation of hybridization network reconstruction methods in the presence of incomplete lineage sorting. Evolutionary Bioinformatics 2017:13.
I am not a great fan of simulations, because they exist under very restricted and usually unrealistic mathematical conditions. They are, however, useful for exploring the mathematical properties of various methods, even if they are hard to connect to the biological properties.

My interpretations of the results from the particular scenarios explored by Kamneva and Rosenberg are:
  1. Most of the methods improve as the internal network edges increase in length.
  2. Most of the methods improve as the number of gene trees increases.
  3. Under good conditions the maximum-likelihood methods do better than the parsimony and consensus methods.
  4. The maximum-likelihood methods are more affected by gene-tree error than are the other methods.
  5. There are conditions under which none of the methods work well.
I doubt that any of this is controversial, in the sense that model-based methods usually work well when their models apply, but not necessarily otherwise. Reality is more complex than the models, and so the methods are likely to fail for real data.

For me, the most interesting part of the paper is the examination of balanced versus skewed parental contributions to the hybrid taxon. A balanced genetic contribution in the simulations is analogous to homoploid or polyploid hybridization, whereas a skewed contribution is analogous to introgression or horizontal gene transfer (HGT). The simulations seem to show that the methods examined do not deal very well with skewed contributions.

So, these methods may literally be hybridization-network methods only, with separate network methods needed for detecting introgression or HGT — for example, the admixture methods used for genomes (see the recent post on Producing admixture graphs).

This would mean that we cannot first produce networks with reticulations, and then afterwards explore what is causing the reticulations. Instead, we will need to decide on the possible biological mechanisms of reticulation before the analysis, and then mathematically explore possible networks that reflect those mechanisms.

This is not an issue for constructing trees, of course, since the only recognized mechanisms are speciation and extinction, both of which are explored post hoc rather than a priori. This is an important difference of networks versus trees.

Tuesday, March 7, 2017

Roundels and family trees

I have written before about the slow development of what has come to be known as the "family tree", including reducing human network relationships to a tree-like form (Reducing networks to trees), and presenting it as an actual tree (Drawing family trees as trees), rooted at the base (Does it matter which way up a tree is drawn?).

Most of the early representations of pedigrees had the people's names enclosed in a circle, called a "roundel", and it was these roundels that were connected to show the family relationships. One of the steps on the way to a tree was thus dropping this idea, so that the names could be connected directly.

Some of the diagrams with roundels that I have covered include:

c. 400 CE — The genealogy of Jesus Christ, Part I, Part II, Part III
c. 1000 CE — Genealogy of Cunigunde of Luxembourg
c. 1140 CE — Genealogy of the Carolingians
c. 1185 CE — Genealogy of the Welf dynasty
c. 1237 CE — Genealogy of the Ottonian dynasty

Interestingly, the earliest pedigrees that do not have roundels also date from this early period. As noted by Nathaniel Lane Taylor, the importance of this development is that: "the scribe relies on the power of the names themselves to anchor a diagram on the page, with lines simply taking the place of any syntax needed to describe the filiation." That is, no abstract iconography is needed.

I have already illustrated the earliest known example:
c. 1121 CE — The genealogy of Lambert of Saint-Omer

Taylor provides links to illustrations of the next known example:
   c. 1128, John of Worcester, Chronicle of World and English History (Corpus Christi College MS 157).
This book contains eight genealogies of Anglo-Saxon and Norman kings (pp. 47-54), one of which is shown above.

Taylor also refers to "one of the Arabic stemmata" illustrated in:
   Arthur Watson (1934) The Early Iconography of the Tree of Jesse. Oxford University Press,
I have not seen this book, but the illustrations are apparently confined to those from the 12th century, making the diagram contemporaneous with the two listed above. The Tree of Jesse normally appears in Medieval Christian art as a richly illustrated genealogy of Jesus in illuminated manuscripts, but apparently this one was an exception.