Tuesday, October 24, 2017

Let's distinguish between Hennig and Cladistics

There are theoretically an infinite number of ways to mathematically analyze any set of data, and yet it is unlikely that all (or even most) of these will have any relevance to a study of biology. In this sense, the philosophy of phylogenetic analysis needs to show that there is a strong basis for treating any particular mathematical analysis as having biological relevance. This is a point that I have discussed before: Is there a philosophy of phylogenetic networks?

Willi Hennig clearly has some role to play here. However, his ideas are often treated as being solely related to one particular form of phylogenetic analysis — cladistics. In this post I will point out that his work has a much greater relevance than that — he provides a crucial logical step that applies to all phylogenetic inference.

The steps of phylogenetic inference are shown in the first figure, which is taken from my earlier post. The first step is a mathematical inference from character data to tree/network; the second step is a logical inference that the mathematical summary resulting from the first step has some biological relevance; and the third step is a practical inference that the biological summary applies to whole organisms as well as to their characters.

The logic of phylogeny reconstruction


Hennig's concept of "shared innovations" (which he called synapomorphies) is the only thing that allows us to use the mathematical phylogenetics in the pursuit of genealogical history. Without this concept, the mathematics could just produce something like the arithmetic mean, a mathematical concept with no connection to real objects (unlike the median or mode, which will always be real). The idea of shared innovations is what leads us to believe that the mathematical summary (whether tree or network) might actually also be a close approximation to the real thing. This is a separate concept from cladistics, which is simply a mathematical algorithm based on a particular optimality criterion (parsimony), just like maximum likelihood or bayesian approaches. So, shared innovations underlie the use of both parsimony, likelihood and distance methods — Willi Hennig (and, before him, Karl Brugmann in linguistics) is relevant no matter what algorithm we use.

Mathematical analyses

If they are to represent genealogical history, then all trees and networks in phylogenetics will be directed acyclic graphs (DAGs), mathematically. There are many ways to produce a DAG, some of which have had varying degrees of popularity in phylogenetics, and some of which have not been used at all.

To produce an acyclic line graph (in which nodes are connected by edges), we can start with character data or distance data. We can then use various optimality criteria to choose among the many graphs that could apply to the data, such as parsimony (usually ssociated with cladistics) and likelihood (either as maximum likelihood or integrated likelihood). We can also ensure that the graph is directed (ie. the edges have arrows), by choosing a root location, either directly as part of the analysis or a posteriori by specifying an outgroup.

All of these approaches are mathematically valid, as are a number of others. They all provide a mathematical summary of the data. This is step one of the phylogenetic inference, as illustrated above.

But what of step two? Biologists need a summary of the data that has biological relevance, as well, not just mathematical relevance. This has long been a thorn in the side of biologists — just because they can perform a particular mathematical calculation does not automatically mean that the calculation is relevant to their biological goal.

Consider the simplest mathematics of all — calculating the central location of a set of data. There are many ways to do this, mathematically — indeed, there are technically an infinite number of ways. These include the mode, the median, the arithmetic mean, the geometric mean, and the harmonic mean. All of these are mathematically valid, but do any of them produce a central location that describes biology?

The mode does, because it is the most common observation in the dataset. The median usually does, because it is the "middle" observation in the dataset. But what of the various means? There is no necessary reason for them to describe biology, although they are perfectly valid mathematics.

For instance, the modal number of children in modern families is 2, meaning that more families have this number than any other number of children. The median number is also 2, meaning that half of the families have 2 or fewer children and half of the families have 2 or more. So, these mathematical summaries are also descriptions of real families. But the means are not. For example, the arithmetic mean number of children is 2.2, which does not describe any real family. If you ever find a family with 2.2 children, then you should probably call the police, to investigate!

Mathematically valid data summaries have a lot of relevance, but they do not necessarily describe biological concepts. I can use the mean number of children per local family to estimate the number of schools that I might need in that area, but I cannot use it to describe the families themselves. This is a classic case of "horses for courses".


So, in phylogenetics we need some piece of logic that says that we can expect our DAG (a mathematical concept) to be a representation of a genealogy (a biological concept). Our genealogical estimate may still be wrong (and indeed it probably will be, in some way!), but that is a separate issue. The DAG needs to a reasonable representation, not a correct one. Correctness needs to be a result of our data, not our mathematics.

This is where Willi Hennig comes in. Hennig's ideas, and the ideas derived from them, are illustrated in the second figure.

Hennig explicitly noted that characters have a genealogical polarity, with ancestral states being modified into derived states through evolutionary time. Furthermore, he noted that it is only the derived states that are of relevance to studying evolutionary history — the sharing of derived character states reveals evolutionary history, but shared ancestral states tells us nothing.

We have done two things with these Hennigian ideas. Some people have been interested in classification, for which the concept of monophyly is relevant, and others have been interested in reconstructing the genealogies, rather than simply interpreting them.


Reconstructing a tree-like phylogenetic history is conceptually straightforward, although it took a long time for someone (Hennig 1966) to explain the most appropriate approach. Interestingly, the study of historical linguistics has developed the same methodology (Platnick and Cameron 1977; Atkinson and Gray 2005), thus independently arriving at exactly the same solution to what is, in effect, exactly the same problem. From this point of view, the logical inference itself is uncontroversial; and its generic nature means that it can be used for any objects with characteristics that can be identified and measured, and that follow a history of descent with modification. I will, however, discuss this in terms of biology — you can make the leap to other objects yourself.

The objective is to infer the ancestors of the contemporary organisms, and the ancestors of those ancestors, etc., all the way back to the most recent common ancestor of the group of organisms being studied. Ancestors can be inferred because the organisms share unique characteristics (shared innovations, or shared derived character states. That is, they have features that they hold in common and that are not possessed by any other organisms. The simplest explanation for this observation is that the features are shared because they were inherited from an ancestor. The ancestor acquired a set of heritable (i.e. genetically controlled) characteristics, and passed those characteristics on to its offspring. We observe the offspring, note their shared characteristics, and thus infer the existence of the unobserved ancestor(s). If we collect a number of such observations, what we often find is that they form a set of nested groupings of the organisms.

Hennig, in particular, was interested in the interpretation of phylogenetic trees, rather than their reconstruction. He did this interpretation in terms of monophyletic groups (also called clades), each of which consists of an ancestor and all of its descendants. These are natural groups in terms of their evolutionary history, whereas other types of groups (eg. paraphyletic, polyphyletic) are not. So, a phylogenetic tree consists of a set of nested clades, which are the groups that are represented and given names in formal taxonomic schemes.

For phylogenetic trees, there is thus a rationale for treating a tree diagram as a representation of evolutionary history. For example, in a study of a set of gene sequences, first we produce a mathematical summary of the data based on a quantitative model. We then infer that this summary represents the gene history, based on the Hennigian logic that the patterns are formed from a nested series of shared innovations (this is a logical inference about the biology being represented by the mathematical summary). We then infer that this gene history represents the organismal history, based on the practical observation that gene changes usually track changes in the organisms in which they occur (ie. a pragmatic inference).

Mis-interpretations of Hennig

What I have said above has lead to various mis-interpretations of Hennig's role in phylogenetics.

First, he did not propose any specific method for producing a phylogenetic tree (or network). He was concerned about the logic of the diagram. not how to get it in the first place. He distinguished shared derived character states, or shard innovations, (he called them synapomorphies) from shared ancestral states (symplesiomorphies), and noted that only the former are relevant for phylogenies. So, distance methods will also work in phylogenetics provided the distances are based on homologous apomorphic features. If they are not so based, then they are simply mathematical constructions, which may or may not represent anything to do with phylogeny. Distances estimated from plesiomorphic features can be used to construct a tree, obviously, but there is no reason to expect that tree to represent a phylogeny.

Second, parsimony analysis was developed independently of Hennig, by people such as Farris, Nelson and Platnick. This came to be called cladistics, intended by Ernst Mayr to be a derogatory term for the new form of analysis. The fact that the Willi Hennig Society is associated exclusively with cladistics has nothing to do with Hennig himself, or with the logic of his approach to phylogenetics. You need to clearly distinguish between Hennig and Cladistics!

Third, Hennig was more interested in classification than he was in phylogeny reconstruction. This seems to cause confusion for gene jockeys and linguists, in particular, who often associate phylogenetics solely with classification (see, for example, Felsenstein 2004, chapter 10). Sure, Hennig was primarily interested in the interpretation of phylogenies, rather than their construction. However, that was simply a personal point of view. The logic of his work transcends his own personal interests. Without him, no genealogical reconstruction makes logical sense, in genetics or linguistics. Mathematical methods for summarizing data were developed independently in genetics and linguistics, just as they were in other areas of biology and also in stemmatology. However, without the concept of shared innovations, these methods remain mathematical summaries, not estimates of genealogies.

Finally, Hennig's work was not original, being naturally a synthesis of much previous work. In biology, the work of Walter Zimmerman is frequently noted (eg. Donoghue & Kadereit 1992), and in linguistics the work of Karl Brugmann is obviously important (see Mattis' post Arguments from authority, and the Cladistic Ghost, in historical linguistics). Sometimes, wheels have to be re-invented many times before the general populace comes to realize just how important they are.


Atkinson QD, Gray RD (2005) Curious parallels and curious connections — phylogenetic thinking in biology and historical linguistics. Systematic Biology 54: 513-526.

Donoghue MJ, Kadereit W (1992) Walter Zimmermann and the growth of phylogenetic theory. Systematic Biology 41: 74-85.

Felsenstain J (2004) Inferring Phylogenies. Sinauer Associates, Sunderland MA.

Hennig W (1966) Phylogenetic Systematics. University of Illinois Press, Urbana IL. [Translated by DD Davis and R Zangerl from W. Hennig 1950. Grundzüge einer Theorie der Phylogenetischen Systematik. Deutscher Zentralverlag, Berlin.]

Platnick NI, Cameron HD (1977) Cladistic methods in textual, linguistic, and phylogenetic analysis. Systematic Zoology 26: 380-385.

Tuesday, October 17, 2017

Networks, not trees, identify "weak spots" in phylogenetic trees

A major application of networks in exploratory data analysis is to identify signal oddities and visualise ambiguity. Thus, they would be the natural choice when it comes to pinpointing weaknesses in phylogenetic trees. This is particularly so when the aim is to propose a relatively stable (and intuitive) ‘phylogenetic’ (identifying likely monophyla sensu Hennig) or ‘cladistic’ (clade-based) systematic framework for a group of organsims. In other words, whenever we try to translate branching patterns into monophyletic groups.

‘Weak spots’ in phylogenetic trees are relationships with either little or ambiguous support, or branching patterns strongly affected by sampling (taxa and characters). These are topological phenomena that are rather the rule than the exception when studying extinct groups of organisms (e.g. spermatophytes or ‘long-necks’).

One example appears to be probably one of the fiercest group of marine predators: the mosasaurs (mosasauroid squamates; Madzia & Cau 2017). I will discuss this example in this post.

Fig. 1. The tree-based systematic groups of mosasaurs (Mosasauroidae plus ancient relatives) when applying Madzia & Cau's nomenclature to their Bayesian-inferred majority-rule consensus tree. Most higher taxa (above genus) are "branch-based", except for the "node-based" Mosasauridae, Russellosaurina (wrong suffix, kept as rank-less taxon by the authors), Tethysaurinae, and Yaguarasaurinae. Genera represented by a single OTU in blue, 'non-monophyletic' genera in red. Thick branches received near unambiguous support (PP ≥ 0.95)

Madzia & Cau “re-examined a data set that results from modifications assembled in the course of the last 20 years and performed multiple parsimony analyses and Bayesian tip-dating analysis” in order to identify the ‘weak spots’ and take them into account when providing a revised cladistic nomenclature of “the ‘traditionally’ recognized mosasauroid clades” (Fig. 1). They define possibly monophyletic groups via recurring branching patterns in their various trees, along with the position of key taxa in those trees (see their chapter Phylogenetic [in fact: cladistic] nomenclature). This allows the groups to “self-destruct” when not forming a clade, and to be replaced.

Although the combination of unweighted and differentially weighted parsimony and Bayesian tip-dating analyses could be methodologically interesting (when examined in detail), it is hardly necessary in order to identify weaknesses and strengths of the data matrix used – going back to Bell 1997, and being emended since (see Introduction of Madzia & Cau) – to define possible monophyletic (or other) groups. A quick and simple neighbour-net splits graph would have done the trick, too.

The situation regarding tree inference, e.g. parsimony

The mosasaurid data matrix suffers from the typical problems: ambiguous, highly homoplasious signals, paired with a few missing data issues (typically lack of data overlap). Adding to this is the miscellaneous signal from taxa regarded as outgroups (here: ancient potential members of the mosasaurs): Adriosaurus suessi (which the authors used to root their trees), Dolichosaurus longicollis, and Ponto-saurus kornhuberi. Accordingly, standard parsimony analysis fails to provide a useful result for about half of the taxa, when documented in the traditional fashion (see my last post) — a strict consensus cladogram of all most parsimonious trees (MPTs) is shown in Fig. 2A.

Fig. 2 Strict consensus graphs based on 152 equally (most) parsimonious trees inferred from the matrix (all characters treated as unweighted and unordered) using PAUP*. Green, unambiguous placement/grouping; turquois, weakly 'rogue-ish', red, rogue taxa

But even the Adams consensus tree (Fig. 2B) is more informative, and the (near) strict consensus network (only showing splits that occur in more than a single MPT) highlights where the equally parsimonious solutions agree and disagree, and which taxa act more ‘rogueish’ than others (Fig. 2C). Weighting and Bayesian inference naturally produce more resolved trees; but the question remains whether the overall higher to unambiguous branch support sufficiently reflects the signal in the character matrix.

Data sets of extinct organisms need neighbour-nets, to start with

The consensus network of the most (equally) parsimonious trees (MPT; Fig. 2C) informs us about equally valid topological alternatives and ‘rogueness’. Using the branch-length averaging option, we can visualize character support to some degree for the alternatives. But there is a quicker and more comprehensive alternative, when it comes to (tree-)incompatible signal.

The neighbour-net (Fig. 3) directly identifies potentially strong signals and ‘weak spots’. First, we can see that the outgroup taxa are not clustered, which is never good. Obviously, they are not too useful to infer an ingroup root (Madzia & Cau discuss the outgroup sampling bias). Only one of the outgrops, Pontosaurus, is placed closed to the Aigialosauridae, which collects the earliest diverging Mosasauroideae lineage (see Fig. 1). Their signals are likely to mess-up any tree inference (Fig. 2).

Fig. 3 The neighbour-net based on simple (Hamming) mean distances inferred from Madzia & Cau's matrix. Colouring as in Fig. 1

Trivial (data-wise) lineages are e.g. the Tylosaurinae, supported by a very long narrow branch— this lineage is characterised by high group coherence and distinctness to any other taxon/taxon group and will inevitably have high support and placed close (phylogenetically and absolute) to the Plioplate-carpinae (Figs 2, 3). The Mosasaurinae are equally well circumscribed, with only one putative member, Dallasaurus, being substantially apart from the rest, and bridging Mosasaurinae and Halisaurinae, their putative sisters. Hence, trees will favour splits rejecting the "Natantia" group unless Dallasaurus is excluded from the inference.

Species of the same genera are conspicuously grouped; this differs from Madzia & Cau’s trees, where Mosasaurus or Prognathodon species are collected in the same subtrees, but are “non-monophyletic”, i.e. do not form an exclusive clade. Based on the neighbour-net, the main reason may be terminal noise and resulting flat likelihood surfaces (hence, low posterior probabilities). The placement of the older members of the mosasaurs (classified as Tethysaurinae and Yaguarasaurinae) to each other, and the slightly older outgroup taxa, is clearly difficult with this matrix, even though there is no ambiguity, e.g. in the MPT sample (Fig. 2). Hence, the branch-lengths do not reflect synapomorphies or rarely shared apomorphies in this subtree, but instead shared convergences — a perfect phylogeny always generates a perfectly tree-like distance matrix.

Oddly placed taxa in the neighbour-net? Probably unrepresentative distances; and the quick fix

In contrast to trees, the network in Fig. 3 fails to resolve a likely position for one Prognathodon species: P. currii, and the large associated box indicates a data issue. The pairwise distances of the oddly placed P. currii and the probably misplaced Dolichosaurus, are poorly defined: both have zero-distances to non-similar taxa, but also to each other. But whereas Dolichosaurus differs from other members of Prognathodon by mean morphological distances (MD) of 0.5–1.0 (1.0 means it differs in all defined characters!), P. currii is much more similar to its congeners (MD = 0.17–0.27 and 0.46). Their other affinities also lie with strongly different taxon sets.

Their position in the neighbour-net is the result of a missing data artefact. Being just a 2-dimensional graph, such severe signal ambiguity cannot be resolved. Unrepresentative distances are the major (only) obstacle for neighbour-nets in the context of extinct groups. Trees are more decisive in such cases, when the few covered characters fit well the preferred tree's topology. By removing the outgroup taxa and P. currii, we can generate a neighbour-net (Fig. 4) in-line, and going beyond the Bayesian-tree-based groups suggested by Madzia & Cau (Fig. 1).

Fig. 4 Same data and method as shown in Fig. 3; four OTUs were excluded, the non-Mosasauroidea (outgroup) and the misplaced Prognathodon currii

Using networks to define taxonomic groups

Just based on the neighbour-nets (Figs 3, 4), circumscription of genera and higher taxa can be discussed (assuming that morphology mirrors phylogeny). For instance, Mosasaurus can be kept as-is or can include Plotosaurus; whereas the Clidastes form a clearly distinct taxon (whether paraphyletic/ monophyletic or clade/grade may be impossible to decide, see Fig. 1). Including (all) Prognathodon in the Globidensini remains an option; Eremiasaurus may be included, too, or included in the likely sister clade, the Mosasaurini.  

Dallasaurus is not only the oldest possible but clearly the most unique (primitive?) member of the Mosasaurinae, and the Halisaurinae likely represent their early diverged sister lineage. Treating Tylosaurinae and Plioplatecarpinae as reciprocally monophyletic sister lineages makes sense with respect to the older taxa and the co-eval Mosasaurinae-Halisaurinae lineage. The ancient forms are generally more similar to Plioplatecarpinae (+ Tylosaurinae) than to the Mosasaurinae and Halisaurinae lineages; but whether they should be included in the same systematic group ("Russellosaurina") cannot be judged based on the data matrix or the inferred trees (see also Figs 1, 2). Their topological attraction may be due to more shared primitive features (Hennig's ‘symplesiomorphies’), and the "Russellosaurina" could be a paraphyletic clade.

An interesting pronounced central edge bundle in the network in Fig. 4, which agrees well with Madzia & Cau's Bayesian consensus tree (Fig. 1), is the one separating all oldest, potentially more primitive taxa/lineages (> 90 Ma) from the later more diversified lineages (Mosasaurinae, Halisaurinae, Plioplatecarpinae, and Tylosaurinae). Regarding primitiveness vs. derivedness, an option to map characters on networks and extract alternative trees directly from the network would be handy (see also David’s 500th post).

Fig. 5 Bootstrap (BS) support network based on 10,000 BS (pseudo)replicates optimised under parsimony. Splits are shown that occurred only in at least 20% of the BS replicates; trivial splits are collapsed. Some taxa have low, but unchallenged support, in other cases no preference at all is found (e.g. for the highest level bracketing taxa) or two alternatives compete with each other.

Also in the case of the mosasaurs: when we want to use phylogenetic trees as the sole (or main) basis for classification, rather than neighbour-nets (see my last post) and common sense backed up by EDA (e.g. Fig. 4; Bomfleur et al. 2017), the method of choice would be the support consensus networks based on parsimony (example provided in Fig. 5), least-squares, and/or likelihood bootstrapping pseudoreplicate samples. in addition to or instead of the Bayesian-inferred topologies sample. The posterior probabilities in Madzia & Cau’s tip-dated tree and Bayesian majority-rule consensus tree include values << 1.0, which already can be an indication of very strong signal conflict or just lack of discriminating signal (flat likelihood surfaces).

We should not be over-confident in PP, when the underlying data are not tree-like at all, as they too easily tilt towards one alternative (see also Zander 2004). The same holds for post-analysis character weighting, designed to eliminate (down-weigh) conflicting signals. While parsimony and distance methods are more easily affected by branching artefacts, probabilistic methods may struggle with flat likelihood surfaces. Thus, bootstrap support networks should be the first choice for ‘phylogenetic’ (by identifying Hennigian monophyla) or ‘cladistic’ (clade-based) classification as they show the robustness of the signal for the preferred and other topological alternatives, and can be generated under different optimality criteria. Having a certain support for a clade is nice, but one should always consider the support for alternatives, and consider how many characters support or oppose an alternative.

Morphological matrices need to be analysed using network approaches

Madzia & Cau’s study is methodologically interesting by providing a tip-dated Bayesian tree for an extinct group of organisms. A one-to-one comparison of their parsimony-BS support using different character and weighting schemes vs. Bayesian PP may be interesting, too — note the difference between the tip-dated tree and the majority rule consensus trees for several critical branches. However, following the current standard practice, no BS pseudoreplicate and Bayesian saved topologies samples were provided. Regarding the main objective, the identification of ‘weak spots’ to propose enhanced systematic groups, networks (Figs 2–5) would have been more informative and straightforward.

No matter what classification philosophy is applied, when we deal with morphological matrices of extinct groups of organisms, the first step should always be to explore the primary signal in the data before we infer trees using (highly) sophisticated methods, and interpret them — the latter may actually obscure ‘weak spots’ rather than identifying them. The quickest analyses are neighbour-nets, but watch out for odd pairwise distance patterns (easily visualised using heat maps)!

The second step is producing support consensus networks, for the fine-tuning and to decide on the most probable trees to explain the data. Regarding classification, we should ask ourselves whether we really want inevitably unstable clade-based classification systems (when dealing with extinct organisms), or robust ones that reflect the general data situation and include potentially or likely paraphyletic taxa (see e.g. Clidastes in Figs 2–5 and Madzia & Cau's trees, and their elaborate discussion of higher level taxa, which – to a good degree – could become superfluous when allowing paraphyletic taxa).


All graphics, and some primary data files, are publicly available from figshare. An archive including all re-analysis files can be downloaded at www.palaeogrimm.org.


Bell GL (1997) A phylogenetic revision of North American and Adriatic Mosasauroidea. In: Callaway JM, and Nicholls EL, eds. Ancient Marine Reptiles. San Diego: Academic Press, pp. 293–332 [cited from Madzia & Cau 2017]

Bomfleur B, Grimm GW, McLoughlin S. 2017. The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5:e3433. https://peerj.com/articles/3433/.

Madzia D, Cau A (2017) Inferring 'weak spots' in phylogenetic trees: application to mosasauroid nomenclature. PeerJ 5: e3782. https://peerj.com/articles/3782/.

Zander RH (2004) Minimal values of reliability of Bootstrap and Jackknife proportions, Decay index, and Bayesian posterior probability. PhyloInformatics 2: 1–13.

Tuesday, October 10, 2017

Where to retire in the USA

Some weeks ago I published a post on recommended countries for Where to retire. Not everyone wants to leave their homeland, however, and so for many of our readers it may therefore be relevant to consider which states in the USA might be recommended as most desirable for retirees.

In this regard, the Bankrate web site has recently considered Where are the best and worst states to retire? They collated data (from various sources) for each of the 50 states for the following eight characteristics:
  • Cost of living
  • Healthcare quality
  • Crime rate
  • Cultural and social vitality
  • Weather
  • Taxes (income and sales taxes)
  • Senior citizens' overall well-being
  • The prevalence of other seniors
For 2017, the states were then ranked from 1–50 for each of these characteristics separately. These rankings were then weighted according to a survey of the reported relative importance of each of these characteristics — they are listed above in the order of decreasing importance. From the weighted data, Bankrate produced an overall ranking of the states for their desirability to retirees, which you can check out on their web site.

However, this ranking is overly simplistic, because it suggests that there is only one main dimension to retirement desirability, from best to worst. Clearly, retirement is multi-dimensional — there is no reason to expect the eight characteristics to be highly correlated. Therefore a network analysis would be handy to explore which characteristics differ between the states.

As for my previous analysis, I have calculated the Manhattan distance pairwise between the states; and I am displaying this in the figure using a NeighborNet network. States that have similar retirement characteristics are near each other in the network; and the further apart they are in the network then the more different are their characteristics.

In the network graph I have highlighted Bankrate's top 10 ranked states in green and their bottom 10 states in red. Note that they do not cluster neatly in the network, emphasizing the importance of considering the different characteristics, rather than just averaging them into a single ranking.

So, the network does not represent a single trend (from best to worst) — this would produce a long thin graph. Instead, the network scatters the states broadly, indicating that they have multiple relationships with each other — the eight retirement characteristics are not highly correlated. Indeed, the network is L-shaped, suggesting two main trends. The main part of the L has the north-eastern and west-coast states at one end and the mid-western and western states at the other, while the short part of the L separates out the south-eastern and south-western states. There are several obvious exceptions to these broad patterns (eg. Kentucky).

You can see that the north-eastern states tend to cluster together as being among the most desirable retirement locations (in Bankrate's ranking), and that the southern states tend to cluster together as being among the least desirable.

California is interesting because it ranks in the top two for Weather and Culture, but near the bottom for everything else. Hawaii ranks highly on Well-being and Culture but very poorly on Taxes, Crime rate, and Cost of living (where it is dead last). Florida, naturally, ranks first for Prevalence of seniors, but it is ranked mediocre to poor on everything else (including its hurricane-prone weather). New York is ranked first for Culture but mediocre to poor for everything else (and is ranked last for Taxes).

Alaska is ranked best for Taxes, Mississippi is best for Cost of living, Vermont is top for Crime rate (being low!), and Maine is best for Health care. Of these, only the latter state scores well for other characteristics, being second for Crime Rate and Prevalence of seniors. This puts it in the overall top three states, along with New Hampshire and Colorado.

New Hampshire gets the top spot by ranking well on everything except Cost of living and Weather — it is close to last for the latter characteristic!

So, the bottom line is that there is no state that particularly stands out as most suitable for retirees — in terms of desirable characteristics, what you win on the swings you lose on the roundabouts. Hardly surprising, really.

If you are interested in retiring to a particular city, then this recent web page may also be of relevance to you: Top 25 cities where you can live large on less than $70k.

Neither this nor the previous analysis (for countries) has addressed the issue of politics. Political voting is not randomly distributed, and some people prefer to live surrounded by voters similar to themselves. If this is you, then Wikipedia has a map indicating which states you might prefer.

Tuesday, October 3, 2017

Clades, cladograms, cladistics, and why networks are inevitable

During the work for another post, I stumbled on a kind of gap-in-knowledge that has nagged me for quite some time. This gap exists because researchers like to stay within chosen philosophical viewpoints, rather than reassessing their stance.

This gap involves the use of cladistic methodology in a manner that obscures information about evolutionary history, rather than revealing it. A clade, a subtree in a rooted tree that fulfills the parsimony criterion (or, indeed, any other criterion), may or may not reflect monophyly in a Hennigian sense, i.e. inclusive common origin. This is especially true for studies of extinct lineages.

I will explore this idea here in some detail.

Assumptions when studying fossils

Phylogenetic papers dealing with the evolution of extinct groups of organisms frequently use strict consensus trees (typically cladograms) of a sample of equally parsimonious trees (MPT) as the sole or main basis for their conclusions. They do this under two important implicit assumptions:
  • The morphological differentiation patterns encoded in a character matrix provide a generally treelike signal. In other words, the data patterns in the morphological matrix can be explained by a single, dichotomous, 1-dimensional graph. This assumption is also the basis for posterior filtering or down-weighting of characters that support splits (taxon bipartitions) conflicting with the branches in the inferred tree(s).
  • Morphological evolution is generally parsimonious. Although this may apply for characters that evolved only once or only evolve under very rare conditions, total evidence and DNA-constrained analysis demonstrate that this is not generally the case: the tree inferred by total-evidence or molecular constraints is typically longer than the tree(s) with the fewest character changes inferred on the morphological partition alone.
Another implicit assumption seems to be that all fossil specimens must represent extinct sister clades, and that no fossil specimen is ancestral to any other (or to an extant species) — hence, all taxa can be treated as terminals (not ancestors). Rooting typically relies on outgroups, under the assumption that ingroup-outgroup branching artefacts (such as long-branch attraction) play no role for parsimony inference when using morphological data sets.

In many of these morphology-phylogenetic papers (using parsimony or other methods) the authors state that they have conduct a “cladistic” study (I also made this error in my masters thesis; Grimm 1999). Cladistics is a classification system established by Hennig (1950) that relies on synapomorphies, exclusively shared, derived traits, that are linked with groups of inclusive common origin, the so-called monophyla.

Over 90 years earlier, Haeckel (1866) used the German word monophyletisch to refer to “natural” groups defined by a shared evolutionary history (a common origin). The latter could also include what Hennig identified as paraphyla: groups that have a common origin, but are not inclusive. To avoid confusion between Haeckelian and Hennigian monophyletic groups, Ashlock (1971) suggested the term holophyletic for the latter. This can be useful when a classification should recognise evolutionary relationships but needs to classify potentially or definitely paraphyletic groups for reasons of practicality (see e.g. Bomfleur, Grimm & McLoughlin 2017). Here, I will stick to Hennig’s terminology, as it is much more commonly used (although not necessarily correctly applied).
Hennig’s monophyla are from a theoretical (and computational) point of view a brilliant concept, as they can be inferred using a rooted tree. The test for monophyly is simple: Do A and B have a common ancestor? If yes, identify all taxa that are part of the same subtree as A and B. Unfortunately, we often find more than one possible tree, and roots can be misleading.

Strict consensus trees poorly represent the alternative topologies in a MPT sample

All consensus-tree approaches are limited to depicting the topological alternatives in a tree sample, but strict consensus trees are probably the worst (see e.g. Felsenstein 2004, chapter 30). They also have become obsolete with the development of consensus networks (Holland & Moulton 2003), and their subsequent implementation in freely accessible software packages such as SplitsTree (Huson 1998; Huson & Bryant 2006) and, more recently, the PHANGORN library for R (Schliep 2011; Schliep et al. 2017).

Figure 1 illustrates this difference for two extreme cases of binary matrices and their MPT collections. The two datasets in Fig. 1 reflect a substantially different data situation. The data in one matrix are perfectly tree-unlike (completely “confused about relationships”): any possible non-trivial bipartition of the 5-taxon set is supported by one (parsimony-informative) character. The data in the other matrix reflect two incongruent trees: each character is compatible with either one of the trees (parsimony-informative characters) or both trees (unique characters). The non-treelike matrix allows for many more MPTs than does the tree-like matrix, which results in two MPTs perfectly matching the two conflicting true trees. But both consensus analyses result in the same, unresolved (polytomous) strict consensus tree. In contrast, the two consensus networks highlight the difference in the quality between the data sets and the MPT sample.

Fig. 1 Non-treelike and treelike data, and the representation of their most-parsimonious tree collections as strict consensus trees and networks

Another example is shown in Figure 2, which shows four trees that differ only in the placement of one taxon (T8). This is a common phenomenom, particularly when dealing with extinct groups of organisms. The three main reasons for such topological ambiguity are:
  1. Indicisive data regarding the exact position of T8 with respect to the members of the red (T1–T4) and green clades (T5–T7).
  2. Conflicting data, T8 shows a combination of traits that are otherwise restricted to (parts of) the green or red clade.
  3. T8 is an ancestor or primitive member of the green or red clade, or both. 

Fig. 2 A single rogue taxon (T8) with ambiguous affinities collapses the strict consensus tree. In contrast, the conensus network can simultaenously show all alternatives, and identifies T8 as the source of topological ambiguity.

The strict consensus tree shows only three clades (three pairs of sister taxa) and a large polytomy, but the strict consensus network shows simultaneously the topology of all four trees and the position of T8 in these trees. From the consensus network, it is clear that the members of the red and green clades share a common origin. T8 can easily be identified as the rogue taxon (lineage).

Cladograms are incomplete representations of evolutionary trees

Figure 3 shows one of the first phylogenetic trees ever produced, and how it would look in the results section of a cladistic study. The tree was produced 150 years ago by Franz Martin Hilgendorf — more than 100 years before Hennig’s ideas were introduced to the Anglo-Saxon world and became mainstream. Hilgendorf was a palaeontology Ph.D. student at the same institute (in Tübingen, Germany) that also promoted me. Quenstedt, his supervisor, forced a quick promotion to get him and his heretic Darwinian ideas out of his university; there are thus no figures in Hilgendorf's thesis, and he published a phylogenetic tree only after he left Tübingen. It shows the evolution of derived forms (terminals) from putative ancestral forms (placed at the nodes) of fossils snails from the Steinheimer Becken, and clearly distinguishes ancestors and sisters. At some point, Hilgendorf even considered including the reticulation of lineages to better explain some forms, but later dropped this idea, feeling it would violate Darwin’s principle (Rasser 2006; see The dilemma of evolutionary networks and Darwinian trees).

Fig. 3 Hilgendorf's phylogenetic tree of fossil snails and its representation in form of a cladogram. The coloured fields and boxes refer to a series of nested clades, which here equal monophyletic groups.

Translating Hilgendorf’s tree into a cladogram comes with a loss of information about the evolution of the snails. Some ancestors are placed as sisters to their descendants (e.g. 18 vs. 18a and 19) and others are collected in a polytomy together with their descendants/descending lineages (e.g. 15, the ancestor of the siblings 16, 17, and the 18+). The loss of information regarding assumed ancestor-descendant relationships is dramatic. But this is no problem for cladistic classification: all clades in the cladogram in Fig. 3 (boxes) refer to Hennigian monophyletic groups seen in the original phylogenetic tree (coloured backgrounds). The polytomies in the cladogram are hard polytomies and do not reflect uncertainty or ambiguity. This contrasts with most cladograms depicted in the phylogenetic (“cladistic”) literature, where polytomies can also reflect lack of support or topological ambiguity.

Accepting the possibility that some fossils (fossil forms) may be ancestral to others (or their modern counterparts), or at least represent an ancestral, underived form, we actually should not infer plain parsimony trees but median networks (Bandelt et al. 1995). Median networks and related inferences (reduced median networks: Bandelt et al. 1995; median joining networks: Bandelt, Forster & Röhl 1999) work under the same optimality criterion (evolution is parsimonious) but allow taxa to be placed at the nodes (the “median”) of the graph. In doing so, they depict ancestor-descendant relationships. That they have not been used for morphological data so far, nor in palaeophylogenetic studies (as far as I know), may have to do with their vulnerability to homoplasy and missing data. High levels of homoplasy are common in morphological matrices, and missing data can be a problem when working with extinct organisms.

An ideal matrix, in which each divergence is followed by the accumulation of synapomorphies (or “autapomorphies”, unique traits, close to the tips), results in a median network perfectly depicting the evolutionary tree (Figure 4). As soon as convergent evolution steps in, a median network can easily become chaotic, although less so for a median-joining network. Note that half of the characters are homoplasious, and yet the median-joining network is still largely treelike (Fig. 4), with only one 2-dimensional box. The true tree is included in the network; but an E-G clade evolving from D is indicated as alternative to the correct (and monophyletic) FGH clade, with G and H evolving from F. Another deviation from the true tree is that A, the ancestor of B and C, is not placed at the node, but is closer to the all-common ancestor X.

Fig. 4 Two datasets, one without (left) and one with homoplasy (right), and their median(-joining) networks. Green branches refer to exact fits with the true tree, red indicate deviation or conflict with the true tree.

Paraphyletic clades...

Figures 5A and B show the corresponding MPT for the ideal matrix and the strict consensus tree vs. strict consensus network for the matrix affected by homoplasy. As our ideal matrix includes actual ancestors, the MPT rooted with the most primitive taxon X (the common ancestor of A–H) cannot resolve the exact relationships, in contrast to the median network. It thus represents the true tree only partly. But it also does not show any clade that is not monophyletic.

In the case of the partly homoplasious data, the median-joining network reconstructs a synapomorphy of the clade BC, because A is not placed on the node. This is because one character in our matrix is a methodologically undetectable parallelism — the same trait evolved in the sister taxa B and C, but only after both evolved from A. Clade BC is non-inclusive (paraphyletic), since A is the direct ancestor of both B and C and the clade BC lacks a real synapomorphy (if we go back to Hennig's concept). The reconstructed A would, however, be a stem taxon and clade BC would be inclusive (monophyletic) with one (inferred) synapomorphy. But this is a purely semantic problem of cladistics. In the real world, we will hardly have the data to discern whether A represents: the last common ancestor of B and C, a stem taxon of the ABC-lineage (a’), a very early precursor of B or C (b/c), or an ancient sister lineage of A, B, and/or C (a*). For practicality, one would eventually include all fossil forms with A-ish appearance in a paraphyletic taxon A (Fig. 5C), in (silent) violation of cladistic classification, to name only monophyletic groups.

Fig. 5A The median network compared to the single most-parsimonious tree inferred based on the ideal matrix

Fig. 5B The median-joining network compared to the strict consensus tree and networks of five most-parsimonious trees inferred based on the matrix with homoplasy. Red edges indicate deviations from or conflicts with the true tree.

Fig. 5C Potential monophyla that could be inferred from the median-joining network (Clades XY), when rooted with the most ancient taxon X. Groups that are monophyletic according to the true tree in blue, groups that are not in orange.

The strict consensus tree of the five MPTs that can be inferred from the homoplasious matrix shows only the paraphyletic (pseudo-monophyletic) clade BC and two monophyletic clades (ABC and D–H); and it contains no further information about the actual topology of the five MPTs. Its lack of resolution is due to the ancestors, which have typically less derived traits (no autapomorphies and fewer synapomorphies), in combination with the homoplasy-induced topological ambiguity. In contrast, the strict consensus networks reveal that all five MPTs place D, the ancestor of the D–H lineage, as (zero branch length) sister to a technically paraphyletic E–H clade, thereby identifying D as the most primitive form of the monophyletic D–H clade. Furthermore, all MPTs recognise a paraphyletic FH clade (F again a zero-length branch). They disagree in the placement of G, which is either sister to F+H (monophyletic FGH clade) or sister to E (a wrong EG clade).

... and monophyletic grades

Figure 6 shows a scenario in which paraphyletic groups are resolved as clades and monophyletic groups form grades, both because of outgroup-ingroup branching artefacts. The derived outgroup O is notably distinct from all ingroup taxa showing a character suite of convergently evolved traits that are randomly shared with parts of the ingroup. Within the ingroup, members of clade DEF are much more derived than are A and C.

Fig. 6 Ingroup-outgroup long-branch attraction can turn monophyla into grades and paraphyla into clades. The ingroup (A–F) consists of a sequence of nested monophyletic lineages (green shades) including two taxa (lowercase letters) that are ancestral to others. Each ingroup lineage evolved (convergent) traits also found in the outgroup O. The data allow inferring two MPTs that misplace O. The outgroup-misinformed root leads to a series of nested clades that a paraphyletic. Splits congruent with the actual monophyletic groups in green, those in conflict with the true tree in red.

Parsimony-tree inference finds two MPTs, which, rooted with the outgroup O, recognise a distinctly paraphyletic A–D+X clade. In both outgroup-rooted MPTs, the monophyletic DEF group is dissolved into a grade. By the way: using neighbour-joining (NJ) to find a tree fulfilling the least-squares (LS) criterion based on the corresponding pairwise mean distance matrix, the outgroup-inferred root is still misplaced with respect to the primitive taxa (X, A–C), but the DEF monophylum is correctly resolved as a clade. Call the Spanish Inquisition! A “phenetic” clustering algorithm finds a tree that is less wrong than the MPTs.

The most comprehensive display of the misleading signal in this matrix is nevertheless the neighbour-net (NNet; Figure 7), which includes both the parsimony and LS-solutions, and it can be used to map the competing support patterns surfacing in a bootstrap analysis of the data. In this network we can see that the signal is not compatible with a single tree, and that the signal from the distant outgroup O is too ambiguous for rooting the ingroup. Based on this graph, one can argue to delete the outgroup, thereby deleting all non-treelike signal — a NNet (or median network) excluding O matches exactly the true tree.

Fig. 7 Neighbour-net based on mean pairwise distances (same data in Fig. 6). The outgroup O provides a strongly ambiguous (non-treelike) signal, thus, triggering a series of splits (in red) conflicting the true tree (shown in grey). Edges compatible with the true tree shown in green. The numbers refer to non-parametric bootstrap support estimated under three optimality criteria: least-squares (LS; via neighbour-joinging), maximum likelihood (ML; using Lewis' 1-parameter Mk model), and maximum parsimony (MP) and 10,000 (pseudo)replicates each. Upper right: A splits-rose illustrating the competing support patterns for proximal splits involving O: green — split seen in the true tree, reddish — the competing splits seen in the two MPTs.

We need to accept that a clade, a subtree in a rooted tree (see e.g. Felsenstein 2004) fulfilling the parsimony criterion (or any other criterion), may or may not reflect monophyly in a Hennigian sense, i.e. inclusive common origin. Thus, it is imperative to distinguish between a classification concept that interprets trees (cladistics) and the method used to infer trees (typically parsimony, in the case of extinct lineages). This is especially so when one has to work with stand-alone data, such as morphological data of extinct groups of organisms.

Aside from the clades/grades ↔ monophyla / paraphyla / can't-say problem, the instability of clades in a parsimony or otherwise optimised rooted tree, or the alternative clades that can be inferred from the more data-comprehensive networks, make it difficult to enforce a strictly cladistic naming scheme. For the example shown in Fig. 2, we would be unable to name the red and green clades until the exact position of T8 is settled (see also Bomfleur, Grimm & McLoughlin 2017). In the end, the overall diversity patterns (studied using exploratory data analysis) may remain the most solid ground for classification.

It should also be obligatory in phylogenetic studies to use networks to display both competing topological alternatives and incompatible data patterns. There should also always be some information on edge-lengths. Consensus trees are insufficient, as they mask conflicting data patterns, and cladograms mask the amount of change.


Ashlock PD. (1971) Monophyly and associated terms. Systematic Zoology 20:63–69.

Bandelt H-J, Forster P, Röhl A. (1999) Median-joining networks for inferring intraspecific phylogenies. Molecular Biology and Evolution 16:37-48.

Bandelt H-J, Forster P, Sykes BC, Richards MB. (1995) Mitochondrial portraits of human populations using median networks. Genetics 141:743-753.

Bomfleur B, Grimm GW, McLoughlin S. (2017) Figure 8 of: The fossil Osmundales (Royal Ferns)—a phylogenetic network analysis, revised taxonomy, and evolutionary classification of anatomically preserved trunks and rhizomes. PeerJ 5:e3433.

Felsenstein J. (2004) Inferring phylogenies. Sunderland, MA, U.S.A.: Sinauer Associates Inc.

Grimm GW. (1999) Phylogenie der Cycadales. Diploma thesis. Eberhard Karls Universität. [in German]

Haeckel E. (1866) Generelle Morphologie der Organismen. Berlin: Georg Reiner.

Hennig W. (1950) Grundzüge einer Theorie der phylogenetischen Systematik. Berlin: Dt. Zentralverlag.

Holland B, Moulton V. (2003) Consensus networks: A method for visualising incompatibilities in collections of trees. In: Benson G, and Page R, eds. Algorithms in Bioinformatics: Third International Workshop, WABI, Budapest, Hungary Proceedings. Berlin, Heidelberg, Stuttgart: Springer Verlag, p. 165–176.

Huson DH. (1998) SplitsTree: analyzing and visualizing evolutionary data. Bioinformatics 14:68–73.

Huson DH, Bryant D. (2006) Application of phylogenetic networks in evolutionary studies. Molecular Biology and Evolution 23:254–267.

Rasser MW. (2006) 140 Jahre Steinheimer Schnecken-Stammbaum: der älteste fossile Stammbaum aus heutiger Sicht. Online version, originally published in Geologica et Palaeontologica, vol. 40.

Schliep K, Potts AJ, Morrison DA, Grimm GW. (2017) Intertwining phylogenetic trees and networks. Methods in Ecology and Evolution DOI:10.1111/2041-210X.12760.

Schliep KP. (2011) Phangorn: phylogenetic analysis in R. Bioinformatics 27:592–593.

Tuesday, September 26, 2017

Some desiderata for using splits graphs for exploratory data analysis

This is the 500th post from this blog, making it one of the longest-running blogs in phylogenetics, if not the longest. For example, among the phylogenetics blogs that I have previously listed, there has been only one post so far this year that has not been about a specific computer program.

Our first blog post was on Saturday 25 February 2012; and most weeks since then have had one or two posts. We have covered a lot of ground during that time, focusing on the use of network graphs for phylogenetic data, broadly defined (ie. including biology, linguistics, and stemmatology). However, we have not been averse to applying what are know as "phylogenetic networks" to other data, as well; and to discussing phylogenetic trees, when appropriate.

For this 500th post, I though that I should focus on what seems to me to be one of the least appreciated aspects of biology — the need to look at data before formally analyzing it.

Phylogeneticists, for example, have a tendency to rush into some specified form of phylogenetic analysis, without first considering whether that analysis is actually suitable for the data at hand. It is therefore wise to investigate the nature of the data first, before formal analysis, using what is known as exploratory data analysis (EDA).

EDA involves getting a picture of the data, literally. That picture should be clear, as well as informative. That is, it should highlight some particular characteristics of the data, whatever they may be. Different EDA tools are likely to reveal different characteristics — there is not single tool that does it all. That is why it is called "exploration", because you need to have a look around the data using different tools.

This is where splits graphs come into play, perhaps the most important tool developed for phylogenetics over the past 50 years.

Splits graphs

Splits graphs are the best current tools for visualizing phylogenetic data. They were developed back in 1992, by Hans-Jürgen Bandelt & Andreas Dress. These graphs had a checkered career for the first 15 years, or so, but they have become increasingly popular over the past 10 years.

It is important to note that splits graphs are not intended to represent phylogenetic histories, in the sense of showing the historical connections between ancestors and descendants. This does not mean that there is no reason why should not do so, but it is not their intended purpose. Their purpose is to display phenetic data patterns efficiently. In this sense, calling them "phylogenetic networks" may be somewhat misleading — they are data-display networks, not evolutionary networks.

A split is simply a partitioning of a group of objects into two mutually exclusive subgroups (a bipartition). In biology, these objects can be individuals, populations, species, or even higher taxonomic groups (OTUs); and in the social sciences, they might be languages or language groups, or they could be written texts, or verbal tales, or tools or any other human artifacts. Any collection of objects will contain a set of such splits, either explicitly (eg. based on character data) or implicitly (eg. based on inter-object distances). A splits graph simultaneously displays some subset of the splits.

Ideally, a splits graph would display all of the splits; but for realistic biological data this is not likely to happen — the graph would simply be too complex for interpretation. So, a series of graphing algorithms have been developed that will display different subsets of the splits. That is, splits graphs actually form a family of closely related graphs. Technically, the Median Network is the only graph type that tries to display all of the splits; however, the result will usually be too complicated to be useful for EDA.

So, these days there is a range of splits-graph methods available for character-based data (such as Median Networks and Parsimony Splits), distance-based data (such as NeighborNet and Split Decomposition), and tree-based data (such as Consensus Networks and SuperNetworks). In population genetics, haplotype networks can be produced by methods that conceptually modify Median Networks (such as Reduced Median Networks and Median-Joining Networks).

The purpose of this post, however, is not to discuss all of the types of splits graphs, but to consider what computer tools we would need in order to successfully use this family of graphs for EDA in phylogenetics.


The basic idea of EDA is to have a picture of the data. So, any computer program for EDA in phylogenetics needs to be able to quickly and easily produce the splits graph, and then allow us to explore and manipulate it interactively.

To do this, the features listed below are the ones that I consider to be most helpful for EDA (and thanks to Guido Grimm and Scot Kelchner for making some of the suggestions). It would be great to have a computer program that implements all of these features, but this does not yet exist. SplitsTree has some of them, making it the current program of choice. However, there is quite some way to go before a truly suitable program could exist.

Note that these desiderata fall into several groups:
  1. evaluating the network itself
  2. comparing the network to other possible representations of the data
  3. manipulating the presentation of the network
It is desirable to be able to interactively:
  • specify which supported splits are shown in the graph— eg. show only those explicitly supported by character
  • list the split-support values
  • highlight particular splits in the graph — eg. by clicking on one of the edges
  • identify splits for specified taxon partitions (if the split is supported) — this is the complement to the previous one, in which we specify the split from a list of objects, not from the graph itself
  • identify which splits are sensitive to the model used — eg. different network algorithms
  • identify which edges are missing when comparing a planar graph with an n-dimensional one — this would potentially be complex if one compares, say, a NeighborNet to a Median Network
  • map support values onto the graph (ie. other than split support, which is usually the edge length) — eg. bootstrap values
  • evaluate the tree-likeness of the network — ie. the extent of reticulation needed to display the data
  • map edges from other networks or trees onto the graph — this allows us to compare graphs, or to superimpose a specified tree onto the network
  • find out if the network is tree-based, by breaking it down into a defined number of trees —along with a measure for how comprehensive these trees capture the network
  • create a tree-based network by having the network be the super-set of some specified tree — eg. the NeighborNet graph could be a superset of the Neighbor-Joining tree
  • manipulate the presentation of the graph — eg. orientation, colours, fonts, etc
  • remove trivial splits — eg. those with edges shorter than some specified minimum, assuming that edge length represents split support
  • plot characters onto the graph — possibly next to the object labels, but preferably on the edges if they are associated with particular partitions
  • examine which subsets of the data are responsible for the reticulations — eg. for character-based inputs this might a sliding window that updates the network for each region of an alignment, or for tree-based inputs it might be a tree inclusion-exclusion list.
Other relevant posts

Here are some other blog posts that discuss the use of splits graphs for exploring genealogical data.

How to interpret splits graphs

Recognizing groups in splits graphs

Splits and neighborhoods in splits graphs

Mis-interpreting splits graphs

Tuesday, September 19, 2017

Arguments from authority, and the Cladistic Ghost, in historical linguistics

Arguments from authority play an important role in our daily lives and our societies. In political discussions, we often point to the opinion of trusted authorities if we do not know enough about the matter at hand. In medicine, favorable opinions by respected authorities function as one of four levels of evidence (admittedly, the lowest) to judge the strength of a medicament. In advertising, the (at times doubtful) authority of celebrities is used to convince us that a certain product will change our lives.

Arguments from authority are useful, since they allow us to have an opinion without fully understanding it. Given the ever-increasing complexity of the world in which we live, we could not do without them. We need to build on the opinions and conclusions of others in order to construct our personal little realm of convictions and insights. This is specifically important for scientific research, since it is based on a huge network of trust in the correctness of previous studies which no single researcher could check in a lifetime.

Arguments from authority are, however, also dangerous if we blindly trust them without critical evaluation. To err is human, and there is no guarantee that the analysis of our favorite authorities is always error proof. For example, famous linguists, such as Ferdinand de Saussure (1857-1913) or Antoine Meillet (1866-1936), revolutionized the field of historical linguistics, and their theories had a huge impact on the way we compare languages today. Nevertheless, this does not mean that they were right in all their theories and analyses, and we should never trust any theory or methodological principle only because it was proposed by Meillet or Saussure.

Since people tend to avoid asking why their authority came to a certain conclusion, arguments of authority can be easily abused. In the extreme, this may accumulate in totalitarian societies, or societies ruled by religious fanatism. To a smaller degree, we can also find this totalitarian attitude in science, where researchers may end up blindly trusting the theory of a certain authority without further critically investigating it.

The comparative method

The authority in this context does not necessarily need to be a real person, it can also be a theory or a certain methodology. The financial crisis from 2008 can be taken as an example of a methodology, namely classical "economic forecasting", that turned out to be trusted much more than it deserved. In historical linguistics, we have a similar quasi-religious attitude towards our traditional comparative method (see Weiss 2014 for an overview), which we use in order to compare languages. This "method" is in fact no method at all, but rather a huge bunch of techniques by which linguists have been comparing and reconstructing languages during the past 200 years. These include the detection of cognate or "homologous" words across languages, and the inference of regular sound correspondence patterns (which I discussed in a blog from October last year), but also the reconstruction of sounds and words of ancestral languages not attested in written records, and the inference of the phylogeny of a given language family.

In all of these matters, the comparative method enjoys a quasi-religious authority in historical linguistics. Saying that they do not follow the comparative method in their work is among the worst things you can say to historical linguists. It hurts. We are conditioned from when we were small to feel this pain. This is all the more surprising, given that scholars rarely agree on the specifics of the methodology, as one can see from the table below, where I compare the key tasks that different authors attribute to the "method" in the literature. I think one can easily see that there is not much of an overlap, nor a pattern.

Varying accounts on the "comparative methods" in the linguistic literature

It is difficult to tell how this attitude evolved. The foundations of the comparative method go back to the early work of scholars in the 19th century, who managed to demonstrate the genealogical relationship of the Indo-European languages. Already in these early times, we can find hints regarding the "methodology" of "comparative grammar" (see for example Atkinson 1875), but judging from the literature I have read, it seems that it was not before the early 20th century that people began to introduce the techniques for historical language comparison as a methodological framework.

How this framework became the framework for language comparison, although it was never really established as such, is even less clear to me. At some point the linguistic world (which was always characterized by aggressive battles among colleagues, which were fought in the open in numerous publications) decided that the numerous techniques for historical language comparison which turned out to be the most successful ones up to that point are a specific method, and that this specific method was so extremely well established that no alternative approach could ever compete with it.

Biologists, who have experienced drastic methodological changes during the last decades, may wonder how scientists could believe that any practice, theory, or method is everlasting, untouchable and infallible. In fact, the comparative method in historical linguistics is always changing, since it is a label rather than a true framework with fixed rules. Our insights into various aspects of language change is constantly increasing, and as a result, the way we practice the comparative method is also improving. As a result, we keep using the same label, but the product we sell is different from the one we sold decades ago. Historical linguistics are, however, very conservative regarding the authorities they trust, and our field was always very skeptical regarding any new methodologies which were proposed.

Morris Swadesh (1909-1967), for example, proposed a quantitative approach to infer divergence dates of language pairs (Swadesh 1950 and later), which was immediately refuted, right after he proposed it (Hoijer 1956, Bergsland and Vogt 1962). Swadesh's idea to assume constant rates of lexical change was surely problematic, but his general idea of looking at lexical change from the perspective of a fixed set of meanings was very creative in that time, and it has given rise to many interesting investigations (see, among others, Haspelmath and Tadmor 2009). As a result, quantitative work was largely disregarded in the following decades. Not many people payed any attention to David Sankoff's (1969) PhD thesis, in which he tried to develop improved models of lexical change in order to infer language phylogenies, which is probably the reason why Sankoff later turned to biology, where his work received the appreciation it deserved.

Shared innovations

Since the beginning of the second millennium, quantitative studies have enjoyed a new popularity in historical linguistics, as can be seen in the numerous papers that have been devoted to automatically inferred phylogenies (see Gray and Atkinson 2003 and passim). The field has begun to accept these methods as additional tools to provide an understanding of how our languages evolved into their current shape. But scholars tend to contrast these new techniques sharply with the "classical approaches", namely the different modules of the comparative method. Many scholars also still assume that the only valid technique by which phylogenies (be it trees or networks) can be inferred is to identify shared innovations in the languages under investigation (Donohue et al. 2012, François 2014).

The idea of shared innovations was first proposed by Brugmann (1884), and has its direct counterpart in Hennig's (1950) framework of cladistics. In a later book of Brugmann, we find the following passage on shared innovations (or synapomorphies in Hennig's terminology):
The only thing that can shed light on the relation among the individual language branches [...] are the specific correspondences between two or more of them, the innovations, by which each time certain language branches have advanced in comparison with other branches in their development. (Brugmann 1967[1886]:24, my translation)
Unfortunately, not many people seem to have read Brugmann's original text in full. Brugmann says that subgrouping requires the identification of shared innovative traits (as opposed to shared retentions), but he remains skeptical about whether this can be done in a satisfying way, since we often do not know whether certain traits developed independently, were borrowed at later stages, or are simply being misidentified as being "shared". Brugmann's proposed solution to this is to claim that shared, potentially innovative traits, should be numerous enough to reduce the possibility of chance.

While biology has long since abandoned the cladistic idea, turning instead to quantitative (mostly stochastic) approaches in phylogenetic reconstruction, linguists are surprisingly stubborn in this regard. It is beyond question that those uniquely shared traits among languages that are unlikely to have evolved by chance or language contact are good proxies for subgrouping. But they are often very hard to identify, and this is probably also the reason why our understanding about the phylogeny of the Indo-European language family has not improved much during the past 100 years. In situations where we lack any striking evidence, quantitative approaches may as well be used to infer potentially innovated traits, and if we do a better job in listing these cases (current software, which was designed by biologists, is not really helpful in logging all decisions and inferences that were made by the algorithms), we could profit a lot when turning to computer-assisted frameworks in which experts thoroughly evaluate the inferences which were made by the automatic approaches in order to generate new hypotheses and improve our understanding of our language's past.

A further problem with cladistics is that scholars often use the term shared innovation for inferences, while the cladistic toolkit and the reason why Brugmann and Hennig thought that shared innovations are needed for subgrouping rests on the assumption that one knows the true evolutionary history (DeLaet 2005: 85). Since the true evolutionary history is a tree in the cladistic sense, an innovation can only be identified if one knows the tree. This means, however, that one cannot use the innovations to infer the tree (if it has to be known in advance). What scholars thus mean when talking about shared innovations in linguistics are potentially shared innovations, that is, characters, which are diagnostic of subgrouping.


Given how quickly science evolves and how non-permanent our knowledge and our methodologies are, I would never claim that the new quantitative approaches are the only way to deal with trees or networks in historical linguistics. The last word on this debate has not yet been spoken, and while I see many points critically, there are also many points for concrete improvement (List 2016). But I see very clearly that our tendency as historical linguists to take the comparative method as the only authoritative way to arrive at a valid subgrouping is not leading us anywhere.

Do computational approaches really switch off the light which illuminates classical historical linguistics?

In a recent review, Stefan Georg, an expert on Altaic languages, writes that the recent computational approaches to phylogenetic reconstruction in historical linguistics "switch out the light which has illuminated Indo-European linguistics for generations (by switching on some computers)", and that they "reduce this discipline to the pre-modern guesswork stage [...] in the belief that all that processing power can replace the available knowledge about these languages [...] and will produce ‘results’ which are worth the paper they are printed on" (Georg 2017: 372, footnote). It seems to me, that, if a discipline has been enlightened too much by its blind trust in authorities, it is not the worst idea to switch off the light once in a while.

  • Anttila, R. (1972): An introduction to historical and comparative linguistics. Macmillan: New York.
  • Atkinson, R. (1875): Comparative grammar of the Dravidian languages. Hermathena 2.3. 60-106.
  • Bergsland, K. and H. Vogt (1962): On the validity of glottochronology. Current Anthropology 3.2. 115-153.
  • Brugmann, K. (1884): Zur Frage nach den Verwandtschaftsverhältnissen der indogermanischen Sprachen [Questions regarding the closer relationship of the Indo-European languages]. Internationale Zeischrift für allgemeine Sprachewissenschaft 1. 228-256.
  • Bußmann, H. (2002): Lexikon der Sprachwissenschaft . Kröner: Stuttgart.
  • De Laet, J. (2005): Parsimony and the problem of inapplicables in sequence data. In: Albert, V. (ed.): Parsimony, phylogeny, and genomics. Oxford University Press: Oxford. 81-116.
  • Donohue, M., T. Denham, and S. Oppenheimer (2012): New methodologies for historical linguistics? Calibrating a lexicon-based methodology for diffusion vs. subgrouping. Diachronica 29.4. 505–522.
  • Fleischhauer, J. (2009): A Phylogenetic Interpretation of the Comparative Method. Journal of Language Relationship 2. 115-138.
  • Fox, A. (1995): Linguistic reconstruction. An introduction to theory and method. Oxford University Press: Oxford.
  • François, A. (2014): Trees, waves and linkages: models of language diversification. In: Bowern, C. and B. Evans (eds.): The Routledge handbook of historical linguistics. Routledge: 161-189.
  • Georg, S. (2017): The Role of Paradigmatic Morphology in Historical, Areal and Genealogical Linguistics. Journal of Language Contact 10. 353-381.
  • Glück, H. (2000): Metzler-Lexikon Sprache . Metzler: Stuttgart.
  • Gray, R. and Q. Atkinson (2003): Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature 426.6965. 435-439.
  • Harrison, S. (2003): On the limits of the comparative method. In: Joseph, B. and R. Janda (eds.): The handbook of historical linguistics. Blackwell: Malden and Oxford and Melbourne and Berlin. 213-243.
  • Haspelmath, M. and U. Tadmor (2009): The Loanword Typology project and the World Loanword Database. In: Haspelmath, M. and U. Tadmor (eds.): Loanwords in the world’s languages. de Gruyter: Berlin and New York. 1-34.
  • Hennig, W. (1950): Grundzüge einer Theorie der phylogenetischen Systematik. Deutscher Zentralverlag: Berlin.
  • Hoenigswald, H. (1960): Phonetic similarity in internal reconstruction. Language 36.2. 191-192.
  • Hoijer, H. (1956): Lexicostatistics. A critique. Language 32.1. 49-60.
  • Jarceva, V. (1990): . Sovetskaja Enciklopedija: Moscow.
  • Klimov, G. (1990): Osnovy lingvističeskoj komparativistiki [Foundations of comparative linguistics]. Nauka: Moscow.
  • Lehmann, W. (1969): Einführung in die historische Linguistik. Carl Winter:
  • List, J.-M. (2016): Beyond cognacy: Historical relations between words and their implication for phylogenetic reconstruction. Journal of Language Evolution 1.2. 119-136.
  • Makaev, E. (1977): Obščaja teorija sravnitel’nogo jazykoznanija [Common theory of comparative linguistics]. Nauka: Moscow.
  • Matthews, P. (1997): Oxford concise dictionary of linguistics . Oxford University Press: Oxford.
  • Rankin, R. (2003): The comparative method. In: Joseph, B. and R. Janda (eds.): The handbook of historical linguistics. Blackwell: Malden and Oxford and Melbourne and Berlin.
  • Sankoff, D. (1969): Historical linguistics as stochastic process . . McGill University: Montreal.
  • Weiss, M. (2014): The comparative method. In: Bowern, C. and N. Evans (eds.): The Routledge Handbook of Historical Linguistics. Routledge: New York. 127-145.