Wednesday, October 31, 2012

When is there support for a large phylogeny?


I have commented before (see this post: Networks and bootstraps as tree-support criteria) that there is often a discrepancy between bootstrap support on a phylogenetic tree (performed as a data analysis) and the output of a network analysis (performed as an exploratory data analysis, EDA). Here I will present a couple of examples of large phylogenies.

Soltis et al. (2011) report for their study of angiosperm phylogeny that:

  • We conducted two primary analyses based on 640 species representing 330 families. The first included 25,260 aligned base pairs from 17 genes, representing all three plant genomes, i.e., nucleus, plastid, and mitochondrion ... Phylogenetic analyses using maximum likelihood were conducted in the program RAxML ... Many important questions of deep-level relationships in the non-monocot angiosperms have now been resolved with strong support ... Our analyses confirm that with large amounts of sequence data, most deep-level relationships within the angiosperms can be resolved.

By "strong support" the authors mean that most of the branches on their phylogenetic tree have >85% bootstrap support.

I performed a NeighbortNet analysis on their data file of aligned sequences (available in TreeBASE). This was quite a challenge for the SplitsTree program, testing whether SplitsTree can handle 640 taxa. It turns out that it can, but it is very slow to re-draw the figure. Thus, rotating the figure, as I did here, took a very long time.

 The network shows that the bootstrap support is not as convincing as it sounds. There is not much clear tree-like structure in the dataset.


As an alternative example, Decker et al. (2009) used a NeighborNet to display their data about cattle domestication:

  • We constructed a phylogenomic network to accurately describe the relationships between 48 cattle breeds and facilitate inferences concerning the history of domestication and breed formation ... Due to memory limitations in SplitsTree, genotypes at 14,023 SNPs were used to construct a network of 372 individuals belonging to 48 breeds. Default settings in SplitsTree were used to construct the networks ... This figure reveals that the history of breed formation in cattle has been complicated and has involved bottlenecks, evolution in isolation, coancestry, migration, and admixture.

The network simply shows that the different breeds can be recognized but that their relationships are not easy to resolve.


In neither of these two examples does there seem to be much reason to be confident in any conclusions about evolutionary relationships, as the network in both cases is essentially an unresolved bush.

The problem is likely to be the size of the phylogenies, as the potential complexity of a dataset increases combinatorially with the number of taxa (each added taxon can potentially have a reticulation with every one of the existing taxa). A dataset thus needs a very strong tree signal when there are hundreds of taxa, if the network is to show anything more than the disorganized blobs displayed here. This seems to be an unlikely scenario for most taxonomic groups, especially when using genetic data.

If this idea is correct then we will need to start thinking about potential solutions, in order to fully utilize networks for EDA. Perhaps the most obvious approach is to filter out the smaller patterns before constructing the network, with "smaller" being defined relative to the objective of the analysis. This approach is already used, for example, for consensus networks (where only a specified percentage of the splits in the input trees is included in the network) and super networks (where splits are also filtered in order to keep the network planar; Huson et al. 2006; Whitfield et al. 2008).

References

Decker J.E., Pires J.C., Conant G.C., McKay S.D., Heaton M.P., Chen K., Cooper A., Vilkki J., Seabury C.M., Caetano A.R., Johnson G.S., Brenneman R.A., Hanotte O., Eggert L.S., Wiener P., Kim J.-J., Kim K.S., Sonstegard T.S., Van Tassell C.P., Neibergs H.L., McEwan J.C., Brauning R., Coutinho L.L., Babar M.E., Wilson G.A., McClure M.C., Rolf M.M., Kim J., Schnabel R.D., Taylor J.F. (2009) Resolving the evolution of extant and extinct ruminants with high-throughput phylogenomics. Proceedings of the National Academy of Sciences of the U.S.A. 106: 18644-18649.

Huson D.H., Steel M., Whitfield J.B. (2006) Reducing distortion in phylogenetic networks. Lecture Notes in Bioinformatics 4175: 150-161.

Soltis D.E., Smith S.A., Cellinese N., Wurdack K.J., Tank D.C., Brockington S.F., Refulio-Rodriguez N.F., Walker J.B., Moore M.J., Carlsward B.S., Bell C.D., Latvis M., Crawley S., Black C., Diouf D., Xi Z., Rushworth C.A., Gitzendanner M.A., Sytsma K.J., Qiu Y.L., Hilu K.W., Davis C.C., Sanderson M.J., Beaman R.S., Olmstead R.G., Judd W.S., Donoghue M.J., Soltis P.S. (2011) Angiosperm phylogeny: 17 genes, 640 taxa. American Journal of Botany 98: 704-730.

Whitfield J.B., Cameron S.A., Huson D.H., Steel M.A. (2008) Filtered z-closure supernetworks for extracting and visualizing recurrent signal from incongruent gene trees. Systematic Biology 57: 939-947.

No comments:

Post a Comment