Complex models for genetic sequence data

Hannaford, Naomi Elizabeth

Please use this identifier to cite or link to this item: http://theses.ncl.ac.uk/jspui/handle/10443/5458

Title:	Complex models for genetic sequence data
Authors:	Hannaford, Naomi Elizabeth
Issue Date:	2021
Publisher:	Newcastle University
Abstract:	In this thesis, the aim is to develop biologically motivated Bayesian models in two areas: molecular phylogenetics and time-series metagenomics. In molecular phylogenetics, the goal is generally to learn about the evolutionary history of a collection of species using molecular sequence data, for example, DNA. Evolutionary history is represented graphically using evolutionary trees, where the root of a tree represents the most recent common ancestor of all species in the tree. Substitutions in sequences are modelled through a continuous time Markov process, characterised by an instantaneous rate matrix, which standard models assume is stationary and time-reversible. These assumptions are biologically questionable and induce a likelihood function which is invariant to a tree’s root position. This is detrimental to inference, since a tree’s biological interpretation depends on where it is rooted. By relaxing both assumptions, we introduce two new models whose likelihoods can distinguish between rooted trees. These models are non-stationary, with step changes in the rate matrix on each branch. Each rate matrix belongs to a non-reversible family of Lie Markov models, which are closed under matrix multiplication. The two models differ in that a different non-reversible Lie Markov model is used in each. We perform our analysis in the Bayesian framework using Markov chain Monte Carlo methods. We assess the performance of our models using a simulation study, before considering an application to a Drosophila data set, where most models fail to identify a plausible root position. In time-series metagenomics, counts of operational taxonomic units (OTUs), which are pragmatic proxies for microbial species, are modelled over time. We have weekly counts of different OTUs from two tanks in a wastewater treatment plant. We develop a Bayesian hierarchical vector autoregressive model to model the dynamics of the OTUs, whilst also incorporating environmental and chemical data. Clustering methods are explored to reduce the dimensionality of our data and mitigate the issue of large proportions of zero-counts in the data. We use a seasonal phase-based clustering approach and a symmetric, circulant, tri-diagonal error structure. The autoregressive coefficient matrix is assumed to be sparse, so we explore different priors that allow for sparsity by analysing simulated data sets before selecting the regularised horseshoe prior for our hierarchical model. The chemical and environmental covariates are incorporated through a time varying mean. Finally, we fit the model to the data from each tank using Hamiltonian Monte Carlo.
Description:	PhD Thesis
URI:	http://hdl.handle.net/10443/5458
Appears in Collections:	School of Mathematics, Statistics and Physics

Files in This Item:

File	Description	Size	Format
Hannaford N E 2021.pdf		6.94 MB	Adobe PDF	View/Open
dspacelicence.pdf		43.82 kB	Adobe PDF	View/Open

Show full item record