Oct 31 , 15:30 - 16:30

A scalable and parametrizable pipeline for the Bayesian phylodynamic inference of SARS-CoV-2

The recent COVID-19 pandemic led to the adoption of genomic biosurveillance to identify variants that arise due to genome variations in the process of evolution. The integration of the model of evolution and epidemiology gives rise to the field of phylodynamics which is normally analyzed using Bayesian approaches. These approaches albeit come with advantages such as summarizing phylogenetic uncertainty among others, also come with various challenges due to the scale of a global pandemic, complexity of the bioinformatics pipeline, and the inherent computational intractability of Bayesian inference. To address these challenges, we have developed a scalable and parametrizable pipeline based on the principles of nf-core to streamline the process of compiling sequence data from public databases, setting up prior distributions, performing various preliminary genomic analyses, and finally calculating the posterior distributions of the epidemiologic parameters being inferred using Markov chain Monte Carlo (MCMC) methods together with simultaneously sampling the phylogeny and population size trajectory. The main input of the pipeline are sequence and metadata files of SARS-CoV-2 either from the EpiCoV database of the Global Initiative on Sharing All Influenza Data (GISAID) and/or from the ones we generated in-house at the Philippine Genome Center. The first filtering steps are subsetting the data by time period of the analysis and subsampling fraction. Outgroups are then added to stabilize the phylogenetic tree and the filtered dataset are now inputted to an Augur pipeline as preliminary genomic analyses. The multiple sequence alignment is loaded to an extensible markup language (XML) together with the necessary models and parameter configurations such as MCMC initial values, prior distributions, chain length, and number of CPU cores to be used. The XML is then used by Bayesian evolutionary analysis by sampling trees (BEAST2) as an input. Overall, the pipeline provides a straightforward method of deploying the analyses to different high-performance computing cluster and configuring different combinations of parameter configurations crucial in Bayesian phylodynamic inference.

Speaker

Co-authors

Francis Tablizo, John Justine Villar