Speaker
Description
One way of finding new features of biological importance in an organism is to find regions where its genetic sequence has unexpectedly high conservation, signifying that deviations from the conserved sequence affect ability to produce viable progeny. Many methods implementing this approach carry an implicit assumption that each locus in a sequence should be treated equally. But, in fact, some loci yield more information than others. For example, in nucleotide-level analysis, differing information content is seen when different levels of constraint are imposed by codon biases for different amino acids in coding regions. Methods that assume equal information still work in organisms with high genetic diversity, but in organisms with low genetic diversity, accounting for information content may be crucial for detecting weak signals of conservation. In newly-emerged pathogens, whose populations have low genetic diversity, early detection of candidate drug targets is crucial for development of therapeutics.
In newly-emerged pathogens, a further problem in sequence analysis is overdispersion of measures of sequence variability, a result of early mutation patterns. In such scenarios, methods with underlying parametric assumptions may be inappropriate; even methods that appeal to the Central Limit Theorem may be unreliable because the central limit is approached very slowly.
We present a method, RNAdescent, that accounts for the variable information obtained as one traverses a viral genome, and that allows a non-parametric approach to finding conservation in coding regions. We apply this method to the large dataset of SARS-CoV-2 genomes (>5 million sequences) to characterise regions of nucleic acid that must remain conserved, such as a packaging signal and regions key to the formation of subgenomic RNAs. We further show that the method is sufficiently sensitive that it can identify some analogous regions in SARS-CoV, despite the much smaller and less varied set of viral sequences available (119 sequences).