Speaker
Description
With dozens or hundreds of minor variants of SARS-CoV-2 circulating in the global population, there is an urgent need for predicting the future frequencies of a new variant when it emerged in the population. This would allow for more focused experimental efforts and for timely formulation of new vaccines. To address this need, we constructed machine learning models, based on the transformer architecture (used in modern language processing models). These models use pango lineage frequency time series as input data to predict future lineage frequency. We trained the models on data collected in the US and the UK before the end of 2022, and tested the model against data collected in 2023. The best model is able to predict the frequency of a newly emerged lineage two months in the future with a high level of accuracy, i.e. mean average error less than 0.4 on a log10 scale. Surprisingly, the model makes predictions at similar accuracy on data collected from other countries where the total number of sequences is greater than 10000) without retraining. We compared our model performance with the NextStrain prediction (based on a multinomial logistic model), and found our model outperformed NextStrain substaintially especially for newly emerged lineages. These results demonstrate machine learning approaches, such as natural language processing models, represent promising new methods utilizing genomic data for SARS-CoV-2 lineage frequency forecasting.