Speaker
Description
The SARS-CoV-2 pandemic was an important example of how the spillover of a novel virus can go from a localised outbreak to a global pandemic in weeks. In the early stages of an outbreak, information is scarce and what's available can be exceedingly valuable. Experimental data is time-consuming to produce and is often not available until much later stages of an outbreak
Protein language models (PLMs) like ESM-2 utilise millions of protein sequences to develop an understanding of the properties of amino acid sequences. Recently, ESM-2 was shown to fold protein sequences into their 3D structure, with no alignment or other information about the sequence necessary. As such, it is clear that the embeddings PLMs produce contain information about protein structure and evolutionary constraints.
Here, we describe how ESM-2 can be used to represent meaningful information about a novel virus based on single sequences at various stages of an outbreak. We show using the SARS-CoV-2 pandemic how these models could have been applied; both in the early stages before experimental observations and in later stages for monitoring and horizon scanning. We show how the model outputs can supplement the available information and make a case for their future application and use in outbreaks or pandemics to come.