May 6 – 9, 2025
Abbaye de Royaumont, Asnières-sur-Oise, France
Europe/Paris timezone

SARITA: a generative large language model accurately predicting the emergence of SARS-CoV-2 critical mutations

Not scheduled
20m
Abbaye de Royaumont, Asnières-sur-Oise, France

Abbaye de Royaumont, Asnières-sur-Oise, France

Abbaye de Royaumont, 95270 Asnières-sur-Oise, France
Oral Genomics & bioinformatics

Speaker

Prof. Marco Salemi (University of Florida Emerging Pathogens Institute)

Description

The COVID-19 pandemic has caused over 776 million cases and 7 million deaths globally, highlighting the need for predictive tools to anticipate SARS-CoV-2 evolution. The S1 subunit of the Spike glycoprotein, essential for viral entry into human cells, undergoes frequent mutations that influence transmissibility and immune evasion. Predicting such mutations could be crucial for the development of future vaccines and therapies. We present SARITA, a generative large language model (LLM) based on the GPT-3 architecture, designed to generate high-quality synthetic sequences of the SARS-CoV-2 Spike S1 subunit. SARITA is available in four sizes, ranging from 85 million (SARITA-S) to 1.2 billion parameters (SARITA-XL), enabling efficient sequence generation. For training, we downloaded 16,187,950 Spike protein sequences from the GISAID database (December 2019–November 2023) and filtered them for high-quality criteria. A curated set of 150,000 balanced Spike sequences (December 2019–March 2021) was used for model training to avoid overrepresentation of dominant lineages. SARITA performance was evaluated on a test set of 145,059 unique sequences collected between March 2021 and November 2023, covering emerging variants (Delta, Omicron). The evaluation focused on sequence quality (valid amino acids), similarity (alignment with real-world sequences), and single-point mutation prediction of biologically plausible mutations. Results showed that SARITA generated high-quality sequences in 97–99% of cases and achieved a Levenshtein distance of less than 10 in 98–99% of generated sequences, reflecting strong similarity to real data. High similarity was further confirmed by PAM30 scores, obtained by aligning the generated sequences with the Wuhan reference strain. Notably, SARITA accurately reproduced the emergence of critical mutations, including L212I in the Omicron variant and T19L in the Delta variant, demonstrating its capacity to model biologically relevant evolutionary changes. These results highlight SARITA robust capability to predict SARS-CoV-2 evolution, offering valuable support for the proactive development of adaptable vaccines and targeted treatments.

Expedited Notification Yes, I want to opt-in for Expedited Notification

Primary authors

Prof. Marco Salemi (University of Florida Emerging Pathogens Institute) Dr Simone Marini (University of Florida Emerging Pathogens Institute) Mr Simone Rancati (University of Pavia, Italy)

Co-authors

Dr Giovanna Nicora (University of Pavia, Italy) Prof. Riccardo Bellazzi (University of Pavia, Italy) Prof. Mattia Prosperi (University of Florida Emerging Pathogens Institute)

Presentation materials

There are no materials yet.