May 6 – 9, 2025
Abbaye de Royaumont, Asnières-sur-Oise, France
Europe/Paris timezone

SARITA: A Large Language Model for Generating the S1 Subunit of the SARS-CoV-2 Spike

Not scheduled
20m
Abbaye de Royaumont, Asnières-sur-Oise, France

Abbaye de Royaumont, Asnières-sur-Oise, France

Abbaye de Royaumont, 95270 Asnières-sur-Oise, France
Poster Software, tools & methods Virtual posters

Speakers

Simone Rancati (University of Pavia)Prof. Marco Salemi (University of Florida)

Description

The COVID-19 pandemic has caused over 776 million cases and 7 million deaths globally, highlight-ing the need for predictive tools to anticipate SARS-CoV-2 evolution. The S1 subunit of the Spike glycoprotein, essential for viral entry into human cells, undergoes frequent mutations that influence transmissibility and immune evasion. Predicting these mutations is crucial for developing vaccines and therapies. We present SARITA, a generative large language model (LLM) based on the GPT-3 archi-tecture, designed to generate high-quality synthetic sequences of the SARS-CoV-2 Spike S1 subunit. SARITA is available in four sizes, ranging from 85 million (SARITA-S) to 1.2 billion parameters (SARITA-XL), enabling efficient sequence generation. For training, we downloaded 16,187,950 Spike protein sequences from the GISAID database (December 2019–November 2023) and filtered them for high-quality criteria. A curated set of 150,000 balanced Spike sequences (December 2019–March 2021) was used for model training to avoid overrepresentation of dominant lineages. SARITA performance was evaluated on a test set of 145,059 unique sequences collected between March 2021 and November 2023, covering emerging variants (Delta, Omicron). The evaluation focused on se-quence quality (valid amino acids), similarity (alignment with real-world sequences), and single-point mutation prediction biological plausibility of mutations). Results showed that SARITA gen-erated high-quality sequences in 97–99% of cases, and achieved a Levenshtein distance of less than 10 in 98–99% of generated sequences, reflecting strong similarity to real data. High similarity was further confirmed by PAM30 scores, obtained by aligning the generated sequences with the Wuhan reference strain. Notably, SARITA accurately reproduced critical mutations, including L212I in the Omicron variant and T19L in the Delta variant, demonstrating its capacity to model biologically rele-vant evolutionary changes. These results highlight SARITA robust capability to predict SARS-CoV-2 evolution, offering valuable support for the proactive development of adaptable vaccines and targeted treatments.

Expedited Notification No thanks, I do not require Expedited Notification

Primary authors

Dr Giovanna Nicora (University of Pavia) Dr Tommaso Buonocore (University of Pavia) Dr Riccardo Bellazzi (University of Pavia) Ms Laura Bergomi (University of Pavia) Dr Mattia Prosperi (University of Florida)

Co-authors

Simone Rancati (University of Pavia) Dr Simone Marini (University of Florida) Prof. Marco Salemi (University of Florida)

Presentation materials

There are no materials yet.