May 19 – 22, 2026
Canada/Pacific timezone

vir2vec: a pan-viral genomic language model and benchmark for viral genome understanding

May 20, 2026, 12:10 PM
20m
Oral Software, tools & methods Software, Tools & Methods

Speakers

Dr Marco Salemi (Emerging Pathogens Institute, University of Florida)Dr Simone Marini (Department of Epidemiology, College of Public Health and Health Professions, University of Florida)

Description

Genomic language models (gLMs) have emerged as powerful tools for learning numerical representations of DNA sequences. Most existing models, however, are not trained on viral genomes, or limited viral references and lack systematic evaluation frameworks tailored to virology. Here, we introduce vir2vec, a 422-million-parameter decoder-only genomic language model obtained through continual pretraining of Mistral-DNA on a curated pan-viral corpus comprising 565,747 complete genomes from 295 viral species. vir2vec operates on byte-pair–encoded DNA subwords and produces fixed-length, genome-level embeddings via pooling over contextualized token representations, enabling reuse across heterogeneous downstream tasks without task-specific fine-tuning. We evaluate vir2vec on a prediction task benchmark we created, Viral Genomic Understanding Evaluation (vGUE), spanning multiple levels of viral organization: organism-level discrimination (virus vs non-virus genomes and reads), genome-wide evolutionary signatures (DNA vs RNA viruses and host-range prediction), intra-genus species separation (HIV-1 vs HIV-2), fine-grained variant and subtype typing (SARS-CoV-2 lineages), and phenotypic context signal detection (HIV-1 brain vs plasma tropism). vir2vec achieves outperforms competing approaches, i.e., a genomic foundation model and a viral-specific embedded based on ModernBERT. Performance is particularly strong in genome-wide and evolutionary tasks, with balanced accuracies of 0.96 for DNA vs RNA virus discrimination; 0.84 for host prediction; 1.00 for HIV-1 vs HIV-2 classification; and 0.99 for SARS-CoV-2 lineage identification. By coupling a domain-specialized genomic language model with a standardized viral benchmark, vir2vec and vGUE provide an open foundation for future viral genomic models, surveillance applications, and discovery pipelines.

Expedited Notification No thanks, I do not require Expedited Notification

Primary authors

Simone Rancati (Department of Electrical, Computer and Biomedical Engineering, University of Pavia) Mr Pablo Arozarena Donelli (Department of Electrical, Computer and Biomedical Engineering, University of Pavia)

Co-authors

Dr Giovanna Nicora (Department of Electrical, Computer and Biomedical Engineering, University of Pavia) Mr Laura Bergomi (Department of Electrical, Computer and Biomedical Engineering, University of Pavia) Dr Tommaso Mario Buonocore (Department of Electrical, Computer and Biomedical Engineering, University of Pavia) Mr Micheal Aaron Sy (2Department of Epidemiology, College of Public Health and Health Professions, University of Florida) Mr Sakshi Pandey (4Department of Computer and Information Science and Engineering, University of Florida) Dr Marco Salemi (Emerging Pathogens Institute, University of Florida) Dr Riccardo Bellazzi (Department of Electrical, Computer and Biomedical Engineering, University of Pavia) Dr Christina Boucher (4Department of Computer and Information Science and Engineering) Dr Enea Parimbelli (Department of Electrical, Computer and Biomedical Engineering, University of Pavia) Dr Simone Marini (Department of Epidemiology, College of Public Health and Health Professions, University of Florida)

Presentation materials

There are no materials yet.