Speaker
Description
Astroviruses comprise a genetically diverse viral family linked to diseases in both humans and birds, resulting in substantial health impacts and economic burdens. Traditionally classified into Avastrovirus and Mamastrovirus genera based on host species, next-generation sequencing has revealed broader transmission patterns, necessitating a reevaluation of the current classification approach. In response to these challenges, a novel alignment-free taxonomic classification method is introduced, leveraging whole genome sequence k-mer composition alongside host information. To control the impact of genetic recombination, an optional component for identifying recombinant sequences is incorporated into the method's pipeline. This three-pronged classification approach integrates a supervised machine learning method (support vector machine), an unsupervised machine learning method (K-means++), and consideration of host species. Applying this approach to 191 unclassified astrovirus genomes (with continuous updates for emergent genomes), genus labels are successfully proposed. Additionally, eight genomes displaying incompatibility with reported host species suggest cross-species infections. Notably, the machine learning-based approach, enhanced by principal component analysis (PCA), supports the hypothesis of a human astrovirus (HAstV) subgenus within Mamastrovirus and a goose astrovirus (GoAstV) subgenus within Avastrovirus due to the differences in their genome compositions. In summary, this multipronged machine learning approach offers a rapid, reliable, and scalable method for predicting taxonomic labels. It addresses the challenges posed by emerging viruses and the exponential growth in current genome sequencing output, and facilitates viral taxonomy research.