Speaker
Description
Since October 2021, the Ontario wastewater surveillance initiative has used next-generation sequencing to monitor the composition of SARS-CoV-2 RNA in wastewater samples. The fragmented nature of these data precludes using comparative methods that require full-length genomes. We developed a method to map each sample as a partial vector of mutation frequencies to a kernel space to quantify the temporal and spatial structure of these data.
Nucleic acids extracted from wastewater samples were sequenced on the Illumina platform using the ARTIC SARS-CoV-2 tiled-PCR approach. Raw data were trimmed using cutadapt and mapped to the reference NC_045512 using minimap2. Mutation frequencies and coverage statistics were extracted from the output stream with Python. These data were filtered for samples with incomplete metadata, positions with insufficient coverage (<100 reads), or mutations with frequencies <1%. For every pair of samples, we calculated the dot product $D(x,y)$ of mutation frequency vectors and normalized by $\sqrt{D(x,x)D(y,y)}$. We used permutation tests to compare the mean D between samples from the same region against the mean for different regions in a given one month period.
In total, we processed 1,619 samples from October 2021 to June 2023 (20 months). The average depth was 8,359 reads, with mean coverage of 24,853 nt. A total of 241,078 mutations were detected in these samples. We restricted our analysis to 17 months with samples from >1 health region. A PCA of the distance matrix revealed substantial temporal structure largely driven by variants of concern. Distances between samples from the same region were significantly shorter for 20 out of 70 region/month-specific permutation tests. These results suggest that spatial differences in the genomic variation of SARS-CoV-2 among wastewater samples can be detected, even at the relatively small scale of a single province.