Ethical considerations
This project was undertaken following approval from the relevant bodies, including Kenya Medical Research Institute (KEMRI) Scientific Review Unit (Approval # KEMRI/RES/7/3/1 and KEMRI/SERU/CGHR/035/3864), Kenya’s National Commission for Science, Technology, and Innovation (License # NACOSTI/P/15/9609/4270 and NACOSTI/P/22/14839), Kenya Wildlife Services (permit # 0004754 and # WRTI-0136–02-22), and National Environment, Management Authority (permit # NEMA/AGR/46/2014 – Registration # 0178 and NEMA/AGR/159/2022 – Registration # 201). Collections of S. mansoni from schoolchildren were approved by KEMRI’s Scientific and Ethics Review Unit (SERU), reference SERU No. 3540, by Institutional Review Board of the Western University of Health Sciences No. FB19/IRB/111 2021, the and by the Institutional Review Board of the University of New Mexico (UNM) No.18115. Informed consent was obtained from the parents of five children that were deemed to be positive for S. mansoni from fecal samples using egg microscopy, of which remaining samples were used for S. mansoni miracidial hatching. All five children were treated with praziquantel during a follow-up visit by a physician.
Snail phenotyping
In total, 329 B. sudanica snails were collected from Anyanga Beach Kanyibok, Kenya (Latitude: −00.08958°, Longitude: 34.08592°) on March 27th 2018. These snails were screened the following day for any patent echinostome infections, of which eight were found and discarded, leaving 321 snails that formed the parental generation. Parent snails were bred in plastic aquaria in an outdoor rearing facility at Kenya Medical Research Institute (KEMRI) in Kisumu. Snails housed in aquaria were allowed to breed and lay eggs for 7–14 days, then the adults moved to a new aquarium so that the eggs can hatch, and juveniles develop. This process was repeated every 7–14 days for approximately two months to reach the goal of 1400 offspring juvenile snails with a shell diameter of 4–6 mm to use in our infection cohort, which was reached in May 2018.
Schistosoma mansoni eggs were obtained from five local school children, aged 6–12 years old that attended the regional primary school, Kanyibok Primary School, Usenge, Kenya (Latitude: −00.08552°, Longitude: 34.07778°), on May 30th 2018, following informed consent from parents. Infected children were given praziquantel (40 mg/kg body weight) treatment during a follow-up visit by a KEMRI physician. Individual fecal samples were screened for S. mansoni eggs using the Kato-Katz method and then five samples with high egg intensity were selected for use in snail challenges that were performed over the course of the next two days, storing fecal samples at 4 °C overnight. The eggs hatch upon exposure to freshwater and light and the resulting free-swimming miracidia were used for snail exposures.
In total, 1400 offspring juvenile snails (4–6 mm) were each exposed to eight freshly-hatched miracidia in the wells of a 24-well tissue culture plates each filled with ~2 ml of bottled water overnight. Exposed snails were maintained in large plastic aquaria (~20 L) in an open-air snail facility under ambient light and fed lettuce. Each snail was checked for infection via cercarial shedding every three to five days from 5- to 9-weeks post exposure. Shedding snails were considered “true positive” and preserved in 100% ethanol. Any snails that did not release S. mansoni cercariae by 9-weeks post-exposure were later subjected to a PCR diagnostic to ensure they were “true negative” (uninfected) as opposed to harboring a prepatent/latent infection (“false negative”)29. Only snails that were cercariae and PCR negative were considered negative and used for GWAS.
Pooled genome-wide association study
Genomic DNA (gDNA) from each snail was extracted from headfoot tissue following a modified CTAB protocol for freshwater snails71. For the positive and negative pools, 10 ng of gDNA from each snail was included in the appropriate pool after quantification by two Qubit DS DNA assays (Invitrogen, CA, USA). Two Nextera libraries were created per pool as technical replicates to help ensure robustness and mitigate potential biases, and each was subjected to paired end 150 bp sequencing across two lanes of the Illumina NovaSeq 6000 S4 flow cell, with the positive and negative snail libraries on one lane and the technical replicate libraries of each on the other.
All command line and custom Perl scripts used in the following bioinformatic analysis are available in the FigShare project repository (https://figshare.com/projects/Genome_Wide_Association_Study_of_Biomphalaria_sudanica/230615). To process the resulting DNA sequence read data from each pool, Nextera Illumina Adapters were cut in cutadapt v3.172 and reads trimmed in trimmomatic v0.30 using options: LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:5073. Schistosoma mansoni reads were then removed from the DNA libraries by aligning data to the S. mansoni v9 genome (GCA_000237925.5)74 using bwa v0.7.17 mem75,76 and removing reads that mapped with CIGAR value > 70 using a custom Perl script (FilterMappedBamCigar.pl, options: -m 70) whilst removing secondary and supplementary reads in samtools v1.19.2 using options: -f 1 -F 2304 -bS77 to retain only reads that are B. sudanica. FASTQ files of the filtered B. sudanica reads were generated using BEDtools function bamtofastq78 then aligned to the B. sudanica genome27 using bwa mem75,76. PCR duplicate reads were marked, and removed using samtools fixmate (options: -m) and samtools markdup (options: -s -f), along with unmapped, secondary and supplementary alignments, and only paired reads were kept (samtools77 options: view -f 1 -F 3332 -bS). Due to large file size, some files had to be partitioned for improved processing, these were merged using samtools merge, before each bam file representing each library was indexed; samtools index.
The allele depth was used to calculate allele counts and therefore frequencies to analyze genotype-phenotype associations between our positive and negative pools of snails. First a Variant Call Format (VCF) file was generated using the bcftools v1.979 mpileup (options: -A -a AD) and call (options -m -Ov) functions. Subsequently, a custom Perl script that extracted information from the allele depth region (AlleleCountsFromGwasVcf.pl, options: -d 20 -q 0.05 -l 0,1 -s 100 -i), was used to make a table of counts of each allele for each sample in the VCF. The options correspond to filtering the variant data by using a minimum allele count of 20, and removing variants if any of the four quality metrics (RPB, MQB, MQSB, and BQB) contained in the VCF were <0.05, or if allele depth was <100. Indels were removed. This resulted in a table of counts of each allele for each sample in the VCF. Technical replicates were compared by calculating the allele frequency difference between positive and negative pools at each variant, and regressing these values from one technical replicate against those from the other technical replicate. Technical replicates were then combined evenly in downstream analysis due to observed similarity.
For each variant in the B. sudanica genome that passed the filtering process, we calculated Fishers exact test p values from 2 × 2 contingency tables representing allele count by phenotype (positive or negative), and the odds ratios for each variant, in R v4.380. This resulted in a table denoting one row per variant with columns providing the allele count in positive and negative pools, odds ratio, Fisher’s and Chi-squared p values. This was then summarized using a custom Perl script (SummarizeGWAS_p2.pl).
B. sudanica contigs/scaffolds were ordered (Fig. 1) based on orthology to B. glabrata iM scaffolds and linkage groups22, following our previous approach27. Namely, we used BLASTp (see BLAST analysis) of all B. glabrata proteins against all B. sudanica proteins, and vice versa, to identify reciprocal best hits. B. sudanica contigs/scaffolds for which a majority of ortholog-matched genes occurred on the same B. glabrata scaffold were assigned a position based on the location of those orthologs.
Validation of variants by amplicon genotyping
To validate the pooled-GWAS outliers, a multiplex amplicon panel was designed using the Genotyping-in-Thousands by sequencing (GT-Seq) method30. The panel contains 470 amplicons in total, including markers for 234 dual-variants i.e., genomic locations with two or more proximate variants (<50 kb) that were strong outliers (Fishers exact test p < 2.5e-9), and 12 singleton-variants (with p < 1e-13 from the pooled-GWAS), and 22 markers for a priori gene candidates based on variants located near candidate immune genes of B. glabrata or exceptionally diverse regions of B. sudanica (panel details given in Supplementary Data 2)27. We applied the panel to the 220 independent validation snails (122 positive, 98 negative) and 276 of the pooled-GWAS snails (138 positive, 138 negative), referred to as genotyped-validation and genotyped-pooled-GWAS snails respectively, to confirm genotypes of snail individuals and validate the differential allele frequencies observed in the pooled-GWAS sequencing data.
In addition to the pooled-GWAS variants and a priori gene markers were an additional 201 ‘neutral’ markers (i.e., with no expected phenotypic association) included in the amplicon panel to facilitate the inclusions of large contigs on a linkage map (see Linkage mapping, below). Also, a Schistosoma 16S rRNA (mitochondrial) amplicon to confirm parasite presence was included in the panel for future applications since a PCR diagnostic29 was used to determine infected snails in this study.
In addition to the GWAS positive and negative B. sudanica genotyped by the amplicon panel, two outbred B. choanomphala snails collected in Lake Victoria using offshore deep-water dredging in 2019 from Ndiara Beach (Latitude:−00.090925°, Longitude:34.085067°) were also included for purposes of investigating population structure and utility of the developed panel on this taxon. We also genotyped seven “false negative” snails (did not shed parasites but PCR positive for infection) but we excluded these from phenotypic analysis due to their ambiguous status.
Amplicon genotyping was conducted by GTseek (Twin Falls, Idaho), which reports read counts of each allele per sample30. Sequencing libraries were prepared using the GT-seq genotyping method30 modified by using Nate’s Plates tagging and Normalization kits (https://gtseek.com/products/). Post sequencing, a custom Perl script was used to count allele specific sequences for each locus from raw sequencing reads (https://github.com/GTseq/GTseek_utils). Amplicon genotypes were inferred from read counts as follows: <20x coverage considered missing, if each allele has >10% the coverage of the other allele the genotype is designated heterozygote, otherwise the genotype is designated homozygote (Fig. S4). We used Fisher’s exact tests at each marker to confirm that genotype missingness was not systematically different between phenotype pools (fewer markers showed p < 0.05 than expected by chance, and their signal between pools was not in a consistent direction).
We assessed population structure in genotyped-validation snails and genotyped-pooled-GWAS snails using ADMIXTURE v1.3.031 and principal component analysis (PCA) in R. All command line, Perl and R scripts used in the following bioinformatic analysis are available in the FigShare project repository (https://figshare.com/projects/Genome_Wide_Association_Study_of_Biomphalaria_sudanica/230615). Specifically, we used the snpgdsIBDKING function from the SNPRelate v1.40.0 package81 and PC-AiR in GENESIS v2.36.082 to identify a set of unrelated individuals, due to the possibility of close relatives in our set of F1 snails. We then ran ADMIXTURE on this unrelated set and projected the remaining samples onto the resulting ancestry inferences with the -P option. Standard errors were inferred by bootstrapping with the -B option. For both PCA and ADMIXTURE, we only used the ‘neutral’ linkage map markers rather than the GWAS outliers or a priori candidate markers. For ADMIXTURE we ran models with K (number of ancestral populations) ranging from 2 to 7 and chose the value of K with the lowest Cross-validation (CV) error in the unrelated samples. We also performed PCA including genotypes from the two outbred B. choanomphala samples, and the genotypes of two previously sequenced B. sudanica inbred line snail genomes27. We did not use all four Illumina-sequenced inbred lines as two were parents in our linkage cross and our markers were chosen to be different between them, causing a strong ascertainment bias. Given a signal of population structure, we accounted for ancestry in our validation analysis using two complementary statistical approaches. For both approaches, we converted genotype markers to two sets of binary variables, alternately designating heterozygotes as 0 or 1, effectively modeling a dominant effect on phenotype. We chose this approach rather than an additive model because deletions and duplications can lead to incorrect inferences about diploid genotypes83, and we avoid this complication by simply asking if an allele is present in at least one copy, or absent. In our first approach, we fit a logistic regression model for each marker in the set of genotyped-validation samples, assuming a dominant effect on phenotype, and including ancestry as a quantitative cofactor (proportion of ancestry from second population). This was calculated in R using the glm function with family = binomial, with phenotype (binary 0 or 1) as the response variable, and ancestry (continuous from 0 to 1) and genotype (binary 0 or 1) as independent variables. In this way a genetic marker will only be significant if it has an effect on the phenotype that isn’t already captured by the ancestry variable in the regression. We designated significance using a Bonferroni threshold of 0.05/2 N, where N is the number of candidate amplicons (i.e., excluding linkage map amplicons) with minor allele frequency >5% and <20% missing genotypes (≥176 samples with non-missing genotypes) in genotyped-validation samples, doubled to account for two tests per marker as either allele could be dominant. Second, to test for effects specific to ancestry groups, we divided genotyped-validation samples into two groups based on ancestry (group A and B), and performed Fisher’s exact tests (in R with the fisher.test function) between genotypes and phenotypes within each ancestry group. To minimize ancestry effects, we required snails in these ancestry groups to show at least 90% ancestry from one ancestral population, and therefore ignored the more admixed samples. We designated a significance threshold as before, again requiring <20% missing genotypes per group (≥37 samples with non-missing genotypes in Group A; ≥28 samples with non-missing genotypes in Group B). The sample size used for Group A and Group B in this analysis is 46 and 34, respectively. To assess whether close family relationships drive these results, we first used snpgdsIBDKING81 to identify snail pairs with estimated KING-robust kinship coefficients. For each marker we then fit a linear mixed model against the infection phenotype using the lmm.diago function from the gaston v1.6 package84, incorporating the eigen decomposition of the kinship matrix with the eigenK argument. As above, we ran two different analyses: one including our previously-defined ancestry estimate as a covariable, and one restricted to just ancestry group A or B. We calculated p values from the z-score obtained as BLUP_beta/sqrt(varbeta).
To find the best-fitting model relating genotype to phenotype from the amplicon panel data, we combined both genotyped-validation and genotyped-pooled-GWAS snail data. We tested all combinations of logistic multiple regression models (base e) encompassing ancestry and all successfully validated loci, considering all combinations of dominance at the genetic markers. We chose the model with the lowest AIC, and for which all individual variables have p < 0.05.
BLAST analysis
BLAST searches (BLASTn, tBLASTn, BLASTp and BLASTx) described in the following sections were all performed using BLAST v2.14.0+ or BLAST v2.15.0+ using default parameters (E-value threshold 10 and scoring matrix of BLOSUM62). The queries consisted of nucleotide or protein sequences in FASTA format, and searches were run against the respective genome or protein databases created using the function ‘makeblastdb’.
Linkage mapping
To generate crossed snails for linkage mapping, two juvenile (<4 mm) B. sudanica, one from each line 163 and KEMRI, which demonstrate opposite patterns of susceptibility to S. mansoni (UNMKenya line)85, were reared in a ~ 1 liter breeding tank. Once egg laying and hatched juveniles were present in the tank, the two parent snails and eight randomly selected juveniles had gDNA extracted using the Qiagen Blood & Tissue kit (Qiagen, MD, USA). Outcrossing was confirmed using a developed PCR and restriction enzyme based diagnostic marker86. Remaining F1 juveniles were moved to a new tank and then reared until eggs and F2 juveniles were present. These F2 juveniles were reared until >3 mm size. F2s were challenged with S. mansoni miracidia and classified as positive or negative for infection using the same methods as for the GWAS snails. gDNA of the Bs163 and BsKEMRI parents and 87 F2 B. sudanica snails were genotyped using the amplicon panel as described above.
Genotype data from parents and F2s were used to create genomic linkage maps with OneMap v3.0.087, using minimum logarithm of the odds (LOD) of 4.5 from the rf_2pts option to unite markers into linkage groups. We included all amplicon markers that segregated in the F2s, whether they were included as GWAS variants or linkage map markers. Amplicon sequences were then matched to the B. glabrata genome assembly (xgBioGlab47.1, NCBI RefSeq GCF_947242115.1) using BLASTn, facilitating synteny comparison between our B. sudanica map and the B. glabrata genome. We did not perform a map-wide analysis for quantitative trait loci due to low statistical power given our samples size. However, for the validated genomic regions from the GWAS, we tested for phenotype-genotype associations in the linkage cross using Fisher’s exact tests.
Characterization of validated genomic regions
Regions of the B. sudanica genome that were significantly associated with S. mansoni resistance and then validated were characterized using the B. sudanica genome annotation27 to determine the function of protein coding genes, and annotate any potentially missing genes from the current B. sudanica annotation. Synteny of regions to the B. glabrata genome assembly (xgBioGlab47.1, NCBI RefSeq GCF_947242115.1) and B. pfeifferi genome assembly UNM_Bpfe_1.0 (GenBank GCA_030265305.1) were assessed using D-GENIES88. tBLASTn searches against the B. sudanica genome were conducted using proteins in the syntenic regions of the B. glabrata genome. Alignments of B. glabrata and B. pfeifferi homologs/orthologs (see Dataset S5 and S7) to B. sudanica sequences were used where possible to determine complete gene sequences in the case that the reference protein sequences were suspected to be truncated. Open reading frames (ORFs) were assessed in the B. sudanica validated genomic regions using the in-built tool in Geneious v2022.0.2 (Biomatters Ltd.) and predicted protein sequences generated that were characterized based on the protein families and functional domains predicted using InterProScan89 and DeepTMHMM90. Neighboring ORFs in the same direction were merged (internal stop codons in ORFs, predicted to be in intronic regions, were removed) to construct complete coding sequences of suspected proteins. Available PacBio and Illumina RNA-seq transcript data for B. sudanica (NCBI accessions: SRX22544968- SRX22544970) were aligned using bwa mem75 to predicted mRNA sequences of these proteins.
BLASTn searches of predicted mRNA sequences generated following manual annotation were performed against the B. sudanica genome to determine any other homologous genes in the genome that may have not been annotated. For the 23,598 genes assigned to have open reading frames in the B. sudanica genome27, these were searched against using key descriptors of genes of interest (i.e., PTPRA and GR101). In addition, InterPro accessions attributed to each gene in the annotated B. sudanica genome were searched against InterPro gene families and functional domains of interest. Specifically, to determine homologous genes to GRL101-like G-protein coupled receptor (GRL101) proteins containing a leucine rich repeat (LRR) region, a low-density lipoprotein receptor class A repeat (LDL) and a C-type lectin-like (CTL) domain, we filtered gene data by those matching InterPro accessions: IPR000276; IPR017452; IPR001304; IPR016187; IPR016186; IPR032675; IPR001611; IPR036055; IPR002172. To determine homologous genes to receptor-like protein-tyrosine-specific phosphatases (RPTP) containing multiple epidermal growth factor (MEGF) and galactose binding domain(s) (GBD), we filtered gene data by those matching InterPro accessions: IPR000242; IPR02902; IPR016130; IPR000387; IPR003595; IPR008979; IPR009030; IPR000742; IPR002049.
We examined read depth in our original pooled-GWAS data at our validated regions, to test if it was unusual relative to the rest of the genome. We calculated mean depth for each phenotype pool in 10 kb windows and normalized to the genome-wide median for that pool (Figs. S15, S18). We also calculated mean depth in 100 kb sliding windows (step size = 10 kb) and plotted this for positive and negative pools (Fig. S19).
PacBio whole genome sequencing and assembly of a resistant Biomphalaria sudanica snail genome
High molecular weight DNA from a single snail included in the pooled-GWAS (Bs2280) that was phenotypically resistant (not shedding cercariae or S. mansoni PCR positive) and contained non-reference (resistant) haplotypes assessed using the amplicon panel at both SudRes1 (contigs c582_65596: TT, contig_6_3490: AT and contig_6844_72816: GT, the latter two not indicating heterozygosity but rather paralogs, since there are never any TT’s) and SudRes2 loci (contig sc94 positions 2174117: AA, 2150435: CC and 2149367: AA) was sequenced on the PacBio platform. Genomic DNA was cleaned, fragmented and library produced following the standard HiFi SMRTbell Library preparation by Novogene (Sacramento, CA). The library was sequenced on a single PacBio Revio using a single SMRT cell, resulting in 13.4 G total bases. The PacBio ccs reads (raw reads available in NCBI BioProject: PRJNA1149315, BioSample: SAMN45084274) were assembled using Flye91 v2.9.4, settings flye –pacbio-hifi -g 1 g -t 16. Basic statistics of the genome assembly were calculated in seqkits v0.16.192.
Following assembly, orthologous contigs to SudRes1 and SudRes2 regions were identified by performing mutual best hit BLASTn searches for orthologous contigs to those in the Bs111 SudRes1 and Sudres2 region, in addition to tBLASTn and BLASTx searches to orthologous protein sequences. Orthologous regions were then aligned using MAFFT93, and orthologous protein coding regions annotated in Geneious v2022.0.2 (Biomatters Ltd.).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.