Contaminated drinking water facilitates Escherichia coli strain-sharing within households in urban informal settlements

Categories: Disease & Virus

May 2, 2025

Study sites and sample collection

To investigate E. coli strain-sharing between humans, domesticated animals and the environment, 50 poultry-owning households with at least one child under the age of 5 years were enrolled in two subcounties, Dagoretti South (n = 25) and Kibera (n = 25), in Nairobi, Kenya, during June to August 2019 as part of a cross-sectional study. One household was randomly enrolled from each compound that owned poultry. Written informed consent was obtained from each adult participant before stool sampling, and child assent and parental written consent were obtained for each child participant. Participants received a pack of soap as compensation. A household survey was performed along with the sample collection to collect demographic information and household characteristics including water, sanitation, and hygiene access and practices. No statistical methods were used to pre-determine sample sizes, but our sample sizes are similar to those reported in previous publications¹². Data collection and analysis were not performed blind to the conditions of the experiments.

Each household was visited twice for the sample collection. During the first visit, we collected up to three poultry cloacal swabs (depending on the number of poultry present), one canine faecal sample, one household soil and one stored drinking water sample from each household. A trained veterinary student administered the poultry cloacal swabs and placed them in storage tubes filled with Cary–Blair transport medium. To collect canine faeces, the top layer of fresh faeces from the centre of the pile was transferred into a 50 ml centrifuge tube with a sterile plastic scoop. For household soil collection, a 30 cm × 50 cm square area closest to the entrance of the household was marked, and the top layer of the entire surface of the marked area was scraped with a sterile disposable plastic scoop. Approximately 150 g of soil was collected and transferred into a 500 ml sterile Whirlpak bag. The stored water within each household was collected by withdrawing 350 ml of the water from the bottom of the storage container with a sterile serological pipette. The water was then transferred into a 500 ml sterile Whirlpak bag containing 10 mg of sodium thiosulfate. All samples were immediately placed in a cooler filled with ice and transported to the Kenya Medical Research Institute (KEMRI).

To capture transmission of bacterial strains from the environment to humans, approximately 1 week after the initial environmental sampling visit, households were revisited to collect human stool from one household member in the following three age groups: child aged 0–4 years, child aged 5 14 years and adult aged 15 years or older. A stool collection kit was provided during the first visit, which included a 50 ml plastic pot with a sterile scoop for each member with instructions on how to collect the sample. Caretakers were instructed to collect faeces on aluminium foil, then, using sterile gloves and scoops, to transfer the faeces to the plastic pot for collection. The primary caretaker of each household was informed by mobile phone 1 day before the revisit to collect stool from the previous night or the morning of the revisit day. If stool samples were not available on the visit day, households were revisited up to three times to collect the remaining stool samples. After pick-up, the received stool samples were placed in a cooler filled with ice and transported to KEMRI.

The study received ethical approval from the KEMRI Scientific and Ethics Review Unit (12/3823) and the Tufts Health Sciences Institutional Review Board (13205). In addition, a research permit was granted by the Kenyan National Commission for Science, Technology, and Innovation.

PIC-seq

Here we describe our method of pooling presumptive isolated E. coli colonies for sequencing (PIC-seq). E. coli colonies were first cultured from the collected samples within 6 h of sample collection. A sterile swab was inserted into 1 g of each collected human stool and canine faeces sample, and the swab was streaked on Tryptone Bile X-glucuronide (TBX) agar plates. Poultry cloacal swabs were directly streaked on the agar plates. Household soil and stored water samples were membrane filtered and cultured on TBX. Collected soil was first screened through 2 mm pore mesh sieves to remove rocks and leaves, then 5 g was homogenized with 50 ml of distilled water. After allowing soil particles to settle for 1 min, 100 μl of supernatant was diluted with 10 ml of distilled water, vacuumed through a 0.45 μm membrane filter and plated. About 100 ml of stored water was directly filtered, then plated on TBX. All plates were incubated overnight (18–24 h) at 44 °C.

After incubation, up to five presumptive E. coli colonies (blue-green colour) were selected from each agar plate for pooling. The selected colonies from each plate were transferred into a single 2 ml tube containing 1 ml of Luria–Bertani medium. The pooled E. coli colony cultures were incubated overnight (18 h) at 37 °C with shaking at 200 r.p.m. After incubation, 100 μl of the cultures was mixed with 1 ml of 30% glycerol–70% tryptone soy broth solution and stored at −80 °C. The samples were then shipped on dry ice to Tuft University.

Pooled isolates were thawed on ice and inoculated into 5 ml of fresh tryptone soy broth medium with a sterile inoculating loop. The cultures were incubated overnight at 37 °C with 200 r.p.m. shaking, then 1.5 ml of the culture was pelleted by centrifugation at 5,000 × g for 10 min. The DNA was then extracted from pellets using DNeasy PowerSoil Pro Kit (Qiagen) following the protocol provided by the manufacturer. The concentration and purity of the extracted DNA were measured by a Qubit 4.0 Fluorometer (Thermo Fisher Scientific), a sequencing library was prepared using Nextera XT kit (Illumina), and shotgun sequencing was performed on Illumina NovaSeq 6000 (Illumina) with a target sequencing depth of 1.5 Gb at the Broad Institute.

Identification of E. coli strains and calculation of strain-sharing rates

The quality of raw sequence reads was examined using FastQC v0.12.1 (ref. ⁴⁵), and low-quality reads were filtered out using Trimmomatic v0.39 (ref. ⁴⁶) with parameters set to LEADING:3, TRAILING:3, SLIDINGWINDOW:4:15 and MINLEN:70. The subsequent reads were used for the identification and comparison of Escherichia strains using the software StrainGE v1.3.3, the software specifically designed for accurate calling and comparison of strains in the complex metagenome data based on k-mer comparison, bypassing assembly process²¹. The database for StrainGE was constructed using the complete genome sequences (2,496 genomes) of the genus of Escherichia or Shigella downloaded from the National Center for Biotechnology Information (NCBI) RefSeq database on 20 August 2022, by using the built-in database construction pipeline (‘ncbi-genome-download’, ‘kmerize’, ‘kmersim’, ‘cluster’ and ‘createdb’). The reference genomes in the database were clustered if they shared 99% of their k-mer content with a Jaccard similarity over 0.90, and to minimize redundancy, only the representative genome sequences of each cluster were retained, resulting in a final set of 885 genomes.

After strain identification, we used StrainGR, the second module of StrainGE v1.3.3, to call sequence variants among samples²¹. Sequence reads were aligned to the genomes of the identified strains in each sample using bwa mem v0.7.17 with default parameters⁴⁷. For sample pairs with the same reference strain, the average ACNI and gap similarity were calculated based on the callable regions, using ‘call’ and ‘compare’ command in StrainGR²¹. The threshold for defining the same strain was set to >99.95% ACNI.

The strain-sharing rates between sample types were calculated, using the results from StrainGR. Within each household, we normalized the number of sample pairs sharing the same strain by the total number of combinations between the certain sample types. We then computed the average rates based on the rates from all households for comparison. For strain-sharing rates between different households, we normalized the number of sample pairs with the same strain by the total number of possible sample pairs, between all combinations of different households. We then calculated the average rates from these inter-household rates for comparisons between certain types of sample. Stored water samples without E. coli contamination were included in calculating water-related total number of possible pairs.

Phylogeny and phylogroup of detected E. coli strains

To infer the phylogenetic relationships among the strains identified from the samples, we used the reference genome sequences included in the StrainGE database. Prokka v1.14.5 was used to annotate genes, with the genus parameter set to Escherichia⁴⁸. The core genes that exist in 99% of all reference genomes were identified using Roary v3.13.0 (ref. ⁴⁹). The concatenated core gene alignment was generated using MAFFT v7.508, and SNP-sites v2.5.1 (ref. ⁵⁰) was used to extract the alignment regions with single-nucleotide polymorphisms to reduce the size of the alignment. The reduced alignment was uploaded to CIPRES Science Gateway v3.3 (ref. ⁵¹) (www.phylo.org), and a maximum likelihood tree was constructed using RAxML v8.2.12 (ref. ⁵²) with the nucleotide substitution model set to GTRGAMMA with 500 bootstraps. E. coli phylogroup was predicted using ClermonTyping v20.03 (ref. ⁵³) with default parameters. MLST of the StrainGE-identified reference strains were assigned using the software mlst v2.22.1 (https://github.com/tseemann/mlst) based on the ecoli_achtman_4 scheme.

Identification of pathogenic Escherichia and Shigella strains

The identification of pathogenic E. coli strains was inferred based on the presence of pathotype-associated virulence genes detected in the assembled contigs for each sample. These virulence genes were annotated from the contigs using ABRicate v1.0.1, in combination with the Virulence Factor Database (download in May 2023)⁵⁴. The presence of different E. coli pathotypes was based on the detection of the following gene sets: agg, aat and aai for EAEC; eae and bfp for EPEC; elt and est for ETEC; stx for enterohaemorrhagic E. coli; afa and drg for diffusely adherent E. coli; and ipa for enteroinvasive E. coli^55,56,57. The presence of Shigella was additionally investigated based on StrainGST results.

Annotations of ARGs and MGEs

The integrated assembly pipeline, which combines reference-based binning and de novo assembly, was used to generate contigs for ARGs and MGEs annotation (Supplementary Fig. 4). Plasmid sequences were assembled from sequencing reads using metaplasmidSPAdes v3.15.4 (ref. ⁵⁸) with default parameters. Subsequently, the original reads were aligned and binned to the set of the assembled plasmid sequences and the reference genomes of strains identified through StrainGST, using mSWEEP v.2.0.0 (ref. ²⁷) and mGEMs v.1.3.0 (ref. ²⁶) with default parameters. For each reference strain, the reads binned to their genome sequences were individually assembled de novo using metaSPAdes v3.15.4 (ref. ⁵⁹) with default parameters. The sequence reads that did not align to any sequences were separately assembled de novo using metaSPAdes. The gene-coding sequences in the assembled contigs were predicted using Prodigal v2.6.3 (ref. ⁶⁰) with default parameters. The translated amino acid sequences of the predicted gene-coding sequences were annotated using DIAMOND blastp v2.0.14.152 (ref. ⁶¹) against the Comprehensive Antibiotic Resistance Database (CARD) v3.2.4 (ref. ⁶²) with the parameters set to ‘–id 90’, ‘–min-orf 25’ and ‘–query-cover 80’. The annotations with the highest bitscore for each gene-coding sequence were retained for the downstream analysis.

To predict the mobility of ARGs, we examined the presence of any MGEs in the flanking regions of the annotated ARGs. We extracted up to 5,000 base pairs from both upstream and downstream of ARGs on the contigs using a custom bash script. The mobileOG-pl pipeline was then used to annotate MGEs with mobileOG-db release beatrix-1.6 as the reference database⁶³, which is a manually curated database that includes sequences from eight different databases (ICEBerg, ACLAME, GutPhage Database, Prokaryotic viral orthologous groups, COMPASS, NCBI Plasmid RefSeq, immedb and ISfinder), along with homologues of the manually curated sequences. ARGs with identified MGEs in their flanking region were presumed to be mobile, while those without MGEs were categorized as non-mobile. In instances where the contig length was insufficient (<5,000 base pairs) to detect MGEs in the flanking region, the mobility of these ARGs was considered ambiguous.

Preparation of simulated datasets for benchmarking

To benchmark our bioinformatics pipeline, we generated simulated datasets that emulate sequence reads we would expect from PIC-seq on five E. coli isolates. The paired-end Illumina sequence reads (2 × 150 bp) and the complete genome sequences of five clinically relevant antibiotic-resistant E. coli isolates, possessing either bla_CTX-M-15 or mcr-1 genes along with other ARGs, were provided by the Broad Institute. We characterized each isolate based on their complete genome sequences. The phylogroup of each E. coli isolate was predicted using ClermonTyping v20.03 (ref. ⁵³) with default parameters. MOB-suite v3.1.0 was used to identify any plasmid contents included in the genome files⁶⁴. The gene coding sequences within genomes were predicted using Prodigal v2.6.3 (ref. ⁶⁰) and were annotated using DIAMOND blastp v2.0.14.152 (ref. ⁶¹) against the CARD v3.2.4 with parameters set to minimum identity of 90%, minimum query cover 80% and minimum amino acid length of 25. The proximity of MGEs to genes annotated as ARGs was examined in a 5,000 bp region both upstream and downstream of the ARGs, using the mobileOG-pl pipeline and the mobileOG-db release beatrix-1.6 as the reference database⁶³.

To generate simulated datasets, we subsampled the sequence reads from each E. coli isolate and pooled with 6 different proportions, ranging from equal ratios to 400-fold difference in sequence coverage, targeting a total sequencing depth of 2 Gb using seqtk v1.3 (Supplementary Table 7). After pooling sequences, the quality of the reads was screened using Trimmomatic v0.39 (ref. ⁴⁶) with parameters set to LEADING: 3, TRAILING: 3, SLIDINGWINDOW: 4:15, and MINLEN: 70. The downstream analyses were performed subsequently based on these datasets.

Comparison of resistomes reconstructed from two different assembly strategies

To evaluate the efficacy of using a reference-based binning approach before de novo assembly in reconstructing resistomes, we compared resistome profiles generated from assemblies with and without the binning process. (1) For assemblies that used the binning approach, the procedure followed was identical to the approach used for real samples. The plasmid-like contigs were first assembled out of initial data, and the initial reads were aligned to these plasmid contigs and the reference genomes of strains predicted by StrainGE²¹ using mSWEEP v2.0.0 (ref. ²⁷) and mGEMs v1.3.1 (ref. ²⁶) with default parameters. The reads binned to each reference genome were individually de novo assembled using metaSPAdes v3.15.4 (ref. ⁵⁹). The unaligned reads were separately de novo assembled using metaSPAdes. (2) The assembly without binning process was simply performed by running metaSPAdes on the initial simulated dataset with default parameters.

For all contigs assembled from two different approaches, the coding gene sequences were predicted using Prodigal v2.6.3 (ref. ⁶⁰), and the genes were annotated by running DIAMOND blastp v2.0.14.152 (ref. ⁶¹) against the CARD with the same parameter used for the real samples. The co-localization of MGEs was investigated by annotating coding genes located within 5,000 bp both upstream and downstream of identified ARGs. ARGs that had MGEs in their flanking regions were considered to have the potential for mobility. ARGs with no MGEs in their flanking regions, but whose flanking region were shorter than 5,000 base pairs due to contig size limitations, were categorized as having ambiguous mobility. The coding regions of all ARGs were clustered at 100% nucleotide identity together with ARGs from reference genomes for the comparison of resistomes.

Assembly using short or long sequence reads of simulated datasets

To examine the efficacy of using long read sequence data for investigating ARG sharing dynamics, we compared the assembly metrics of contigs assembled using short or long sequence reads. The long sequence reads of the matched five E. coli isolates were generated using Oxford Nanopore Technology MinION Platform and provided by Broad, along with the paired-end Illumina reads. We generated the simulated PIC-seq E. coli dataset for long reads using the methods described for the short reads.

In addition to the contigs assembled using only short reads, we used two additional approaches: (1) assembling using short reads first, then correcting with long reads and (2) first assembling long reads and then polishing the resulting contigs with the short reads mapped onto them. Binning of sequence reads to the reference genomes identified by StrainGE was performed before assembly, as described for the short reads. To assemble contigs starting with short reads and then correct the resulting contigs with long reads, we used HybridSPAdes v3.15.4 (ref. ⁶⁵) with default parameters. For the assembly of contigs beginning with long reads, we used metaFlye v.2.8.1 (ref. ⁶⁶) with default parameters, followed by medaka v1.7.0 for error correction in the contigs, using the same long reads as input. Short reads were then mapped on to the contigs using bbmap⁶⁷, and the resulting alignment files were used to polish errors in the contigs using pilon v1.24 (ref. ⁶⁸). The assembly metrics of the contigs were estimated by using MetaQUAST v5.2.0 (ref. ⁶⁹), based on reference genomes provided as the complete genomes of E. coli isolates used for simulated datasets.

Statistical analysis

All statistical analyses in this study were carried out using R version 4.2.1 (ref. ⁷⁰). All collected samples were included in the analyses, with no exclusions. The normality of the data distribution was tested before applying appropriate statistical tests. The differences in mean strain-sharing rates were compared using a two-sided nonparametric permutation test, with 1,000 iterations of resampling. Resistome similarity was estimated by Jaccard similarity of ARG cluster compositions, defined as the number of unique ARG clusters shared divided by the total number of unique ARG clusters from both samples. The pairwise Mann–Whitney U test was used for comparing resistome similarities between sample types, using the ‘wilcox.test’ command from the ‘stats’ package in R⁷⁰. All P values were adjusted accordingly using Benjamini–Hochberg correction. Finally, the correlation analyses among strain-sharing and resistome similarity patterns were performed using Spearman’s rank correlation method, as implemented in the ‘cor.test’ function from the ‘stats‘ package in R⁷⁰.