All code used for the methods described below can be found at https://github.com/bastiaanvdroest/TB_SNP_cutoff_assessment.
Data
Our data contains 2008 whole genome sequences from Mtbc culture-positive cases sampled by the National Institute for Public Health and the Environment (RIVM) in the Netherlands between December 2015 and December 2019 with only strains typed as M. tuberculosis sensu stricto (lineages 1-4) or M. africanum (lineages 5 and 6). Isolates from the Netherlands were sequenced at the RIVM Bilthoven as follows: genomic DNA was extracted from positive MGIT culture tubes (Becton Dickinson, NJ, USA) using the QIAamp DNA mini kit (QIAGEN GmbH, Hilden, Germany) and sequencing was performed on Illumina HiSeq2500 with 2 x 125bp reads following Nextera XT library preparation. Sequences are stored under BioProjects PRJNA882748, PRJNA1079737, PRJEB32037 and PRJEB25592, and SRA accessions found in Supplementary Table 2. Reads were first trimmed using fastp v0.20.1.28 and aligned to the reference genome H37Rv (accession NC_000962.3) using BWA mem v0.7.17 (parameters: default)29. Duplicate reads were removed with Picard v2.25.1 (http://broadinstitute.github.io/picard/). SNPs were called using Pilon v1.24 (parameters: default) requiring confidence base calls (pass filter)30. To analyze the genetic variation in our sample we created a genotypes matrix based on the variant call format files. We excluded all sites that had an Empirical Base-level Recall (EBR) score of<0.931, and all sites located in mobile genetic elements (e.g., transposases, integrases, phages, or insertion sequences). We also dropped SNP sites that are missing in \(>10\%\) of our strains. We cannot infer any information on these sites, as they are missing because of real deletions or low sequence quality. If a base call at a specific reference position for an isolate did not meet the filter criteria that allele was coded as missing. Our filtered genotypes matrix had dimensions of 2,008×325275 SNPs representing 325275 SNP sites across 2,008 strains. The phylogenetic lineage was determined using a refined lineage-calling scheme based on 96-SNPs with fast-lineage-caller32. For all isolates, the phylogenetic lineage type was known (Table 3). Besides sequences, we had clinical and demographic data, comorbidity information, and data on social risk factors for TB for all cases33. Gröschel et al.33 described data of close contacts. However, we could not use these data, because close contacts were mostly latent TB cases, of which no sequence was available, and we did not have information to link active TB cases via contact. Furthermore, we used both pulmonary TB and extra pulmonary TB (EPTB) cases in our study, although cases with EPTB have a significant lower infectiousness. EPTB are unlikely to be infectors but could be present in transmission events.
Assessment pipeline
The reconstruction of transmission events and assessment of the SNP cut-offs involves multiple steps (Figure 1). First, we split the set of isolates into genetic clusters with a genetic distance between clusters of more than 20 SNPs, so that recent transmission can be ruled out for isolates in different clusters34. Next, we used the sequences to infer phylogenetic trees of Mtbc lineages, to obtain mutation rates of these lineages. Then, we inferred for each genetic cluster potential transmission events. Finally, we use the inferred transmission events as a reference for the assessment of the SNP cut-offs.
Flow diagram of the SNP cut-off assessment pipeline. Sequences, and their sampling times, are used to infer phylogenetic trees per Mtbc lineage and to estimate mutation rates for these lineages with BEAST. Independently of the phylogenetic tree inference, the sequences are grouped in genetic clusters based on a SNP cut-off. These clusters are used, together with the sampling times of sequences, and the mutation rate of the corresponding Mtbc lineage, to infer transmission trees with phybreak. Finally, inferred transmission events and unconnected cases serve as a reference to assess the performance of SNP cut-offs in determining transmission events.
Cluster formation
We used the R package adegenet35 to perform genetic transitive clustering with a distance cut-off of 20 SNPs. This ensures that each sequence in a cluster of size two or larger has a SNP distance of 20 or less to at least one other sequence in the same cluster. A distance of 20 SNPs is assumed to be large enough to ensure recent transmission events between clusters can be ruled out. Clusters of size 2 are dropped from the analysis because they contain too little information for reliable inference. Once genetic clusters were defined, SNPs that were found in subsets of isolates in different clusters were removed. These are believed to be due to miscalling of nucleotides.
Phylogenetic tree inference
Since the genetic clusters are too small to estimate a reliable mutation rate, we estimated the mutation rate from sequences from all isolates with the same Mtbc lineage in a phylogenetic analysis with BEAST 1.736. We used the GTR mutation model with a relaxed clock23,37 combined with a Skyride demographic model38. We ran two MCMC chains of length 100,000,000 and derived the mean mutation rate and standard deviation over all branches. The mean mutation rate per lineage was used as a fixed parameter in the transmission tree inference.
TreeAnnotater, included in the BEAST software package36, was used to get the consensus phylogenetic tree to find the posterior support of the genetic clusters made by the 20 SNP cut-off in the phylogenetic tree. If the posterior support of the most recent common ancestor of a genetic cluster is 1, we conclude that this cluster is complete, i.e. the cluster is fully distinctive from all other isolates.
Transmission tree inference
For the inference of transmission, we used the phybreak package24 in the R programming language39, which contains a Bayesian inference method to reconstruct transmission trees. Using genetic sequences and their sampling times, phybreak infers transmission chains by simultaneously sampling phylogenetic and transmission trees from the posterior distribution. In particular, phybreak allows for multiple transmission trees in a genetic cluster representing multiple index cases coming from an unsampled source population. In phybreak, a complete bottleneck at transmission is assumed, which is in line with observation of genetic bottlenecks at transmission in case of Mtbc33. On the other hand, the model does allow for within-host diversity, resulting genetic variation within and between hosts as observed in experiments40. In a sensitivity analysis we used the package TransPhylo22, which is also a Bayesian inference method for transmission trees. TransPhylo uses a phylogenetic tree as input on which it places transmission events including potential unobserved cases between the observed cases. The input phylogenetic trees are inferred from sequences and sampling times in BEAST.
We ran the transmission tree inference with phybreak on each cluster defined above, using the prior distributions for the generation and sampling time interval shown as in23 (Table 1), assuming that all cases other than the index cases developed active TB soon after infection, without a phase of latent TB. Furthermore, we chose the substitution rate to be constant at the mean substitution rate of the cluster’s lineage, estimated in the phylogenetic analysis with BEAST: 0.355 and 0.3 SNPs/genome/year for lineages 3 and 4, which corresponds to a substitution rate of 0.2-0.5 SNPs per genome per year found in literature41. We ran for each cluster two MCMC chains of 10,000 cycles and combine the results of both chains. The time of the transmission cluster inference strongly depends on the genetic cluster size, but was less then 10 minutes on a Apple M1 chip for the largest cluster of 23 isolates and two MCMC chains of 10,000 cycles each. The sensitivity analyses with TransPhylo were done with the settings as in23 (Table 1).
We decided not to estimate the mean generation time and mean sampling time from the data because most clusters were small and did therefore not contain sufficient information for estimation, and to ensure that the means were the same across clusters. Instead, we carried out sensitivity analyses for the means. We varied means for the generation time, i.e. 1, 1.3, and 2 years, and the sampling time, i.e. 0.3, 0.8, and 1.3 years.
Assessment of SNP cut-offs
We assessed SNP cut-offs from 0 to 20 to either rule out recent transmission or identify transmission clusters. The SNP distance between two isolates is calculated using all SNPs in the dataset, including the non-cluster-unique SNPs, to make a better comparison with previous SNP cut-offs. We used three definitions of pairs being linked by recent transmission (Figure 2): direct linkage, i.e. one having infected the other (in red); infector linkage, i.e. both cases having the same infector (in green); transmission tree linkage, i.e. both cases being part of the same transmission tree (in blue). We reserve the words linked and linkage to describe the relation of a pair in the inferred transmission trees. Two isolates are truly identified as having a transmission event between them, i.e. True Positive, if the event is inferred and the distance between the isolates is lower than or equal to the SNP cut-off (Table 2). We calculated the sensitivity, i.e. the fraction of isolate pairs with a distance below or equal to the SNP cut-off among all linked pairs: \(\frac{\text {True Positives}}{\text {True Positives} + \text {False Negatives}}\). In addition, we calculated the precision, i.e. the fraction of pairs with an inferred transmission event among all pairs with a distance below or equal to the SNP cut-off: \(\frac{\text {True Positives}}{\text {True Positives} + \text {False Positives}}\).
Three possible types of linkage between a pair of isolates. The linkage types are given between two cases (colored dots) in a transmission tree, with the transmission direction given by arrows. In red; direct linkage, i.e., infector-infectee pair, in green; infector linkage, i.e., linkage via the same infector, and in blue; transmission tree linkage by the same transmission tree.
To account for the uncertainty in the transmission trees, we built the confusion matrix with 1000 samples from the phybreak posterior distributions of trees instead of a single consensus tree. As a result, the linkage relations between all pairs were appropriately weighed by its support in the posterior. Confidence intervals for sensitivity and precision were obtained by bootstrapping the phybreak posterior distributions of trees and thus creating 1000 bootstrap samples of the confusion matrix.
A sensitivity analysis on the effect of transmission inference with phybreak on the sensitivity and precision scores was done by inferring transmission events with TransPhylo.