NGS-based Aspergillus detection in plasma and lung lavage of children with invasive pulmonary aspergillosis

Aim of this study

The aim of this study is optimization of the microbial NGS workflow tailored to Aspergillus detection in liquid biopsy samples. This involves shotgun sequencing of liquid biopsy samples, such as blood or BAL fluid, with the objective of sequencing intact microbial wcDNA or small DNA molecules from degraded microbes (i.e., cell-free DNA, cfDNA). Following microbial NGS, taxonomic identification of the DNA source is performed by tracing the origin of sequenced DNA molecules. This process includes human read subtraction, taxonomic classification, and subsequent statistical analysis to determine if there is supporting evidence for a pathogenic microbe. In this study, we refined both the wet-lab and computational workflow (cfSPI) by optimizing six key steps (Q#1-6) in microbial NGS-based fungal diagnostics, as detailed in Fig. 1.

IPA patients

Diagnostic work-up of IMD suspected patients includes clinical microbiology

At the Princess Máxima Center for Pediatric Oncology (PMC) in Utrecht, The Netherlands, patients suspected of IMD are evaluated by a chest high-resolution computerized tomography (HRCT) scan. If suspected lesions for IMD are seen on HRCT, a BAL is performed. The BAL fluid provides material for microscopy, fungal culture, galactomannan (GM) assay (Platelia Aspergillus Ag, Bio-Rad), and molecular diagnostics. A BAL GM > 1.0 is considered to be positive according to the criteria of the European Organization for Research and Treatment of Cancer and Mycoses Study Group (EORTC/MSG) criteria9.

Retrospective proof-of-principle: inclusion of IPA patient samples

We included plasma and BAL fluid from immunocompromised pediatric patients who were diagnosed with an IPA at the PMC between 2020 and 2023. We searched manually for cases with probable or proven IPA according to the EORTC/MSG criteria9 (Table 1).

The date of diagnosis was based on the date of BAL retrieval with a positive microbiological finding. One BAL fluid sample and one blood plasma sample (time-matching the BAL fluid sample or at least one or two days around the date of diagnosis) were used from each case, as well as a plasma sample collected approximately fourteen days prior to the date of diagnosis (Supplementary Fig. 11). These plasma samples collected earlier in time served as internal control samples. We also included BAL fluid and plasma samples from pediatric immunocompromised patients without IPA, so called external control samples. BAL fluid control samples were used from patients pre-HSCT, during an anesthetic procedure for line insertion32. All materials used for this study were clinical samples routinely stored at -70°C. Before freezing, the plasma samples were prepared by centrifuging EDTA plasma to remove the cells. BAL fluids were stored directly after use without prior centrifugation. Our study sets a high standard by incorporating both internal and external control samples, surpassing other studies that solely reported sequencing results from suspected IPA cases.

In total, this study encompassed 25 pediatric patients, including 7 diagnosed with probable IPA and 18 external immunocompromised controls (9 plasma samples and 9 BAL fluids). All included patients provided written informed consent for participation in the biobank for storage and use of their rest materials (International Clinical Trials Registry Platform: NL7744; https://onderzoekmetmensen.nl/en/trial/21619). For use of samples and data in this study we refer to local Biobank and Data Access Committee approval (approval number PMCLAB2022.364). All patients have also provided informed consent for sequencing and use of these data for publication. This study was conducted in compliance with the principles of the Declaration of Helsinki.

Diagnostic sensitivity and specificity

One of the aims of this pilot study was to examine the diagnostic sensitivity of microbial NGS-based detection of Aspergillus DNA in immunocompromised pediatric patients. The diagnostic sensitivity was gauged by the proportion of probable IPA patients displaying significantly heightened levels of at least one Aspergillus species, or at the Aspergillus genus level, in at least one liquid biopsy sample (blood or BAL fluid) in comparison to external controls.

The specificity of microbial NGS was determined by the proportion of external control patient samples that displayed significantly elevated levels of at least one Aspergillus species or at the Aspergillus genus level, in at least one liquid biopsy sample (blood or BAL fluid) in comparison to all other external controls.

Wet-lab

Sample preparation and DNA isolation

Plasma was obtained from EDTA blood samples by centrifugation, 10 min at 1500 g, followed by an additional centrifugation step for 5 min at 12,000 g, to remove all cells (both centrifugation steps at room temperature). Plasma was subsequently stored at -80 °C. Fresh BAL fluid samples were stored at -80 °C until further processing. To separate the (whole) cells from the BAL fluid the samples were centrifuged for 5 min at 13,000 rpm at 4 °C. Subsequently, the pellet and supernatant were processed in separate DNA isolation methods for wcDNA sequencing and cfDNA sequencing, respectively.

cfDNA nucleic acids were isolated from BAL supernatant, EDTA plasma and sterile Dulbecco’s Phosphate Buffered Saline (DPBS), using the Circulating Nucleic Acid Kit (Qiagen, 55114) with the following modifications to the manufacturer’s protocol: A select set of our BAL fluid and plasma samples were complemented with DPBS, up to a total volume of 2 mL. Furthermore, the lysis time was increased from 30 to 60 minutes and the final elution of cfDNA was done with Nuclease Free water (Invitrogen, 10977-035).

DNA from 5 (out of 13) BAL pellets was extracted from whole cells by mechanical bead beating (see Supplementary Data 2, “internal” DNA isolation) where 50 µL DPBS and a 5 mm stainless steel-bead (Qiagen, 69989) were added to the BAL pellet. Samples were beaten for 2 min (with a frequency of 2 5 1/s) on the TissueLyser II (Qiagen) and subsequently diluted in ATL buffer (Qiagen, 939011) and transferred to a fresh tube. Additional ATL buffer was added to a total volume of 300 µL, with an overnight incubation at 56°C after addition of Proteinase K (Qiagen, 19131). wcDNA isolation was performed using the DNeasy Blood & Tissue Kit (Qiagen, 69506), following manufacturers protocol adjusting subsequent volumes to accommodate the larger lysis volume.

From the remaining 8 of our 13 BAL pellet samples (see Supplementary Data 2, external DNA isolation) wcDNA was isolated according to a standard protocol for fungal DNA isolation prior to RT PCR (UMC Utrecht, Dept. Medical Microbiology, The Netherlands). In short, BAL pellets were bead beaten and snap lysed (by freezing them at -80 °C and heating them to 96 °C). After addition of a lysis buffer, nucleic acids were isolated using the MagNA Pure system (Roche).

All DNA samples were quantified using the Qubit dsDNA High Sensitivity assay kit or Broad Range assay kit (Thermofisher Scientific, Q32854 and Q32853, respectively). The DNA fragment length distribution and concentration of the cfDNA was evaluated using the TapeStation 2200, D1000HS kit (Agilent, 5067-5585).

Preparation of next-generation sequencing libraries

For ss-ligation based DNA-capture the SRSLY PicoPlus NGS Library Prep Kit for Illumina (Claret Bioscience, CBS-K250B-96) was used, according to the manufacturer’s Moderate Fragment Retention version of the protocol, using max 5 ng cfDNA as input and 11 PCR cycles. For ds-ligation-based cf/wcDNA-capture the KAPA Library Preparation Kit (Roche) was used, with a maximum of 50 ng of input DNA and 8-12 PCR cycles. For ds-capture of the BAL pellet DNA, the manufacturer’s protocol was followed. For ds-capture of cfDNA the following changes to the manufacturer’s protocol have been made: no fragmentation step and following similar bead clean up steps as the moderate fragment retention protocol to improve the yield of small fragments. All library preparations were quantified using the Qubit dsDNA High Sensitivity Assay Kit (Thermofisher Scientific, 32854) and size distribution was analyzed using the TapeStation 2200, either the D1000 and/or D5000 kits (Agilent, 5067-5583/5067-5589). Some samples showed an aberrant size distribution (substantial fraction of >700 bp fragments); these samples were subjected to a bead-based size selection protocol (see Supplementary Data 2), to remove long fragments as Illumina sequencing is optimized for fragments <700 bp. Concentration (see Supplementary Data 2, total DNA yield in ng) and size were re-evaluated after size selection. Samples were pooled equimolarly and submitted for sequencing.

Next-generation sequencing

Sequencing of all 60 Illumina libraries was conducted on the NovaSeq 6000, 2×150 bp reads, resulting in 42 to 218 million reads per ss-cfDNA sample, 37 to 126 million reads per ds-cfDNA patient sample, and 34 to 62 million reads per wcDNA patient sample (see Supplementary Data 2).

Read simulations

In order to evaluate the impact of kraken2’s reference hash-table database composition and confidence threshold on the Aspergillus classification accuracy, we simulated cfDNA-like datasets from 87 complete, scaffold, or draft genomes derived from the NCBI RefSeq, among which 55 Aspergillus, 7 Penicillium, and 25 other pathogenic fungal genomes with a Illumina read simulator tool named ReSeq33. To create a realistic error profile of sequencing reads, the unprocessed ss-cfDNA sequencing reads from plasma of patient A01 (i.e., A01Pasp, for details on sample workup see Supplementary Data 2) sequenced with Illumina Novaseq 6000 2×150 bp were mapped to the human GRCh38.p14 reference genome using Bowtie2 (run with option “-X 2000”) alignment software34. Subsequently, the reference sequence statistics were determined using ReSeq (option “–statsOnly”), and 151 bp paired-end reads were simulated in-silico using the illuminaPE ReSeq command (option “–noBias”)33. For each genome we simulated between 99,050 and 101,023 reads; for details on simulated datasets and the number of reads per dataset see Supplementary Data 1.

Database construction

For the purpose of fungal nucleic acid detection, we utilized 9 kraken2 classification databases, namely uR.7, uR.7 w/o CHM13v2, cRE.21, dRE.21, uRE.21, uRE.31, dRE.31, dREM.258 and dREM.260 of which details on the construction and composition are reported in Supplementary Data 5.

Prior to hash-table construction, specific genomic regions of the reference genomes were masked to prevent spurious misclassifications. This masking procedure involved dustmasking, where low-complexity regions were masked using the DUST algorithm21 as advised by the developers of kraken2, more rigorous decontamination efforts thereby masking contaminant organism sequences, or a combination of dustmasking and decontamination (i.e., full cleanup), such as in the work of Lu and Salzberg35. Contaminant organisms’ sequences are sequences within genome assemblies that do not accurately represent the organism’s genetic information.

Database names indicate the masking procedure used (unaltered, dustmasked or cleaned), the database sources (RefSeq, EuPathDB and/or MycoCosm) and the number of Aspergillus species included (Supplementary Fig. 2). For example, cRE.21 refers to the cleaned version of a combination of RefSeq and EuPathDB with a total of 21 Aspergillus species. Each database was built using the NCBI taxonomic information, which was downloaded on 05-05-2023.

The NCBI RefSeq22 is a comprehensive and curated collection of nucleotide sequences, encompassing a wide range of species, which — in the context of kraken2 classification — are often used as a standard reference hash-table database construction. Using the “kraken2-build –download-taxonomy command, genomes of the NCBI RefSeq were downloaded on 15-05-2023 (RefSeq release 218, file creation date 05-05-2023), including 1,495 archaea, 285,827 bacteria, 498 fungi, 98 protozoa, 14,979 viral and 1 human sequence plus 3,137 contigs which were part of the UniVec_Core. UniVec_Core comprises oligonucleotide and vector sequences sourced from bacteria, phage, yeast, and synthetic constructs, excluding vector sequences of mammalian origin.

The EuPathDB encompasses a curated set of genomic sequences of 386 pathogenic fungi, protists, oomycetes as well as evolutionarily related non-pathogenic species26. EuPathDB genomic sequences were downloaded on 16-05-2023 (file creation date was 28-10-2020), consisting of the following subsets: AmoebaDB (n = 30), CryptoDB (n = 18), FungiDB (n = 164), GiardiaDB (n = 10), MicrosporidiaDB (n = 35), PiroplasmaDB (n = 10), PlasmoDB (n = 45), ToxoDB (n = 33), TrichDB (n = 1) and TriTrypDB (n = 42). Upon inspection post-download, we observed that two fungal genomes were omitted from the purified edition of FungiDB-46. Jennifer Lu, one of the authors of the contaminant removal paper35, kindly supplied us with the latest version of these two genomes:

FungiDB-54_PgraminisCRL75-36-700-3_Genome_cleaned_v_final.fna and FungiDB-54_Ptriticina1-1BBBDRace1_Genome_cleaned_v_final.fna.

An updated version of the seqid2taxid.map used for database construction was also provided by Jennifer Lu. In total, EuPathDB version 46 includes 27 Aspergillus genomes, representing 21 Aspergillus species, while version 64 contains 38 genomes, representing 31 Aspergillus species.

The MycoCosm27 is a web-based resource and information portal developed by the Joint Genome Institute (JGI) for fungal genomics. It provides access to a comprehensive collection of fungal genomes, associated functional annotations, and tools for comparative analysis. We acquired all 763 genomic sequences of Aspergillus assembly scaffolds from this resource, along with their corresponding taxonomic annotations.

Host read subtraction

Our objective was to eliminate potential false positives microbial reads originating from incomplete host read subtraction. To achieve this, we employed mapping to either the GRCh38.p14, CHM13v2, or a dual-mapping utilizing a consolidated reference index for Bowtie2 alignment, encompassing both GRCh38.p14 and CHM13v2, thereby avoiding a two-step mapping process.

Sequencing and synthetic data processing: the cfSPI-pipeline

Illumina sequencing and synthetic data were processed using the Snakemake36 cfSPI-pipeline available in our Github repository (https://github.com/AEWesdorp/cfSPI/cfspi/). In short, duplicates were removed (using nubeam37), after which high-quality sequencing data was generated (using fastp38) by default removal of low quality reads and usage of a low complexity filter as well as by adapter removal and removal of short (<35 bp) reads (using AdapterRemoval39). After subtraction of host sequences by mapping to the human reference genome using Bowtie234, the remaining paired-end reads were taxonomically classified using kraken220 with the 9 kraken2 databases specified above, employing a confidence threshold (CT) ranging from 0.0 (no filter on the fraction of k-mers matching, default setting) to 1.0 (100% of k-mers within the read match a taxa, very stringent setting), increments of 0.1.

Remapping of classified reads

Reads classified as Aspergillus at the species level with the cRE.21 database, along with all lower-ranking taxa within the same clade, were aligned to the respective Aspergillus species genomic sequences used for the cRE.21 database construction, through Bowtie2.

Relative abundance per taxon

The relative abundance per taxon within each sample was quantified as Reads Per Million (RPM), a normalized measure that accounts for i.e., differences in sequencing depth. RPM was calculated according to the following formula:

$${RPM}={totalSumTaxonReads}/{totalNumQCReads}* {1,000,000}$$

(1)

totalSumTaxonReads corresponds to the number of reads classified at each taxon (e.g., genus- or species level) and all lower-ranking taxons belonging to the same clade. totalNumQCReads is the count of reads that passed quality control (i.e., number of reads remained after duplicate removal, low-quality and low-complexity reads filtering, and adapter removal; see Supplementary Fig. 1a).

Limits of Significant Detection

In order to explore the relationship between fungal background levels in immunocompromised individuals, the observed classification rate in simulated datasets, and the resulting theoretical ‘Limits of Significant Detection’, we formulated the following methodology. First, we defined our taxon, database, and CT of interest. Second, we set the theoretical sequencing depth (tSD), ranging from 10 to 100 million reads (with intervals of 15 million) as well as a theoretical number of molecules per million (mpm) ranging from 0.25 to 4096. Third, we obtained the median background abundance (ba) observed in our external immunocompromised samples for the specified taxon, database, and CT of interest (normalized, RPM). Fourth, we determined the classification rate (cr) observed in our simulated datasets for the specified taxon, database, and CT of interest. Subsequently, we applied the following calculation to determine the total taxon read count in our artificial sample, rounded to the nearest integer:

$${total\; taxon\; count}=\left\lfloor ({tSD}* {ba})+({tSD}* {mpm}* {cr})+0.5\right\rfloor$$

(2)

Following this, we applied a one-tailed Fisher’s exact test to assess whether the observed total taxon count in our theoretical samples significantly differed from that in our external control samples. A mean p ≤ 0.001 was deemed statistically significant. The lowest mpm value with a p-value ≤ 0.001 was reported as the MPM.

Identification of elevated Aspergillus levels

Following taxonomic classification of all clinical samples, we conducted a one-tailed Fisher’s exact test to assess statistical differences in the read count of a specified taxon between our patient samples and internal/external control samples. This test was based on comparing the following two counts in the contingency tables:

  1. 1.

    The number of reads classified at the taxon of interest, including reads at the specified taxonomic level (e.g., genus- or species level) and all lower-ranking taxa within the same clade.

  2. 2.

    The number of reads remaining after duplicate removal, low-quality, and low-complexity read filtering, excluding those classified at the taxon of interest.

The significance level was set at p ≤ 0.001, calculated by deriving the mean of all Fisher’s exact tests conducted across the samples. This analysis aimed to identify meaningful differences in taxon-specific read counts between patient and control groups, considering the overall composition and diversity of microbial taxa in the studied samples.

Statistical analysis

To evaluate the influence of computational choices, including host-read subtraction and database composition, we utilized the one-tailed paired t-test. For comparisons involving sample types and library preparation, we applied the one-tailed Wilcoxon rank-test. To account for multiple testing, we employed Bonferroni correction in these analyses.

Software

Data and statistical analyses were conducted in R (v.4.2.0). Figures were generated in R (v.4.2.0), and illustrations created using BioRender and Adobe Illustrator (2024, v28.6).

Source link

Get RawNews Daily

Stay informed with our RawNews daily newsletter email

Differential efficacy of first licensed western vaccines protecting without immunopathogenesis Wuhan-1-challenged hamsters from severe COVID-19

IL-1 protects from fatal systemic candidiasis in mice by inhibiting oxidative phosphorylation and hypoxia

NGS-based Aspergillus detection in plasma and lung lavage of children with invasive pulmonary aspergillosis

Ubiquitination of gasdermin D N-terminal domain directs its membrane translocation and pore formation during pyroptosis