sWGA was optimized for amount of input DNA, dNTP composition, debranching, and reaction temperature
Initial optimizations were conducted using the BL720 cell line (27 EBV copies/cell) and subsequent mixtures after dilution to create viral dilutions with EBV-negative human DNA representing 27, 13.5, 6.75, 3.375, and 1 EBV copies / human cell (Supplementary Table 1). In these optimizations, we first sought to optimize the sWGA reaction. To maximize the relative yield of EBV to human DNA, various aspects of the selective amplification pipeline were tested using BL720 cell line DNA (Table 1). Stock sample as well as all serial dilutions with human control DNA (13.5, 6.75, 3.375, and 1 copy/cell) were confirmed by EBV qPCR (Supplementary Table 1). For sWGA, the highest percentage of EBV-mapping reads for all BL720 dilutions was obtained with an input DNA volume of 12.5 ng (Fig. 1, Supplementary Table 3). More than 12.5 ng of input DNA (100, 50, and 25 ng) had a slightly decreased yield, while less than 12.5 (6.25 and 1 ng) did not reach the enzyme’s maximum capacity for amplification. Additionally, since EBV has a high GC content (59.5%),18 the relative concentrations of dNTPs were explored. Similar to prior findings14, the percentage of reads mapping to EBV was maximized with greater ratio of dGTP and dCTP. While 30/30/5/5 and 30/30/10/10 of dG/dC/dA/dT achieved similar percentages, 30/30/5/5 was selected, because it achieved a slightly higher percentage at the highest EBV concentration for BL720 at 27 copies/cell, and most patient tumor samples have at least 27 EBV copies/cell.
Next, since the EquiPhi29™ polymerase amplifies DNA through multiple displacement amplification, it produces DNA branches that can clog pores on the ONT flow cell60. In terms of sequence optimization, the sWGA product underwent a debranching reaction with the T7 endonuclease to prevent clogging and maximize sequencing output. There are currently two debranching protocols for ONT sequencing: one from New England Biolabs (NEB)61 and the other from ONT (Supplemental Material 1). The main difference between the two protocols was the amount of input DNA with NEB’s protocol requiring 200 ng and ONT’s requiring 1500 ng. Both debranching reactions yielded a lower amount of DNA compared to the initial input, and subsequent reactions in the pipeline such as PCR required greater than 200 ng of DNA after debranching. As a result, we adopted ONT’s debranching protocol for our pipeline. Initial optimized output of the debranched sWGA from sequencing, hereafter described as post-sWGA, showed a median read length of 1449 bps, significant 1410-fold enrichment of EBV, 0.83-fold enrichment of human, and the total amount of DNA increased from 12.5 ng to 2652 ng (Table 2, Supplementary Table 4). Ultimately, the sWGA reaction was finalized in a 20 µL reaction with 12.5 ng of total input DNA and 30/30/5/5 dG/dC/dA/dTTP.
Supplementation with region-specific multiplex PCR to optimize coverage
Optimized sWGA did not show uniform amplification with an overrepresentation of the region spanning 110,000 to 130,000 bases in the reference genome (Fig. 2). As a result, we added two-tube multiplex PCR leveraging previous primers that targeted all parts of the EBV genome except for bases 110,000-130,000 to the post-sWGA product. These PCR reactions were optimized with 200 ng of input DNA (100 ng for each reaction) and a gradient annealing temperature ranging from 58° to 49 °C to accommodate for the multiple primers and their varying melting temperatures. From 200 ng of input DNA after sWGA, the PCR reactions, henceforward referred to as post-multiplex PCR, outputted median read lengths of 1233 bps and an average of 3540 ng of DNA across all BL720 dilutions (Table 2). Post-multiplex PCR provided more uniform coverage (Fig. 2) with a 6617-fold increase in enrichment of EBV (p = 0.05) and 0.42-fold enrichment of human (p < 0.01, Table 2). Finally, after PCR, we obtained increased sequencing output by debranching the reaction again with T7 endonuclease. Note that we observed significantly decreased sequencing throughput when only doing a single round of debranching for post-multiplex PCR compared to two rounds total with one round for post-sWGA and another for post-multiplex PCR. Overall, all non-repetitive regions of the EBV genome, henceforth referred to as the unique genome, were covered by sequencing reads with a median read depth of 1124 across all measured ranges of EBV copies/cell for BL720 (Fig. 3). The 14 repeat regions observed a median read depth of 83 for all BL720 dilutions with all repeat regions smaller than 1238 bases being covered across all dilutions (Table 2, Supplementary Tables 6, 7). In contrast, the largest regions (IR1 at 23,355 bps, IR4 at 2517 bps, TR at 2137 bps, and IR2 at 1538 bps) did not have read coverage in at least one BL720 dilution (Supplementary Tables 6, 7). Altogether, across all measured ranges of EBV copies/cell for BL720, the combination of post-sWGA and post-multiplex PCR followed by ONT long-read sequencing demonstrated significant enrichment of EBV relative to human and relatively uniform read coverage across the unique genome (Table 2; Fig. 3).
(A) Read Coverage Across Genome of BL720 (27 copies/cell down to 1 copy/cell) (B) Read Coverage Across Genome of Tumor Cell Lines (C) Read Coverage Across Genome of Patient Tumor Samples. Note that EBNA2 is at genomic positions 36,216–37,679, and the targeted LMP2 region is at 0-1680 and 171,772–171,823. Created with BioRender.com.
Sequencing of tumor cell lines and further optimization for low coverage in EBNA2 and LMP2
With our optimized parameters for sWGA, multiplex PCR, and debranching, we sequenced a broader array of BL cell lines (Akata, Akata GFP, Daudi, BL717, BL719, BL725, BL740, BL760, Jijoye, Makau, Mutu I, Namalwa, Raji, Wewak2, MBL118, MBL120, and MBL121). 18 total cell lines were also sequenced, of which 9 were derived from male patients and 6 female (Table 1, Supplementary Table 2). The ages of these patients ranged from 3 to 16 years, and the original tumor sites included abdomen, jaw, neck, orbit, and ascitic fluid. The median EBV copies/cell for the cell lines was 27 copies/cell (range 1-132) with 77.78% being EBV type 1 and 22.22% type 2. All showed significant fold EBV enrichment (631) and human enrichment (0.71) after sWGA with a DNA yield of 2790 ng and median read length of 1795 bps and 1306-fold EBV enrichment and 0.44-fold human enrichment after multiplex PCR (Table 2), yielding 4586 ng of DNA with median read length of 1437 bps. Similar to the BL720 dilutions, all cell lines demonstrated biased coverage at bases 110,000-130,000 after sWGA and more uniform coverage after multiplex PCR (Fig. 2).
While the BL720 dilutions had sufficient coverage across all genes after sWGA and multiplex PCR, there were regions of consistently lower coverage in the cell lines, specifically in EBNA2 and LMP2A/2B. These regions likely observed relatively lower coverage due to the repeat regions flanking EBNA2 and the terminal repeat region within LMP2A/2B62,63. Since increasing the concentration of PCR primers at these regions in the multiplex PCR reaction did not significantly increase coverage, we designed single primer-pairs (singleplex) PCR amplifying these genes after the initial sWGA, henceforth referred to as post-singleplex PCR. Post-singleplex PCR resulted in 1287-fold EBV enrichment and 0.56-fold human enrichment, a total output of 1806 ng of DNA, median read length of 1483 bps, and increased coverage at EBNA2 and LMP2A/2B compared to the original, unamplified DNA (Figs. 2, 3, Supplementary Table 4). For the final coverage, the median read depth was 1232, and there was statistically significant positive correlation between GC content of assemblies and reads mapping to EBV after sWGA (p = 0.03, Pearson’s correlation; Supplementary Table 4). The median read depth for the repeat regions was 316 (Table 2). 12 of the 14 repeat regions had read coverage across all cell lines. The two regions that failed included IR1, which measures 23,355 bps, and IR2 at 1538 bps (Supplementary Table 6, Supplementary Table 7). Note that multiplex PCR and singleplex PCR were performed concurrently, meaning that both reactions used the post-sWGA product (Fig. 1). Thus, each sample was assigned three barcodes during the native barcode ligation step of ONT’s Native Barcoding Kit 24 V14 protocol. After the native barcode ligation step, however, all barcodes were combined into a single library for adapter ligation and priming and loading the flow cell.
We then applied our final optimizations on the cell lines for three tumor samples, which consisted of one male and two female patients aged nine to ten years-old with the two known tumor sites being the abdomen (Supplementary Table 2). All contained type 1 EBV. The three patient tumor samples had 26.4%, 67.5%, and 46.8% of total bases mapping to EBV for post-sWGA, post-multiplex PCR, and post-singleplex PCR, respectively (Table 2; Fig. 3). For these patient tumors, the median read depth was 1094, and patterns in read coverage across the genome for post-sWGA, post-multiplex PCR, and post-singleplex PCR resembled those of the cell lines and BL720 dilutions (Figs. 2, 3). Ten of the 14 repeat regions demonstrated read coverage across all patient tumor samples. The four repeat regions that failed were some of the longest repeat regions including IR1 (23,355 bps), IR2 (1,538 bps), IR4 (2,517 bps), and TR (2,137 bps).
ONT assemblies covered all key genes with L50 of 1 and high identity to illumina
Across all assemblies, there were no consistent gaps in any genes across the EBV genome (Fig. 4). For the BL720 sample that was diluted with additional human DNA to generate 1 EBV copy/cell, the EBV assembly covered 93.90% of the unique genome at 99.91% similarity to the Illumina assembly (Table 2, Supplementary Table 8). In comparison, the undiluted BL720 at 27 copies/cell was sequenced and assembled at 98.60% coverage breadth and 99.86% similarity to the Illumina assembly. Of the 18 cell lines, 14 had Illumina assemblies, which served as the reference genomes for coverage breadth and similarity measurements. All cell line assemblies observed at least 98.60% coverage breadth (Supplementary Table 8). The average similarity of the ONT assemblies to the Illumina assemblies was 99.61%. Due to the absence of any prior assemblies, assembly breadth of coverage for the patient tumor samples was estimated by comparing to the EBV type 1 reference genome NC_007605. The estimated assembly breadths of coverage for MBL144, MBL213, and BL694 were 98.72%, 98.71%, and 98.60%, respectively (Supplementary Table 4). All BL720 dilution, cell line, and patient sample final assemblies had L50s of 1 with average N50s of 166,059, 170,257, and 168,673, respectively. In 13 of the 21 repeat regions defined in NC_007605 including all repeat regions less than 993 bps except the 183-bp R gene, at least 80% of the BL720 dilution assemblies completely covered the region, hereon termed continuous assembly coverage (Table 3, Supplementary Tables 9, 10). Similarly, at least 80% of the cell line assemblies had continuous assembly coverage in 14 repeat regions including all repeats less than 993 bps (Table 3, Supplementary Tables 9, 10). Finally, all patient tumor assemblies had continuous assembly coverage in 12 repeat regions including all regions smaller than 708 bps (Table 3, Supplementary Tables 9, 10).
Assemblies for the 18 tumor cell lines (Akata, Akata GFP, Daudi, BL717, BL719, BL720, BL725, BL740, BL760, Jijoye, Makau, Mutu I, Namalwa, Raji, Wewak2, MBL118, MBL120, and MBL121) and 3 FNA tumor samples (MBL213, MBL144, and BL694) were mapped to the reference genome NC_007605. All assemblies covered all genes (i.e. EBNA2, EBNA3, BZLF1, EBNA1, BALF5, LMP1, and LMP2), and the median continuous assembly coverage across all assemblies for repeat regions up to 1000 bps was 100.00%.
ONT assemblies most resembled illumina counterparts
A phylogenetic tree of all ONT and Illumina assemblies showed that the ONT assemblies most resembled their previous Illumina assemblies counterparts when available (Fig. 5). As expected, there was a clear separation between assemblies based on EBV type. All ONT assemblies also had shorter or similar branch lengths compared to their Illumina assemblies, which may be indicative of a lower or comparable error rate in the new long-read assemblies, as random errors are often expected to extend branch lengths (Supplementary Figs. 1–14)64,65.
Phylogenetic tree for ONT assemblies of the 18 tumor cell lines and 3 patient tumor samples as well as 14 of their available Illumina assemblies (Akata, Daudi, BL719, BL720, BL725, BL740, Jijoye, Makau, Mutu I, Namalwa, Raji, Wewak2, MBL118, and MBL121).