RA genome sequencing and assembly
Whole genome sequencing of epimastigote forms of the hybrid RA strain (TcVI) was performed using PacBio RSII technology. A total of 702,578 reads with a 9394.9 bp average length (range 35–66,694 bp) were assembled using HGAP4 software into 1815 contigs with 73 × mean coverage (Table 1). Of these, 1812 corresponded to the nuclear genome and summed to 91.2 Mpb with a gapless assembly (N50 of 132 Kb) and 51.8% GC content (Table 1). The remaining 3 contigs corresponded to mitochondrial DNA; 2 of them to maxicircles (RA_345 and RA_766, with 58,044 bp and 23,936 bp, respectively) and the third one to minicircle sequences (RA_1814, 992 bp). Completeness of the nuclear genome was assessed using Benchmarking Universal Single-Copy Orthologs (BUSCO), by mapping single-copy genes conserved in Trypanosomatid clade (trypanosoma_odb12). A total of 5383 out of 5397 BUSCO genes were identified and classified as complete. For 4901 of these genes, two copies, most likely alleles, were found, suggesting a high degree of resolution of parental haplotypes. Overall, the RA genome presented a 99.7% completeness, which is in line with the 99.8% obtained for the TCC strain25 , also affiliated to TcVI and with an overall 98% identity to the RA genome (Suppl. Table 1, Table 1).
RA strain nuclear genome annotation
The performance of currently used genome annotation protocols in T. cruzi is curtailed by the level of sequence fragmentation. In addition, these protocols often rely on the migration by homology transfer of spurious and erroneous annotations from other genomes, e.g. CL Brener. These common drawbacks are particularly troublesome for the annotation of multigene families, because their variants tend to collapse during genome assembly and also because a substantial fraction of their reference protein datasets is made up of partial, out-of-frame and/or inaccurately annotated sequences45. To overcome this limitation, we firstly performed a comprehensive manual curation of the sequences of 4 out of the 6 most expanded and complex multigene families (TcMUC, TS, MASP and GP63 metalloproteases) (Suppl File 1)45,46. The same methodology was followed to annotate other, less represented gene families such as Ser-, Ala- and Pro-rich proteins (SAP) and Small Mucin-like Genes (TcSMUG)47,48. Within the TS family, eight groups of CDS showing structural coherence (termed TS-GI to TS-GVIII) were defined as in49. A ninth group, termed ‘TS unspecified’ encompassed sequences that were not assigned to either group (Suppl File 1) . In the same line, sequences from the TcSMUG gene family were split into two robust groups (TcSMUGL and TcSMUGS) differing in their structure, expression pattern and function48,50. Sequences of the remaining 2 most expanded multigene families (Dispersed Gene Family-1 [DGF-1] and Retrotransposon hot spot proteins [RHS]), and of UDP-Gal/GlcNAc-dependent glycosyltransferases (GT) and Thr-, Ala-, Ser- and Val-rich proteins (TcTASV) were extracted from reference datasets25,51,52. Despite their relevance, some of these T. cruzi gene families, e.g. TcTASV and SAP, are not annotated in currently available genomes. Certain groups of transposable elements and non-coding RNAs (ncRNAs) were also annotated as per guidelines provided in Methods.
Following the consolidation of our annotation protocol (Suppl Fig. 1), we obtained 29,456 features in the RA nuclear genome. These summed up to 42.11 Mb, covering 46.5% of the genome, with 1186 out of 1812 contigs containing at least one annotated feature (Table 2 and Suppl Fig. 2). When only contigs > 50 Kb were examined (n = 386), the annotated fraction increased up to 52.4% (Table 2 and Suppl Fig. 2). Of these features, 19,946 corresponded to CDS: 17,037 to Trypanosomatid-conserved proteins (TCP), which either bore functional annotation (henceforth TCFP, n = 9258) or encode hypothetical proteins of unknown function (henceforth TCHP, n = 7779), and 2909 to full-length members of the above mentioned multigenic families (Table 2). The number of multigenic families’ sequences increased up to 6897 when truncated variants (pseudogenes) were considered (Tables 2 and 3).
As expected, TS, MASP, TcMUC, RHS, GP63 and DGF-1 were the most represented multigenic families in the nuclear RA genome, with 1644, 1566, 1073, 1068, 548 and 541 members, respectively (Table 3). TcSMUG, GT, TcTASV and SAP followed with 203, 140, 65 and 65 members, respectively. Among TS groups, TS-GII and TS-GV were the most numerous (608 and 409 sequences, respectively) whereas TS-GII and TS-GVII displayed the highest proportion of pseudogenes (~ 80%). Consistent with previous data4, all of the analyzed gene families comprised large amounts of pseudogenes, which added up to ~ 30–70% of their total sequences (Table 3). This was particularly remarkable for RHS, in which pseudogenes constitute ~ 95% of the total count of sequences, suggesting a faster evolution pace for this family. Overall, multigene families accounted for ~ 13.19 Mb (14.46%) of the RA nuclear genome (Table 2).
In addition to CDS, 3452 transposable elements and 2070 ncRNAs were identified (Table 2). The most abundant transposon was the non-autonomous Short Interspersed Repetitive Element (SIRE), with 2233 copies. Other transposons such as the Long Autonomous Terminal Repeat Retrotransposons (L1Tc), Vestigial Interposed Retroelement (VIPER), Short non-Autonomous Terminal Repeat Retrotransposons (NARTc) and cruzi-associated retrotransposon (CZAR) were represented by 512, 497, 186 and 24 sequences, respectively (Table 4). Among ncRNA sequences we identified 119 tRNAs, 329 rRNAs (rRNA 5S, rRNA 18S, Large subunit (LSU)-rRNA, etc.) and 150 copies of the SL-RNA. Other kinds of ncRNAs, including small nuclear RNAs (snRNAs) and small nucleolar RNAs (snoRNAS) such as H/ACA snoRNAs and C/D snoRNAs, were also annotated (Table 4). Overall, transposable elements and ncRNAs accounted for ~ 4.67 Mb (~ 5,12%) and ~ 257.4 Kb (~ 0.28%) of the RA nuclear genome, respectively (Table 2).
For comparison, the sequences from TCP, multigene families, transposons and ncRNAs were also assessed in the genome of TCC25 using our annotation protocol (Suppl Fig. 1). This analysis evidenced that the larger genome size of the RA strain compared to TCC does not correlate with an increase in feature content. In fact, the TCC genome assembly shows a higher percentage of occupancy (51.95%) compared to RA (46.15%) (Tables 1 and 2). As shown in Tables 2 and 3, multigene families in RA and TCC strains exhibited quite similar genomic landscapes, both in quantitative and qualitative terms. The most noticeable differences were i) a slight but consistent increase (~ 8–12%) in the number of TS, RHS, GP63 and DGF-1 pseudogenes in the RA genome; and ii) an increased dosage of TcSMUGL sequences in RA as compared to TCC (104 vs 71). A closer examination of the contigs harbouring TcSMUGL sequences however suggested that this discrepancy may arise from assembly differences between the strains (Suppl Fig. 3).
The dosages of most types of analyzed transposable elements were also rather conserved between RA and TCC; with apparent expansions of NARTc sequences in RA (186 vs 121 in TCC) and of CZAR elements in TCC (43 vs 24 in RA) (Table 4). This conservation extends to every analyzed ncRNA except for SL-RNA that was found to be more abundant in TCC than in RA (232 vs 150 copies; Table 4).
RA genome organization
To gain deeper insights into RA genome organization we used a recently developed pipeline for the high-throughput assessment of isochores on DNA datasets53. Using default parameters of 500 bp windows with 300 bp sliding step and a previously established 51% GC cutoff, this pipeline allows for the unbiased, i.e. independent of gene annotation, classification of core and disruptive compartments on T. cruzi genomes53. A total of 386 contigs > 50 Kb were processed, collectively spanning ~ 64.4 Mb (70.44% of the RA nuclear genome). From the 1331 regions that could be defined, 703 were classified as disruptive (smoothed GC content ≥ 0.51%) and 628 as core compartments (smoothed GC content < 51%), accounting for 55% (~ 35.5 Mb) and 45% (~ 28.9 Mb) of the analyzed genomic space, respectively (Table 5 and Fig. 1a). These figures are in the range of other T. cruzi genomes sequenced by long reads-based methods (Suppl Table 2). The length distribution was similar for both kinds of compartments (median length [Q1-Q3] of 31.89 [14.9–60.9] Kb for disruptive regions and 26.7 [13.9–59.6] Kb for core regions) (Table 5 and Fig. 1b). The percentage of occupancy within the corresponding contigs was also similar for core and disruptive compartments (Table 5). As previously demonstrated25, and at variance with T. brucei, disruptive regions in RA were not restricted to the subtelomeres, but distributed throughout the genome. Although most of the contigs displayed a typical mosaic structure, 63 of them were exclusively disruptive (ranging from 50.1 to 792.8 Kb), and 33 were exclusively core (ranging from 50.7 to 226.9 Kb) (Table 5). Most notably, the eight largest compartments were classified as disruptive, thus arguing for the capability of the herein used sequencing and assembly approaches in the recovery and deconvolution of repetitive regions (Fig. 1b).
Structural features of disruptive and core regions in the RA genome. a, c, d and e. Scatter and box and whiskers plots in log scale showing the median %GC (a), the CDS per PTU (c), the density of SSR (d), and the distance (in Kb) between genes adjacent to convergent (C) and divergent (D) SSR (e) in core (green) or disruptive (pink) compartments. Each box represents the first quartile, median, and third quartile, with whiskers extending 1.5 times the IQR. b. Jointplot displaying the percentage of occupancy of each region (calculated as its length divided by the length of the corresponding contig, multiplied by 100) according to its length. Kernel density estimate plots for disruptive (pink) and core (green) regions are shown along each axis, representing the distribution of each variable across the compartments. In c and d, P-values were derived from Mann–Whitney U tests comparing the indicated feature between disruptive and core regions. In e, Kruskal Wallis and Dunn’s post-hoc tests were performed to compare distances among compartments and SSR configurations.
Taking into account the pivotal role of SSRs in T. cruzi DNA transcription, we assessed their distribution on the RA genome. Of note, our preliminary characterizations revealed overlaps between PTUs encoded on complementary strands of certain contigs. A closer look at these regions, however, indicated that such events most likely resulted from the migration by homology transfer of erroneously annotated sequences (Suppl Fig. 4), and the SSR defining these PTUs were thus not further considered. Global analyses indicated that core compartments present significantly less density of SSRs and more CDSs per PTU as compared to disruptive compartments (Table 5 and Fig. 1c, d). Notably, more divergent than convergent SSR were counted in core compartments (131 vs 58); a bias not observed in disruptive compartments (472 divergent SSR vs 495 convergent SSR, Table 5). The distances between CDS flanking convergent and divergent SSR in disruptive compartments, as well as those next to divergent SSR in core compartments, were relatively uniform (Table 5 and Fig. 1e). On the other hand, the distances between CDS adjacent to convergent SSR in core compartments were the shortest (Fig. 1e).
Next, we moved on to assess in more detail the effect of isochore-based compartmentalization of the RA genome on the distribution of annotated features. Though CDS were roughly equally represented in each type of compartment, GC-poor, core regions presented more density of CDS than GC-rich, disruptive regions (Table 5 and Fig. 2a). Trypanosomatid-conserved proteins (TCPs), irrespective if they have functional annotation (TCFP) or are tagged as ‘hypothetical’ (TCHP), were distributed throughout core and disruptive compartments, although they were more represented (~ 68%) in core compartments (Fig. 2d), which is consistent with the observed higher conservancy (among strains and also among trypanosomatids) of the core genome. In line with their metabolic and/or housekeeping roles, TCFP were usually found either as single-copy genes (2378; 82.56%) or as a small set of 3 to 10 highly homologous genes showing head-to-tail disposition that likely emerged by sequence duplication (441; 15%) (Suppl Table 3 and Suppl Fig. 5). It should be noted, however, that some small families of TCFP genes bearing up to 200 sequences could be identified in the RA genome (Suppl Table 4). These may be confined to a single or few contigs, i.e. histone H4 and Histone H2A, or be conspicuously distributed on the parasite genome, i.e. target of rapamycin (TOR) kinase 1 (Suppl Table 4). In addition to gene dosage and genome distribution, such small families of TCFP genes also differ in their degree of representation in core and disruptive compartments (Suppl Table 4). As a general trend, a positive correlation between gene count and accumulation in the disruptive genome may be observed (Suppl Fig. 5).
Distribution of functional features among core and disruptive regions in the RA genome. a, b and c. Scatter plots showing the density of CDS (a), transposon (b) and non-coding RNA (ncRNA) (c) in core (green) or disruptive (pink) compartments in log scale. Each box represents the first quartile, median, and third quartile, with whiskers extending 1.5 times the IQR for each compartment class. d, e, f, g and h. Barplots depicting the distribution as percentage of CDS and pseudogenes (d), multigene families (e), TS groups (f) transposon (g) and non-coding RNAs (ncRNAs) (h) in core (green) or disruptive (pink) compartments. MF: multigenic families. In a, b and c, P-values were derived from Mann–Whitney U tests comparing the indicated feature between disruptive and core regions. In d (CDS and MF data), e and f, pseudogenes are included in the analysis.
It is noteworthy the case of glycine dehydrogenase [decarboxylating] (GlyDh). On one hand, and despite the considerable size of this TCFP gene family (n = 61), GlyDh sequences were exclusively found in the core genome (Suppl Table 4). Moreover, inspection of GlyDh sequences revealed that they comprise solely two full-length members, showing 98.14% of identity between them, which were intriguingly non-syntenic with their orthologs in related trypanosomatids. The remaining 59 GlyDh sequences were unique to T. cruzi and corresponded to non-functional, fragmented variants (most likely pseudogenes), with a conspicuous core genomic distribution (Suppl Table 4). This is most striking, as it suggests i) that GlyDh amplification in T. cruzi was not linked to cell physiological demands, i.e. increase of protein dosage, as in other housekeeping genes; and ii) that massive amplification and pseudogenization of sequences in this organism is not a process restricted to the disruptive genome.
At variance with TCP, sequences from large multigene families were conspicuously distributed on the RA genome, though mainly in disruptive compartments (Fig. 2d). This enrichment was near absolute for DGF-1, TcMUC, MASP, TcSMUG, and TcTASV, with over 99.5% of their members located within GC-rich isochores (Fig. 2e). In this regard, it should be noted that DGF-1 genes and pseudogenes are large sequences (7–12 Kb) with a high content of GC (> 60%)52, which may have by themselves a major impact on local isochore definition. Other gene families such as TS, GP63, RHS, GT and SAP did not show such a strongly biased distribution, with ~ 81–97% of their sequences found in disruptive compartments (Fig. 2e and Suppl Table 3). This finding is particularly relevant in the case of TS (93.6% of association with GC-rich compartments), as it challenges the proposal of this multigenic family as a diagnostic marker of disruptive compartments25. When different groups of TS were analysed separately, TS-GI (30.77%), TS-GIII (21.74%), TS-GII (9.78%) and TS ‘unspecified’ (8.3%) turned out to be the most represented in core compartments (Fig. 2f and Suppl Table 3). It is particularly interesting the case of TS-GI, bearing enzymatically active TSs, in which the genomic distribution of its members seems to correlate with their evolutionary track. Briefly, the most parsimonious hypothesis states that a molecule with trans-sialidase activity emerged in an ancestor of the trypanosome lineage, and was readily adopted by these parasites for their interaction with arthropod vectors54. Following speciation, T. cruzi further elaborated on the TS scaffold, evolving a huge repository of polymorphic sequences (genes and pseudogenes). Even though most of them lack TS activity, they are nevertheless unified by certain structural features, including a sequence associated with tissue tropism known as FLY as well as typical bacterial/viral sialidase motifs such as Asp-boxes. Only a few TS molecules (restricted to TS-GI) retained trans-sialylation capacity; and a fraction of them, displaying the antigenic SAPA repeats as signature, were also repurposed as key determinants of infection and pathogenesis in the mammalian host13,55. Our analyses revealed that out of the 17 TS-GI CDS annotated in the RA genome, solely 11 presented the N[S/A]AYS catalytic motif and shared > 70% identity in pairwise alignments with sequences with experimentally demonstrated TS activity56. Of these, those likely corresponding to ‘ancestral’, insect-dwelling stages expressed variants57 were found in core compartments whereas those ‘novel’, i.e. bearing SAPA repeats, segregated to disruptive compartments (Suppl Table 5).
As described, transposable elements were also more represented in disruptive compartments (Fig. 2b). SIRE, VIPER, L1Tc and NARTc were predominantly (but not exclusively) found in disruptive compartments (Fig. 2g). RNaseH, which is encoded within L1Tc as part of its own retrotransposition machinery12, displayed the same distribution profile (Fig. 2g). In contrast, CZAR transposons were exclusively found in disruptive compartments (Fig. 2g and Suppl Table 3). As for ncRNAs, we found that tRNAs and snRNAs were predominantly found in core compartments while rRNAs and snoRNAs were mostly found in disruptive ones (Fig. 2c, h and Suppl Table 3).
Integrated data on the features (annotations of selected CDS, transposons and ncRNAs) encompassed in the 40 largest RA contigs along with the GC content-based classification for each region is shown in Fig. 3.
RA genome organization. a. In-scale circos plot of the 40 largest contigs from the RA strain. Pink and green bands indicate the disruptive (GC content > 51%) and core (GC content < 51%) genome compartments, respectively, predicted with GCanner (A). Densities of specific gene families and genetic elements are displayed as heatmaps: TCHP (B), TcMUC (C), MASP (D), TS (E), GP63 (F), DGF-1 (G), RHS (H), ncRNAs (I), transposons (J). Green and gray scales indicate the density of each feature.
Disruptive genome architecture
Visual analysis of disruptive compartments revealed recurrent patterns of features’ distributions, thereby indicating putative genomic co-occurrences within regions (Fig. 3). Some of these genomic associations already have been reported, such as between MASP and TcMUC sequences10. To explore this issue at a genome-wide scale, we first conducted an analysis of the representation of selected features across disruptive compartments (Fig. 4a, Suppl Table 6). Correlation matrices built upon all the features’ densities in compartments > 20 Kb (n = 454), not only confirmed the strong TcMUC-MASP genomic association (Pearson coefficient = 0.65) but also revealed genomic co-occurrences involving other gene families/transposons. As shown in the network plot, 2 major clusters of positively correlated features could be outlined (Fig. 4b). The first one involved MASP, TcMUC, TS-GV and TS-GVI, with loose connections with GP63-SIRE-VIPER-TS-GVIII (via MASP) and L1Tc-RNaseH-NARTc, via TcMUC. Genomic association between MASP and TS-GV is of particular interest, as we have recently shown that the majority of MASP-TS chimeric genes involve TS-GV sequences45. Considering that TcMUC-MASP chimeras have also been demonstrated45, it could be hypothesized that genomic associations favor recombination events between members of these genomic families.
Protein association in disruptive compartments. (a). Correlation matrix generated using the densities of the indicated features within disruptive compartments > 20 Kb (n = 454), with at least one annotation. The divergent scale displays positive correlations in blue and negative correlations in red, with correlation coefficients indicated within each cell. (b). Robust associations (Pearson correlation > 0.21) between the genome features analyzed in a were summarized using a network plot. The thickness represents the absolute value of correlation and solid and dotted links depict positive and negative correlations, respectively. The dot size represents the count of disruptive regions harboring the corresponding feature.
The second cluster involved a core of robust associations between DGF-1, RHS and TS-GII, with a weak connection to TS-GVII (via DGF-1). Both MASP/TcMUC/TS-GV/TS-GVI and DGF-1/RHS/TS-GII clusters displayed negative correlation with TCHP, which was more marked for the latter (Fig. 4b). An additional positive association was verified for TS-GIV and SAP (Pearson coefficient = 0.41), which could not be linked to any of the above mentioned clusters (Fig. 4).
Clustergrams built upon the densities of the most represented ‘classical’ multigenic families/groups (TcMUC, MASP, TS-GV, RHS, DGF-1, TS-GII and GP63), were coherent with these correlations and allowed for the robust delineation of 4 major categories of disruptive regions in the T. cruzi RA genome (Fig. 5a). The first category was composed of regions with high MASP, TcMUC and TS-GV sequences, and largely overlapped the first cluster derived from the correlation matrix. This category encompassed 100 compartments, including the largest ones, and was termed ‘TcMUC/MASP/TS-GV’ (Fig. 5a). TcMUC/MASP/TS-GV regions displayed negligible or null densities of RHS, TS-GII and DGF-1 and a range of densities (from null to medium) of TCHP (Fig. 5b). Some of these regions also showed a range of GP63 densities (Fig. 5a, b). The second category, composed of 176 regions, was enriched in RHS, DGF-1 and/or TS-GII, and corresponded to the second cluster derived from the correlation matrix (Fig. 5a). This category was accordingly named as ‘RHS/DGF-1/TS-GII’ and displayed both negligible or null amounts of MASP, TcMUC and TS-GV sequences and a range of densities (from null to medium) of TCHP (Fig. 5b). RHS/DGF-1/TS-GII regions were frequently found at the terminal ends of the assembled contigs (Fig. 3), which in certain cases may correspond to telomeric/sub-telomeric chromosome regions. As described27,31,58, DGF-1, RHS and TS-GII are indeed enriched at T. cruzi telomeric/sub-telomeric positions and, due to their extremely high evolution rate and recombination frequency, they were proposed to have a major role in shaping the structure, dynamics and gene expression regulation of these regions. In line with this, it is worth noting that RHS, TS-GII and DGF-1 displayed the highest proportion of pseudogenes among multigenic groups/families in the RA genome (Table 3).
Enrichment of T. cruzi gene families within the genomic compartments. (a) Clustergram and heatmap depicting all the disruptive compartments > 20 Kb with at least one annotated feature (n = 454). Each column represents a region tagged as the corresponding category according to the enrichment shown (see reference). (b) Scatter and box and whiskers plots showing the density of TCHP, TcMUC, MASP, TS-GV, RHS, DGF-1, TS-GII and GP63 in the disruptive subcompartments. Each box represents the first quartile, median, and third quartile, with whiskers extending 1.5 times the IQR for each compartment class. (c). Heatmap depicting the densities of proteins, transposons and snoRNAs for each region. The regions within each category are ordered by decreasing length. EF1-alpha: Elongation factor 1 alpha; ESAG: expression site-associated gene; GT: UDP-Gal or UDP-GlcNAc-dependent glycosyltransferase; SBP: Syntaxin binding protein; SAP: Ser-, Ala- and Pro-rich proteins; TOR: target of rapamycin.
The third and less represented category (termed ‘GP63’) comprised 32 regions containing GP63 sequences (Fig. 5a). These were almost devoid of any other multigene families and contained high to very high densities of TCHP sequences (Fig. 5b). Most notably, GP63 sequences found in these compartments were different from those occasionally found in ‘TcMUC/MASP/TS-GV’ regions (see above). A recent evolutionary study of GP63 in T. cruzi allowed for the substructuring of this gene family into multiple groups based on sequence alignments46. An in-depth analysis of GP63 sequences displaying co-occurrence with ‘GP63’ or ‘TcMUC/MASP/TS-GV’ regions showed that they belong to different groups, thereby indicating that in addition to structural variations, groups of GP63 genes/pseudogenes delineated in the above study also differ in their genomic distribution (our unpublished results).
Finally, a fourth category, termed ‘Other’, was built upon the 146 regions not included in previous categories (Fig. 5a). Regions tagged as ‘Other’ were unified by their null densities of sequences from major multigene families (Fig. 5b). In addition, these regions displayed as a general trend higher densities of TCHP than those lying in other categories (Fig. 5b).
To further characterize the GC-rich, ‘disruptive’ compartments in RA, we performed heatmaps to analyze the densities of less numerous gene families (including the TS groups not considered before), TCFP gene families (Suppl Table 4), transposable elements and ncRNAs across the 4 proposed categories (Fig. 5c). This analysis supported correlation data, i.e. genomic associations between TS-GVI, L1Tc, RNaseH and SIRE with ‘TcMUC/MASP/TS-GV’ category, and unveiled previously overlooked co-occurrences of these regions with a variety of TCFP gene families, including Lys-63-specific deubiquitinase BRCC36, flagellar calcium-binding 24 kDa protein and casein kinase (Fig. 5c and Suppl Table 7). Most interestingly, it revealed the enrichment in the ‘TcMUC/MASP/TS-GV’ category of SAP sequences (Fig. 5c), involved in the invasion of mammalian cells by metacyclic trypomastigotes47 and syntaxin binding proteins (SBP) (Fig. 5c and Suppl Table 7), involved in the docking and fusion of vesicles in other organisms59. Though T. cruzi SBP have not been characterized, vesicle dynamics is essential for the trafficking, processing, surface disposition and shedding of GPI-anchored mucins, MASP and TS molecules60,61,62. GT sequences, involved in the elaboration of complex glycans that decorate mucins and MASP molecules and determine their functional properties63, were shown to be underrepresented in ‘TcMUC/MASP/TS-GV’ regions (Fig. 5c and Suppl Table 7). Low levels of C/D snoRNAs were also detected in regions lying under the ‘TcMUC/MASP/TS-GV’ category (Fig. 5c and Suppl Table 7).
On the other hand, TS-III, TS-IV, TS-VII, and TS-VIII were primarily located within regions of the ‘RHS/DGF-1/TS-GII’ category (Fig. 5c and Suppl Table 7). The N-acetyltransferase complex ARD1 subunit and ESAG-like protein families showed a similar distribution. Interestingly, ESAG proteins in T. brucei, which are critical for antigenic variation in this parasite, are also codified in subtelomeric regions64. Although not exclusively, key protein-glycosylation enzymes such as GT and beta-galactofuranosyl glycosyltransferase proteins63 were also highly represented in ‘RHS/DGF-1/TS-GII’ regions. In contrast to ‘TcMUC/MASP/TS-GV’ regions, the ‘RHS/DGF-1/TS-GII’ category showed a reduced representation of L1Tc (and RNAseH), SIRE and VIPER, and an overrepresentation of C/D snoRNA sequences (Fig. 5c and Suppl Table 7).
The ‘GP63’ regions showed co-occurrences with the GT gene family and, to a lesser extent, with the glutamamyl carboxypeptidase and amastin families (Fig. 5c and Suppl Table 7). As evidenced in the correlation matrix (Fig. 4), SIRE elements were also overrepresented in ‘GP63’ regions (Fig. 5c and Suppl Table 7). The role of these transposons in the transcription regulation and evolution mechanisms of the GP63 gene family was recently discussed46. Approximately 20% of ‘GP63’ regions showed high and concomitant density increase of C/D and H/ACA snoRNAs (Fig. 5c and Suppl Table 7). A closer inspection revealed that GP63 sequences were embedded within tandem repeats, typically comprising two H/ACA snoRNAs for each C/D snoRNA and GP63 sequence. This arrangement suggests a role for such snoRNAs in the localized amplification of GP63 sequences. Associations of some groups of T. cruzi GP63 with snoRNAs have been recently reported46.
All low copy number of T. cruzi specific gene families such as TcSMUG, Trypomastigote Small Surface Antigen (TSSA)65 and TcTASV as well as most of the small TCFP families, i.e. histones, cystathionine beta-synthase, sigma-adaptin 3, glycerate kinase, elongation factor 1-alpha, dynein light chain, chaperonin HSP60 mitochondrial precursor, folate/pteridine transporter, amastin and glutamamyl carboxypeptidase, were found on ‘Other’ regions (Fig. 5c and Suppl Tables 4 and 6). These were mostly arranged in a discrete number of tandems of highly homologous genes (likely paralogues) showing head-to-tail disposition that likely emerged by sequence duplication. These small families display low diversification, with minimal pseudogenization and/or genome translocation as compared to more complex gene families such as TS or DGF-1, suggesting a different mode of evolution. In line with this, regions from the ‘Other’ category were shown to be poor in most transposable elements, e.g. L1Tc, while exhibiting varying densities of SIRE elements (Fig. 5c and Suppl Table 7). A small proportion (~ 10%) of these regions displayed a high density of snoRNAs, which were also arranged in tandem arrays (Fig. 5c). As described above for GP63 sequences, some TCFP gene families, i.e. tyrosine aminotransferases, displayed ‘array units’ made up of the specific CDS surrounded by different kinds of snoRNAs (see RA_174 and RA_73), suggesting that these ncRNAs play an underappreciated role in T. cruzi genome evolution. It should also be mentioned that a few of the analyzed features, i.e. TOR kinase 1, showed a non-biased distribution across our defined categories of the disruptive T. cruzi genome (Fig. 5c and Suppl Table 7).
From a structural standpoint, the proposed categories of RA disruptive regions displayed differences in their length, CDS density, SSR density and, though to a lower extent, also in average GC content (Fig. 6). Overall, these findings, together with above shown variations in gene/feature composition and density, genomic distribution and mode of evolution, provide support to our proposed categorization of the T. cruzi disruptive genome.
Structural features of disruptive subcompartments in the RA genome. Scatter and box and whiskers plots showing the region length (in Kb) (a), CDS (*and pseudogene) density (b), median %GC (c) and the density of SSR (d) in each kind of disruptive subcompartment (in log scale). Each box represents the first quartile, median, and third quartile, with whiskers extending 1.5 times the IQR. Statistical differences were assessed using the Kruskal–Wallis test, followed by Dunn’s post-hoc tests to compare individual features among compartments, from which the P-values were derived.
In summary, we have generated a high-quality whole genome assembly of RA (TcVI), a virulent T. cruzi strain commonly used as a model in Chagas disease research laboratories. The completion and release of this genome, along with the highly curated databases underlying its annotation, will provide the scientific community with valuable resources to improve functional, epidemiological and comparative evolutionary studies into this neglected parasite. The exhaustive analysis of the RA genomic assembly, carried out using custom-built bioinformatic tools, revealed novel aspects of the T. cruzi genome architecture, dynamics and evolution.





