Study population
We used data from UK biobank and restricted the sample to participants who had received at least one dose of a COVID-19 vaccine. The outcome of COVID-19 vaccine immunogenicity was assessed using immunoglobulin (IgG) result by self-administered lateral flow test, specifically the Fortress Fast COVID-19 Test or AbC-19TM kit. We used IgG test positive as the indicator of immunogenicity. We also extracted information on whether participants received a second booster dose, the interval between the first or second vaccination and antibody testing, and the type of antibody test kit. Because lateral flow tests cannot distinguish between vaccine-induced and infection-induced antibodies, an additional Thriva coronavirus antibody test was further used to collect capillary blood from participants. Laboratory analysis was then performed to detect IgG antibodies against the nucleocapsid protein of SARS-CoV-2, in order to determine prior infection status. We excluded all participants who were ever infected with SARS-CoV-2 (Fig. 1). We extracted individual-level data from multiple databases in the UK Biobank, including COVID-19-related data, Genomics data, Hospital Inpatient data, Primary Care records, and Death Register data. The research has been approved by the UK Biobank Ethics and Governance Council. All participants in the UK Biobank provided written informed consent at the time of recruitment, including consent for their data to be used in future health-related research. Therefore, no additional informed consent was required for this study.
Genome-wide association study (GWAS)
We performed linear association analyses between each genetic variant and immunogenicity of COVID-19 vaccines. We performed quality control filter on the variants according to: information score ≥0.427, minor allele frequency ≥0.001, Hardy-Weinberg equilibrium -log10(P) ≤ 7, and inclusion of only autosomal and biallelic variants. These inclusion criteria were consistent with previous studies on the same study population28. For the GWAS analysis, IgG positivity after COVID-19 vaccination was used as the primary phenotype and dependent variable; independent variables included SNP term, genotyping array, age, genetic sex, and 15 principal components to control for population stratification. We applied a genome-wide significance threshold of p-value < 5 × 10−8 to account for multiple testing. Related individuals with high kinship coefficients (>0.3) were removed to reduce the impact of cryptic relatedness. Above genome-wide association analyses were implemented using Plink229. Moreover, we performed conditional and joint analysis to identify multiple independent signals. Specially, we conditioned on the primary associated SNP, selected other significant SNPs in a stepwise manner with p-value threshold of 5 × 10−8. We set the window size of selection 10 Mb, assuming that SNPs 10 Mb away or on other chromosomes are in linkage equilibrium. Among all SNPs reached genome-wide significance, we identified three multiple independent SNPs associated with immunogenicity of COVID-19 vaccination, located near HLA-DQA1 and HLA-DQB1 on chromosome 6, and IGHV1-69 on chromosome 14 (Supplementary Table 2). Analysis was performed using GCTA-COJO package30.
Stratified linkage disequilibrium score regression
LD score regression assumes that heritability uniformly distributes across the genome. For complex traits, however, associated heritability may cluster in regions. To deal with this drawback, stratified LD score regression was proposed to dissect SNP-based heritability into different categories. We applied a previously published method to attribute SNP-based heritability from GWAS into different functional categories and cell types31. Only cell type group related to immune system was significantly associated with vaccination immunogenicity after Bonferroni correction (p = 0.006 ≤ 0.05/9) (Supplementary Table 5). For SNPs related to immune system, they accounted for 23.34% of total SNPs explained but explained almost 99.8% of \({\hat{h}}_{{SNP}}^{2}\), with fold change of 4.28. This functional relevance was supported by existing knowledge that immunogenicity was related to immune system.
GWAS using REGENIE framework
To account for relatedness in our analysis, we utilized the REGENIE framework, which allows the inclusion of related individuals without the need for prior exclusion14. For the generation of the bed, bim, and fam genotype files used in step 1, we applied stringent quality control filters: SNPs with minor allele frequency (MAF) less than 1%, minor allele count (MAC) less than 100, or missing genotype call rate greater than 1% were excluded. We also removed variants showing significant deviation from Hardy–Weinberg equilibrium (HWE) in the overall sample (p < 1 × 10⁻¹⁵), as such deviations may indicate genotyping artifacts or underlying population structure. Individuals with a genotype missingness rate greater than 2% were also excluded. Furthermore, our analysis was restricted to biallelic SNPs only, since multiallelic variants may complicate modeling and are generally less well-supported by GWAS tools. We also retained only SNPs with unambiguous A, C, G, or T alleles, excluding indels, structural variants, and ambiguous base calls to ensure consistency in downstream analyses. To select representative SNPs and corresponding samples for step 1 modeling, we additionally applied filters based on imputation quality (INFO score > 0.8) and individual missingness (–mind 0.02). Based on available computational resources, we set the block size (–bsize) to 4000. For step 2 association testing, we included the following covariates in the model: GenotypeBatch, Age, GeneticSex, and the first 15 principal components (PC1–PC15). Firth logistic regression with approximate inference was applied to account for potential bias in unbalanced binary traits. To reduce false positives while maintaining sensitivity, we adopted a conservative significance threshold of –pThresh 0.01.
HLA type imputation and HLA-immunogenicity association analysis
Classical HLA types with four-digit resolution at the class II regions were imputed from the SNP genotyping in the UK Biobank population. Imputation of four-digit HLA alleles was carried out using HLA*IMP:0232. For each HLA allele, we performed logistic regression to analyze its association with immunogenicity, adjusting for covariates including the influential factors mentioned in the GWAS section and the type of antibody test kit. Such logistic regression produced a beta estimate: positive value of beta coefficient indicated additional copy of this HLA allele associated with elevated immunogenicity, while negative value indicated reduced immunogenicity. The significance of HLA alleles in the logistic regression model was tested while accounting for multiple testing.
HLA-peptide binding affinity prediction and binding-affinity-immunogenicity association analysis
To assess peptide binding affinity between HLA alleles with peptide from COVID-19 vaccine, we obtained BNT-162b2 sequence in fasta format from GenBank (Accession No. OR134577.1). To predict peptide binding affinity, we used NetMHCpan-4.1 for MHC class I peptide affinity prediction33, and NetMHCIIPan-4.1 and 4.3 for MHC class II peptide affinity prediction34. This machine learning models were trained on existing binding affinity and eluted ligand data as outputs, using peptide sequence and HLA allele sequences as inputs. We derived the BNT-162b2 sequence into 12- to 18-mers and predicted peptide affinity in nanomolar IC50, eluted ligand prediction score, and log-scaled predicted binding affinity using netMHCPan. Specifically, for each HLA-DQ allele, we calculated the average peptide binding affinity by averaging the binding affinity across all 12- to 18-mers, and assigned this value to the corresponding individuals. For individuals that were heterozygote (two different HLA-DQA1 alleles or two different HLA-DQB1 alleles) or double heterozygote (two different HLA-DQA1 alleles and two different HLA-DQB1 alleles), we calculated binding affinity across all alleles, and assigned the averaged binding affinity to that person.
To explore the association between binding affinity and immunogenicity, we first analyzed associations between immunogenicity and binding affinity at the allelic level. For all HLA alleles, we performed linear regression to examine the association between the beta coefficients representing the relationship between HLA alleles and immunogenicity, and the averaged HLA-peptide binding affinity. Second, for all individuals, we analyzed associations between immunogenicity and binding affinity at the population level. We used logistic regression to analyze the association between individual-level binding affinity and immunogenicity, and adjusting for other covariates. To assess nonlinear dose-response relationships, we repeated the above regression model but replaced the linear term of binding affinity with splines.
Functional genomic analysis of causative variants near IGHV1-69
We used LocusZoom35 to plot SNP-gene relationship in a specific genomic region. We performed tracks analysis to assess potential regulatory functions of the risk SNPs. The conversion of SNPs in different genome assemblies was performed using liftover36. We collected functional genomics data from ENCODE (https://www.encodeproject.org/)37, including chromatin accessibility (DNase-seq, ATAC-seq), histone H3 modification (H3K27ac, H3K4me1, H3K4me3), and ChIP-seq of critical factors (Supplementary Table 6). H3K27ac is a mark of active promoters and enhancers, while H3K4me3 and H3K4me1 usually serve as marks of promoters and enhancers, respectively. Besides 7 cell lines in ENCODE histone modification collection, DND41 and DOHH2 were added in our analysis for their high expression of IGHV1-69, revealed by Harmonizome38. We incorporated ChIP-seq peaks of CCCTC-Binding factor (CTCF) and RNA polymerase II subunit A (POLR2A) in B cell samples in our analysis. We applied annotations of cis-regulatory elements using resources from UCSC genome browser39.
Mendelian randomization analysis of causative variants near IGHV1-69
In addition, we used Mendelian randomization analysis to determine the causal association between IGHV1-69 expression level and the immunogenicity of COVID-19 vaccination, based on summary data from GWAS and eQTL. In the framework of Mendelian randomization analysis, the outcome of interest (Y) is the immunogenicity of COVID-19 vaccination. The gene expression (X) stands as the exposure of interest (i.e., IGHV1-69), while the single nucleotide polymorphism (SNP) or genetic variant (Z) acts as the instrumental variable. To denote the effect size of gene expression on the outcome of interest, we use bxy. Similarly, bzy represents the effect size of the SNP on the outcome of interest, and bzx is the effect size of the SNP on gene expression. In the Mendelian randomization framework, our primary objective is to ascertain the effect size of gene expression on immunogenicity, denoted by bxy, which can be calculated as bxy = bzy/bzx. Based on previous research, bxy can be estimated from summary-level data, despite the outcome of interest, gene expression, and SNP being obtained from different data samples; the statistical power of bxy estimation can be significantly increased if bzy and bzx are derived from two independent datasets with large sample sizes40. Our GWAS analysis was based on nearly 200,000 individuals, while eQTL data were procured from a separate large-scale sample, eQTLGen consortium18. Therefore, we calculated the test statistic of bxy by: \(T\approx \frac{{Z}_{{zy}}^{2}{Z}_{{zx}}^{2}}{{Z}_{{zy}}^{2}+{Z}_{{zx}}^{2}}\), where zzy and zzx were z-test statistics obtained from GWAS and eQTL data set. The positive association between gene expression and outcome was visualized by scatter plot of Z-scores derived from the GWAS and eQTL summary data. Among all 10 SNPs located near IGHV1-69 and reached genome-wide significance (i.e., 5 × 10−8), 7 of them had available records in eQTL. The expression level of IGHV1-69 was positively associated with immunogenicity, since SNPs of interest were (1) all negatively associated with immunogenicity, according to GWAS results, (2) all negatively associated with reduced expression of IGHV1-69, as indicated by negative z-value in Supplementary Table 7.
Interactive effects of genetic factors on the immunogenicity of COVID-19 vaccine
To evaluate possible interaction between HLA alleles and lead SNP near IGHV1-69 on COVID-19 vaccine immunogenicity, we first calculated an HLA-DQ binding affinity by treating continuous nanomolar IC50 of binding affinity at individual level into a binary variable. Individuals were categorized into two groups according to whether their binding affinity was above or below the average based on the nanomolar IC50 of binding affinity. Separate logistic regression models were performed to evaluate associations between COVID-19 vaccine immunogenicity and genetic factors: (1) the HLA-DQ binding affinity, (2) the lead SNP near IGHV1-69, and (3) the interaction term of HLA-DQ binding affinity * lead SNP near IGHV1-69, adjusting for other covariates. The interaction model additionally included the main effects of both genetic variables.
Modification of genetic factors on relationships between date of vaccinations and immunogenicity
To further investigate whether genetic factors contribute to the heterogeneity of associations and how they modify the relationships between date of vaccinations and immunogenicity, we calculated the proportion of vaccinated individuals with IgG test positive (i.e., immunogenicity) by days after vaccination among groups with: (1) HLA-DQ binding affinity above average versus HLA-DQ binding affinity below average, (2) homozygous or heterozygous reference allele (AA or AT) of the lead SNP near IGHV1-69 versus homozygous risk allele (TT) of the lead SNP near IGHV1-69, and (3) HLA-DQ binding affinity above average and AA of the lead SNP near IGHV1-69 versus HLA-DQ binding affinity below average and TT of the lead SNP near IGHV1-69 (comparisons of other groups were presented in Supplementary Table 9). For days after the first vaccination, the analysis was conducted in individuals who received only one dose of the COVID-19 vaccine. For days after the second vaccination and interval days between the first and second vaccination, the analysis focused on those who received a second booster dose. We also excluded participants from cohorts if they received the first or the second vaccine on the day of the IgG antibody test.