“Hidden batch effects and study design bias impact identification of genetic risk factors in large genetic sequencing studies“
Abstract: Genetic studies have shifted to sequencing‐based rare variants discovery after decades of success in identifying common disease variants by Genome‐Wide Association Studies (GWAS) using SNP chips. The large sample sizes of these studies are required for statistical power but often inadvertently introduce batch effects and other confounding factors and biases. We investigated batch effects and confounding factors and their impact on association analysis in the Alzheimer’s Disease Sequencing Project (ADSP) exome dataset of more than 10,000 cases and controls, which were processed and sequenced at three different sequencing centers (Broad Institute, Washington University and Baylor College of Medicine) using two different exome capture kits (Illumina and NimbleGen). In addition, the cases in ADSP were intentionally older than controls to favor the detection of disease‐causal variants that are absent from older but cognitively normal individuals. Therefore, age as a covariate confounds with Alzheimer’s disease (AD) status in this dataset.
As expected, population substructure was visible by Principal Components Analysis (PCA) but no obvious sample batches due to different sequencing centers and/or capture kits were detected in PCA. However, after association analyses that included both ancestry and sequencing centers as covariates, PCA of our top variants associated with AD revealed significant batch differences related to sequencing center. Almost all top variants associated with AD in our analysis were significant in samples processed using the Illumina kit (at Broad) but not in those processed using the NimbleGen kit (at Washington University and Baylor), despite adjustment for sequencing center in the association models. Further investigations revealed clear batch differences of genotype quality (GQ) scores and minor allele concentrations (the percent of reads supporting alternative alleles in a sample) between variants sequenced at Broad vs. those sequenced at the other two centers. These variant quality batch differences should be considered during the association analyses.
Division of Biostatistics seminars
For inquiries contact Chengjie Xiong.