Abstract
Most human traits, ranging from physical appearance to behavior and disease susceptibility, are in part inherited through genetic material. Whole-genome sequencing has enabled the complete characterization of human genetic variation. While most of common DNA sequence variation has been observed in genetic studies from worldwide populations, rare genetic variation is
... read more
more geographically clustered and requires many more individuals from diverse populations to be studied. In this thesis I describe the genetic variation in 250 Dutch parent-offspring families from the Genome of the Netherlands (GoNL) Project obtained through whole-genome sequencing. A total of 20.4 million single nucleotide variants (SNVs), 1.2 million short insertions and deletions (indels) and 27.5 thousands structural variants (SVs) were discovered in these families. While most of the SNVs were known, the majority of the indels and almost all SVs are novel, partly due to the ability to identify mid-size deletions of size 30bp-500bp for the first time on a population scale. Taking advantage of the trio design, the SNVs and indels were phased into a highly accurate haplotype panel, which improves imputation accuracy especially for lower allele frequency alleles. In addition to describing the inherited DNA sequence variation in the Dutch population, I was also able to characterize de novo mutations at an unprecedented scale. Indeed, 11,020 de novo SNVs, 291 de novo indels and 41 de novo SVs were identified, from which a mutation rate of 1.15 x 10-8 SNVs/bp, 0.68 x 10-9 indels/bp and 0.16 SVs per generation can be estimated. Despite their much lower rate, de novo SVs affect 91 times more bases on average, including 52 times more protein coding bases, than de novo SNVs. This is in contrast with the relatively similar footprint of inherited SNVs and SVs, likely indicating much stronger selection pressure on SVs than on SNVs. Looking at the distribution of mutations across offspring, I confirmed the previously reported increase of de novo SNVs with paternal age and furthermore showed that paternal age also influences their chromosomal location. In particular, de novo SNVs in offspring of older fathers occur more frequently in early-replicating, genic regions than those in offspring of younger fathers. Although there was no significant association between father’s age and rates of de novo indels and SVs (possibly due to limited detection power), both of these variant types are also enriched in the paternal germline. De novo SNVs clustered significantly within individuals at distances up to 20kbp. Mutations in these clusters represent 1.5% of the SNVs and exhibit a unique mutational spectrum, which may point to a novel mutational mechanism. Local SNV mutation rates vary across genomes and are elevated in functional regions due to CpG dinucleotide context influences. Transcribed regions exhibit a pronounced strand asymmetry compatible with the action of transcription-coupled repair. On larger scales, I observed that SNV mutation and recombination rates independently associate with nucleotide diversity, and that variation in substitution rates based on species divergence is only partly explained by mutation rate heterogeneity. Overall, this thesis presents key results of large-scale sequencing data collected within a single population. The parent-offspring design has proved beneficial, allowing in-depth characterization of common, rare and de novo SNVs, indels and SVs. Analyses of these data have revealed novel insights into inherited and de novo variation in human genomes.
show less