Long-read sequencing of complex genomes (#14)
The human genome is arguably the most well-assembled reference assembly yet many gaps remain and aspects of its structural variation remain poorly understood even ten years after its completion. The discovery and resolution of this variation is critical to understanding disease. I will present our most recent work sequencing human and nonhuman primate genomes using single-molecule sequencing (SMS) technology. We have developed methods to detect indels and structural variants from several bases up to 50 kbp. We have closed or extended ~50% of the remaining interstitial gaps in the human genome and find that 80% of these carry long polypyrimidine/purine tracts multiple kilobases in length. Comparing the single haplotype to the human reference, we resolve >35,000 structural variants and >500,000 indels at the base-pair level with 99.9% sequence accuracy. More than 50% of insertions and deletions <2 kbp in length are novel representing large swaths >10 Mbp of undiscovered genetic variation within human genomes. We find that such sequences vary extensively in copy number and affect functional elements in the genome. In addition, the analysis uncovers other categories of complex variation that have been difficult to assess, including mobile element insertions as well as inversions mapping within more complex and GC-rich regions of the genome. Our results suggest a systematic bias against longer and more complex repetitive DNA that can now be partially resolved with new sequencing technologies. I will discuss the potential of this technology to create accurate de novo assemblies of additional human and nonhuman primate genomes that more comprehensively capture the full spectrum of human genetic diversity and its importance to our understanding of genetic variation and disease.