Nature Methods Paper Uses Long-Read Data for Highly Contiguous Diploid Human Genome
Monday, June 29, 2015
A new publication in Nature Methods describes a new single-molecule assembly approach that resulted in “the most contiguous clone-free human genome assembly to date,” according to lead authors Matthew Pendleton, Robert Sebra, Andy Pang, and Ajay Ummat.
The paper, “Assembly and Diploid Architecture of an Individual Human Genome via Single Molecule Technologies,” comes from a large team of collaborators at the Icahn School of Medicine at Mount Sinai, Cornell, Cold Spring Harbor Laboratory, and other institutions.
Their new approach leverages the best aspects of each single-molecule data type by combining long-read sequencing for de novo assembly with single-molecule genome maps for scaffolding. The resulting hybrid assembly represents a mix of SMRT® Sequencing data and single-molecule genome maps from BioNano Genomics’ NanoChannel Arrays.
The paper describes sequencing the well-studied NA12878 genome using SMRT Sequencing and generating single-molecule genome maps with nicking enzymes. “Individually, the assemblies and genome maps markedly improve contiguity and completeness compared with de novo assemblies from clone-free, short-read shotgun sequencing data,” the authors write. “Moreover, by combining the two platforms, we achieve scaffold N50 values greater than 28 Mb, improving the contiguity of the initial sequence assembly nearly 30-fold and of the initial genome map nearly 8-fold.”
The scientists then compared their assembly to the human reference genome to identify a comprehensive set of genetic variants, including a wide variety of larger structural variants that are often overlooked by short-read SBS approaches. The scientists note that while short-read technologies are frequently used to survey genomes to identify single nucleotide variants, they cannot resolve most large-scale genetic variation, including a wide variety of structural variants and repetitive regions that confound short-read assemblies.
“Though the cost of sequencing has markedly decreased, de novo human genome analysis has, to some extent, regressed,” the authors report. “Although HuRef and the original Celera whole-genome shotgun assembly have scaffold N50 values … of 19.5 Mb and 29 Mb respectively, the best next-generation sequencing (NGS) assemblies have scaffold N50 values of 11.5 Mb, even with the use of high-coverage fosmid jumping libraries.” The biggest challenges in these short-read assemblies, they add, are repetitive structures, transposable elements, segmental duplications, and heterochromatin.
Advantages of this extraordinary contiguity in their single-molecule assembly, to which short-read NGS data was later added, include detecting large structural variants and successfully phasing both single nucleotide and structural variants. Comparisons of the assembly to reference genomes allowed the team to resolve and phase structural variants such as tandem repeats across the genome. They successfully separated maternal and paternal alleles, revealing complex events that had been missed in previous assemblies.
For structural elements, the authors report that “a major benefit of continuous long reads is the ability to directly observe structural variants,” an approach they say is more effective than relying on breakpoint analysis or local realignment.
The combination of SMRT Sequencing data, genome maps, and NGS data “allowed us to resolve long-standing assembly discrepancies,” the scientists write.