Effect of coverage depth and haplotype phasing on structural variant detection with PacBio long reads
Each human genome has thousands of structural variants compared to the reference assembly, up to 85% of which are difficult or impossible to detect with Illumina short reads and are only visible with long, multi-kilobase reads. The PacBio RS II and Sequel single molecule, real-time (SMRT) sequencing platforms have made it practical to generate long reads at high throughput. These platforms enable the discovery of structural variants just as short-read platforms did for single nucleotide variants. Numerous software algorithms call structural variants effectively from PacBio long reads, but algorithm sensitivity is lower for insertion variants and all heterozygous variants. Furthermore, the impact of coverage depth and read lengths on sensitivity is not fully characterized. To quantify how zygosity, coverage depth, and read lengths impact the sensitivity of structural variant detection, we obtained high coverage PacBio sequences for three human samples: haploid CHM1, diploid NA12878, and diploid SK-BR-3. For each dataset, reads were randomly subsampled to titrate coverage from 0.5- to 50-fold. The structural variants detected at each coverage were compared to the set at “full” 50-fold coverage. For the diploid samples, additional titrations were performed with reads first partitioned by phase using single nucleotide variants for essentially haploid structural variant discovery. Even at low coverages (1- to 5-fold), PacBio long reads reveal hundreds of structural variants that are not seen in deep 50-fold Illumina whole genome sequences. At moderate 10-fold PacBio coverage, a majority of structural variants are detected. Sensitivity begins to level off at around 40-fold coverage, though it does not fully saturate before 50-fold. Phasing improves sensitivity for all variant types, especially at moderate 10- to 20-fold coverage. Long reads are an effective tool to identify and phase structural variants in the human genome. The majority of variants are detected at moderate 10-fold coverage, and even extremely low long-read coverage (1- to 5-fold) reveals variants that are invisible to short-read sequencing. Performance will continue to improve with better software and longer reads, which will empower studies to connect structural variants to healthy and disease traits in the human population.