Many applications of high throughput sequencing rely on the availability of an accurate reference genome. Errors in the reference genome assembly increase the number of false-positives in downstream analyses. Recently, we have shown that over 33% of the current pig reference genome, Sscrofa10.2, is either misassembled or otherwise unreliable for genomic analyses. Additionally, ~10% of the bases in the assembly are Ns in gaps of an arbitrary size. Thousands of highly fragmented contigs remain unplaced and many genes are known to be missing from the assembly. Here we present a new assembly of the pig genome, Sscrofa11, assembled using 65X PacBio sequencing from T.J. Tabasco, the same Duroc sow used in the assembly of Sscrofa10.2. The PacBio reads were assembled using the Falcon assembly pipeline resulting in 3,206 contigs with an initial contig N50 of 14.5Mb. We used Sscrofa10.2 as a template to scaffold the PacBio contigs, under the assumption that its gross structure is correct, and used PBJelly to fill gaps. Additional gaps were filled using large, sequenced BACs from the original assembly. Following gap filling, the assembly has substantially improved contiguity and contains more sequence than the Sscrofa10.2 assembly. Arrow and Pilon were used to polish the assembly. The contig N50 is now 58.5Mb with 103 gaps remaining. By comparing regions of the two assemblies we show that regions with structural abnormalities we identified in Sscrofa10.2 are resolved in the new PacBio assembly.
Organization: Roslin Institute