June 1, 2021  |  

FALCON-Phase integrates PacBio and HiC data for de novo assembly, scaffolding and phasing of a diploid Puerto Rican genome (HG00733)

Haplotype-resolved genomes are important for understanding how combinations of variants impact phenotypes. The study of disease, quantitative traits, forensics, and organ donor matching are aided by phased genomes. Phase is commonly resolved using familial data, population-based imputation, or by isolating and sequencing single haplotypes using fosmids, BACs, or haploid tissues. Because these methods can be prohibitively expensive, or samples may not be available, alternative approaches are required. de novo genome assembly with PacBio Single Molecule, Real-Time (SMRT) data produces highly contiguous, accurate assemblies. For non-inbred samples, including humans, the separate resolution of haplotypes results in higher base accuracy and more contiguous assembled sequences. Two primary methods exist for phased diploid genome assembly. The first, TrioCanu requires Illumina data from parents and PacBio data from the offspring. The long reads from the child are partitioned into maternal and paternal bins using parent-specific sequences; the separate PacBio read bins are then assembled, generating two fully phased genomes. An alternative approach (FALCON-Unzip) does not require parental information and separates PacBio reads, during genome assembly, using heterozygous SNPs. The length of haplotype phase blocks in FALCON-Unzip is limited by the magnitude and distribution of heterozygosity, the length of sequence reads, and read coverage. Because of this, FALCON-Unzip contigs typically contain haplotype-switch errors between phase blocks, resulting in primary contig of mixed parental origin. We developed FALCON-Phase, which integrates Hi-C data downstream of FALCON-Unzip to resolve phase switches along contigs. We applied the method to a human (Puerto Rican, HG00733) and non-human genome assemblies and evaluated accuracy using samples with trio data. In a cattle genome, we observe >96% accuracy in phasing when compared to TrioCanu assemblies as well as parental SNPs. For a high-quality PacBio assembly (>90-fold Sequel coverage) of a Puerto Rican individual we scaffolded the FALCON-Phase contigs, and re-phased the contigs creating a de novo scaffolded, phased diploid assembly with chromosome-scale contiguity.

April 21, 2020  |  

Extended haplotype phasing of de novo genome assemblies with FALCON-Phase

Haplotype-resolved genome assemblies are important for understanding how combinations of variants impact phenotypes. These assemblies can be created in various ways, such as use of tissues that contain single-haplotype (haploid) genomes, or by co-sequencing of parental genomes, but these approaches can be impractical in many situations. We present FALCON-Phase, which integrates long-read sequencing data and ultra-long-range Hi-C chromatin interaction data of a diploid individual to create high-quality, phased diploid genome assemblies. The method was evaluated by application to three datasets, including human, cattle, and zebra finch, for which high-quality, fully haplotype resolved assemblies were available for benchmarking. Phasing algorithm accuracy was affected by heterozygosity of the individual sequenced, with higher accuracy for cattle and zebra finch (>97%) compared to human (82%). In addition, scaffolding with the same Hi-C chromatin contact data resulted in phased chromosome-scale scaffolds.

April 21, 2020  |  

Critical length in long-read resequencing

Long-read sequencing has substantial advantages for structural variant discovery and phasing of vari- ants compared to short-read technologies, but the required and optimal read length has not been as- sessed. In this work, we used long reads simulated from human genomes and evaluated structural vari- ant discovery and variant phasing using current best practicebioinformaticsmethods.Wedeterminedthatoptimal discovery of structural variants from human genomes can be obtained with reads of minimally 20 kb. Haplotyping variants across genes only reaches its optimum from reads of 100 kb. These findings are important for the design of future long-read sequenc- ing projects.

April 21, 2020  |  

Chromosome-level assembly of the water buffalo genome surpasses human and goat genomes in sequence contiguity.

Rapid innovation in sequencing technologies and improvement in assembly algorithms have enabled the creation of highly contiguous mammalian genomes. Here we report a chromosome-level assembly of the water buffalo (Bubalus bubalis) genome using single-molecule sequencing and chromatin conformation capture data. PacBio Sequel reads, with a mean length of 11.5?kb, helped to resolve repetitive elements and generate sequence contiguity. All five B. bubalis sub-metacentric chromosomes were correctly scaffolded with centromeres spanned. Although the index animal was partly inbred, 58% of the genome was haplotype-phased by FALCON-Unzip. This new reference genome improves the contig N50 of the previous short-read based buffalo assembly more than a thousand-fold and contains only 383 gaps. It surpasses the human and goat references in sequence contiguity and facilitates the annotation of hard to assemble gene clusters such as the major histocompatibility complex (MHC).

Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.