Read-based phasing deduces the haplotypes of an individual from sequencing reads that cover multiple variants, while genetic phasing takes only genotypes as input and applies the rules of Mendelian inheritance to infer haplotypes within a pedigree of individuals. Combining both into an approach that uses these two independent sources of information-reads and pedigree-has the potential to deliver results better than each individually.We provide a theoretical framework combining read-based phasing with genetic haplotyping, and describe a fixed-parameter algorithm and its implementation for finding an optimal solution. We show that leveraging reads of related individuals jointly in this way yields more phased variants and at a higher accuracy than when phased separately, both in simulated and real data. Coverages as low as 2× for each member of a trio yield haplotypes that are as accurate as when analyzed separately at 15× coverage per individual.https://email@example.com.© The Author 2016. Published by Oxford University Press.
The Genome in a Bottle Consortium, hosted by the National Institute of Standards and Technology (NIST) is creating reference materials and data for human genome sequencing, as well as methods for genome comparison and benchmarking. Here, we describe a large, diverse set of sequencing data for seven human genomes; five are current or candidate NIST Reference Materials. The pilot genome, NA12878, has been released as NIST RM 8398. We also describe data from two Personal Genome Project trios, one of Ashkenazim Jewish ancestry and one of Chinese ancestry. The data come from 12 technologies: BioNano Genomics, Complete Genomics paired-end and LFR, Ion Proton exome, Oxford Nanopore, Pacific Biosciences, SOLiD, 10X Genomics GemCode WGS, and Illumina exome and WGS paired-end, mate-pair, and synthetic long reads. Cell lines, DNA, and data from these individuals are publicly available. Therefore, we expect these data to be useful for revealing novel information about the human genome and improving sequencing technologies, SNP, indel, and structural variant calling, and de novo assembly.
Short tandem repeat number estimation from paired-end reads for multiple individuals by considering coalescent tree.
Two types of approaches are mainly considered for the repeat number estimation in short tandem repeat (STR) regions from high-throughput sequencing data: approaches directly counting repeat patterns included in sequence reads spanning the region and approaches based on detecting the difference between the insert size inferred from aligned paired-end reads and the actual insert size. Although the accuracy of repeat numbers estimated with the former approaches is high, the size of target STR regions is limited to the length of sequence reads. On the other hand, the latter approaches can handle STR regions longer than the length of sequence reads. However, repeat numbers estimated with the latter approaches is less accurate than those with the former approaches.We proposed a new statistical model named coalescentSTR that estimates repeat numbers from paired-end read distances for multiple individuals simultaneously by connecting the read generative model for each individual with their genealogy. In the model, the genealogy is represented by handling coalescent trees as hidden variables, and the summation of the hidden variables is taken on coalescent trees sampled based on phased genotypes located around a target STR region with Markov chain Monte Carlo. In the sampled coalescent trees, repeat number information from insert size data is propagated, and more accurate estimation of repeat numbers is expected for STR regions longer than the length of sequence reads. For finding the repeat numbers maximizing the likelihood of the model on the estimation of repeat numbers, we proposed a state-of-the-art belief propagation algorithm on sampled coalescent trees.We verified the effectiveness of the proposed approach from the comparison with existing methods by using simulation datasets and real whole genome and whole exome data for HapMap individuals analyzed in the 1000 Genomes Project.
We report on the sequencing of 10,545 human genomes at 30×-40× coverage with an emphasis on quality metrics and novel variant and sequence discovery. We find that 84% of an individual human genome can be sequenced confidently. This high-confidence region includes 91.5% of exon sequence and 95.2% of known pathogenic variant positions. We present the distribution of over 150 million single-nucleotide variants in the coding and noncoding genome. Each newly sequenced genome contributes an average of 8,579 novel variants. In addition, each genome carries on average 0.7 Mb of sequence that is not found in the main build of the hg38 reference genome. The density of this catalog of variation allowed us to construct high-resolution profiles that define genomic sites that are highly intolerant of genetic variation. These results indicate that the data generated by deep genome sequencing is of the quality necessary for clinical use.
Read-based phasing allows to reconstruct the haplotype structure of a sample purely from sequencing reads. While phasing is a required step for answering questions about population genetics, compound heterozygosity, and to aid in clinical decision making, there has been a lack of an accurate, usable and standards-based software. WhatsHap is a production-ready tool for highly accurate read-based phasing. It was designed from the beginning to leverage third-generation sequencing technologies, whose long reads can span many variants and are therefore ideal for phasing. WhatsHap works also well with second-generation data, is easy to use and will phase not only SNVs, but also indels and other variants. It is unique in its ability to combine read-based with genetic phasing, allowing to further improve accuracy if multiple related samples are provided.