PacBio 2013 User Group Meeting Presentation Slides: Lisbeth Guethlein from Stanford University School of Medicine looked at highly repetitive and variable immune regions of the orangutan genome. Guethlein reported that “PacBio managed to accomplish in a week what I have been working on for a couple years” (with Sanger sequencing), and the results were concordant. “Long story short, I was a happy customer.”
Long Amplicon Analysis: Highly accurate, full-length, phased, allele-resolved gene sequences from multiplexed SMRT Sequencing data.
The correct phasing of genetic variations is a key challenge for many applications of DNA sequencing. Allele-level resolution is strongly preferred for histocompatibility sequencing where recombined genes can exhibit different compatibilities than their parents. In other contexts, gene complementation can provide protection if deleterious mutations are found on only one allele of a gene. These problems are especially pronounced in immunological domains given the high levels of genetic diversity and recombination seen in regions like the Major Histocompatibility Complex. A new tool for analyzing Single Molecule, Real-Time (SMRT) Sequencing data – Long Amplicon Analysis (LAA) – can generate highly accurate, phased and full-length consensus sequences for multiple genes in a single sequencing run.
Complete resequencing of extended genomic regions using fosmid target capture and single molecule real-time (SMRT) long read sequencing technology.
A longstanding goal of genomic analysis is the identification of causal genetic factors contributing to disease. While the common disease/common variant hypothesis has been tested in many genome-wide association studies, few advancements in identifying causal variation have been realized, and instead recent findings point away from common variants towards aggregate rare variants as causal. A challenge is obtaining complete phased genomic sequences over extended genomic regions from sufficient numbers of cases and controls to identify all potential variation causal of a disease. To address this, we modified methods for targeted DNA isolation using fosmid technology and single-molecule, long-sequence-read generaton that combine for complete, haplotype-resolved resequencing across extended genomic subregions. As proof of principal, we validated the approach by resequencing four 800 kbp segments that span a major histocompatibility complex (MHC) common extended haplotype (CEH) associated with disease. The data revealed the extent of conservation exposing a near identity among four DR4 CEHs over conserved regions, detailing rare variation and measuring sequence accuracy. In a second test, we sequenced the complete KIR haplotypes from 8 individuals within a specific timeframe and cost. Single molecule long-read sequencing technology generated contiguous full-length fosmid sequences of 30 to 40 kb in a single read, allowing assembly of resolved haplotypes with very little data processing. All of the sequences produced from these projects were contiguous, phased, with accuracy above 99.99%. The results demonstrated that cost-effective scale-up is possible to generate scores to hundreds of phased chromosomal sequences of extended lengths that can encompass genomic regions associated with disease.
The complex immune regions of the genome, including MHC and KIR, contain large copy number variants (CNVs), a high density of genes, hyper-polymorphic gene alleles, and conserved extended haplotypes (CEH) with enormous linkage disequilibrium (LDs). This level of complexity and inherent biases of short-read sequencing make it challenging for extracting immune region haplotype information from reference-reliant, shotgun sequencing and GWAS methods. As NGS based genome and exome sequencing and SNP arrays have become a routine for population studies, numerous efforts are being made for developing software to extract and or impute the immune gene information from these datasets. Despite these efforts, the fine mapping of causal variants of immune genes for their well-documented association with cancer, drug-induced hypersensitivity and immune-related diseases, has been slower than expected. This has in many ways limited our understanding of the mechanisms leading to immune disease. In the present work, we demonstrate the advantages of long reads delivered by SMRT Sequencing for assembling complete haplotypes of MHC and KIR gene clusters, as well as calling correct genotypes of genes comprised within them. All the genotype information is detected at allele- level with full phasing information across SNP-poor regions. Genotypes were called correctly from targeted gene amplicons, haplotypes, as well as from a completely assembled 5 Mb contig of the MHC region from a de novo assembly of whole genome shotgun data. De novo analysis pipeline used in all these approaches allowed for reference-free analysis without imputation, a key for interrogation without prior knowledge about ethnic backgrounds. These methods are thus easily adoptable for previously uncharacterized human or non-human species.
The MHC Diversity in Africa Project (MDAP) pilot – 125 African high resolution HLA types from 5 populations
The major histocompatibility complex (MHC), or human leukocyte antigen (HLA) in humans, is a highly diverse gene family with a key role in immune response to disease; and has been implicated in auto-immune disease, cancer, infectious disease susceptibility, and vaccine response. It has clinical importance in the field of solid organ and bone marrow transplantation, where donors and recipient matching of HLA types is key to transplanted organ outcomes. The Sanger based typing (SBT) methods currently used in clinical practice do not capture the full diversity across this region, and require specific reference sequences to deconvolute ambiguity in HLA types. However, reference databases are based largely on European populations, and the full extent of diversity in Africa remains poorly understood. Here, we present the first systematic characterisation of HLA diversity within Africa in the pilot phase of the MHC Diversity in Africa Project, together with an evaluation of methods to carry out scalable cost-effective, as well as reliable, typing of this region in African populations.To sample a geographically representative panel of African populations we obtained 125 samples, 25 each from the Zulu (South Africa), Igbo (Nigeria), Kalenjin (Kenya), Moroccan and Ashanti (Ghana) groups. For methods validation we included two controls from the International Histocompatibility Working Group (IHWG) collection with known typing information. Sanger typing and Illumina HiSeq X sequencing of these samples indicated potentially novel Class I and Class II alleles; however, we found poor correlation between HiSeq X sequencing and SBT for both classes. Long Range PCR and high resolution PacBio RS-II typing of 4 of these samples identified 7 novel Class II alleles, highlighting the high levels of diversity in these populations, and the need for long read sequencing approaches to characterise this comprehensively. We have now expanded this approach to the entire pilot set of 125 samples. We present these confirmed types and discuss a workflow for scaling this to 5000 individuals across Africa.The large number of new alleles identified in our pilot suggests the high level of African HLA diversity and the utility of high resolution methods. The MDAP project will provide a framework for accurate HLA typing, in addition to providing an invaluable resource for imputation in GWAS, boosting power to identify and resolve HLA disease associations.
Dan Geraghty explains that while there have been decades’ worth of studies associating the genetics of the major histocompatibility complex (MHC), and the highly polymorphic HLA class 1 and 2…
ASHG Virtual Poster: The MHC Diversity in Africa Project (MDAP) pilot – 125 African high resolution HLA types from 5 populations
In this ASHG 2016 poster video, Martin Pollard from the Wellcome Trust Sanger Institute and the University of Cambridge describes an ambitious project to better represent natural variation in the…
Pacific Biosciences’ SMRT sequencing method was used to extend the sequence of HLA-A*02:13. © 2019 John Wiley & Sons A/S. Published by John Wiley & Sons Ltd.
Background Assemblies of diploid genomes are generally unphased, pseudo-haploid representations that do not correctly reconstruct the two parental haplotypes present in the individual sequenced. Instead, the assembly alternates between parental haplotypes and may contain duplications in regions where the parental haplotypes are sufficiently different. Trio binning is an approach to genome assembly that uses short reads from both parents to classify long reads from the offspring according to maternal or paternal haplotype origin, and is thus helped rather than impeded by heterozygosity. Using this approach, it is possible to derive two assemblies from an individual, accurately representing both parental contributions in their entirety with higher continuity and accuracy than is possible with other methods.Results We used trio binning to assemble reference genomes for two species from a single individual using an interspecies cross of yak (Bos grunniens) and cattle (Bos taurus). The high heterozygosity inherent to interspecies hybrids allowed us to confidently assign >99% of long reads from the F1 offspring to parental bins using unique k-mers from parental short reads. Both the maternal (yak) and paternal (cattle) assemblies contain over one third of the acrocentric chromosomes, including the two largest chromosomes, in single haplotigs.Conclusions These haplotigs are the first vertebrate chromosome arms to be assembled gap-free and fully phased, and the first time assemblies for two species have been created from a single individual. Both assemblies are the most continuous currently available for non-model vertebrates.MbmegabaseskbkilobasesMYAmillions of years agoMHCmajor histocompatibility complexSMRTsingle molecule real time
Over the past decade, RNA sequencing (RNA-seq) has become an indispensable tool for transcriptome-wide analysis of differential gene expression and differential splicing of mRNAs. However, as next-generation sequencing technologies have developed, so too has RNA-seq. Now, RNA-seq methods are available for studying many different aspects of RNA biology, including single-cell gene expression, translation (the translatome) and RNA structure (the structurome). Exciting new applications are being explored, such as spatial transcriptomics (spatialomics). Together with new long-read and direct RNA-seq technologies and better computational tools for data analysis, innovations in RNA-seq are contributing to a fuller understanding of RNA biology, from questions such as when and where transcription occurs to the folding and intermolecular interactions that govern RNA function.
African cichlid fishes are well known for their rapid radiations and are a model system for studying evolutionary processes. Here we compare multiple, high-quality, chromosome-scale genome assemblies to elucidate the genetic mechanisms underlying cichlid diversification and study how genome structure evolves in rapidly radiating lineages.We re-anchored our recent assembly of the Nile tilapia (Oreochromis niloticus) genome using a new high-density genetic map. We also developed a new de novo genome assembly of the Lake Malawi cichlid, Metriaclima zebra, using high-coverage Pacific Biosciences sequencing, and anchored contigs to linkage groups (LGs) using 4 different genetic maps. These new anchored assemblies allow the first chromosome-scale comparisons of African cichlid genomes. Large intra-chromosomal structural differences (~2-28 megabase pairs) among species are common, while inter-chromosomal differences are rare (<10 megabase pairs total). Placement of the centromeres within the chromosome-scale assemblies identifies large structural differences that explain many of the karyotype differences among species. Structural differences are also associated with unique patterns of recombination on sex chromosomes. Structural differences on LG9, LG11, and LG20 are associated with reduced recombination, indicative of inversions between the rock- and sand-dwelling clades of Lake Malawi cichlids. M. zebra has a larger number of recent transposable element insertions compared with O. niloticus, suggesting that several transposable element families have a higher rate of insertion in the haplochromine cichlid lineage.This study identifies novel structural variation among East African cichlid genomes and provides a new set of genomic resources to support research on the mechanisms driving cichlid adaptation and speciation. © The Author(s) 2019. Published by Oxford University Press.
High-throughput amplicon sequencing of the full-length 16S rRNA gene with single-nucleotide resolution.
Targeted PCR amplification and high-throughput sequencing (amplicon sequencing) of 16S rRNA gene fragments is widely used to profile microbial communities. New long-read sequencing technologies can sequence the entire 16S rRNA gene, but higher error rates have limited their attractiveness when accuracy is important. Here we present a high-throughput amplicon sequencing methodology based on PacBio circular consensus sequencing and the DADA2 sample inference method that measures the full-length 16S rRNA gene with single-nucleotide resolution and a near-zero error rate. In two artificial communities of known composition, our method recovered the full complement of full-length 16S sequence variants from expected community members without residual errors. The measured abundances of intra-genomic sequence variants were in the integral ratios expected from the genuine allelic variants within a genome. The full-length 16S gene sequences recovered by our approach allowed Escherichia coli strains to be correctly classified to the O157:H7 and K12 sub-species clades. In human fecal samples, our method showed strong technical replication and was able to recover the full complement of 16S rRNA alleles in several E. coli strains. There are likely many applications beyond microbial profiling for which high-throughput amplicon sequencing of complete genes with single-nucleotide resolution will be of use. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Human metapneumovirus (HMPV) has been a notable etiological agent of acute respiratory infection in humans, but it was not discovered until 2001, because HMPV replicates only in a limited number of cell lines and the cytopathic effect (CPE) is often mild. To promote the study of HMPV, several groups have generated green fluorescent protein (GFP)-expressing recombinant HMPV strains (HMPVGFP). However, the growing evidence has complicated the understanding of cell line specificity of HMPV, because it seems to vary notably among HMPV strains. In addition, unique A2b clade HMPV strains with a 180-nucleotide duplication in the G gene (HMPV A2b180nt-dup strains) have recently been detected. In this study, we re-evaluated and compared the cell line specificity of clinical isolates of HMPV strains, including the novel HMPV A2b180nt-dup strains, and six recombinant HMPVGFP strains, including the newly generated recombinant HMPV A2b180nt-dup strain, MG0256-EGFP. Our data demonstrate that VeroE6 and LLC-MK2 cells generally showed the highest infectivity with any clinical isolates and recombinant HMPVGFP strains. Other human-derived cell lines (BEAS-2B, A549, HEK293, MNT-1, and HeLa cells) showed certain levels of infectivity with HMPV, but these were significantly lower than those of VeroE6 and LLC-MK2 cells. Also, the infectivity in these suboptimal cell lines varied greatly among HMPV strains. The variations were not directly related to HMPV genotypes, cell lines used for isolation and propagation, specific genome mutations, or nucleotide duplications in the G gene. Thus, these variations in suboptimal cell lines are likely intrinsic to particular HMPV strains.
Benchmark small variant calls are required for developing, optimizing and assessing the performance of sequencing and bioinformatics methods. Here, as part of the Genome in a Bottle (GIAB) Consortium, we apply a reproducible, cloud-based pipeline to integrate multiple short- and linked-read sequencing datasets and provide benchmark calls for human genomes. We generate benchmark calls for one previously analyzed GIAB sample, as well as six genomes from the Personal Genome Project. These new genomes have broad, open consent, making this a ‘first of its kind’ resource that is available to the community for multiple downstream applications. We produce 17% more benchmark single nucleotide variations, 176% more indels and 12% larger benchmark regions than previously published GIAB benchmarks. We demonstrate that this benchmark reliably identifies errors in existing callsets and highlight challenges in interpreting performance metrics when using benchmarks that are not perfect or comprehensive. Finally, we identify strengths and weaknesses of callsets by stratifying performance according to variant type and genome context.
Potential of TLR-gene diversity in Czech indigenous cattle for resistance breeding as revealed by hybrid sequencing
A production herd of Czech Simmental cattle (Czech Red Pied, CRP), the conserved subpopulation of this breed, and the ancient local breed Czech Red cattle (CR) were screened for diversity in the antibacterial toll-like receptors (TLRs), which are members of the innate immune system. Polymerase chain reaction (PCR) amplicons of TLR1, TLR2, TLR4, TLR5, and TLR6 from pooled DNA samples were sequenced with PacBio technology, with 3–5×?coverage per gene per animal. To increase the reliability of variant detection, the gDNA pools were sequenced in parallel with the Illumina X-ten platform at low coverage (60× per gene). The diversity in conserved CRP and CR was similar to the diversity in conserved and modern CRP, representing 76.4?% and 70.9?% of its variants, respectively. Sixty-eight (54.4?%) polymorphisms in the five TLR genes were shared by the two breeds, whereas 38 (30.4?%) were specific to the production herd of CRP; 4 (3.2?%) were specific to the broad CRP population; 7 (5.6?%) were present in both conserved populations; 5 (4.0?%) were present solely for the conserved CRP; and 3 (2.4?%) were restricted to CR. Consequently, gene pool erosion related to intensive breeding did not occur in Czech Simmental cattle. Similarly, no considerable consequences were found from known bottlenecks in the history of Czech Red cattle. On the other hand, the distinctness of the conserved populations and their potential for resistance breeding were only moderate. This relationship might be transferable to other non-abundant historical cattle breeds that are conserved as genetic resources. The estimates of polymorphism impact using Variant Effect Predictor and SIFT software tools allowed for the identification of candidate single-nucleotide polymorphisms (SNPs) for association studies related to infection resistance and targeted breeding. Knowledge of TLR-gene diversity present in Czech Simmental populations may aid in the potential transfer of variant characteristics from other breeds.