June 1, 2021  |  

Complete HIV-1 genomes from single molecules: Diversity estimates in two linked transmission pairs using clustering and mutual information.

We sequenced complete HIV-1 genomes from single molecules using Single Molecule, Real- Time (SMRT) Sequencing and derive de novo full-length genome sequences. SMRT sequencing yields long-read sequencing results from individual DNA molecules with a rapid time-to-result. These attributes make it a useful tool for continuous monitoring of viral populations. The single-molecule nature of the sequencing method allows us to estimate variant subspecies and relative abundances by counting methods. We detail mathematical techniques used in viral variant subspecies identification including clustering distance metrics and mutual information. Sequencing was performed in order to better understand the relationships between the specific sequences of transmitted viruses in linked transmission pairs. Samples representing HIV transmission pairs were selected from the Zambia Emory HIV Research Project (Lusaka, Zambia) and sequenced. We examine Single Genome Amplification (SGA) prepped samples and samples containing complex mixtures of genomes. Whole genome consensus estimates for each of the samples were made. Genome reads were clustered using a simple distance metric on aligned reads. Appropriate thresholds were chosen to yield distinct clusters of HIV genomes within samples. Mutual information between columns in the genome alignments was used to measure dependence. In silico mixtures of reads from the SGA samples were made to simulate samples containing exactly controlled complex mixtures of genomes and our clustering methods were applied to these complex mixtures. SMRT Sequencing data contained multiple full-length (greater than 9 kb) continuous reads for each sample. Simple whole genome consensus estimates easily identified transmission pairs. The clustering of the genome reads showed diversity differences between the samples, allowing us to characterize the diversity of the individual quasi-species comprising the patient viral populations across the full genome. Mutual information identified possible dependencies of different positions across the full HIV-1 genome. The SGA consensus genomes agreed with prior Sanger sequencing. Our clustering methods correctly segregated reads to their correct originating genome for the synthetic SGA mixtures. The results open up the potential for reference-agnostic and cost effective full genome sequencing of HIV-1.


April 21, 2020  |  

Long-read sequencing for rare human genetic diseases.

During the past decade, the search for pathogenic mutations in rare human genetic diseases has involved huge efforts to sequence coding regions, or the entire genome, using massively parallel short-read sequencers. However, the approximate current diagnostic rate is <50% using these approaches, and there remain many rare genetic diseases with unknown cause. There may be many reasons for this, but one plausible explanation is that the responsible mutations are in regions of the genome that are difficult to sequence using conventional technologies (e.g., tandem-repeat expansion or complex chromosomal structural aberrations). Despite the drawbacks of high cost and a shortage of standard analytical methods, several studies have analyzed pathogenic changes in the genome using long-read sequencers. The results of these studies provide hope that further application of long-read sequencers to identify the causative mutations in unsolved genetic diseases may expand our understanding of the human genome and diseases. Such approaches may also be applied to molecular diagnosis and therapeutic strategies for patients with genetic diseases in the future.


April 21, 2020  |  

A chromosome-level sequence assembly reveals the structure of the Arabidopsis thaliana Nd-1 genome and its gene set.

In addition to the BAC-based reference sequence of the accession Columbia-0 from the year 2000, several short read assemblies of THE plant model organism Arabidopsis thaliana were published during the last years. Also, a SMRT-based assembly of Landsberg erecta has been generated that identified translocation and inversion polymorphisms between two genotypes of the species. Here we provide a chromosome-arm level assembly of the A. thaliana accession Niederzenz-1 (AthNd-1_v2c) based on SMRT sequencing data. The best assembly comprises 69 nucleome sequences and displays a contig length of up to 16 Mbp. Compared to an earlier Illumina short read-based NGS assembly (AthNd-1_v1), a 75 fold increase in contiguity was observed for AthNd-1_v2c. To assign contig locations independent from the Col-0 gold standard reference sequence, we used genetic anchoring to generate a de novo assembly. In addition, we assembled the chondrome and plastome sequences. Detailed analyses of AthNd-1_v2c allowed reliable identification of large genomic rearrangements between A. thaliana accessions contributing to differences in the gene sets that distinguish the genotypes. One of the differences detected identified a gene that is lacking from the Col-0 gold standard sequence. This de novo assembly extends the known proportion of the A. thaliana pan-genome.


April 21, 2020  |  

Genetic Variation, Comparative Genomics, and the Diagnosis of Disease.

The discovery of mutations associated with human genetic dis- ease is an exercise in comparative genomics (see Glossary). Although there are many different strategies and approaches, the central premise is that affected persons harbor a significant excess of pathogenic DNA variants as com- pared with a group of unaffected persons (controls) that is either clinically defined1 or established by surveying large swaths of the general population.2 The more exclu- sive the variant is to the disease, the greater its penetrance, the larger its effect size, and the more relevant it becomes to both disease diagnosis and future therapeutic investigation. The most popular approach used by researchers in human genetics is the case–control design, but there are others that can be used to track variants and disease in a family context or that consider the probability of different classes of mutations based on evolutionary patterns of divergence or de novo mutational change.3,4 Although the approaches may be straightforward, the discovery of patho- genic variation and its mechanism of action often is less trivial, and decades of research can be required in order to identify the variants underlying both mendelian and complex genetic traits.


April 21, 2020  |  

Antarctic blackfin icefish genome reveals adaptations to extreme environments.

Icefishes (suborder Notothenioidei; family Channichthyidae) are the only vertebrates that lack functional haemoglobin genes and red blood cells. Here, we report a high-quality genome assembly and linkage map for the Antarctic blackfin icefish Chaenocephalus aceratus, highlighting evolved genomic features for its unique physiology. Phylogenomic analysis revealed that Antarctic fish of the teleost suborder Notothenioidei, including icefishes, diverged from the stickleback lineage about 77 million years ago and subsequently evolved cold-adapted phenotypes as the Southern Ocean cooled to sub-zero temperatures. Our results show that genes involved in protection from ice damage, including genes encoding antifreeze glycoprotein and zona pellucida proteins, are highly expanded in the icefish genome. Furthermore, genes that encode enzymes that help to control cellular redox state, including members of the sod3 and nqo1 gene families, are expanded, probably as evolutionary adaptations to the relatively high concentration of oxygen dissolved in cold Antarctic waters. In contrast, some crucial regulators of circadian homeostasis (cry and per genes) are absent from the icefish genome, suggesting compromised control of biological rhythms in the polar light environment. The availability of the icefish genome sequence will accelerate our understanding of adaptation to extreme Antarctic environments.


April 21, 2020  |  

Patterns of non-ARD variation in more than 300 full-length HLA-DPB1 alleles.

Our understanding of sequence variation in the HLA-DPB1 gene is largely restricted to the hypervariable antigen recognition domain (ARD) encoded by exon 2. Here, we employed a redundant sequencing strategy combining long-read and short-read data to accurately phase and characterise in full length the majority of common and well-documented (CWD) DPB1 alleles as well as alleles with an observed frequency of at least 0.0006% in our predominantly European sample set. We generated 664 DPB1 sequences, comprising 279 distinct allelic variants. This allows us to present the, to date, most comprehensive analysis of the nature and extent of DPB1 sequence variation. The full-length sequence analysis revealed the existence of two highly diverged allele clades. These clades correlate with the rs9277534 A???G variant, a known expression marker located in the 3′-UTR. The two clades are fully differentiated by 174 fixed polymorphisms throughout a 3.6?kb stretch at the 3′-end of DPB1. The region upstream of this differentiation zone is characterised by increasingly shared variation between the clades. The low-expression A clade comprises 59% of the distinct allelic sequences including the three by far most frequent DPB1 alleles, DPB1*04:01, DPB1*02:01 and DPB1*04:02. Alleles in the A clade show reduced nucleotide diversity with an excess of rare variants when compared to the high-expression G clade. This pattern is consistent with a scenario of recent proliferation of A-clade alleles. The full-length characterisation of all but the most rare DPB1 alleles will benefit the application of NGS for DPB1 genotyping and provides a helpful framework for a deeper understanding of high- and low-expression alleles and their implications in the context of unrelated haematopoietic stem-cell transplantation.Copyright © 2018 The Authors. Published by Elsevier Inc. All rights reserved.


April 21, 2020  |  

Long-read sequencing identified intronic repeat expansions in SAMD12 from Chinese pedigrees affected with familial cortical myoclonic tremor with epilepsy.

The locus for familial cortical myoclonic tremor with epilepsy (FCMTE) has long been mapped to 8q24 in linkage studies, but the causative mutations remain unclear. Recently, expansions of intronic TTTCA and TTTTA repeat motifs within SAMD12 were found to be involved in the pathogenesis of FCMTE in Japanese pedigrees. We aim to identify the causative mutations of FCMTE in Chinese pedigrees.We performed genetic linkage analysis by microsatellite markers in a five-generation Chinese pedigree with 55 members. We also used array-comparative genomic hybridisation (CGH) and next-generation sequencing (NGS) technologies (whole-exome sequencing, capture region deep sequencing and whole-genome sequencing) to identify the causative mutations in the disease locus. Recently, we used low-coverage (~10×) long-read genome sequencing (LRS) on the PacBio Sequel and Oxford Nanopore platforms to identify the causative mutations, and used repeat-primed PCR for validation of the repeat expansions.Linkage analysis mapped the disease locus to 8q23.3-24.23. Array-CGH and NGS failed to identify causative mutations in this locus. LRS identified the intronic TTTCA and TTTTA repeat expansions in SAMD12 as the causative mutations, thus corroborating the recently published results in Japanese pedigrees.We identified the pentanucleotide repeat expansion in SAMD12 as the causative mutation in Chinese FCMTE pedigrees. Our study also suggested that LRS is an effective tool for molecular diagnosis of genetic disorders, especially for neurological diseases that cannot be positively diagnosed by conventional clinical microarray and NGS technologies. © Author(s) (or their employer(s)) 2019. No commercial re-use. See rights and permissions. Published by BMJ.


April 21, 2020  |  

Population Genome Sequencing of the Scab Fungal Species Venturia inaequalis, Venturia pirina, Venturia aucupariae and Venturia asperata.

The Venturia genus comprises fungal species that are pathogens on Rosaceae host plants, including V. inaequalis and V. asperata on apple, V. aucupariae on sorbus and V. pirina on pear. Although the genetic structure of V. inaequalis populations has been investigated in detail, genomic features underlying these subdivisions remain poorly understood. Here, we report whole genome sequencing of 87 Venturia strains that represent each species and each population within V. inaequalis We present a PacBio genome assembly for the V. inaequalis EU-B04 reference isolate. The size of selected genomes was determined by flow cytometry, and varied from 45 to 93 Mb. Genome assemblies of V. inaequalis and V. aucupariae contain a high content of transposable elements (TEs), most of which belong to the Gypsy or Copia LTR superfamilies and have been inactivated by Repeat-Induced Point mutations. The reference assembly of V. inaequalis presents a mosaic structure of GC-equilibrated regions that mainly contain predicted genes and AT-rich regions, mainly composed of TEs. Six pairs of strains were identified as clones. Single-Nucleotide Polymorphism (SNP) analysis between these clones revealed a high number of SNPs that are mostly located in AT-rich regions due to misalignments and allowed determining a false discovery rate. The availability of these genome sequences is expected to stimulate genetics and population genomics research of Venturia pathogens. Especially, it will help understanding the evolutionary history of Venturia species that are pathogenic on different hosts, a history that has probably been substantially influenced by TEs.Copyright © 2019 Le Cam et al.


April 21, 2020  |  

Detecting a long insertion variant in SAMD12 by SMRT sequencing: implications of long-read whole-genome sequencing for repeat expansion diseases.

Long-read sequencing technology is now capable of reading single-molecule DNA with an average read length of more than 10?kb, fully enabling the coverage of large structural variations (SVs). This advantage may pave the way for the detection of unprecedented SVs as well as repeat expansions. Pathogenic SVs of only known genes used to be selectively analyzed based on prior knowledge of target DNA sequence. The unbiased application of long-read whole-genome sequencing (WGS) for the detection of pathogenic SVs has just begun. Here, we apply PacBio SMRT sequencing in a Japanese family with benign adult familial myoclonus epilepsy (BAFME). Our SV selection of low-coverage WGS data (7×) narrowed down the candidates to only six SVs in a 7.16-Mb region of the BAFME1 locus and correctly determined an approximately 4.6-kb SAMD12 intronic repeat insertion, which is causal of BAFME1. These results indicate that long-read WGS is potentially useful for evaluating all of the known SVs in a genome and identifying new disease-causing SVs in combination with other genetic methods to resolve the genetic causes of currently unexplained diseases.


April 21, 2020  |  

Long-read sequencing identifies GGC repeat expansions in NOTCH2NLC associated with neuronal intranuclear inclusion disease.

Neuronal intranuclear inclusion disease (NIID) is a progressive neurodegenerative disease that is characterized by eosinophilic hyaline intranuclear inclusions in neuronal and somatic cells. The wide range of clinical manifestations in NIID makes ante-mortem diagnosis difficult1-8, but skin biopsy enables its ante-mortem diagnosis9-12. The average onset age is 59.7 years among approximately 140 NIID cases consisting of mostly sporadic and several familial cases. By linkage mapping of a large NIID family with several affected members (Family 1), we identified a 58.1 Mb linked region at 1p22.1-q21.3 with a maximum logarithm of the odds score of 4.21. By long-read sequencing, we identified a GGC repeat expansion in the 5′ region of NOTCH2NLC (Notch 2 N-terminal like C) in all affected family members. Furthermore, we found similar expansions in 8 unrelated families with NIID and 40 sporadic NIID cases. We observed abnormal anti-sense transcripts in fibroblasts specifically from patients but not unaffected individuals. This work shows that repeat expansion in human-specific NOTCH2NLC, a gene that evolved by segmental duplication, causes a human disease.


April 21, 2020  |  

Mutation of a bHLH transcription factor allowed almond domestication.

Wild almond species accumulate the bitter and toxic cyanogenic diglucoside amygdalin. Almond domestication was enabled by the selection of genotypes harboring sweet kernels. We report the completion of the almond reference genome. Map-based cloning using an F1 population segregating for kernel taste led to the identification of a 46-kilobase gene cluster encoding five basic helix-loop-helix transcription factors, bHLH1 to bHLH5. Functional characterization demonstrated that bHLH2 controls transcription of the P450 monooxygenase-encoding genes PdCYP79D16 and PdCYP71AN24, which are involved in the amygdalin biosynthetic pathway. A nonsynonymous point mutation (Leu to Phe) in the dimerization domain of bHLH2 prevents transcription of the two cytochrome P450 genes, resulting in the sweet kernel trait. Copyright © 2019 The Authors, some rights reserved; exclusive licensee American Association for the Advancement of Science. No claim to original U.S. Government Works.


April 21, 2020  |  

Improvement of the Pacific bluefin tuna (Thunnus orientalis) reference genome and development of male-specific DNA markers.

The Pacific bluefin tuna, Thunnus orientalis, is a highly migratory species that is widely distributed in the North Pacific Ocean. Like other marine species, T. orientalis has no external sexual dimorphism; thus, identifying sex-specific variants from whole genome sequence data is a useful approach to develop an effective sex identification method. Here, we report an improved draft genome of T. orientalis and male-specific DNA markers. Combining PacBio long reads and Illumina short reads sufficiently improved genome assembly, with a 38-fold increase in scaffold contiguity (to 444 scaffolds) compared to the first published draft genome. Through analysing re-sequence data of 15 males and 16 females, 250 male-specific SNPs were identified from more than 30 million polymorphisms. All male-specific variants were male-heterozygous, suggesting that T. orientalis has a male heterogametic sex-determination system. The largest linkage disequilibrium block (3,174?bp on scaffold_064) contained 51 male-specific variants. PCR primers and a PCR-based sex identification assay were developed using these male-specific variants. The sex of 115 individuals (56 males and 59 females; sex was diagnosed by visual examination of the gonads) was identified with high accuracy using the assay. This easy, accurate, and practical technique facilitates the control of sex ratios in tuna farms. Furthermore, this method could be used to estimate the sex ratio and/or the sex-specific growth rate of natural populations.


April 21, 2020  |  

Wild relatives of maize

Crop domestication changed the course of human evolution, and domestication of maize (Zea mays L. subspecies mays), today the world’s most important crop, enabled civilizations to flourish and has played a major role in shaping the world we know today. Archaeological and ethnobotanical research help us understand the development of the cultures and the movements of the peoples who carried maize to new areas where it continued to adapt. Ancient remains of maize cobs and kernels have been found in the place of domestication, the Balsas River Valley (~9,000 years before present era), and the cultivation center, the Tehuacan Valley (~5,000 years before present era), and have been used to study the process of domestication. Paleogenomic data showed that some of the genes controlling the stem and inflorescence architecture were comparable to modern maize, while other genes controlling ear shattering and starch biosynthesis retain high levels of variability, similar to those found in the wild relative teosinte. These results indicate that the domestication process was both gradual and complex, where different genetic loci were selected at different points in time, and that the transformation of teosinte to maize was completed in the last 5,000 years. Mesoamerican native cultures domesticated teosinte and developed maize from a 6 cm long, popping-kernel ear to what we now recognize as modern maize with its wide variety in ear size, kernel texture, color, size, and adequacy for diverse uses and also invented nixtamalization, a process key to maximizing its nutrition. Used directly for human and animal consumption, processed food products, bioenergy, and many cultural applications, it is now grown on six of the world’s seven continents. The study of its evolution and domestication from the wild grass teosinte helps us understand the nature of genetic diversity of maize and its wild relatives and gene expression. Genetic barriers to direct use of teosinte or Tripsacum in maize breeding have challenged our ability to identify valuable genes and traits, let alone incorporate them into elite, modern varieties. Genomic information and newer genetic technologies will facilitate the use of wild relatives in crop improvement; hence it is more important than ever to ensure their conservation and availability, fundamental to future food security. In situ conservation efforts dedicated to preserving remnant populations of wild relatives in Mexico are key to safeguarding the genetic diversity of maize and its genepool, as well as enabling these species to continue to adapt to dynamic climate and environmental changes. Genebank ex situ efforts are crucial to securely maintain collected wild relative resources and to provide them for gene discovery and other research efforts.


April 21, 2020  |  

Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads.

Tandemly repeated DNA is highly mutable and causes at least 31 diseases, but it is hard to detect pathogenic repeat expansions genome-wide. Here, we report robust detection of human repeat expansions from careful alignments of long but error-prone (PacBio and nanopore) reads to a reference genome. Our method is robust to systematic sequencing errors, inexact repeats with fuzzy boundaries, and low sequencing coverage. By comparing to healthy controls, we prioritize pathogenic expansions within the top 10 out of 700,000 tandem repeats in whole genome sequencing data. This may help to elucidate the many genetic diseases whose causes remain unknown.


Talk with an expert

If you have a question, need to check the status of an order, or are interested in purchasing an instrument, we're here to help.