The expression of androgen receptor (AR) variants is a frequent, yet poorly-understood mechanism of clinical resistance to AR-targeted therapy for castration-resistant prostate cancer (CRPC). Among the multiple AR variants expressed in CRPC, AR-V7 is considered the most clinically-relevant AR variant due to broad expression in CRPC, correlations of AR-V7 expression with clinical resistance, and growth inhibition when AR-V7 is knocked down in CRPC models. Therefore, efforts are under way to develop strategies for monitoring and inhibiting AR-V7 in castration-resistant prostate cancer (CRPC). The aim of this study was to understand whether other AR variants are co-expressed with AR-V7 and promote resistance to AR-targeted therapies. To test this, we utilized RNA-seq to characterize AR expression in CRPC models. RNA-seq revealed the frequent coexpression of AR-V9 and AR-V7 in multiple CRPC models and metastases. Furthermore, long-read single-molecule real-time (SMRT) sequencing of AR isoforms revealed that AR-V7 and AR-V9 shared a common 3’terminal cryptic exon. To test this, we knocked down AR-V7 in prostate cancer cell lines and confirmed that AR-V9 mRNA and protein expression were also impacted. In reporter assays with AR-responsive promoters, AR-V9 functioned as a constitutive activator of androgen/AR signaling. Similarly, infection of AR-V9 lentiviral construct in LNCaP cells induced androgen-independent cell proliferation. In conclusion, these data implicate co-expression of AR-V9 with AR-V7 as an important component of constitutive AR signaling and therapeutic resistance in CRPC.
Fast and effective variant calling algorithms have been crucial to the successful application of DNA sequencing in human genetics. In particular, joint calling – in which reads from multiple individuals are pooled to increase power for shared variants – is an important tool for population surveys of variation. Joint calling was applied by the 1000 Genomes Project to identify variants across many individuals each sequenced to low coverage (about 5-fold). This approach successfully found common small variants, but broadly missed structural variants and large indels for which short-read sequencing has limited sensitivity. To support use of large variants in rare disease and common trait association studies, it is necessary to perform population-scale surveys with a technology effective at detecting indels and structural variants, such as PacBio SMRT Sequencing. For these studies, it is important to have a joint calling workflow that works with PacBio reads. We have developed pbsv, an indel and structural variant caller for PacBio reads, that provides a two-step joint calling workflow similar to that used to build the ExAC database. The first stage, discovery, is performed separately for each sample and consolidates whole genome alignments into a sparse representation of potentially variant loci. The second stage, calling, is performed on all samples together and considers only the signatures identified in the discovery stage. We applied the pbsv joint calling workflow to PacBio reads from twenty human genomes, with coverage ranging from 5-fold to 80-fold per sample for a total of 460-fold. The analysis required only 102 CPU hours, and identified over 800,000 indels and structural variants, including hundreds of inversions and translocations, many times more than discovered with short-read sequencing. The workflow is scalable to thousands of samples. The ongoing application of this workflow to thousands of samples will provide insight into the evolution and functional importance of large variants in human evolution and disease.
High-throughput NGS methods are increasingly utilized in the clinical genomics market. However, short-read sequencing data continues to remain challenged by mapping inaccuracies in low complexity regions or regions of high homology and may not provide adequate coverage within GC-rich regions of the genome. Thus, the use of Sanger sequencing remains popular in many clinical sequencing labs as the gold standard approach for orthogonal validation of variants and to interrogate missed regions poorly covered by second-generation sequencing. The use of Sanger sequencing can be less than ideal, as it can be costly for high volume assays and projects. Additionally, Sanger sequencing generates read lengths shorter than the region of interest, which limits its ability to accurately phase allelic variants. High-throughput SMRT Sequencing overcomes the challenges of both the first- and second-generation sequencing methods. PacBio’s long read capability allows sequencing of full-length amplicons
FALCON-Phase integrates PacBio and HiC data for de novo assembly, scaffolding and phasing of a diploid Puerto Rican genome (HG00733)
Haplotype-resolved genomes are important for understanding how combinations of variants impact phenotypes. The study of disease, quantitative traits, forensics, and organ donor matching are aided by phased genomes. Phase is commonly resolved using familial data, population-based imputation, or by isolating and sequencing single haplotypes using fosmids, BACs, or haploid tissues. Because these methods can be prohibitively expensive, or samples may not be available, alternative approaches are required. de novo genome assembly with PacBio Single Molecule, Real-Time (SMRT) data produces highly contiguous, accurate assemblies. For non-inbred samples, including humans, the separate resolution of haplotypes results in higher base accuracy and more contiguous assembled sequences. Two primary methods exist for phased diploid genome assembly. The first, TrioCanu requires Illumina data from parents and PacBio data from the offspring. The long reads from the child are partitioned into maternal and paternal bins using parent-specific sequences; the separate PacBio read bins are then assembled, generating two fully phased genomes. An alternative approach (FALCON-Unzip) does not require parental information and separates PacBio reads, during genome assembly, using heterozygous SNPs. The length of haplotype phase blocks in FALCON-Unzip is limited by the magnitude and distribution of heterozygosity, the length of sequence reads, and read coverage. Because of this, FALCON-Unzip contigs typically contain haplotype-switch errors between phase blocks, resulting in primary contig of mixed parental origin. We developed FALCON-Phase, which integrates Hi-C data downstream of FALCON-Unzip to resolve phase switches along contigs. We applied the method to a human (Puerto Rican, HG00733) and non-human genome assemblies and evaluated accuracy using samples with trio data. In a cattle genome, we observe >96% accuracy in phasing when compared to TrioCanu assemblies as well as parental SNPs. For a high-quality PacBio assembly (>90-fold Sequel coverage) of a Puerto Rican individual we scaffolded the FALCON-Phase contigs, and re-phased the contigs creating a de novo scaffolded, phased diploid assembly with chromosome-scale contiguity.
Targeted sequencing of genomic DNA requires an enrichment method to generate detectable amounts of sequencing products. Genomic regions with extreme composition bias and repetitive sequences can pose a significant enrichment challenge. Many genetic diseases caused by repeat element expansions are representative of these challenging enrichment targets. PCR amplification, used either alone or in combination with a hybridization capture method, is a common approach for target enrichment. While PCR amplification can be used successfully with genomic regions of moderate to high complexity, it is the low-complexity regions and regions containing repetitive elements sometimes of indeterminate lengths due to repeat expansions that can lead to poor or failed PCR enrichment. We have developed an enrichment method for targeted SMRT Sequencing on the PacBio Sequel System using the CRISPR-Cas9 system that requires no PCR amplification. Briefly, a preformed SMRTbell library containing the target region of interest is cleaved with Cas9 through direct interaction with a sequence-specific guide RNA. After ligation with new poly(A) hairpin adapters, the asymmetric SMRTbell templates are enriched by magnetic bead separation. This method, paired with SMRT Sequencing’s long reads, high consensus accuracy, and uniform coverage, allows sequencing of genomic regions regardless of challenging sequence context that cannot be investigated with other technologies. The method is amenable to analyzing multiple samples and/or targets in a single reaction. In addition, this method also preserves epigenetic modifications allowing for the detection and characterization of DNA methylation which has been shown to be a key factor in the disease mechanism for some repeat expansion diseases. Here we present results of our latest No-Amp Targeted Sequencing procedure applied to the characterization of CAG triplet repeat expansions in the HTT gene responsible for Huntington’s Disease.
Full-length transcriptome sequencing of melanoma cell line complements long-read assessment of genomic rearrangements
Transcriptome sequencing has proven to be an important tool for understanding the biological changes in cancer genomes including the consequences of structural rearrangements. Short read sequencing has been the method of choice, as the high throughput at low cost allows for transcript quantitation and the detection of even rare transcripts. However, the reads are generally too short to reconstruct complete isoforms. Conversely, long-read approaches can provide unambiguous full-length isoforms, but lower throughput has complicated quantitation and high RNA input requirements has made working with cancer samples challenging. Recently, the COLO 829 cell line was sequenced to 50-fold coverage with PacBio SMRT Sequencing. To validate and extend the findings from this effort, we have generated long-read transcriptome data using an updated PacBio Iso-Seq method, the results of which will be shared at the AACR 2019 General Meeting. With this complimentary transcriptome data, we demonstrate how recent innovations in the PacBio Iso-Seq method sample preparation and sequencing chemistry have made long-read sequencing of cancer transcriptomes more practical. In particular, library preparation has been simplified and throughput has increased. The improved protocol has reduced sample prep time from several days to one day while reducing the sample input requirements ten-fold. In addition, the incorporation of unique molecular identifier (UMI) tags into the workflow has improved the bioinformatics analysis. Yield has also increased, with v3 sequencing chemistry typically delivering > 30 Gb per SMRT Cell 1M. By integrating long and short read data, we demonstrate that the Iso-Seq method is a practical tool for annotating cancer genomes with high-quality transcript information.
SMRT Sequencing is a DNA sequencing technology characterized by long read lengths and high consensus accuracy, regardless of the sequence complexity or GC content of the DNA sample. These characteristics…
During the past decade, the search for pathogenic mutations in rare human genetic diseases has involved huge efforts to sequence coding regions, or the entire genome, using massively parallel short-read sequencers. However, the approximate current diagnostic rate is <50% using these approaches, and there remain many rare genetic diseases with unknown cause. There may be many reasons for this, but one plausible explanation is that the responsible mutations are in regions of the genome that are difficult to sequence using conventional technologies (e.g., tandem-repeat expansion or complex chromosomal structural aberrations). Despite the drawbacks of high cost and a shortage of standard analytical methods, several studies have analyzed pathogenic changes in the genome using long-read sequencers. The results of these studies provide hope that further application of long-read sequencers to identify the causative mutations in unsolved genetic diseases may expand our understanding of the human genome and diseases. Such approaches may also be applied to molecular diagnosis and therapeutic strategies for patients with genetic diseases in the future.
Improved assembly and variant detection of a haploid human genome using single-molecule, high-fidelity long reads.
The sequence and assembly of human genomes using long-read sequencing technologies has revolutionized our understanding of structural variation and genome organization. We compared the accuracy, continuity, and gene annotation of genome assemblies generated from either high-fidelity (HiFi) or continuous long-read (CLR) datasets from the same complete hydatidiform mole human genome. We find that the HiFi sequence data assemble an additional 10% of duplicated regions and more accurately represent the structure of tandem repeats, as validated with orthogonal analyses. As a result, an additional 5 Mbp of pericentromeric sequences are recovered in the HiFi assembly, resulting in a 2.5-fold increase in the NG50 within 1 Mbp of the centromere (HiFi 480.6 kbp, CLR 191.5 kbp). Additionally, the HiFi genome assembly was generated in significantly less time with fewer computational resources than the CLR assembly. Although the HiFi assembly has significantly improved continuity and accuracy in many complex regions of the genome, it still falls short of the assembly of centromeric DNA and the largest regions of segmental duplication using existing assemblers. Despite these shortcomings, our results suggest that HiFi may be the most effective standalone technology for de novo assembly of human genomes. © 2019 John Wiley & Sons Ltd/University College London.
Zygotic genome activation (ZGA) following fertilization is accomplished through a process termed the maternal-to-zygotic transition, during which the maternal RNAs and proteins are degraded and zygotic genome is transcriptionally activated.1 In mice, minor ZGA occurs from S phase of the zygote to G1 phase of the two-cell (2C) embryo, while major ZGA takes place during the middle-to-late 2C stage with a burst of transcription of totipotent cleavage stage-specific genes and retrotransposons.2Dux has been recently identified and considered as a master inducer that regulates the ZGA process.3–5Dux can directly bind and robustly activate 2C stage-specific ZGA transcripts and convert mouse embryonic stem cells (mESCs) to a 2C-like state with unique features that resembles the 2C embryos.4Intriguingly, ~20% embryos with zygotic depletion of Dux unexpectedly reached morula or blastocyst stage even though defective ZGA program was detected.
N6-methyladenosine (m6A) is a widespread RNA modification that influences nearly every aspect of the messenger RNA lifecycle. Our understanding of m6A has been facilitated by the development of global m6A mapping methods, which use antibodies to immunoprecipitate methylated RNA. However, these methods have several limitations, including high input RNA requirements and cross-reactivity to other RNA modifications. Here, we present DART-seq (deamination adjacent to RNA modification targets), an antibody-free method for detecting m6A sites. In DART-seq, the cytidine deaminase APOBEC1 is fused to the m6A-binding YTH domain. APOBEC1-YTH expression in cells induces C-to-U deamination at sites adjacent to m6A residues, which are detected using standard RNA-seq. DART-seq identifies thousands of m6A sites in cells from as little as 10?ng of total RNA and can detect m6A accumulation in cells over time. Additionally, we use long-read DART-seq to gain insights into m6A distribution along the length of individual transcripts.
Tandem repeats lead to sequence assembly errors and impose multi-level challenges for genome and protein databases.
The widespread occurrence of repetitive stretches of DNA in genomes of organisms across the tree of life imposes fundamental challenges for sequencing, genome assembly, and automated annotation of genes and proteins. This multi-level problem can lead to errors in genome and protein databases that are often not recognized or acknowledged. As a consequence, end users working with sequences with repetitive regions are faced with ‘ready-to-use’ deposited data whose trustworthiness is difficult to determine, let alone to quantify. Here, we provide a review of the problems associated with tandem repeat sequences that originate from different stages during the sequencing-assembly-annotation-deposition workflow, and that may proliferate in public database repositories affecting all downstream analyses. As a case study, we provide examples of the Atlantic cod genome, whose sequencing and assembly were hindered by a particularly high prevalence of tandem repeats. We complement this case study with examples from other species, where mis-annotations and sequencing errors have propagated into protein databases. With this review, we aim to raise the awareness level within the community of database users, and alert scientists working in the underlying workflow of database creation that the data they omit or improperly assemble may well contain important biological information valuable to others. © The Author(s) 2019. Published by Oxford University Press on behalf of Nucleic Acids Research.
Domestication of clonally propagated crops such as pineapple from South America was hypothesized to be a ‘one-step operation’. We sequenced the genome of Ananas comosus var. bracteatus CB5 and assembled 513?Mb into 25 chromosomes with 29,412 genes. Comparison of the genomes of CB5, F153 and MD2 elucidated the genomic basis of fiber production, color formation, sugar accumulation and fruit maturation. We also resequenced 89 Ananas genomes. Cultivars ‘Smooth Cayenne’ and ‘Queen’ exhibited ancient and recent admixture, while ‘Singapore Spanish’ supported a one-step operation of domestication. We identified 25 selective sweeps, including a strong sweep containing a pair of tandemly duplicated bromelain inhibitors. Four candidate genes for self-incompatibility were linked in F153, but were not functional in self-compatible CB5. Our findings support the coexistence of sexual recombination and a one-step operation in the domestication of clonally propagated crops. This work guides the exploration of sexual and asexual domestication trajectories in other clonally propagated crops.
A Gram-stain-negative bacterial strain, designated CA10T, was isolated from bovine raw milk sampled in Anseong, Republic of Korea. Cells were yellow-pigmented, aerobic, non-motile bacilli and grew optimally at 30?°C and pH 7.0 on tryptic soy agar without supplementation of NaCl. Phylogenetic analysis based on the 16S rRNA gene sequences revealed that strain CA10T belonged to the genus Chryseobacterium, family Flavobacteriaceae, and was most closely related to Chryseobacterium indoltheticum ATCC 27950T (98.75?% similarity). The average nucleotide identity and digital DNA-DNA hybridization values of strain CA10T were 94.4 and 56.9?%, respectively, relative to Chryseobacterium scophthalmum DSM 16779T, being lower than the cut-off values of 95-96?and 70?%, respectively. The predominant respiratory quinone was menaquinone-6; major polar lipid, phosphatidylethanolamine; major fatty acids, iso-C15?:?0, summed feature 9 (iso-C17?:?1?9c and/or C16?:?0 10-methyl), summed feature 3 (iso-C15?:?0 2-OH and/or C16?:?1?7c) and iso-C17?:?0 3-OH. The results of physiological, chemotaxonomic and biochemical analyses suggested that strain CA10T is a novel species of genus Chryseobacterium, for which the name Chryseobacterium mulctrae sp. nov. is proposed. The type strain is CA10T (=KACC 21234T=JCM 33443T).
A Gram-stain-negative, rod-shaped and red-pigmented strain, HME7025T, was isolated from freshwater sampled in the Republic of Korea. Phylogenetic analysis based on its 16S rRNA gene sequence revealed that strain HME7025T formed a lineage within the family Cytophagaceae of the phylum Bacteroidetes. Strain HME7025T was closely related to the genera Pseudarcicella, Arcicella and Flectobacillus. The 16S rRNA gene sequence similarity values of strain HME7025T were under 94.5?% to its closest phylogenetic neighbours. The major fatty acids of strain HME7025T were iso-C15?:?0 (41.9?%), summed feature 3 (comprising C16?:?1?7c and/or C16?:?1?6c; 12.2?%) and anteiso-C15?:?0 (10.8?%). The major respiratory quinone was menaquinone-7. The major polar lipids were phosphatidylethanolamine, two unidentified aminophospholipids and one unidentified polar lipid. The DNA G+C content of strain HME7025T was 37.9?mol%. On the basis of the evidence presented in this study, strain HME7025T represents a novel species of a novel genus within the family Cytophagaceae, for which the name Allopseudarcicella aquatilis gen. nov., sp. nov. is proposed. The type strain is HME7025T (=KCTC 23617T=CECT 7957T).