DNA extracted from tissue samples typically derives from a complex mixture of cell types. Without single cell analysis, it has been generally impossible to determine the cell type of origin for most molecules. One clear example of this is in the complex milieu of a human neoplasm. Here, we develop ROCIT (https://github.com/tobybaker/rocit), a transformerbased model to classify the tumor or non-tumor origin of individual reads from bulk tumor samples sequenced with long-read whole genome sequencing. Using somatic mutations to derive training data, ROCIT uses read-level methylation patterns to accurately classify reads from anywhere in the genome without requiring the adjacent normal tissue or the explicit identification of tumor differentially methylated regions. We apply ROCIT to a cohort of prostate and ovarian tumors and demonstrate high classification accuracy across the entire genome. We then demonstrate the potential of ROCIT predictions to improve somatic variant calling. ROCIT represents a major step forward in the analysis of bulk tumors with long-reads, enabling the accurate and sensitive identification of reads with specific cell types of origin genome-wide.
Journal: bioRxiv
DOI: 10.64898/2026.03.03.709085
Year: 2026