Editor丨coisini
Accurate genome assembly is a cornerstone of biological research, but even the highest quality assemblies still retain errors from construction techniques. The human genome contains 3 billion nucleotides, and even a tiny error rate can lead to a surprisingly large total number of errors, diminishing the value of genomic data.
Base-level errors typically require correction through an additional polishing step—a process that uses sequencing reads aligned against the initial assembly to identify necessary edits. However, existing methods struggle to strike a balance between over-polishing and under-polishing.
To address this, Google, in collaboration with institutions such as the UC Santa Cruz Genomics Institute, developed a new deep learning tool called DeepPolisher. Its goal is to significantly improve genome assembly accuracy by precisely correcting base-level errors.
Paper Link: https://genome.cshlp.org/content/35/7/1595
Open-Source Link: https://github.com/google/deeppolisher
DeepPolisher recently played a crucial role in refining the Human Pangenome Reference Map. Google's Chief Scientist, Jeff Dean, praised it, stating: "(DeepPolisher) has made exciting progress in genome assembly accuracy!"
DeepPolisher's Innovative Breakthrough
DeepPolisher is a pure encoder model based on the Transformer architecture, which uses the alignment results of PacBio HiFi reads with diploid assemblies to predict corrections for underlying sequences.
DeepPolisher innovatively introduces "PHARAOH (PHasing of Homozygous Areas using Reads for Orientation and Haplotyping)", which ensures accurate alignment phasing through ONT ultra-long read data and correctly introduces heterozygosity corrections in erroneously homozygous regions.
DeepPolisher's training data comes from human cell line genomes donated by the Personal Genomes Project. This reference genome has been thoroughly identified by the National Institute of Standards and Technology (NIST) and the National Human Genome Research Institute (NHGRI) and validated using various sequencing technologies, with an expected completeness of 100% and accuracy of 99.99999%.
The research team used human chromosomes 1-19 for training, chromosomes 21 and 22 for model screening, and chromosome 20 for final accuracy validation.
Model input includes four dimensions: base information, sequencer-reported quality scores, read alignment quality, and mismatch base annotations. DeepPolisher can classify assembly errors and propose correction schemes, ultimately achieving precise correction of genome assemblies.
Performance
DeepPolisher can reduce genome assembly errors by about 50%, with "insertion-deletion errors (InDel)" showing particularly significant improvement, decreasing by over 70%.
The correction of insertion-deletion errors is crucial because the insertion or deletion of bases can lead to "frameshift mutations", causing genome annotation programs to miss relevant genes, which in turn affects detection reports in clinical analysis or drug development.
To evaluate DeepPolisher's optimization effect, the research team applied it to 180 assembly samples from the Human Pangenome Reference Consortium (HPRC)'s new data release. Through cross-validation of detection results from different sequencing technologies on the same sample, they successfully identified abnormal nucleotide combinations in the assembled sequences, raising the predicted quality value (QV) for major genomic regions from an average of Q66.7 to Q70.1, an average improvement of 3.4 (equivalent to a 54% reduction in error rate), and all evaluated samples showed significant improvement.
DeepPolisher is now in practical application. In May of this year, the second batch of data announced by HPRC, processed by DeepPolisher, saw single nucleotide errors and insertion-deletion error rates reduced to 50% of their original levels, ultimately achieving an extremely low error rate of less than one error per 500,000 assembled bases.
Google states that DeepPolisher is being released as an open-source tool to make it more widely available to the research community. DeepPolisher will continue to optimize genomic resources for the scientific community.