LTR-checker: deep-learning (DL) guided structural identification of LTR retrotransposon providing decreased computation demand and high flexibility
1 : Department of Plant Physiology, Umea Plant Science Centre, Umea University
Linnaeus väg 6 -
Sweden
- LTR retrotransposons (LTR-RTs), ubiquitous and dominant component in plant genomes, are playing critical roles in functional variation, genome plasticity and evolution. While the computational demand is still high for LTR-RT identification, especially for large genomes such as wheats (5-15 Gb) and conifers (10-40 Gb). Significant accumulation of whole-genome sequences following the ever-advancing sequencing technologies is calling for efficient computational tools in LTR-RT identification.
- Here, we generated a deep-learning (DL) guided LTR-RT identification method, LTR-checker, which provides decreased computation demand and high flexibility. (1) We collected full-length LTR-RTs from most released plant whole genomes reflecting broad phylogenetic representativeness, thus captured the broad LTR-RT diversity, and constructed a consensus LTR-RTs dataset. (2) With the consensus LTR-RT dataset, a DL model of convolutional neural network (CNN) was built to predict the occurrent of LTR-RT. And a new method was created by using DL model, as a guider, to direct the finer structural identification procedure to potential LTR-RT location in the whole genomes, so as to decrease the computational demand. (3) We examined the performance of LTR-checker by comparing it with poplar LTR-RT identification methods.
- A dataset composed of 200,000 consensus LTR-RTs was generated, and publicly available at: https://zenodo.org/records/10454902. Our DL model shows very promising accuracy in LTR-RT prediction. Our new method, LTR-checker, shows competitive performance with poplar methods (LTR-finder, LTR-harvest, LTR-retriever), and it achieves 10-20x faster in CPU-time and for large genomes (maize, wheat, pine), and 4-5x lower in computer RAM memory. The low computational demands enable high flexibility in LTR identification by detecting hundreds of nested LTR-RTs when loosening the range width settings in genomic survey of LTR-RT.
- The new method is publicly available at: https://github.com/morningsun77/ltr_checker.