Transposable elements (TEs) are repetitive DNA elements that can change their position within a genome. They can occupy large proportions of eukaryotic genomes. Many tools have been developed for de novo TE identification, like RepeatModeler, EDTA, REPET, HiTE, and EarlGrey. But manual curation is still required for a high-quality TE annotation by experts, which is very time-consuming. We developed a software called TETrimmer that can replace the main tasks of TE manual curation.
Because the sequence divergence among TE subfamilies can be small, more than one type of subfamilies was usually included into one file after BLASTn and multiple sequence alignment (MSA). TETrimmer combined maximum likelihood phylogenetic tree and DBSCNA methods to efficiently cluster and separate MSA based on sequence relatedness. Annotated TEs from de novo TE annotation software can be fragmented. TETrimmer can automatically identify the proper extension size, clean the MSA, and define TE boundaries. The cleaning module of TETrimmer is very powerful, it uses new algorithm to efficiently remove MSA gaps and low conserved regions. Finally, TETrimmer supplies a graphical user interface to allow the user easily reviewing and modifying TETrimmer outputs. So far, we have tested TETrimmer on Drosophila melanogaster, Danio rerio, Oryza sativa, Zea mays, Blumeria hordei, and Homo sapiens. Comparing with the directly RepeatModeler2 outputs, TETrimmer can dramatically increase the TE annotation quality.
- Poster