quarTeT: Telomere-to-telomere Toolkit

This page is mainly same as github readme.

However, this page updated not as frequently as on github.

quarTeT is a collection of tools for T2T genome assembly and basic analysis in automatic workflow.

Task include:

AssemblyMapper : reference-guided genome assembly
GapFiller : long-reads based gap filling
TeloExplorer : telomere identification
CentroMiner : centromere candidate prediction

Getting Started

Dependencies

Python3 (>3.6, tested on 3.7.4 and 3.9.12)
Minimap2 (tested on 2.24-r1122 and 2.24-r1155-dirty)
MUMmer4 (tested on 4.0.0rc1)
trf (tested on 4.09)
CD-hit (tested on 4.6 and 4.8.1)
BLAST+ (tested on 2.8.1 and 2.11.0)
tidk (tested on 0.2.1 and 0.2.31)
gnuplot (tested on 4.6 patchlevel 2 and 6)
R (>3.5.0, tested on 3.6.0 and 4.2.2)
RIdeogram (tested on 0.2.2)

Installation

quarTeT do not require installation.

Download quarTeT
Extract files by tar -xf {path}/quartet.tar.gz
Run python3 {path}/quartet.py to start.

Usage

quarTeT: Telomere-to-telomere Toolkit

Usage: python3 quartet.py <module> <parameters>

Modules:

AssemblyMapper	am	Assemble draft genome.
GapFiller	gf	Fill gaps in draft genome.
TeloExplorer	te	Identify telomeres.
CentroMiner	cm	Identify centromere candidates.

Use <module> -h for module usage.

AssemblyMapper

AssemblyMapper is a reference-guided assemble tool.

A phased contig-level assembly and a close-related reference genome are required as input, both in fasta format.

Note that contigs should be phased.

It's recommended to obtain such an assembly using hifiasm .

You can convert {prefix}.bp.hap1.p_ctg.gfa and {prefix}.bp.hap2.p_ctg.gfa generated by hifiasm to FASTA format as input, separately.

Usage: python3 quartet.py AssemblyMapper <parameters>

-h, --help	show this help message and exit
-r REFERENCE_GENOME	(*Required) Reference genome file, FASTA format.
-q CONTIGS	(*Required) phased contigs file, FASTA format.
-c MIN_CONTIG_LENGTH	Contigs shorter than INT (bp) will be removed, default: 50000
-l MIN_ALIGNMENT_LENGTH	The min alignment length to be select (bp), default: 10000
-i MIN_ALIGNMENT_IDENTITY	The min alignment identity to be select (%), default: 90
-p PREFIX	The prefix used on generated files, default: quarTeT
-t THREADS	Use number of threads, default: 1
-a {minimap2,mummer}	Specify alignment program (support minimap2 and mummer), default: minimap2
--plot	Plot a colinearity graph for draft genome to reference alignments. (will cost more time)
--overwrite	Overwrite existing alignment file instead of reuse.
--minimapoption MINIMAPOPTION	Pass additional parameters to minimap2 program, default: -x asm5
--nucmeroption NUCMEROPTION	Pass additional parameters to nucmer program.
--deltafilteroption DELTAFILTEROPTION	Pass additional parameters to delta-filter program.

Output files should be as follow:

                                    
                                            {prefix}.draftgenome.fasta
                                            The pseudo-chromosome-level assembly, fasta format.
                                        
                                            {prefix}.draftgenome.agp
                                            The structure of this assembly, AGP format.
                                        
                                            {prefix}.draftgenome.stat
                                            The statistic of this assembly, including total size and each chromosome's
                                                size, GC content, gap count and locations.
                                            
                                            {prefix}.draftgenome.png
                                            The figure draws relative length of chromosomes and gap locations for
                                                assembly.
                                            
                                            {prefix}.contig.mapinfo
                                            The statistic of input contigs, including total mapped and discarded size,
                                                and each contig's destination.
                                            
                                            {prefix}.contig_map_ref.png
                                            The alignment colinearity graph between contigs and reference genome.
                                        
                                            {prefix}.draftgenome_map_ref.png
                                            The alignment colinearity graph between this assembly genome and reference
                                                genome. Only available with --plot.

GapFiller

GapFiller is a long-reads based gapfilling tool.

A gap-tied genome and corresponding long-reads are required as input, both in fasta format.

If possible, using long-reads assembled and polished contigs instead of reads may improve the quality.

Usage: python3 quartet.py GapFiller <parameters>

-h, --help	show this help message and exit
-d DRAFT_GENOME	(*Required) Draft genome file to be filled, FASTA format.
-g GAPCLOSER_CONTIG [GAPCLOSER_CONTIG ...]	(*Required) All contigs files (accept multiple file) used to fill gaps, FASTA format.
-f FLANKING_LEN	The flanking seq length of gap used to anchor (bp), default: 5000
-l MIN_ALIGNMENT_LENGTH	The min alignment length to be select (bp), default: 1000
-i MIN_ALIGNMENT_IDENTITY	The min alignment identity to be select (%), default: 40
-m MAX_FILLING_LEN	The max sequence length acceptable to fill any gaps, default: 1000000
-p PREFIX	The prefix used on generated files, default: quarTeT
-t THREADS	Use number of threads, default: 1
--overwrite	Overwrite existing alignment file instead of reuse.
--minimapoption MINIMAPOPTION	Pass additional parameters to minimap2 program, default: -x asm5

Output files should be as follow:

                                    
                                            {prefix}.genome.filled.fasta
                                            The gap-filled genome, fasta format.
                                        
                                            {prefix}.genome.filled.detail
                                            Detailed information for each gap, including gap closed and remains, total
                                                filled size and closer's ID, range, etc.
                                            
                                            {prefix}.genome.filled.stat
                                            The statistic of filled genome, including total size and each chromosome's
                                                size, GC content, gap count and locations.
                                            
                                            {prefix}.genome.filled.png
                                            The figure draws relative length of chromosomes and gap locations for
                                                assembly.

TeloExplorer

TeloExplorer is a telomere identification tool.

A genome file in fasta format is required as input.

Usage: python3 quartet.py TeloExplorer <parameters>

-h, --help	show this help message and exit
-i GENOME	(*Required) Genome file to be identified, FASTA format.
-c {plant,animal,other}	Specify clade of this genome. Plant will search TTTAGGG, animal will search TTAGGG, other will use tidk explore's suggestion, default: other
-m MIN_REPEAT_TIMES	The min repeat times to be reported, default: 100
-p PREFIX	The prefix used on generated files, default: quarTeT

Output files should be as follow:

                                        
                                                {prefix}.telo.info
                                                The statistic of telomere, including monomer, repeat times on both end of each chromosome.
                                            
                                                {prefix}.telo.png
                                                The figure draws telomere location, alongside relative length of chromosomes and gap locations for assembly.

CentroMiner

CentroMiner is a centromere prediction tool.

A genome file in fasta format is required as input.

Optionally, an addition input of TE annotation (or just LTR annotation) in gff3 format can improve the performance.

It's recommended to obtain TE annotation using EDTA.

{genome file}.mod.EDTA.TEanno.gff3 generated by EDTA can directly feed CentroMiner, unless you have sequence ID longer than 15 characters.

Note that the sequence ID in first column should be consistent with in genome. Some tools may change sequence ID if ID is too long.

The sequence ontology in the third column should include "LTR" to be recognized.

Usage: python3 quartet.py CentroMiner <parameters>

-h, --help	show this help message and exit
-i GENOME_FASTA	(*Required) Genome file, FASTA format.
--TE TE	TE annotation file, gff3 format.
-n MIN_PERIOD	Min period to be consider as centromere repeat monomer. Default: 100
-m MAX_PERIOD	Max period to be consider as centromere repeat monomer. Default: 200
-s CLUSTER_IDENTITY	Min identity between TR monomers to be clustered (Cannot be smaller than 0.8). Default: 0.8
-d CLUSTER_MAX_DELTA	Max period delta for TR monomers in a cluster. Default: 10
-e EVALUE	E-value threholds in blast. Default: 0.00001
-g MAX_GAP	Max allowed gap size between two tandem repeats to be considered as in one tandem repeat region. Default: 50000
-l MIN_LENGTH	Min size of tandem repeat region to be selected as candidate. Default: 100000
-t THREADS	Limit number of using threads, default: 1
-p PREFIX	Prefix used by generated files. Default: quarTeT
--trf [TRF_PARAMETER ...]	Change TRF parameters: <match> <mismatch> <delta> <PM> <PI> <minscore> Default: 2 7 7 80 10 50
--overwrite	Overwrite existing trf dat file instead of reuse.

Output files should be as follow:

                                    
                                            {prefix}.best.candidate
                                            The best centromere candidate on each chromosome, and corresponding
                                                monomers.
                                            
                                            {prefix}.centro.png
                                            The figure draws best centromere candidate location, alongside relative
                                                length of chromosomes and gap locations for assembly.
                                            
                                            candidate/
                                            The folder of all centromere candidates. Check here if the best candidate
                                                doesn't look well.
                                            
                                            TRfasta/
                                            The folder of all tandem repeat monomers identified by trf and cluster
                                                result on each chromosome.
                                            
                                            TRgff3/
                                            The folder of all tandem repeat hit by BLAST on each chromosome, in gff3
                                                format.

{prefix}.draftgenome.fasta	The pseudo-chromosome-level assembly, fasta format.
{prefix}.draftgenome.agp	The structure of this assembly, AGP format.
{prefix}.draftgenome.stat	The statistic of this assembly, including total size and each chromosome's size, GC content, gap count and locations.
{prefix}.draftgenome.png	The figure draws relative length of chromosomes and gap locations for assembly.
{prefix}.contig.mapinfo	The statistic of input contigs, including total mapped and discarded size, and each contig's destination.
{prefix}.contig_map_ref.png	The alignment colinearity graph between contigs and reference genome.
{prefix}.draftgenome_map_ref.png	The alignment colinearity graph between this assembly genome and reference genome. Only available with --plot.

{prefix}.genome.filled.fasta	The gap-filled genome, fasta format.
{prefix}.genome.filled.detail	Detailed information for each gap, including gap closed and remains, total filled size and closer's ID, range, etc.
{prefix}.genome.filled.stat	The statistic of filled genome, including total size and each chromosome's size, GC content, gap count and locations.
{prefix}.genome.filled.png	The figure draws relative length of chromosomes and gap locations for assembly.

{prefix}.telo.info	The statistic of telomere, including monomer, repeat times on both end of each chromosome.
{prefix}.telo.png	The figure draws telomere location, alongside relative length of chromosomes and gap locations for assembly.

{prefix}.best.candidate	The best centromere candidate on each chromosome, and corresponding monomers.
{prefix}.centro.png	The figure draws best centromere candidate location, alongside relative length of chromosomes and gap locations for assembly.
candidate/	The folder of all centromere candidates. Check here if the best candidate doesn't look well.
TRfasta/	The folder of all tandem repeat monomers identified by trf and cluster result on each chromosome.
TRgff3/	The folder of all tandem repeat hit by BLAST on each chromosome, in gff3 format.