TaxID-mapping file
Why is it needed?
- During the taxonomic annotation process, BlobTools uses a nodesDB file to infer the taxonomy at each rank for each of the hits in a hits file based on the taxID of the hit
- Sometimes a hits file does not include the NCBI TaxIDs of the subject sequences.
- Using the BlobTools module
taxify
and a TaxID-mapping file, sequence IDs (of the subject) in hits files can be annotated with TaxIDs.
Format
- Any type of TSV file, in which one column lists a sequence ID (of a subject) and another the NCBI TaxID
- An example of UniProt taxid-mapping file:
Q6GZX4 NCBI_TaxID 654924
Q6GZX3 NCBI_TaxID 654924
Q197F8 NCBI_TaxID 345201
Q197F7 NCBI_TaxID 345201
Q6GZX2 NCBI_TaxID 654924
Q6GZX1 NCBI_TaxID 654924
Q197F5 NCBI_TaxID 345201
Q6GZX0 NCBI_TaxID 654924
Use with BlobTools taxify
Assuming the TaxID-mapping file from UniProt (uniprot_ref_proteomes.taxids
) and a Diamond blastx search result against UniProt Reference Proteomes (diamond.out
)
contig_1 P41846 66.9 885 42 10 8878 6224 127 760 7.5e-257 897.1
contig_2 O62095 48.2 502 42 7 4342 2837 104 387 5.2e-101 380.2
contig_3 Q19945 49.4 969 64 15 18326 15420 308 850 1.8e-189 674.5
contig_4 Q9TZG1 95.6 389 0 1 7647 8813 1 372 1.2e-202 718.0
BlobTools taxify
can be run as follows
blobtools taxify \
-f diamond.out \
-m uniprot_ref_proteomes.taxids
-s 0 \ # column of sequenceID of subject in taxID mapping file
-t 2 # column of TaxID of sequenceID in taxID mapping file
This generates a taxified diamond output file
contig_1 6239 897.1 P41846
contig_2 6239 380.2 O62095
contig_3 6239 674.5 Q19945
contig_4 6239 718.0 Q9TZG1
Updated over 7 years ago