TaxID-mapping file


Why is it needed?

  • During the taxonomic annotation process, BlobTools uses a nodesDB file to infer the taxonomy at each rank for each of the hits in a hits file based on the taxID of the hit
  • Sometimes a hits file does not include the NCBI TaxIDs of the subject sequences.
  • Using the BlobTools module taxify and a TaxID-mapping file, sequence IDs (of the subject) in hits files can be annotated with TaxIDs.

Format

  • Any type of TSV file, in which one column lists a sequence ID (of a subject) and another the NCBI TaxID
  • An example of UniProt taxid-mapping file:
Q6GZX4	NCBI_TaxID	654924
Q6GZX3	NCBI_TaxID	654924
Q197F8	NCBI_TaxID	345201
Q197F7	NCBI_TaxID	345201
Q6GZX2	NCBI_TaxID	654924
Q6GZX1	NCBI_TaxID	654924
Q197F5	NCBI_TaxID	345201
Q6GZX0	NCBI_TaxID	654924

Use with BlobTools taxify

Assuming the TaxID-mapping file from UniProt (uniprot_ref_proteomes.taxids) and a Diamond blastx search result against UniProt Reference Proteomes (diamond.out)

contig_1        P41846  66.9    885     42      10      8878    6224    127     760     7.5e-257        897.1
contig_2        O62095  48.2    502     42      7       4342    2837    104     387     5.2e-101        380.2
contig_3        Q19945  49.4    969     64      15      18326   15420   308     850     1.8e-189        674.5
contig_4        Q9TZG1  95.6    389     0       1       7647    8813    1       372     1.2e-202        718.0

BlobTools taxify can be run as follows

blobtools taxify \ 
 -f diamond.out \
 -m uniprot_ref_proteomes.taxids 
 -s 0 \ # column of sequenceID of subject in taxID mapping file
 -t 2 # column of TaxID of sequenceID in taxID mapping file

This generates a taxified diamond output file

contig_1        6239    897.1   P41846
contig_2        6239    380.2   O62095
contig_3        6239    674.5   Q19945
contig_4        6239    718.0   Q9TZG1