Hits file

Why is it needed?

A hits file is a TSV file which links sequence IDs in a assembly to NCBI TaxIDs, with a given score.
These can be the results of sequence similarity searches of the assembly against a sequence database (e.g. BLAST/Diamond output files) or custom input in TSV format.
BlobTools create parses all hits present in a hits-file (i.e. a sequence can have more than one hit) and assigns a taxonomy under a given tax-rule to each sequence in the assembly. This process relies entirely on the hits file(s) provided by the user.
For a more detailed description of the taxonomic annotation process, see taxonomic annotation.

Format

The required format is TAB-separated (TSV) and is composed of three columns (additional columns are allowed, but ignored)

1st column: sequenceID (must be part of the assembly)
2nd column: TaxID (a NCBI TaxID)
3rd column: score (a numerical score)

Generating hits files based on custom input

Using the BlobTools module taxify, the user can create their own Hits file

based on other TSV input, such as the output of other contaminant screening pipelines
based on a single NCBI TaxID and a score

Generating hits files based on sequence similarity searches

Sequence databases against which to search

A wide variety of curated and un-curated sequence collections are publicly available (e.g. NCBI, UniProt, RNAcentral) and the choice of sequence collection against which to search depends on the desired goal.
NCBI BLAST DBs include taxIDs for each of the sequences they contain, for other sequence collections the user will have to annotate hit-files a posteriori using BlobTools taxify and providing a TaxID-mapping file.
The user can also conduct sequence similarity searches against custom BLAST/Diamond sequence database.

Public sequence collections

The collections differ in their inclusion criteria of sequences, phylogenetic sampling and the effort made to reduce redundancy. A selection of commonly used public sequence collections:

NCBI nt (recommended for BLASTn)
entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, and environmental sequences are excluded. RefSeq genomic entries are also excluded.
UniProt Reference Proteomes (recommended for Diamond blastx)
8392 proteomes @ Release 2017_07, 05-Jul-2017
NCBI RefSeq
65,964,245 protein sequences @ Release 77, manually curated
NCBI nr
all non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
RNAcentral SILVA
4,478,614 rRNA sequences @ version 122
quality checked and aligned ribosomal RNA sequence data

For all available NCBI sequence collections, see NCBI BLAST FTP site

Custom sequence collections

Custom sequence collections can be constructed based on:

subsets of public databases
unpublished reference genomes/proteomes

Setting up sequence databases

NCBI sequence collections

NCBI offers pre-formatted blastDBs for BLAST through the NCBI BLASTDB FTP site
In order to construct DiamondDBs, users have to retrieve the FASTA files for the sequence collections from the NCBI BLASTDB FASTA FTP site and build there own database (see Diamond documentation).

RNAcentral SILVA collection

# Download ID mapping file from RNAcentral release 5.0
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/id_mapping/id_mapping.tsv.gz

# extract SILVA sequence IDs (this is the TaxID-mapping file)
gunzip -c id_mapping.tsv.gz | grep SILVA | sort | uniq > id_mapping.SILVA.tsv

# generate a list of sequence IDs
cut -f1 id_mapping.SILVA.tsv > id_mapping.SILVA.names.txt

# Download FASTA from RNAcentral release 5.0 (contains all sequences)
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/sequences/rnacentral_active.fasta.gz
gunzip rnacentral_active.fasta.gz

# Subselect only SILVA rRNA sequences
./blobtools seqfilter -i rnacentral_active.fasta -l id_mapping.SILVA.names.tsv 

# Convert rRNA to rDNA
perl -ne 'chomp; if(m/^>/){@temp=split(" "); print $temp[0]."\n";} else {$_ =~ tr/U/T/; print $_."\n"}' rnacentral_active.filtered.fna > rnacentral_active.SILVA.rDNA.fasta

# make the blastDB
makeblastdb -dbtype nucl -in rnacentral_active.SILVA.rDNA.fasta

Sequence similarity search algorithms

Since the goal is to obtain hits for the sequences in the assembly, the queries of the sequence similarity searches are nucleotide sequences.
The sequence similarity search algorithm determines which type (nucleotide or protein) of sequence collection can be used.
The following sequence similarity search algorithms can be used:


Algorithm	scope	database	speed
BLASTn megablast	used to find very similar (e.g., intraspecies or closely related species) sequences	nucleotide	fastest
BLASTn dc-megablast	used to find more distant (e.g., interspecies) sequences	nucleotide	slow
BLASTx	used to find distant sequences	protein	slowest
Diamond blastx	used to find distant sequences	protein	faster than BLASTx

Example commands

blastn \
-task megablast \
-query assembly.fna \
-db nt \
-outfmt '6 qseqid staxids bitscore std' \
-max_target_seqs 1 \
-max_hsps 1 \
-num_threads 16 \
-evalue 1e-25 \
-out assembly.vs.nt.mts1.hsp1.1e25.megablast.out

blastn \
-task dc-megablast \
-query assembly.fna \
-db nt \
-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
-max_target_seqs 1 \
-max_hsps 1 \
-num_threads 16 \
-evalue 1e-25 \
-out assembly.vs.nt.mts1.hsp1.1e25.dc_megablast.out

diamond blastx \
--query assembly.fna \
--max-target-seqs 1 \
--sensitive \
--threads 16 \
--db uniprot_ref_proteomes.diamond.dmnd \
--evalue 1e-25 \
--outfmt 6 \
--out assembly.vs.uniprot_ref.mts1.1e25.out

# taxify results
blobtools taxify -f assembly.vs.uniprot_ref.mts1.1e25.out -m uniprot_ref_proteomes.taxids -s 0 -t 2

blastn \
-task megablast \
-query assembly.fna \
-db rnacentral_active.SILVA.rDNA.fasta \
-outfmt '6 qseqid staxids bitscore std' \
-num_threads 48 \
-evalue 1e-65 \
-out assembly.vs.silva.1e65.megablast.out

# filtering the BLAST result further based on pID, ALN and Evalue is recommended
awk '$5 >= 97 && $6 >= 500 && $13 <= 1e-150' \ assembly.vs.silva.1e65.megablast.out \
> assembly.vs.silva.1e65.megablast.stringent.out

Recommendations

Best tradeoff between speed and accuracy of BlobTools taxonomy assignment is (usually) achieved with the following searches, supplied in this order and using taxrule ‘bestsumorder’.

blastn \
 -query $ASSEMBLY \
 -db nt \
 -outfmt ’6 qseqid staxids bitscore std’ \
 -max_target_seqs 1 \
 -max_hsps 1 \
 -evalue 1e-25

diamond blastx \
 --query $ASSEMBLY \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --outfmt 6 \
 --sensitive \
 --max-target-seqs 1 \
 --evalue 1e-25

Most accurate results are obtained with the following searches, supplied in this order and using taxrule ‘bestsumorder’.

blastn \
 -query $ASSEMBLY \
 -db nt \
 -outfmt ’6 qseqid staxids bitscore std’ \
 -max_target_seqs 10 \
 -max_hsps 1 \
 -evalue 1e-25

diamond blastx \
 --query $ASSEMBLY \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --outfmt 6 \
 --sensitive \
 --max-target-seqs 1 \
 --evalue 1e-25