Hits file


Why is it needed?

  • A hits file is a TSV file which links sequence IDs in a assembly to NCBI TaxIDs, with a given score.
  • These can be the results of sequence similarity searches of the assembly against a sequence database (e.g. BLAST/Diamond output files) or custom input in TSV format.
  • BlobTools create parses all hits present in a hits-file (i.e. a sequence can have more than one hit) and assigns a taxonomy under a given tax-rule to each sequence in the assembly. This process relies entirely on the hits file(s) provided by the user.
  • For a more detailed description of the taxonomic annotation process, see taxonomic annotation.

Format

The required format is TAB-separated (TSV) and is composed of three columns (additional columns are allowed, but ignored)

  • 1st column: sequenceID (must be part of the assembly)
  • 2nd column: TaxID (a NCBI TaxID)
  • 3rd column: score (a numerical score)

Generating hits files based on custom input

Using the BlobTools module taxify, the user can create their own Hits file

  • based on other TSV input, such as the output of other contaminant screening pipelines
  • based on a single NCBI TaxID and a score

Generating hits files based on sequence similarity searches


Sequence databases against which to search

  • A wide variety of curated and un-curated sequence collections are publicly available (e.g. NCBI, UniProt, RNAcentral) and the choice of sequence collection against which to search depends on the desired goal.
  • NCBI BLAST DBs include taxIDs for each of the sequences they contain, for other sequence collections the user will have to annotate hit-files a posteriori using BlobTools taxify and providing a TaxID-mapping file.
  • The user can also conduct sequence similarity searches against custom BLAST/Diamond sequence database.

Public sequence collections

The collections differ in their inclusion criteria of sequences, phylogenetic sampling and the effort made to reduce redundancy. A selection of commonly used public sequence collections:

  • NCBI nt (recommended for BLASTn)
  • entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, and environmental sequences are excluded. RefSeq genomic entries are also excluded.
  • UniProt Reference Proteomes (recommended for Diamond blastx)
  • 8392 proteomes @ Release 2017_07, 05-Jul-2017
  • NCBI RefSeq
  • 65,964,245 protein sequences @ Release 77, manually curated
  • NCBI nr
  • all non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
  • RNAcentral SILVA
  • 4,478,614 rRNA sequences @ version 122
  • quality checked and aligned ribosomal RNA sequence data

For all available NCBI sequence collections, see NCBI BLAST FTP site


Custom sequence collections

Custom sequence collections can be constructed based on:

  • subsets of public databases
  • unpublished reference genomes/proteomes

Setting up sequence databases

NCBI sequence collections


RNAcentral SILVA collection

# Download ID mapping file from RNAcentral release 5.0
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/id_mapping/id_mapping.tsv.gz

# extract SILVA sequence IDs (this is the TaxID-mapping file)
gunzip -c id_mapping.tsv.gz | grep SILVA | sort | uniq > id_mapping.SILVA.tsv

# generate a list of sequence IDs
cut -f1 id_mapping.SILVA.tsv > id_mapping.SILVA.names.txt

# Download FASTA from RNAcentral release 5.0 (contains all sequences)
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/sequences/rnacentral_active.fasta.gz
gunzip rnacentral_active.fasta.gz

# Subselect only SILVA rRNA sequences
./blobtools seqfilter -i rnacentral_active.fasta -l id_mapping.SILVA.names.tsv 

# Convert rRNA to rDNA
perl -ne 'chomp; if(m/^>/){@temp=split(" "); print $temp[0]."\n";} else {$_ =~ tr/U/T/; print $_."\n"}' rnacentral_active.filtered.fna > rnacentral_active.SILVA.rDNA.fasta

# make the blastDB
makeblastdb -dbtype nucl -in rnacentral_active.SILVA.rDNA.fasta

Sequence similarity search algorithms

  • Since the goal is to obtain hits for the sequences in the assembly, the queries of the sequence similarity searches are nucleotide sequences.
  • The sequence similarity search algorithm determines which type (nucleotide or protein) of sequence collection can be used.
  • The following sequence similarity search algorithms can be used:
Algorithmscopedatabasespeed
BLASTn megablastused to find very similar (e.g., intraspecies or closely related species) sequencesnucleotidefastest
BLASTn dc-megablastused to find more distant (e.g., interspecies) sequencesnucleotideslow
BLASTxused to find distant sequencesproteinslowest
Diamond blastxused to find distant sequencesproteinfaster than BLASTx

Example commands

blastn \
-task megablast \
-query assembly.fna \
-db nt \
-outfmt '6 qseqid staxids bitscore std' \
-max_target_seqs 1 \
-max_hsps 1 \
-num_threads 16 \
-evalue 1e-25 \
-out assembly.vs.nt.mts1.hsp1.1e25.megablast.out
blastn \
-task dc-megablast \
-query assembly.fna \
-db nt \
-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
-max_target_seqs 1 \
-max_hsps 1 \
-num_threads 16 \
-evalue 1e-25 \
-out assembly.vs.nt.mts1.hsp1.1e25.dc_megablast.out
diamond blastx \
--query assembly.fna \
--max-target-seqs 1 \
--sensitive \
--threads 16 \
--db uniprot_ref_proteomes.diamond.dmnd \
--evalue 1e-25 \
--outfmt 6 \
--out assembly.vs.uniprot_ref.mts1.1e25.out

# taxify results
blobtools taxify -f assembly.vs.uniprot_ref.mts1.1e25.out -m uniprot_ref_proteomes.taxids -s 0 -t 2
blastn \
-task megablast \
-query assembly.fna \
-db rnacentral_active.SILVA.rDNA.fasta \
-outfmt '6 qseqid staxids bitscore std' \
-num_threads 48 \
-evalue 1e-65 \
-out assembly.vs.silva.1e65.megablast.out

# filtering the BLAST result further based on pID, ALN and Evalue is recommended
awk '$5 >= 97 && $6 >= 500 && $13 <= 1e-150' \ assembly.vs.silva.1e65.megablast.out \
> assembly.vs.silva.1e65.megablast.stringent.out

Recommendations

Best tradeoff between speed and accuracy of BlobTools taxonomy assignment is (usually) achieved with the following searches, supplied in this order and using taxrule ‘bestsumorder’.

blastn \
 -query $ASSEMBLY \
 -db nt \
 -outfmt ’6 qseqid staxids bitscore std’ \
 -max_target_seqs 1 \
 -max_hsps 1 \
 -evalue 1e-25

diamond blastx \
 --query $ASSEMBLY \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --outfmt 6 \
 --sensitive \
 --max-target-seqs 1 \
 --evalue 1e-25

Most accurate results are obtained with the following searches, supplied in this order and using taxrule ‘bestsumorder’.

blastn \
 -query $ASSEMBLY \
 -db nt \
 -outfmt ’6 qseqid staxids bitscore std’ \
 -max_target_seqs 10 \
 -max_hsps 1 \
 -evalue 1e-25

diamond blastx \
 --query $ASSEMBLY \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --outfmt 6 \
 --sensitive \
 --max-target-seqs 1 \
 --evalue 1e-25