Hits file
Why is it needed?
- A hits file is a TSV file which links sequence IDs in a assembly to NCBI TaxIDs, with a given score.
- These can be the results of sequence similarity searches of the assembly against a sequence database (e.g. BLAST/Diamond output files) or custom input in TSV format.
- BlobTools
create
parses all hits present in a hits-file (i.e. a sequence can have more than one hit) and assigns a taxonomy under a given tax-rule to each sequence in the assembly. This process relies entirely on the hits file(s) provided by the user. - For a more detailed description of the taxonomic annotation process, see taxonomic annotation.
Format
The required format is TAB-separated (TSV) and is composed of three columns (additional columns are allowed, but ignored)
- 1st column: sequenceID (must be part of the assembly)
- 2nd column: TaxID (a NCBI TaxID)
- 3rd column: score (a numerical score)
Generating hits files based on custom input
Using the BlobTools module taxify
, the user can create their own Hits file
- based on other TSV input, such as the output of other contaminant screening pipelines
- based on a single NCBI TaxID and a score
Generating hits files based on sequence similarity searches
Sequence databases against which to search
- A wide variety of curated and un-curated sequence collections are publicly available (e.g. NCBI, UniProt, RNAcentral) and the choice of sequence collection against which to search depends on the desired goal.
- NCBI BLAST DBs include taxIDs for each of the sequences they contain, for other sequence collections the user will have to annotate hit-files a posteriori using BlobTools
taxify
and providing a TaxID-mapping file. - The user can also conduct sequence similarity searches against custom BLAST/Diamond sequence database.
Public sequence collections
The collections differ in their inclusion criteria of sequences, phylogenetic sampling and the effort made to reduce redundancy. A selection of commonly used public sequence collections:
- NCBI nt (recommended for BLASTn)
- entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, and environmental sequences are excluded. RefSeq genomic entries are also excluded.
- UniProt Reference Proteomes (recommended for Diamond blastx)
- 8392 proteomes @ Release 2017_07, 05-Jul-2017
- NCBI RefSeq
- 65,964,245 protein sequences @ Release 77, manually curated
- NCBI nr
- all non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects
- RNAcentral SILVA
- 4,478,614 rRNA sequences @ version 122
- quality checked and aligned ribosomal RNA sequence data
For all available NCBI sequence collections, see NCBI BLAST FTP site
Custom sequence collections
Custom sequence collections can be constructed based on:
- subsets of public databases
- unpublished reference genomes/proteomes
Setting up sequence databases
NCBI sequence collections
- NCBI offers pre-formatted blastDBs for BLAST through the NCBI BLASTDB FTP site
- In order to construct DiamondDBs, users have to retrieve the FASTA files for the sequence collections from the NCBI BLASTDB FASTA FTP site and build there own database (see Diamond documentation).
RNAcentral SILVA collection
# Download ID mapping file from RNAcentral release 5.0
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/id_mapping/id_mapping.tsv.gz
# extract SILVA sequence IDs (this is the TaxID-mapping file)
gunzip -c id_mapping.tsv.gz | grep SILVA | sort | uniq > id_mapping.SILVA.tsv
# generate a list of sequence IDs
cut -f1 id_mapping.SILVA.tsv > id_mapping.SILVA.names.txt
# Download FASTA from RNAcentral release 5.0 (contains all sequences)
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/sequences/rnacentral_active.fasta.gz
gunzip rnacentral_active.fasta.gz
# Subselect only SILVA rRNA sequences
./blobtools seqfilter -i rnacentral_active.fasta -l id_mapping.SILVA.names.tsv
# Convert rRNA to rDNA
perl -ne 'chomp; if(m/^>/){@temp=split(" "); print $temp[0]."\n";} else {$_ =~ tr/U/T/; print $_."\n"}' rnacentral_active.filtered.fna > rnacentral_active.SILVA.rDNA.fasta
# make the blastDB
makeblastdb -dbtype nucl -in rnacentral_active.SILVA.rDNA.fasta
Sequence similarity search algorithms
- Since the goal is to obtain hits for the sequences in the assembly, the queries of the sequence similarity searches are nucleotide sequences.
- The sequence similarity search algorithm determines which type (nucleotide or protein) of sequence collection can be used.
- The following sequence similarity search algorithms can be used:
Algorithm | scope | database | speed |
BLASTn megablast | used to find very similar (e.g., intraspecies or closely related species) sequences | nucleotide | fastest |
BLASTn dc-megablast | used to find more distant (e.g., interspecies) sequences | nucleotide | slow |
BLASTx | used to find distant sequences | protein | slowest |
Diamond blastx | used to find distant sequences | protein | faster than BLASTx |
Example commands
blastn \
-task megablast \
-query assembly.fna \
-db nt \
-outfmt '6 qseqid staxids bitscore std' \
-max_target_seqs 1 \
-max_hsps 1 \
-num_threads 16 \
-evalue 1e-25 \
-out assembly.vs.nt.mts1.hsp1.1e25.megablast.out
blastn \
-task dc-megablast \
-query assembly.fna \
-db nt \
-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \
-max_target_seqs 1 \
-max_hsps 1 \
-num_threads 16 \
-evalue 1e-25 \
-out assembly.vs.nt.mts1.hsp1.1e25.dc_megablast.out
diamond blastx \
--query assembly.fna \
--max-target-seqs 1 \
--sensitive \
--threads 16 \
--db uniprot_ref_proteomes.diamond.dmnd \
--evalue 1e-25 \
--outfmt 6 \
--out assembly.vs.uniprot_ref.mts1.1e25.out
# taxify results
blobtools taxify -f assembly.vs.uniprot_ref.mts1.1e25.out -m uniprot_ref_proteomes.taxids -s 0 -t 2
blastn \
-task megablast \
-query assembly.fna \
-db rnacentral_active.SILVA.rDNA.fasta \
-outfmt '6 qseqid staxids bitscore std' \
-num_threads 48 \
-evalue 1e-65 \
-out assembly.vs.silva.1e65.megablast.out
# filtering the BLAST result further based on pID, ALN and Evalue is recommended
awk '$5 >= 97 && $6 >= 500 && $13 <= 1e-150' \ assembly.vs.silva.1e65.megablast.out \
> assembly.vs.silva.1e65.megablast.stringent.out
Recommendations
Best tradeoff between speed and accuracy of BlobTools taxonomy assignment is (usually) achieved with the following searches, supplied in this order and using taxrule ‘bestsumorder’.
blastn \
-query $ASSEMBLY \
-db nt \
-outfmt ’6 qseqid staxids bitscore std’ \
-max_target_seqs 1 \
-max_hsps 1 \
-evalue 1e-25
diamond blastx \
--query $ASSEMBLY \
--db uniprot_ref_proteomes.diamond.dmnd \
--outfmt 6 \
--sensitive \
--max-target-seqs 1 \
--evalue 1e-25
Most accurate results are obtained with the following searches, supplied in this order and using taxrule ‘bestsumorder’.
blastn \
-query $ASSEMBLY \
-db nt \
-outfmt ’6 qseqid staxids bitscore std’ \
-max_target_seqs 10 \
-max_hsps 1 \
-evalue 1e-25
diamond blastx \
--query $ASSEMBLY \
--db uniprot_ref_proteomes.diamond.dmnd \
--outfmt 6 \
--sensitive \
--max-target-seqs 1 \
--evalue 1e-25
Updated almost 3 years ago