{"__v":0,"_id":"5761947faa540f3600bfed4f","category":{"__v":0,"_id":"57619455a7c9f729009a74e0","project":"57618347b65324200072d6a5","version":"57618347b65324200072d6a8","sync":{"url":"","isSync":false},"reference":false,"createdAt":"2016-06-15T17:45:57.617Z","from_sync":false,"order":5,"slug":"file-descriptions","title":"File formats"},"parentDoc":null,"project":"57618347b65324200072d6a5","user":"57617c8caa540f3600bfed20","version":{"__v":8,"_id":"57618347b65324200072d6a8","project":"57618347b65324200072d6a5","createdAt":"2016-06-15T16:33:11.587Z","releaseDate":"2016-06-15T16:33:11.587Z","categories":["57618347b65324200072d6a9","5761912d207db7170022fbe9","57619455a7c9f729009a74e0","576e8ae1f37ab41700147471","5797b8e5209a6e0e00b8321b","57989a8817ced017003c4c69","579ca6f3d46f960e0029a8ec","579ca703fefb1d0e00c94f06"],"is_deprecated":false,"is_hidden":false,"is_beta":false,"is_stable":true,"codename":"blobtools v0.9.19","version_clean":"0.9.19","version":"0.9.19"},"updates":[],"next":{"pages":[],"description":""},"createdAt":"2016-06-15T17:46:39.659Z","link_external":false,"link_url":"","githubsync":"","sync_unique":"","hidden":false,"api":{"results":{"codes":[]},"settings":"","auth":"required","params":[],"url":""},"isReference":false,"order":2,"body":"## Why is it needed?\n\n* A hits file contains the results of a sequence similarity search of the assembly against a sequence database (e.g. BLAST/Diamond output files)\n\n* ```blobtools create``` parses **all** hits present in a hits-file (i.e. a sequence can have more than one hit) and assigns a [taxonomy](doc:taxonomy) under a given [tax-rule](doc:tax-rule) to each sequence in the assembly. This process relies entirely on the [hits file(s)](doc:taxonomy-file) provided by the user.     \n\n* For a more detailed description of the taxonomic annotation process, see [taxonomic annotation](doc:taxonomic-annotation).      \n\n## Required format\n\nThe required format is TAB-separated and should show the following information in the first three columns (additional columns are ignored)\n * 1st column: sequenceID\n * 2nd column: taxID\n * 3rd column: score\n\n# Generating hits files\n\n## 1. Sequence databases against which to search\n\n* A wide variety of curated and un-curated sequence collections are publicly available (e.g. NCBI, UniProt, RNAcentral) and the choice of sequence collection against which to search depends on the desired goal.\n\n* NCBI BLAST DBs include [taxIDs](doc:taxid) for each of the sequences they contain, for other sequence collections the user will have to annotate hit-files *a posteriori* using ```blobtools taxify``` and providing a [TaxID-mapping file](doc:taxid-mapping-file).  \n\n### 1.1 Public sequence collections \nThe collections differ in their inclusion criteria of sequences, phylogenetic sampling and the effort made to reduce [redundancy](http://arep.med.harvard.edu/seqanal/db.html). A selection of commonly used public sequence collections:  \n\n* [UniProt Reviewed (Swiss-Prot)](http://www.uniprot.org/downloads) (551,705 protein sequences :::at::: 2016-06-30): \n * manually curated\n\n\n* [UniProt UniRef90](http://www.uniprot.org/downloads) (42,554,046 protein sequences @ 2016-06-27): \n * provides automatically clustered sets of sequences from the UniProt Knowledgebase and selected UniParc records (contains Swiss-Prot + non-reviewed UniProtKB/TrEMBL)\n\n\n* [RNAcentral SILVA](http://rnacentral.org/expert-database/silva) (4,478,614 rRNA sequences @ version 122)\n * quality checked and aligned ribosomal RNA sequence data\n\n\n* [NCBI RefSeq](ftp://ftp.ncbi.nlm.nih.gov/blast/db/) (65,964,245 protein sequences @ Release 77)\n *  manually curated GenBank sequences\n\n\n* [NCBI nr](ftp://ftp.ncbi.nlm.nih.gov/blast/db/) (many protein sequences)\n * all non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects\n\n\n* [NCBI nt](ftp://ftp.ncbi.nlm.nih.gov/blast/db/) (many nucleotide sequences)\n * entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, and environmental sequences are excluded. RefSeq genomic entries are also excluded.\n\n\n* For all available NCBI sequence collections, see [NCBI BLAST FTP site](https://www.ncbi.nlm.nih.gov/books/NBK62345/)\n\n### 1.2 Custom sequence collections\n\nCustom sequence collections can be constructed based on:\n* subsets of public databases\n* unpublished reference genomes/proteomes\n\n## 2. Setting up sequence databases\n\n### 2.1. UniProt sequence collections\n\n* Sujai Kumar has written a [gist](https://gist.github.com/sujaikumar/9ad04e62449a2d7025b17144de67038b) on how to construct a  blastDB/diamondDB from [UniProt UniRef90](http://www.uniprot.org/downloads), the same steps can be used to create blastDB/diamondDB from other UniProt sequence collections.\n  \n### 2.2. NCBI sequence collections\n\n* NCBI offers pre-formatted blastDBs for BLAST through the [NCBI BLASTDB FTP site](ftp://ftp.ncbi.nih.gov/blast/db/)\n* In order to construct diamondDBs, users have to retrieve the FASTA files for the sequence collections from the [NCBI BLASTDB FASTA FTP site](ftp://ftp.ncbi.nih.gov/blast/db/FASTA/) and build there own database (see [Diamond documentation](https://github.com/bbuchfink/diamond)).\n\n### 2.3 RNAcentral SILVA collection\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"# Download ID mapping file from RNAcentral release 5.0\\nwget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/id_mapping/id_mapping.tsv.gz\\n\\n# extract SILVA sequence IDs (this is the TaxID-mapping file)\\ngunzip -c id_mapping.tsv.gz | grep SILVA | sort | uniq > id_mapping.SILVA.tsv\\n\\n# generate a list of sequence IDs\\ncut -f1 id_mapping.SILVA.tsv > id_mapping.SILVA.names.txt\\n\\n# Download FASTA from RNAcentral release 5.0 (contains all sequences)\\nwget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/sequences/rnacentral_active.fasta.gz\\ngunzip rnacentral_active.fasta.gz\\n\\n# Subselect only SILVA rRNA sequences\\n./blobtools seqfilter -i rnacentral_active.fasta -l id_mapping.SILVA.names.tsv \\n\\n# Convert rRNA to rDNA\\nperl -ne 'chomp; if(m/^>/){@temp=split(\\\" \\\"); print $temp[0].\\\"\\\\n\\\";} else {$_ =~ tr/U/T/; print $_.\\\"\\\\n\\\"}' rnacentral_active.filtered.fna > rnacentral_active.SILVA.rDNA.fasta\\n\\n# make the blastDB\\nmakeblastdb -dbtype nucl -in rnacentral_active.SILVA.rDNA.fasta\",\n      \"language\": \"shell\",\n      \"name\": \"Shell\"\n    }\n  ]\n}\n[/block]\n## 3. Running sequence similarity searches\n\n### 3.1 Sequence similarity search algorithms\n\n* Since the goal is to obtain hits for the sequences in the assembly, the queries of the sequence similarity searches are nucleotide sequences. \n\n* The sequence similarity search algorithm determines which type (nucleotide or protein) of sequence collection can be used. \n* The following sequence similarity search algorithms can be used:\n[block:parameters]\n{\n  \"data\": {\n    \"0-0\": \"Algorithm\",\n    \"1-0\": \"[BLASTn megablast](https://www.ncbi.nlm.nih.gov/books/NBK279675/)\",\n    \"0-1\": \"scope\",\n    \"0-2\": \"database\",\n    \"0-3\": \"speed\",\n    \"1-1\": \"used to find very similar (e.g., intraspecies or closely related species) sequences\",\n    \"1-2\": \"nucleotide\",\n    \"1-3\": \"fastest\",\n    \"2-0\": \"[BLASTn dc-megablast](https://www.ncbi.nlm.nih.gov/books/NBK279675/)\",\n    \"2-1\": \"used to find more distant (e.g., interspecies) sequences\",\n    \"2-2\": \"nucleotide\",\n    \"2-3\": \"slow\",\n    \"3-1\": \"used to find distant sequences\",\n    \"3-2\": \"protein\",\n    \"3-0\": \"[BLASTx](https://www.ncbi.nlm.nih.gov/books/NBK279675/)\",\n    \"3-3\": \"slowest\",\n    \"4-0\": \"[Diamond blastx](https://github.com/bbuchfink/diamond)\",\n    \"4-1\": \"used to find distant sequences\",\n    \"4-2\": \"protein\",\n    \"4-3\": \"faster than BLASTx\"\n  },\n  \"cols\": 4,\n  \"rows\": 5\n}\n[/block]\n### 3.2 Example commands\n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"blastn \\\\\\n-task megablast \\\\\\n-query assembly.fna \\\\\\n-db nt \\\\\\n-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \\\\\\n-culling_limit 5 \\\\\\n-num_threads 48 \\\\\\n-evalue 1e-25 \\\\\\n-out assembly.vs.nt.cul5.1e25.megablast.out\",\n      \"language\": \"shell\",\n      \"name\": \"megablast\"\n    },\n    {\n      \"code\": \"blastn \\\\\\n-task dc-megablast \\\\\\n-query assembly.fna \\\\\\n-db nt \\\\\\n-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \\\\\\n-culling_limit 5 \\\\\\n-num_threads 48 \\\\\\n-evalue 1e-25 \\\\\\n-out assembly.vs.nt.cul5.1e25.megablast.out\",\n      \"language\": \"shell\",\n      \"name\": \"dc-megablast\"\n    },\n    {\n      \"code\": \"diamond blastx \\\\\\n--query assembly.fna \\\\\\n--max-target-seqs 100 \\\\\\n--sensitive \\\\\\n--index-chunks 1 \\\\\\n--threads 48 \\\\\\n--db swissprot.dmnd \\\\\\n--evalue 1e-25 \\\\\\n--outfmt tab \\\\\\n--out assembly.vs.swissprot.dmnd \",\n      \"language\": \"shell\",\n      \"name\": \"diamond-blastx\"\n    },\n    {\n      \"code\": \"blastn \\\\\\n-task megablast \\\\\\n-query assembly.fna \\\\\\n-db rnacentral_active.SILVA.rDNA.fasta \\\\\\n-outfmt '6 qseqid staxids bitscore std' \\\\\\n-num_threads 48 \\\\\\n-evalue 1e-65 \\\\\\n-out assembly.vs.silva.1e65.megablast.out\\n\\n# filtering the BLAST result further based on pID, ALN and Evalue is recommended\\nawk '$5 >= 97 && $6 >= 500 && $13 <= 1e-150' \\\\ assembly.vs.silva.1e65.megablast.out \\\\\\n> assembly.vs.silva.1e65.megablast.stringent.out\",\n      \"language\": \"shell\",\n      \"name\": \"SILVA-megablast\"\n    }\n  ]\n}\n[/block]\n* The use of ```[--max-target-seqs INT]``` in BLAST is discouraged, see [Sujai's gist](https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5) and [Peter's blogpost](https://blastedbio.blogspot.co.uk/2015/12/blast-max-target-sequences-bug.html)","excerpt":"","slug":"taxonomy-file","type":"basic","title":"hits file"}
## Why is it needed? * A hits file contains the results of a sequence similarity search of the assembly against a sequence database (e.g. BLAST/Diamond output files) * ```blobtools create``` parses **all** hits present in a hits-file (i.e. a sequence can have more than one hit) and assigns a [taxonomy](doc:taxonomy) under a given [tax-rule](doc:tax-rule) to each sequence in the assembly. This process relies entirely on the [hits file(s)](doc:taxonomy-file) provided by the user. * For a more detailed description of the taxonomic annotation process, see [taxonomic annotation](doc:taxonomic-annotation). ## Required format The required format is TAB-separated and should show the following information in the first three columns (additional columns are ignored) * 1st column: sequenceID * 2nd column: taxID * 3rd column: score # Generating hits files ## 1. Sequence databases against which to search * A wide variety of curated and un-curated sequence collections are publicly available (e.g. NCBI, UniProt, RNAcentral) and the choice of sequence collection against which to search depends on the desired goal. * NCBI BLAST DBs include [taxIDs](doc:taxid) for each of the sequences they contain, for other sequence collections the user will have to annotate hit-files *a posteriori* using ```blobtools taxify``` and providing a [TaxID-mapping file](doc:taxid-mapping-file). ### 1.1 Public sequence collections The collections differ in their inclusion criteria of sequences, phylogenetic sampling and the effort made to reduce [redundancy](http://arep.med.harvard.edu/seqanal/db.html). A selection of commonly used public sequence collections: * [UniProt Reviewed (Swiss-Prot)](http://www.uniprot.org/downloads) (551,705 protein sequences @ 2016-06-30): * manually curated * [UniProt UniRef90](http://www.uniprot.org/downloads) (42,554,046 protein sequences @ 2016-06-27): * provides automatically clustered sets of sequences from the UniProt Knowledgebase and selected UniParc records (contains Swiss-Prot + non-reviewed UniProtKB/TrEMBL) * [RNAcentral SILVA](http://rnacentral.org/expert-database/silva) (4,478,614 rRNA sequences @ version 122) * quality checked and aligned ribosomal RNA sequence data * [NCBI RefSeq](ftp://ftp.ncbi.nlm.nih.gov/blast/db/) (65,964,245 protein sequences @ Release 77) * manually curated GenBank sequences * [NCBI nr](ftp://ftp.ncbi.nlm.nih.gov/blast/db/) (many protein sequences) * all non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF excluding environmental samples from WGS projects * [NCBI nt](ftp://ftp.ncbi.nlm.nih.gov/blast/db/) (many nucleotide sequences) * entries from traditional divisions of GenBank, EMBL and DDBJ. Sequences from bulk divisions, i.e., gss, sts, pat, est, htg, wgs, and environmental sequences are excluded. RefSeq genomic entries are also excluded. * For all available NCBI sequence collections, see [NCBI BLAST FTP site](https://www.ncbi.nlm.nih.gov/books/NBK62345/) ### 1.2 Custom sequence collections Custom sequence collections can be constructed based on: * subsets of public databases * unpublished reference genomes/proteomes ## 2. Setting up sequence databases ### 2.1. UniProt sequence collections * Sujai Kumar has written a [gist](https://gist.github.com/sujaikumar/9ad04e62449a2d7025b17144de67038b) on how to construct a blastDB/diamondDB from [UniProt UniRef90](http://www.uniprot.org/downloads), the same steps can be used to create blastDB/diamondDB from other UniProt sequence collections. ### 2.2. NCBI sequence collections * NCBI offers pre-formatted blastDBs for BLAST through the [NCBI BLASTDB FTP site](ftp://ftp.ncbi.nih.gov/blast/db/) * In order to construct diamondDBs, users have to retrieve the FASTA files for the sequence collections from the [NCBI BLASTDB FASTA FTP site](ftp://ftp.ncbi.nih.gov/blast/db/FASTA/) and build there own database (see [Diamond documentation](https://github.com/bbuchfink/diamond)). ### 2.3 RNAcentral SILVA collection [block:code] { "codes": [ { "code": "# Download ID mapping file from RNAcentral release 5.0\nwget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/id_mapping/id_mapping.tsv.gz\n\n# extract SILVA sequence IDs (this is the TaxID-mapping file)\ngunzip -c id_mapping.tsv.gz | grep SILVA | sort | uniq > id_mapping.SILVA.tsv\n\n# generate a list of sequence IDs\ncut -f1 id_mapping.SILVA.tsv > id_mapping.SILVA.names.txt\n\n# Download FASTA from RNAcentral release 5.0 (contains all sequences)\nwget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/sequences/rnacentral_active.fasta.gz\ngunzip rnacentral_active.fasta.gz\n\n# Subselect only SILVA rRNA sequences\n./blobtools seqfilter -i rnacentral_active.fasta -l id_mapping.SILVA.names.tsv \n\n# Convert rRNA to rDNA\nperl -ne 'chomp; if(m/^>/){@temp=split(\" \"); print $temp[0].\"\\n\";} else {$_ =~ tr/U/T/; print $_.\"\\n\"}' rnacentral_active.filtered.fna > rnacentral_active.SILVA.rDNA.fasta\n\n# make the blastDB\nmakeblastdb -dbtype nucl -in rnacentral_active.SILVA.rDNA.fasta", "language": "shell", "name": "Shell" } ] } [/block] ## 3. Running sequence similarity searches ### 3.1 Sequence similarity search algorithms * Since the goal is to obtain hits for the sequences in the assembly, the queries of the sequence similarity searches are nucleotide sequences. * The sequence similarity search algorithm determines which type (nucleotide or protein) of sequence collection can be used. * The following sequence similarity search algorithms can be used: [block:parameters] { "data": { "0-0": "Algorithm", "1-0": "[BLASTn megablast](https://www.ncbi.nlm.nih.gov/books/NBK279675/)", "0-1": "scope", "0-2": "database", "0-3": "speed", "1-1": "used to find very similar (e.g., intraspecies or closely related species) sequences", "1-2": "nucleotide", "1-3": "fastest", "2-0": "[BLASTn dc-megablast](https://www.ncbi.nlm.nih.gov/books/NBK279675/)", "2-1": "used to find more distant (e.g., interspecies) sequences", "2-2": "nucleotide", "2-3": "slow", "3-1": "used to find distant sequences", "3-2": "protein", "3-0": "[BLASTx](https://www.ncbi.nlm.nih.gov/books/NBK279675/)", "3-3": "slowest", "4-0": "[Diamond blastx](https://github.com/bbuchfink/diamond)", "4-1": "used to find distant sequences", "4-2": "protein", "4-3": "faster than BLASTx" }, "cols": 4, "rows": 5 } [/block] ### 3.2 Example commands [block:code] { "codes": [ { "code": "blastn \\\n-task megablast \\\n-query assembly.fna \\\n-db nt \\\n-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \\\n-culling_limit 5 \\\n-num_threads 48 \\\n-evalue 1e-25 \\\n-out assembly.vs.nt.cul5.1e25.megablast.out", "language": "shell", "name": "megablast" }, { "code": "blastn \\\n-task dc-megablast \\\n-query assembly.fna \\\n-db nt \\\n-outfmt '6 qseqid staxids bitscore std sscinames sskingdoms stitle' \\\n-culling_limit 5 \\\n-num_threads 48 \\\n-evalue 1e-25 \\\n-out assembly.vs.nt.cul5.1e25.megablast.out", "language": "shell", "name": "dc-megablast" }, { "code": "diamond blastx \\\n--query assembly.fna \\\n--max-target-seqs 100 \\\n--sensitive \\\n--index-chunks 1 \\\n--threads 48 \\\n--db swissprot.dmnd \\\n--evalue 1e-25 \\\n--outfmt tab \\\n--out assembly.vs.swissprot.dmnd ", "language": "shell", "name": "diamond-blastx" }, { "code": "blastn \\\n-task megablast \\\n-query assembly.fna \\\n-db rnacentral_active.SILVA.rDNA.fasta \\\n-outfmt '6 qseqid staxids bitscore std' \\\n-num_threads 48 \\\n-evalue 1e-65 \\\n-out assembly.vs.silva.1e65.megablast.out\n\n# filtering the BLAST result further based on pID, ALN and Evalue is recommended\nawk '$5 >= 97 && $6 >= 500 && $13 <= 1e-150' \\ assembly.vs.silva.1e65.megablast.out \\\n> assembly.vs.silva.1e65.megablast.stringent.out", "language": "shell", "name": "SILVA-megablast" } ] } [/block] * The use of ```[--max-target-seqs INT]``` in BLAST is discouraged, see [Sujai's gist](https://gist.github.com/sujaikumar/504b3b7024eaf3a04ef5) and [Peter's blogpost](https://blastedbio.blogspot.co.uk/2015/12/blast-max-target-sequences-bug.html)