-num_threadsoption for a decent speedup. Finally you can cut your query into many chunks and run them all in parallel giving a more linear speedup over simply increasing the number of threads.
If none of that works you can turn to an alternative BLAST-like tool. Today we are comparing PLAST, a command line tool that promises significant accelartion over BLAST, and DIAMOND, a tool that brags up to a 20,000x speed up over BLAST. I routinely have to BLAST large numbers of sequences (transcriptomes) to large databases (nr, Trembl) so these tools immediately piqued my interest. Which one is better though? And are they just as good as BLAST? Only one way to find out:
Let's race em!
The Race:As a query I am using a random set of 1369 sequences generated from a Trinity assembly. I knew that many of these had no BLASTx hit from a previous run so they'd make a good 'challenge' for the programs.
The overall stats on the query file are as follows:
For a database I am using a local copy of Trembl which although not as large as nr is still quite formidable.
One annoyance I encountered with DIAMOND is that I had to make a separate database file for it:
PLAST on the other hand works with a database generated with NCBI's makeblastdb command. You provide the .pal file to PLAST.
/scripts/diamond/./diamond makedb --in ~/dbs/uniprot_trembl.fasta -d ~/dbs/trembl
Setting up equivalent calls was pretty straightforward except for a parameter unique to PLAST (the seed use ratio) and a parameter unique to DIAMOND (
--more_sensitiveoption). Both were set to the most sensitive. Here's the calls for each program:
#DIAMOND blastx date ; /scripts/diamond/./diamond blastx \ -d ~/dbs/trembl \ -q ~/test_race.fa \ --threads 8 \ --more-sensitive \ --out test_race.diamonblastx.txt \ --outfmt 6 \ --max-target-seqs 10 \ --evalue 0.00005 \ --max-hsps 1 ; date #PLASTx date ; /scripts/plastbinary_linux_20160121/scripts/../build/bin/plast \ -p plastx \ -d ~/dbs/trembl.pal \ -i ~/test_race.fa \ -o /home/jwarner/test_race.plastx.txt \ -e 5e-5 -a 8 \ -max-hsp-per-hit 1 \ -max-hit-per-query 10 \ -seeds-use-ratio 100 \ -outfmt 1 \ -force-query-order 1000 \ -bargraph ; date #BLASTx date ; blastx \ -query ~/test_race.fa \ -db ~/dbs/trembl \ -out /home/jwarner/test_race.blastx.txt -outfmt 6 -evalue 0.00005 \ -max_target_seqs 10 -max_hsps 1 -num_threads 8 ; date
And the winner is...
The results can be binned into two categories: Alignment time and sensitivity.
Both DIAMOND and PLAST are orders of magnitude faster than BLAST, clocking in at 346 and 202 minutes respectively. Compared to 1474 minutes for the BLAST that is some serious computational time savings. Good to see since this speed up is the raison d'être for these alternative programs.
As a proxy for sensitivity, I examined two metrics. The total number of hits found by each program and the number of queries with at least one hit. We can see that in both cases BLASTx is the most sensitive, followed by PLASTx and then DIAMOND.
While BLASTx is still the gold standard for sensitivity it's incredibly long alignment times make it a major bottleneck when many thousands of sequences need to be aligned. This is a serious consideration when we are for example annotating a large transcriptome (our last assembly was over 200,000 sequences). Both PLAST and DIAMOND provide a major run-time advantage without a considerable loss of sensitivity.
In a head to head of PLAST and DIAMOND, PLAST is the clear winner both in terms of speed and sensitivity. Besides this, I found PLAST preferable for three primary reasons:
- PLAST allows for finer tuning of sensitivity versus speed by modifying the seed use ratio.
- PLAST is compatible with databases created using BLAST's makeblastdb command.
- PLAST supports the entire blast family. DIAMOND only includes blastp and blastx.
-bargraphoption which estimates the remaining run time is nice.
In any event both tools provide a (much) faster alternative to BLAST and are certainly worth considering the next time a big annotation project comes down the pipe.
- B. Buchfink, Xie C., D. Huson, "Fast and sensitive protein alignment using DIAMOND", Nature Methods 12, 59-60 (2015).
- Nguyen VH, Lavenier D. (2009) PLAST: parallel local alignment search tool for database comparison. BMC Bioinformatics, vol 10, no 329.