Tuesday, July 4, 2017

BLASTx is too slow! Here's some alternatives.

BLASTx has long been the gold standard for nucleotide ~ amino acid alignments. For most applications it works very well. In some cases though, like aligning thousands of sequences to the non-redunant (nr) database, it is just too slow. If you're facing this problem there are a couple of ways to speed up BLASTx. You can increase the -word_size option for a marginal speedup with little lose of sensitivity. You can increase the -num_threads option for a decent speedup. Finally you can  cut your query into many chunks and run them all in parallel giving a more linear speedup over simply increasing the number of threads.

If none of that works you can turn to an alternative BLAST-like tool. Today we are comparing PLAST, a command line tool that promises significant accelartion over BLAST, and DIAMOND, a tool that brags up to a 20,000x speed up over BLAST. I routinely have to BLAST large numbers of sequences (transcriptomes) to large databases (nr, Trembl) so these tools immediately piqued my interest. Which one is better though? And are they just as good as BLAST? Only one way to find out:

Let's race em!

The Race:

As a query I am using a random set of 1369 sequences generated from a Trinity assembly. I knew that many of these had no BLASTx hit from a previous run so they'd make a good 'challenge' for the programs.
The overall stats on the query file are as follows:

Sequences: 1,369
Mean_length: 349
Median_length: 275
Min_length: 201
Max_length: 2,460

For a database I am using a local copy of Trembl which although not as large as nr is still quite formidable.

One annoyance I encountered with DIAMOND is that I had to make a separate database file for it:


/scripts/diamond/./diamond makedb --in ~/dbs/uniprot_trembl.fasta -d ~/dbs/trembl 

PLAST on the other hand works with a database generated with NCBI's makeblastdb command. You provide the .pal file to PLAST.

Setting up equivalent calls was pretty straightforward except for a parameter unique to PLAST (the seed use ratio) and a parameter unique to DIAMOND (--more_sensitive option). Both were set to the most sensitive. Here's the calls for each program:


#DIAMOND blastx 
date ;
/scripts/diamond/./diamond blastx \
-d ~/dbs/trembl \
-q ~/test_race.fa \
--threads 8 \
--more-sensitive \
--out test_race.diamonblastx.txt \
--outfmt 6 \
--max-target-seqs 10 \
--evalue 0.00005 \
--max-hsps 1 ;
date 

#PLASTx
date ;
/scripts/plastbinary_linux_20160121/scripts/../build/bin/plast \
-p plastx \
-d ~/dbs/trembl.pal \
-i ~/test_race.fa \
-o /home/jwarner/test_race.plastx.txt \
-e 5e-5 -a 8 \
-max-hsp-per-hit 1 \
-max-hit-per-query 10 \
-seeds-use-ratio 100 \
-outfmt 1 \
-force-query-order 1000 \
-bargraph ;
date 

#BLASTx
date ; 
blastx \
-query ~/test_race.fa \
-db ~/dbs/trembl \
-out /home/jwarner/test_race.blastx.txt  -outfmt 6 -evalue 0.00005 \
-max_target_seqs 10 -max_hsps 1 -num_threads 8 ;
date

And the winner is...

Results:



The results can be binned into two categories: Alignment time and sensitivity.

Alignment Time:
Both DIAMOND and PLAST are orders of magnitude faster than BLAST, clocking in at 346 and 202 minutes respectively. Compared to 1474 minutes for the BLAST that is some serious computational time savings. Good to see since this speed up is the raison d'ĂȘtre for these alternative programs.


Sensitivity:
As a proxy for sensitivity, I examined two metrics. The total number of hits found by each program and the number of queries with at least one hit. We can see that in both cases BLASTx is the most sensitive, followed by PLASTx and then DIAMOND.


Conclusion:


While BLASTx is still the gold standard for sensitivity it's incredibly long alignment times make it a major bottleneck when many thousands of sequences need to be aligned. This is a serious consideration when we are for example annotating a large transcriptome (our last assembly was over 200,000 sequences). Both PLAST and DIAMOND provide a major run-time advantage without a considerable loss of sensitivity.

In a head to head of PLAST and DIAMOND, PLAST is the clear winner both in terms of speed and sensitivity. Besides this, I found PLAST preferable for three primary reasons:
  1. PLAST allows for finer tuning of sensitivity versus speed by modifying the seed use ratio.
  2. PLAST is compatible with databases created using BLAST's makeblastdb command.
  3. PLAST supports the entire blast family. DIAMOND only includes blastp and blastx.
The second point is especially important if you maintain a local copy of NCBI's nr and update it with update_blastdb.pl or something similar. There's other reasons to like PLAST, for example their -bargraph option which estimates the remaining run time is nice.

In any event both tools provide a (much) faster alternative to BLAST and are certainly worth considering the next time a big annotation project comes down the pipe.

References:
  1. B. Buchfink, Xie C., D. Huson, "Fast and sensitive protein alignment using DIAMOND", Nature Methods 12, 59-60 (2015).
  2. Nguyen VH, Lavenier D. (2009) PLAST: parallel local alignment search tool for database comparison. BMC Bioinformatics, vol 10, no 329.

No comments:

Post a Comment

Clustering RNAseq data using K-means: how many clusters?

Clustering RNAseq data using K-means: how many clusters? Note this is part 2 of a ser...