Monday, April 3, 2017

Bash commands for manipulating fasta sequence files

Here's a collection of bash one liners to manipulate fasta files that I've found incredibly useful. Some by me, some found on various threads on Biostars and stackexchange.

Just a reminder, a fasta file is a collection of sequence where each sequence identifier is preceded by a '>' and a newline. Example:

>Sequence_1
ATGCGGCAGAGTCTAGT
>Sequence_2
GTCTAGCTCGTCATGTA

Count the number of sequences in a fasta:
grep -c "^>" file.fa
Add something to the end of a fasta line. In this case, YOURID:
sed 's/>.*/&YOURID/' file.fa > output.fa
Add an an ID with increasing integer to a fasta:
awk '/^>/ {$1=$1 " YOURID_" ++i}1' file.fa > output.fa
Remove everything in the header of a fasta after the first space:
awk '{print $1;next}1' file.fa > output.fa
This does the same thing with sed:
sed 's/\s.*$//' < file.fa > output.fa
Remove everything after the second field of a fasta header (you could add as many fields as you like by adding " " $3 " " $4 " " $5 etc):
awk '/^>/ {print $1 " " $2;next}1' file.fa > output.fa
Get a list of IDs from a fasta file:
grep -o -E "^>\w+" file.fa | tr -d ">"
Split a fasta into files containing X number of sequences. This splits a fasta into several files containing 1000 sequences each:
awk 'BEGIN {n_seq=0;} /^>/ {if(n_seq%1000==0){file=sprintf("output.split.%d.fa",n_seq);} print >> file; n_seq++; next;} { print >> file; }' < file.fa
Drop sequences shorter than X basepairs in a fasta (200 in the example below):
awk '!/^>/ { next } { getline seq } length(seq) >= 200 { print $0 "\n" seq }' file.fa > output.fa
Convert multiline fasta to single line:
sed ':a;N;/^>/M!s/\n//;ta;P;D' < file.fa > output.fa
Also with awk:
awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < file.fa > output.fa
Get the sequence of 'geneID' from a multi-fasta.
awk '/'geneID'/{flag=1;print $0;next}/^>/{flag=0}flag' file.fasta >> output.fa

No comments:

Post a Comment

Clustering RNAseq data using K-means: how many clusters?

Clustering RNAseq data using K-means: how many clusters? Note this is part 2 of a ser...