Machine learning (in the informatics world) is like teenage sex: everyone talks about it, nobody really knows how to to do it, everyone thinks everyone else is doing it, so everyone claims they are too. Juvenile comparisons aside, the power of these tools can't be ignored. What really piqued my interest was reading Adam Geitgy's post on Medium where he builds a Super Mario level using a neural network. His entire eight part series by the way, is an awesome primer on machine learning. More recently I read a recent paper by Wang et al. that applied deep learning to transcription factor binding and I was inspired to learn more. Using deep learning tools for DNA analysis requires first converting DNA sequences to numbers. We can do this by one hot encoding our DNA sequence.

One Hot Encoding and DNA sequences:


One hot encoding is a way to represent categorical data as binary vectors. For DNA, we have four catagories A, T, G, and C

Thus a one hot code for DNA could be:
A = [1, 0, 0, 0]
T = [0, 1, 0, 0]
G = [0, 0, 1, 0]
C = [0, 0, 0, 1]

So the sequence AATTC would be:
[[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 0, 0, 1]]

You might be asking 'why not just use A=1, T=2, G=3, C=4?'
The answer is of course we can, this is called integer encoding and we need this for our encoding solution below.

To do this in python, I found a great one hot encoding tutorial by Jason Brownlee that takes advantage of the SciKit Learn library and adapted it to work with DNA sequences. Note that the SciKit learn library is pre-installed with the Anaconda distribution. It made the most sense to me to build this as a python class to use repeatedly for many sequences and storing their attributes for later use. the class hot_dna takes a fasta as argument. The first chunk will check for and store the sequence name (anything between '>' and newline). Then the sequence is converted to an array for integer encoding. The integer encoding is carried out using LabelEncoder(). Next, the integer encoded DNA is one hot encoded using OneHotEncoder(). Finally, these encodings and the original sequence along with it's name get loaded as attributes. Check it out below:

from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
import re

class hot_dna:
 def __init__(self,fasta):
   
  #check for and grab sequence name
  if re.search(">",fasta):
   name = re.split("\n",fasta)[0]
   sequence = re.split("\n",fasta)[1]
  else :
   name = 'unknown_sequence'
   sequence = fasta
  
  #get sequence into an array
  seq_array = array(list(sequence))
    
  #integer encode the sequence
  label_encoder = LabelEncoder()
  integer_encoded_seq = label_encoder.fit_transform(seq_array)
    
  #one hot the sequence
  onehot_encoder = OneHotEncoder(sparse=False)
  #reshape because that's what OneHotEncoder likes
  integer_encoded_seq = integer_encoded_seq.reshape(len(integer_encoded_seq), 1)
  onehot_encoded_seq = onehot_encoder.fit_transform(integer_encoded_seq)
  
  #add the attributes to self 
  self.name = name
  self.sequence = fasta
  self.integer = integer_encoded_seq
  self.onehot = onehot_encoded_seq
And here's what a fasta looks like going through:
  
# EXAMPLE
fasta = ">fake_sequence\nATGTGTCGTAGTCGTACG"
my_hottie = hot_dna(fasta)

print(my_hottie.name)
>fake_sequence
print(my_hottie.sequence)
ATGTGTCGTAGTCGTACG
print(my_hottie.integer)
[[0]
 [3]
 [2]
 [3]
 [2]
 [3]
 [1]
 [2]
 [3]
 [0]
 [2]
 [3]
 [1]
 [2]
 [3]
 [0]
 [1]
 [2]]
print(my_hottie.onehot)
[[ 1.  0.  0.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]
 [ 1.  0.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  1.  0.]
 [ 0.  0.  0.  1.]
 [ 1.  0.  0.  0.]
 [ 0.  1.  0.  0.]
 [ 0.  0.  1.  0.]]

All set for your machine learning algorithms! Which may or may not be the topic of a future post.

LearnMeSomeBiology(my_hottie.onehot)


Until next time!