Skip to content

Arvind Ravi

Decoding a Virus

BioInformatics, Software4 min read

This post is an effort to understand how viruses work from a computation perspective. As someone from a software engineering background, I’m not quite familiar with viruses, but after a tiny little virus brought the entire world to a standstill, I got curious.

This post doesn’t aim to educate or teach anything and is simply meant to journal my research into how viruses function. So, by no means is this any kind of reference.


Simply put, a Virus is a small microscopic agent that causes your body’s biological systems to malfunction. They’re usually made of -- some long molecules of DNA or RNA, a protein coat, called capsid, that protects the genetic material and in some cases another outer layer, called lipids.


In the case of Covid-19, the virus is RNA-only and not DNA based. Let’s take a look at what it looks like.

The genome sequence looks like this:

The sequence of letters here is a representation of the genetic material of the virus. It contains information about proteins that it’s composed of, which could be decoded by cells in the human body. The proteins have instructions, in cases of viruses, that are malicious in nature. Very often, this has to do with replication of the virus itself, our cells happily carry out the virus’ instructions as long as the instructions are provided in the appropriate format.

In the case of SARS Cov 2, the entire attack happens in a few stages

  • Once the virus enters the human body, one of it’s proteins known as the Spike (S) Protein targets the enzyme known as ACE2 which is attached to the outer surface of the lung, spleen or the intenstinal cells
  • After binding to surface of the cell, the virus then goes through a process called cleavage which enables the virus to inject it’s genetic material into the cell it’s attached to
  • The other proteins have very clear instructions for the cell which is to replicate and then self-destruct Now it’s a perfect closed system for the virus in which it could thrive and is able to exploit the cells, multiplying itself and destroying it’s host cells, very much like a virus that operates in a computer system.

In addition to this, the interaction between the spike protein and the ACE2 receptor induces a drop in levels of ACE2 because of the degrading effect it has on the cells, leading to lung damage.

Looking into the Virus

Let’s take a look at how much information the virus carries:

That’s about 29k of data. Nature is really good at compression, especially with RNA based viruses. This is due to something called as gene overlap where some nucleotides code for two genes. Nature likes to keep it [DRY].

Let’s compress our genome to get an idea of how small the virus could be. Using [LZMA] to perform this compression:

That’s 8.4k. That’s how big it is. 8k.

Decoding / Translating

Within our body, our cells decode the genetic material of the virus into instructions. There are broadly two processes that take place in our cells —

  • one is to covert DNAs into RNA — which is known as transcription.
  • the other is converting RNA into proteins — which is known as translation. In the case of SARS Cov 2, which is an RNA based virus, what happens is translation, which is simply the conversion of RNA based genetic material into smaller components called proteins. Luckily for us, we can simulate this translation by using a RNA codon table and converting our string representation of the RNA sequence into smaller proteins.

RNA Codon Table —

RNA Codon Table

Before we translate our sequence into proteins using the codon table, the representation needs to tidied up into a format which we could work with —

We initially get a reference to the codon table, then it is put through a series of cleanup and mapping operations to build our decoder —

  1. dec represents the final value
  2. We loop through the codon table string by splitting it up using the newline character
  3. Then split each line using the tab character (\x09) so this gives us key-value pairs of each codon
  4. Replace our STOP codon with a #
  5. Sanitize our value string by removing special characters, and replacing “u” with “t” since our sequence uses “t” in place of “u”
  6. Finally, loop through the values and app them to tdec dictionary with their corresponding keys

This would give us a decoder which we could use for translation, that looks something like —

Now that we have a decoder, all there’s left to do is to translate our RNA sequence string and inspect it. This could be done like —

In order inspect chains of proteins from the outcome of this, we could split the chains at the STOP codons —

This gives us a list of proteins, I've removed most small chain proteins from this list to keep this brief (it's a REALLY long list of proteins otherwise) and have kept interesting long chain ones:

Spike Protein

The spike protein is the protein in the virus that’s responsible for cell entry. It binds with the target cell, mediates further virus-cell interaction and issues instructions for replications at times. Fun fact - they can usually be observed under an electron microscope and are about 20nm long. In the cases of SARS Cov2, the Spike glycoprotein is responsible for the cell entry and it binds to the surface of lung or spleen cells. It’s usually a long chain protein and can be easily detected once we’ve decoded our sequence and in our case we could zero it down to by cross-checking it against the NCBI’s report about the polyprotein chain.

Looking for the spike protein's site from the NIH's spec sheet, we can figure out where the spike protein is in our sequence and take a look at it:

Which gives us the Spike Glycoprotein --

Pretty fascinating, isn't it?


On the surface of our lungs, there’s a certain enzyme called the Angeotensin-converting Enzyme 2 which among other things is responsible for keeping our blood pressure stable. The spike protein in SARS Cov2, targets this receptor and binds with it exploiting an otherwise closed biological system. Once the binding takes place, a process called cleaving happens which then enables information transfer between the virus and the cell. Not long after replicating instructions are issued. The whole process is extremely well orchestrated and personally, I find this really cool.


The virus being really contagious, which also led to a collapse in the economy, has naturally garnered a lot of interest into development of a vaccine. One of the teams at Oxford University has successfully been able to activate antibody response by using a strain of common cold virus as a vehicle to carry the spike protein from SARS-Cov-2. While vaccines generally take a lot of time to go through clinical trials and approval, we should remain hopeful and do everything to minimise any further damage to ourselves.


Here's the iPython notebook incase you're interested.