Post-sequencing: sequence annotation  

Image Credit: Wellcome Sanger Institute

How do you identify the genes in a genome?

This is part 3 in our series looking at what happens after DNA is sequenced. You might want to start with part 1, looking at the first step of quality control.


  • After the sections of DNA sequence have been assembled into a complete genome sequence we need to identify where the genes and key features are. 
  • But how do we do this?
  • This step is our annotation step.


Illustration highlighting the third step of the bioinformatics pipeline after sequencing: quality control, assembly, annotation, comparison of genomes. Image credit: Laura Olivares Boldú / Wellcome Connecting Science


What’s the challenge?


  • Following the early steps, we now have our aligned and assembled genome sequence. 
  • How do we identify where the genes and other functional regions of the genome are located?


What do we need to do?


  • Annotation involves marking where the genes start and stop in the DNA sequence, and also where other relevant and interesting regions are in the sequence.
  • Pipelines can differ – for example, some elements can be manual while others are automated – but they all share a core set of features.
  • They are generally divided into two distinct phases: gene prediction and manual annotation.


Gene prediction

  • There are two types of gene prediction: ab initio (meaning ‘from the beginning’ in Latin) and evidence based. 
  • The information taken from these two prediction methods is then combined and lined up with the sequenced genome.

Ab initio gene prediction

  • This technique relies on signals within the DNA sequence. 
  • It’s an automated process where a computer is given instructions for finding genes in the sequence and is then left to find them. The computer looks for common sequences known to be found at the start and end of genes such as promoter sequences (where proteins bind that switch on genes), start codons (where the code for the gene product, RNA or protein, starts) and stop codons (where the code for the gene product ends).


Illustration showing the structure of a gene. Image credit: Laura Olivares Boldú / Wellcome Connecting Science


Evidence based gene prediction

  • This technique relies on evidence beyond the DNA sequence. 
  • It involves gathering various pieces of genetic information from the transcript sequence (mRNA) and known protein sequences of the genome. 
  • With these pieces of evidence it is then possible to get an idea of the original DNA sequence by working backwards through transcription and translation (known as reverse transcription and translation). 
  • For example, if you have the protein sequence it is possible to work out the family of possible DNA sequences it could be derived from by working out which amino acids make up the protein and then which combination of codons could code for those amino acids and so on, until you get to the DNA sequence.


Annotating the genome


  • Once gene prediction is complete, the second phase – manual annotation or “curation” – can begin.
  • This is when the information gathered from the prediction phase is looked at, by a person, to find a particular gene or answer a particular question.

Sequence comparisons:  Read on for part 4 of this miniseries, looking at how you find out the significance of a genome after sequencing.