How do you identify the genes in a genome?

After the sections of DNA sequence have been assembled into a complete genome sequence we need to identify where the genes and key features are, but how do we do this?

Bioinformatics 3: annotation

What’s the challenge?

  • We have our aligned and assembled genome sequence but how do we identify where the genes and other functional regions of the genome are located?

What do we need to do?

  • Annotation involves marking where the genes start and stop in the DNA sequence and also where other relevant and interesting regions are in the sequence.
  • Although genome annotation pipelines can differ from one another, for example, some elements can be manual while others have to be automated, they all share a core set of features.
  • They are generally divided into two distinct phases: gene prediction and manual annotation.

Gene prediction

  • There are two types of gene prediction: 
    • Ab initio – this technique relies on signals within the DNA sequence. It is an automated process whereby a computer is given instructions for finding genes in the sequence and is then left to find them. The computer looks for common sequences known to be found at the start and end of genes such as promoter sequences (where proteins bind that switch on genes), start codons (where the code for the gene product, RNA or protein, starts) and stop codons (where the code for the gene product ends).  
Illustration showing the structure of a gene. Image credit: Genome Research Limited

Illustration showing the structure of a gene. Image credit: Genome Research Limited

    • Evidence-based – this technique relies on evidence beyond the DNA sequence. It involves gathering various pieces of genetic information from the transcript sequence (mRNA), and known protein sequences of the genome. With these pieces of evidence it is then possible to get an idea of the original DNA sequence by working backwards through transcription and translation (reverse transcription/translation). For example, if you have the protein sequence it is possible to work out the family of possible DNA sequences it could be derived from by working out which amino acids make up the protein and then which combination of codons could code for those amino acids and so on, until you get to the DNA sequence. 
    • The information taken from these two prediction methods is then combined and lined up with the sequenced genome.

Manual annotation

  • Once gene prediction is completed the second phase, manual annotation or “curation”, can begin.  
  • This is when the information gathered from the prediction phase is looked at, by a person, in order to find a particular gene or answer a particular question.

Comparing genomes

  • Once annotated, the sequence can be compared to the known genome sequence of similar or closely related organisms in order to identify any key similarities or differences. 
    • For example, the genome sequence data of an animal, or model organism, can be annotated and then compared to the annotated sequence of a human. 
    • By comparing them it is possible to identify any similar genes. The mouse genome, for example, is very similar to the human genome.
    • This information can then be used to investigate similarities in the phenotypes of the mouse and human. For example, a genetic variant is linked to deafness in the mouse, but is this the case in the human as well?
    • Mutants can also be created (an organism with a specific genetic mutation) in order to investigate the function of a particular gene. For example, this gene is linked to ear development, but what is the effect when that gene is not functioning? 
  • Alternatively, the sequencing data can be placed alongside the reference genome for that species in order to find out more about the origins of particular characteristics or diseases.
    • The 1000 Genomes Project, which launched in 2008, aimed to produce a catalogue of these differences taken from sequencing the genomes of around 3000 anonymous people from 26 populations around the world.
    • The UK10K was launched by the Wellcome Trust in 2010 aimed to analyse the genomes of 4,000 healthy people with those of 6,000 people currently living with a disease of suspected genetic cause, such as severe obesity.
  • Once the sequencing data is aligned to the reference genome it is possible to compare them in order to highlight where the differences are.
  • This information is then compared with data from existing gene annotations.
  • Conclusions can then be drawn about the significance of the differences, and how they may affect gene expression and contribute to a specific disease or trait.  
Illustration showing the point mutation in the β-globin gene responsible for the genetic blood disorder β-thalassaemia. Image credit: Genome Research Limited.

Illustration showing the point mutation in the β-globin gene responsible for the genetic blood disorder β-thalassaemia. Image credit: Genome Research Limited.


  • The speed of annotation depends wholly on the research question and the accuracy that is needed to answer that question sufficiently. As a result annotation of a genome can take from days to years to complete.


This page was last updated on 2016-01-25