Post-sequencing: sequence annotation

Image Credit: Wellcome Sanger Institute

Someone discussing sequencing on a screen projected on the wall

How do you identify the genes in a genome?

This is part 3 in our series looking at what happens after DNA is sequenced. You might want to start with part 1, looking at the first step of quality control.

After the sections of DNA sequence have been assembled into a complete genome sequence we need to identify where the genes and key features are.
But how do we do this?
This step is our annotation step.

Illustration highlighting the annotation step of the bioinformatics pipeline after sequencing. Image credit: Laura Olivares Boldú / Wellcome Connecting Science

What’s the challenge?

Following the early steps, we now have our aligned and assembled genome sequence.
How do we identify where the genes and other functional regions of the genome are located?

What do we need to do?

Annotation involves marking where the genes start and stop in the DNA sequence, and also where other relevant and interesting regions are in the sequence.
Pipelines can differ – for example, some elements can be manual while others are automated – but they all share a core set of features.
They are generally divided into two distinct phases: gene prediction and manual annotation.

Gene prediction

There are two types of gene prediction: ab initio (meaning ‘from the beginning’ in Latin) and evidence based.
The information taken from these two prediction methods is then combined and lined up with the sequenced genome.

Ab initio gene prediction

This technique relies on signals within the DNA sequence.
It’s an automated process where a computer is given instructions for finding genes in the sequence and is then left to find them. The computer looks for common sequences known to be found at the start and end of genes such as promoter sequences (where proteins bind that switch on genes), start codons (where the code for the gene product, RNA or protein, starts) and stop codons (where the code for the gene product ends).

Illustration showing the structure of a gene. Image credit: Laura Olivares Boldú / Wellcome Connecting Science

Evidence based gene prediction

This technique relies on evidence beyond the DNA sequence.
It involves gathering various pieces of genetic information from the transcript sequence (mRNA) and known protein sequences of the genome.
With these pieces of evidence it is then possible to get an idea of the original DNA sequence by working backwards through transcription and translation (known as reverse transcription and translation).
For example, if you have the protein sequence it is possible to work out the family of possible DNA sequences it could be derived from by working out which amino acids make up the protein and then which combination of codons could code for those amino acids and so on, until you get to the DNA sequence.

Annotating the genome

Once gene prediction is complete, the second phase – manual annotation or “curation” – can begin.
This is when the information gathered from the prediction phase is looked at, by a person, to find a particular gene or answer a particular question.

Post-sequencing: sequence annotation

How do you identify the genes in a genome?

What’s the challenge?

What do we need to do?

Gene prediction

Ab initio gene prediction

Evidence based gene prediction

Annotating the genome

Sequence comparisons: Read on for part 4 of this miniseries, looking at how you find out the significance of a genome after sequencing.