How do you put a genome back together after sequencing?

After DNA sequencing is complete, the fragments of DNA that come out of the machine are all jumbled up. Like a jigsaw puzzle we need to take the pieces of the genome and put them back together.

Bioinformatics 2: assembly

What’s the challenge?

  • The technology of DNA sequencing is not 100 per cent accurate and therefore there are likely to be errors in the DNA sequence that is produced.
  • So, to account for the errors that could potentially occur, each base in the genome is sequenced a number of times over, this is called coverage. For example, 30 times (30-fold) coverage means each base is sequenced 30 times.
  • Effectively, the more times you sequence, or “read”, the same section of DNA, the more confidence you have that the final sequence is correct.
  • 30- to 50-fold coverage is currently the standard used when sequencing human genomes to a high level of accuracy.
  • During the Human Genome Project coverage was only between 5- and 10-fold and used a different sequencing technology to those used today. Coverage has increased because of a few reasons:
    • Although most current sequencing techniques are now faster than they were during the Human Genome Project, some sequencing technologies have a higher error rate.
    • Some sequencing technologies deal with shorter reads of DNA which means that gaps are more likely to occur when the genome is assembled. Having a higher coverage reduces the likelihood of there being gaps in the final assembled sequence.
    • It is also much cheaper to carry out sequencing to a higher coverage than it was at the time of the Human Genome Project.
  • High coverage means that after sequencing DNA we have lots and lots of pieces of DNA sequence (reads).
  • To put this into perspective, once a human genome has been fully sequenced we have around 100 gigabases (100,000,000,000 bases) of sequence data.
  • Like the pieces of a jigsaw puzzle, these DNA reads are jumbled up so we need to piece them together and put them in the correct order to assemble the genome sequence.

What do we need to do?

  • Put the pieces together in the correct order to construct the complete genome sequence and identify any areas of interest.
  • This is done using processes called alignment and assembly:  
    • Alignment is when the new DNA sequence is compared to existing DNA sequences to find any similarities or discrepancies between them and then arranged to show these features. Alignment is a vital part of assembly.
    • Assembly involves taking a large number of DNA reads, looking for areas in which they overlap with each other and then gradually piecing together the ‘jigsaw’. It is an attempt to reconstruct the original genome. This is primarily carried out for de novo sequences.

De novo sequencing

  • De novo sequencing is when the genome of an organism is sequenced for the first time.
  • In de novo assembly there is no existing reference genome sequence for that species to use as a template for the assembly of its genome sequence.
  • If you know that the new species is very similar to another species that does have a reference genome, it is possible to assemble the sequence using a similar genome as a guide.
  • To help assemble a de novo sequence a physical gene map can be developed before sequencing to highlight the “landmarks” so the scientists know where sections of DNA are located in relation to each other.
  • Producing a gene map can be an expensive process, so some assembly programmes rely on data consisting of a mix of single and paired-end reads (see illustration below):
    • Single reads are where one end or the whole of a fragment of DNA is sequenced. These sequences can then be joined together by finding overlapping regions in the sequence to create the full DNA sequence.
    • Paired-end reads are where both ends of a fragment of DNA are sequenced. The distance between paired-end reads can be anywhere between 200 base pairs and several thousand. The key advantage of paired-end reads is that scientists know how far apart the two ends are. This makes it easier to assemble them into a continuous DNA sequence. Paired-end reads are particularly useful when assembling a de novo sequence as they provide long-range information that you wouldn’t otherwise have in the absence of a gene map.
Illustration showing the difference between single and paired-end reads. Image credit: Genome Research Limited

Illustration showing the difference between single and paired-end reads. Image credit: Genome Research Limited

  • Assembly of a de novo sequence begins with a large number of short sections or “reads” of DNA.
  • These reads are compared to each other and those sharing the same DNA sequence are grouped together.
  • From here they are assembled into progressively larger sections to form long contiguous (together in sequence) sequences called “contigs”.
  • These contigs can then be grouped together with information taken from other technologies to provide clues for how to stitch the contigs together and roughly how far apart to place them, even if the sequence in between is still unknown. This is called “scaffolding”.
  • The assembly can be further refined by ordering the individual scaffolds into chromosomes. A physical gene map is a useful tool for doing this.
  • The resulting assembly is then fed on to the next stage of the process – annotation, which identifies where the genes and other features in the sequence start and stop.
  • The assembly of a genome is a computer-intensive job. It usually takes around 20 hours per gigabase of sequence for genome assembly programmes to stitch together an organism’s genome sequence from the reads of DNA sequence generated by the sequencing machines.
  • So, with the 100 gigabases of sequence data we have after sequencing a human genome, it will take 2,000 hours or around 83 days to assemble the complete sequence.

Resequencing

  • This is when the genome being sequenced is known to be from a species that has been sequenced before and therefore a reference genome is available.
  • Resequencing is a term that can be used to describe two distinct processes:
    • One use of resequencing is for improving the quality of the existing DNA sequence for that organism.
      • For example, the Human Genome Project, which was completed in 2003, provided the first fully assembled sequence of the human genome.
      • Since then scientists have been working to produce a reference sequence of a higher quality and accuracy.
      • As a result, the human reference genome has been vastly improved since 2003, with scientists correcting errors, rearranging the order of the individual contigs and filling any remaining gaps in the sequence.
    • Another use of resequencing is when we sequence the genome of an individual from a species that we already have a reference genome for and know a bit about. We can then compare the new genome sequence with that of the reference and find out how they vary.
      • For example, if there is a base-pair change in the new genome that isn’t present in the reference genome it may give a clue as to the genetic origin of a particular trait or disease.
      • The availability of a reference human genome since 2003 has allowed for projects such as the 1000 Genomes Project and UK10K.
      • The 1000 Genomes Project, which launched in 2008, was the first project to sequence the genomes of a large number of people (at least 1,000), to provide a comprehensive resource on human genetic variation.
      • The UK10K was launched by the Wellcome Trust in 2010 and aimed to analyse the DNA of one in every 6,000 individuals in the UK in order uncover rare genetic variants important to human disease.
      • The Genomics England 100,000 Genomes Project, which was launched in late 2012, will focus on patients with rare diseases and their families and patients with cancer. By comparing many genomes and combining the findings with the patients’ medical information it is hoped that they will identify common genetic trends to help with making diagnoses. With better diagnoses doctors have a better chance of providing the most appropriate medication. 
  • Resequencing for comparison with the reference genome generally doesn’t involve any assembly because this has already been done for the reference genome. Instead alignment is used. This means that the sections of DNA or “reads” produced after sequencing are compared to the reference genome and placed alongside their most similar (ideally identical) counterpart.
  • Once all the sections are aligned, it is then possible to look for differences between the individual sequence and the reference sequence.

 

This page was last updated on 2016-01-25