De novo genome assembly pipeline

Recent advances in sequencing technologies allow for efficient, highly precise, and cost-efficient genome assemblies of almost any species from scratch. Over the past years, we have established a set of different sequencing technologies in combination with a highly efficient genome assembly pipeline to assemble de novo genomes. We generate highly accurate and long PacBio HiFi reads which are used for the first genome assembly step. Depending on continuity of these assemblies, we are able to improve them by adding optical mapping data (Bionano Genomics), genomic-linked read data (e.g. TELL-SeqTM), and Oxford Nanopore ultra-long sequence data. To achieve chromosome-level resolution, we implement long-range chromosomal-contact reads (Hi-C) to assembly process. We have applied this genome assembly pipeline successfully to a wide variety of different species covering different taxa, such as flatworms, insects, fishes, reptiles, birds, and mammals.

Long read technologies

Long-read single molecule DNA sequencing

In contrast to Sanger and Illumina sequencing, DNA molecules are sequenced directly without any amplification step. This allows long sequence reads (average read lengths circa 30 kb, maximum read length > 100 kb) of individual DNA fragments. Since each DNA fragment is read only once, these methods come at the expense of accuracy with an error rate of about 10-12%.

We apply two different single molecule sequencing technologies:

Pacific Biosciences single molecule DNA sequencing

PacBio single molecule real time (SMRT) DNA sequencing follows the incorporation of fluorescently labeled nucleotides in a growing primer strand of a circular template in real time. These incorporation events are imaged in small captivities of the SMRT cell and are translated into a nucleotide sequence.

Oxford Nanopore single molecule sequencing

Oxford Nanopore sequencing of single native DNA (and RNA) molecules relies on movement of a single DNA strand through a membrane channel (nanopore) in a microchip. This movement is mediated by a motor protein that links the DNA to the channel. Since each of the different nucleotides leads to characteristic change of the electric current flux in the nanopore channel, base calling is observed in real time while the DNA molecule is passing the pore.

Both technologies are highly suitable to sequence long and ultralong genomic DNA, RNA isoforms, long-range PCR fragments and targeted long genomic regions. PacBio HiFi sequencing as a special flavor of PacBio sequencing, combining sequencing of medium-sized DNA fragments (about 12-15kb on average) with high accuracy (mean accuracy 99,9%, QV30 and more) and is highly suited to unravel repetitive and low complexity regions of genomes.

Additional genome biology technologies

Bionano optical mapping

Optical mapping is used to improve the contig continuity of the scaffolds in de novo genome assembly pipeline. Ultra-long gDNA fragments are labeled at specific sequences across the whole genome and visualized as linearized molecules in a microchip. The resulting images are converted into molecules. These molecules are converted into genome maps and aligned to reference genomes in order to detect structural variations such as deletions, insertions, inversions, and translocations.

Hi-C Chromatin confirmation capturing

Hi-C is used to analyse the spatial organization of chromatin in the nucleus of a cell. Neighboring chromatin region are chemically cross-linked, converted into a short-read sequence library in a multi-step protocol, and ultimately sequenced on Illumina short-read devices. Depending on experimental and sequencing conditions, intra-chromosomal interaction can be visualized. Long-range intramolecular interactions help to scaffold fragmented genome assemblies to chromosomal level, whereas short-range interaction could help to elucidate promoter-enhancer interactions, TADs or chromatin-loops (TADs = topologically associated domains).