Scaffolding and completing genome assemblies in real-time with nanopore sequencing (#178)
Genome assemblies using short read sequencing technology are often fragmented into many contigs because of the abundance of repetitive sequences. Long read sequencing technologies can generate reads spanning most repeat sequences, providing the opportunity to complete these genome assemblies. However, substantial amounts of sequence data and computational resources are required to overcome the high per-base error rate inherent to these technologies. Furthermore, most existing methods only assemble the genomes after sequencing has completed, which could result in either the generation of more sequence data than is required at greater cost, or a low quality assembly if insufficient data are generated. Here we present the first computational method utilising real-time nanopore sequencing to scaffold and complete short read assemblies while the long read sequence data is being generated. The method reports the progress of completing the assembly in real-time so users can terminate sequencing once an assembly of sufficient quality and completeness is obtained. We use our method to assemble four bacterial genomes and one eukaryotic genome, and show that it is able to construct more complete and more accurate assemblies, while at the same time requiring less sequencing data and computational resources than existing pipelines. We also demonstrate that the method can facilitate real-time analyses of positional information such as identification of bacterial genes encoded in plasmids and pathogenicity islands. This will provide a time-effective and cost-effective solution to complete existing bacterial genome assemblies.