3.08.2007

Quality Standards for Genome Sequence

I have been involved in sequencing and analyzing microbial genomes for over 5 years. I have been very fortunate to author and co-author a series of papers describing the genome sequence analysis of several organisms. During the publication submission/review process there has always been one thing that surprised me; that is the fact that none of the journals or the editors handling the papers have ever required to see or check the sequence itself or more importantly the quality of the assembly. For example, I have never been asked to provide information about the sequence read coverage over the entire genome, quality of each base pairs in the genome sequence, what criteria were used to define the quality of the sequence or even the genome sequence itself!

The reality is that there are no universal quality standards for genome sequences.

Due to an ever-decreasing cost, microbial whole genome sequencing is now performed in many different sequencing centers and individual groups. This expansion is a good thing as genomics just like molecular biology not long ago, is becoming a tool for scientist to address fundamental questions. However, I believe that this expansion should not come to the expense of quality. While sequence data from genome projects is made available to all scientists through GenBank, it is no longer sufficient for just the final consensus sequence to be made available on project publication. The assembly and the quality scores that underlie each base call within the consensus sequence must also be made available. This information is critical to evaluate the overall quality of the work.
The NCBI Assembly Archive is a resource where sequence data, quality scores, sequence chromotograms (even pyrosequencing flows) and assembly data can be uploaded to a publicly accessible central repository. However, this or any other such initiative can only be successful if the scientific community populates the repositories with both finished and draft genome sequence data in compliance with an accepted community standard. To date, only a few sequencing center have deposited data into the Assembly Archive (TIGR, JCVI, IMBGA and CRA).

The broader microbial genomics community, together with scientific funding and publishing bodies need to meet and develop new data release standard for microbial genomes. This is even more important has new sequencing technologies are driving cost down, making it affordable for many to sequence genomes. These standards need to account for these new technologies.

For any particular genome, these standards could embrace the timely release of trace, contig and associated quality scores as well as the consensus sequence, using the timeline agreed to by the sequencing center and the funding agency for that genome project. The standard could also define the minimum sequence quality and coverage required for the release or publication of complete and draft microbial genome sequence data. There is an enormous amount of sequence data pending, planned and anticipated. Complete and open access to the underlying quality information, as well as the consensus sequence, will be needed to best capitalize on this forthcoming deluge.

3 comments:

Jonathan Eisen said...

I agree such standards are needed. Any idea how to go about getting this done? A workshop? A white paper?

Jacques Ravel said...

I think that a white paper would be a great start to make people aware of the problem. People's reaction to it will certainly dictate the next step. Do you think PLoS Biology would be a good place for it? Would they support any standards for genome sequence?

Jonathan Eisen said...

PLOS Bio I am sure would support such standards. But they have a 2 page limit for "community" pages so you would have to be concise.