Affymetrix Microbiology Symposia Series

A little bit of self publicity! I will be giving a webinar on Thursday March 29th, 2007 on the work we have done with Affymetrix tiling arrays. This webinar series is sponsored by Affymetrix and highlight creative use of their technology.
We have been working with Affymetrix tiling arrays for about 3 years now, with great succcess. We have designed two tiling arrays for Bacillus anthracis, the causative agent of anthrax. One of the major problem working with such custom-designed arrays is that Affymetrix does not provide any software support for data display and analysis. I'll be describing the array design and a series of algorithm for data analysis, in particular TadPol (Tiling Array Discovery of Polymorphism). TadPol identifies polymorphic regions (SNPs and INDELs) from comparative genome hybridization data. The data is used for genotyping.

If you are interested, here is the web link to register for the webinar. It will be my first webinar, so I'm not sure how it turn out, but it should be an interesting experience!

Here is the announcement:

SNPs, Chips and Transcriptomics in Bacillus anthracis
Thursday, March 29, 2007, 9:00 am (PDT)
Participate in a conference call seminar and Q & A featuring Jacques Ravel, Ph.D. from The Institute for Genomic Research

Dr. Ravel will discuss his research using a custom-designed Affymetrix GeneChip® array that tiles the entire 5.5 Mb genome of Bacillus anthracis Ames Ancestor with 2.1 million overlapping 25-mer oligonucleotides on a single array. Examples of genotyping, polymorphism discovery (SNPs and INDELs), transcript mapping and expression studies will also be discussed. REGISTER FOR SYMPOSIUM


The Shortest Genome Sequence Paper!

The JGI Los Alamos National Lab has just published one of the shortest (if not the shortest) genome sequence paper. The paper is published online ahead of print in the Journal of Bacteriology and report on the genome sequence of Bacillus thuringiensis Al Hakam, a strain isolated in Iraq by the United Nation Special Commission. The author list is about 1 and 1/2 pages and the text not much longer. This paper was published with the sole intention to announce the release of the sequence in GenBank. There is no science beside the fact that the cry, cyt and vip genes (toxins genes) were not found, did you need the genome for get to that conclusion?
Is this where the future of genome sequence papers is heading? Announcements for GenBank Release. I'm not sure if I like this.

We are now sequencing several strains of the same species and I think it can become somewhat of a problem because some of the sequences turn out to not be very informative. But are we selecting the strain to sequence rationally? A lot of work goes into sequencing, assembling, closing, annotating and analyzing a genome. If nothing novel or interesting comes out of this, that is a lot time and effort wasted without returns for the people involved in terms of papers. I think this work should still be published. I had suggested a while back to convince a journal (PLoS?) to have one issue (or part of) or one section in each issue dedicated to genome sequences, just like NAR does for databases. I think these types of paper would fit better in that context. As a standalone paper in the middle of a new issue of the Journal of Bacteriology might not be the best use of journal pages. What about a community page, just like in PLoS Biology? Wouldn't it be a better format? I think it is important to recognize the work that has gone into sequencing and analysis a genome, but what is the best format? Genome sequences are still very important resources that the scientific community uses. Are we getting to the point where a genome sequence is to be released into GenBank and does not warrant a publication unless its analysis advances scientific knowledge? Just what happened to gene sequences not that long ago!


Hawkeye: Finally a tool for assembly validation!

This is related to my first posting. Mike Schatz, a very talented graduate student in Computer Sciences at The University of Maryland College Park Center for Bioinformatics and Computional Biology (CBCB) working with Steven Salzberg, has published today a really good paper entitled "Hawkeye: a visual analytics tool for genome assemblies" in Genome Biology. Finally a tool that allows any scientist to evaluate the quality of a genome assembly. The paper describes the capabilities of the software, including assembly vizualization (including traces), assembly analysis (assembly statistics), contig viewer and scaffold viewer among others. The software is amazingly fast and doesn't slow down when working with large genomes and allows for assembly problems diagnotics and validation. Mike even describes how Hawkeye can be use for biological investigation. Using Hawkeye, Mike was able to identify the assembly of the 6 plasmids harbored by Bacillus megaterium one of my project.

I think this software should become a standard which reviewers, editors... could use to validate genome assemblies prior to publication. Now, all we need is the genome sequencing scientific community to agree on data release standards, more than the consensus sequence will be needed!

This paper is published as "Open Access" and Hawkeye is freely available as part of the AMOS package on sourceforge. Steven Salzberg's group published another paper in the same issue of Genome Biology on "computional discovery of Rho-independant trasncription terminators" as well as a very interesting editorial on "Genome re-annotation: a wiki solution?", unfortunately not as Open Access.

One more thing, Hawkeye works on a Mac. A huge plus to me!!

Microbial Genome News on Technorati

I have now registered the Microbial Genome News on Technorati.

Technorati Profile

You can view my profile on Technorati by clicking on the link above.


Quality Standards for Genome Sequence

I have been involved in sequencing and analyzing microbial genomes for over 5 years. I have been very fortunate to author and co-author a series of papers describing the genome sequence analysis of several organisms. During the publication submission/review process there has always been one thing that surprised me; that is the fact that none of the journals or the editors handling the papers have ever required to see or check the sequence itself or more importantly the quality of the assembly. For example, I have never been asked to provide information about the sequence read coverage over the entire genome, quality of each base pairs in the genome sequence, what criteria were used to define the quality of the sequence or even the genome sequence itself!

The reality is that there are no universal quality standards for genome sequences.

Due to an ever-decreasing cost, microbial whole genome sequencing is now performed in many different sequencing centers and individual groups. This expansion is a good thing as genomics just like molecular biology not long ago, is becoming a tool for scientist to address fundamental questions. However, I believe that this expansion should not come to the expense of quality. While sequence data from genome projects is made available to all scientists through GenBank, it is no longer sufficient for just the final consensus sequence to be made available on project publication. The assembly and the quality scores that underlie each base call within the consensus sequence must also be made available. This information is critical to evaluate the overall quality of the work.
The NCBI Assembly Archive is a resource where sequence data, quality scores, sequence chromotograms (even pyrosequencing flows) and assembly data can be uploaded to a publicly accessible central repository. However, this or any other such initiative can only be successful if the scientific community populates the repositories with both finished and draft genome sequence data in compliance with an accepted community standard. To date, only a few sequencing center have deposited data into the Assembly Archive (TIGR, JCVI, IMBGA and CRA).

The broader microbial genomics community, together with scientific funding and publishing bodies need to meet and develop new data release standard for microbial genomes. This is even more important has new sequencing technologies are driving cost down, making it affordable for many to sequence genomes. These standards need to account for these new technologies.

For any particular genome, these standards could embrace the timely release of trace, contig and associated quality scores as well as the consensus sequence, using the timeline agreed to by the sequencing center and the funding agency for that genome project. The standard could also define the minimum sequence quality and coverage required for the release or publication of complete and draft microbial genome sequence data. There is an enormous amount of sequence data pending, planned and anticipated. Complete and open access to the underlying quality information, as well as the consensus sequence, will be needed to best capitalize on this forthcoming deluge.