November 30, 2000
Posts selected for this feature rarely stand alone. They are usually a part of an ongoing thread, and are out of context when presented here. The material should be read in that light. How are these posts selected? Click here to find out and nominate a post yourself!
The Celera Pitch
I'm going to take this discussion way OT and relay my impressions of the Celera (NYSE: CRA) sales pitch. Yesterday at about 11:30 I came upon a somewhat obscure flier entitled "Speed Matters-Genomics in 2000__ The Celera Genomics Database: What it is and how to use it", presented by Dr. Kennan Kellaris, Scientific Applications Specialist at Celera. It was hosted by the Framingham Genetics Laboratory (a big cardiovascular genetics project affiliated with BU Med) and Heidi Meisenkothen (Celera NE account manager). Realizing I was already a half hour late for the presentation, I sacrificed a Northern so I could report back to y'all (and perhaps distract from the great who-the-heck-is-PhaseDeviant debate):).
I snuck into the relatively under populated conference room (a big one here) to hear what I assumed was a academic subscription sales pitch. I missed out on the snazzy blue brochures they handed out, but think I took pretty good notes. Fortunately for me, I had only missed the description of the sequencing and assembly of the human and drosophila genomes. Ancient history, as far as I'm concerned, so I was glad to arrive at the beginning of the interesting stuff. A final note before getting into the meat: Dr. Kellaris did an excellent job. She really knew her stuff and put together a compelling argument that Celera has an outstanding product. Whether it makes sense for academic labs to pony up the cash they're charging is a whole other issue...
I'll basically relate my notes in the order I took them. Sorry if it sounds a bit like a rough sketch...
Computational annotation methods, in their current form, are inadequate. Common problems include identification of gene fragments (not whole genes), over-prediction of exons, inappropriately merged genes. Basically, a combination of computation and human "curation" is necessary.
Published work will include only "basic" annotation. That is the high quality, assembled sequence and a conservative gene annotation based on proprietary computational tools. My impression is they will essentially tune the computer tools to only recognize high probability assignments (no hanging chads for these folks!).
The "enhanced" annotation efforts will remain proprietary. This will include in-house, expert human annotation (something like 50 scientists working on this stuff), results for the jamborees, data derived from the Panther functional protein classification software, and info derived using the mouse genome data. Obviously, they are also using all available public resources (gene indecies, EST databases, etc.) to further enhance this process.
She took a few minutes to further explain the value of the Panther tool. Basically, it is able to functionally classify proteins into sub- and super-families using Hidden Markov Models (HMMs). Roughly, HMMs use 3D prediction algorithms to generate weighted matricies to classify proteins based on possible function that isn't apparent from simple DNA sequence similarities.
Benefits: High likelihood of functional assignment, increased sensitivity, improved classification, and represents a comparative genomic tool at the protein level. Trained the software on SwissProt and a bunch of other protein databases.
She then went through a number of examples where Panther was able to classify proteins more correctly (often dramatic) than existing genomic software tools. They spent quite a bit of time on this stuff, so I'm guessing this represents a major differentiating factor.
129SvJ, DBA/2J, and 646A/J being sequenced at 1X coverage apiece. The mouse gene index will be generated from a combination of the TIGR index and proprietary sources (in-house, I imagine). Currently contains _complete_ mouse gene data, which they claim is another unique resource.
Additionally, the comparative genomic reader interface is ready (or almost ready). This will allow researchers and curators to overlay sequences allowing enhanced annotation of genes and (with significant emphasis) regulatory sequences.
Truly a state-of-the-art ASP system. Everything is available through your browser and all computation and storage takes place at Celera. Real snazzy looking. Demonstration in a bit.
Everything there. All public and Celera proprietary databases and every tool under the sun. Took us through several pages listing every database I've ever heard of and many I haven't. Same thing for analysis and query tools.
Included will be the human stuff and drosophila stuff (genome, gene, etc. NOT SNP). Mouse fragments are currently available and overlay will be included when the mouse is complete.
SNP database is an extra charge (more on my secret knowledge of pricing later).
The SNP database currently contains over 2 million unique, high quality SNPs derived from 5 individuals. The database offering also includes all public SNP efforts as well and Celera has spent a lot of time going through this data and keeping only the good ones.
Celera SNP value:
Whole genome approach allows for even coverage and distribution, the ability to discern paralogous regions (not really sure wy this is important), the ability to discern haplotypes, and no redundancy. So, after cleanup, Celera was able to pair down to 2.5 million or so good SNPs.
Comparison to public data: First, less than 11% of public SNPs can be accurately mapped to a unique genomic position. Why? Reads are too short so the SNP matches multiple regions when BLASTed. Second, after validating using something like 160 individuals (or maybe 160 SNPs on some other number of individuals), 95% of Celera SNPs proved real, while only 77% of public SNPs held up under closer scrutiny.
One drawback: no allelic frequency data. Only 5 people, remember...
Whole Genome SNP map
SNP assay kits (no real info on this. The expert in this area was unavailable).
Functional Polymorphisms (City of Hope type projects, I assume).
OK, time for some questions:
So how many genes are there? Sheepish answer since the number is lowest I've heard thus far: 25,000. Celera has fully annotated approx. 20,000 in-house by conservative methods. Obviously, this low number does not take into account alternate splicing, transcripts, whatnot.
When will the mouse be done? Q1 2001. Now that the computers are more free, they can use more processing time for mouse assembly. However, much of the data is there and with overlay software and fragments, "completion" is somewhat academic.
A few other silly questions...
Outstanding. I'm not going to go into detail since you have to see it to appreciate the scale, but it is by far the best software tool I've seen (remember, however, I'm in the Dark Ages doing Westerns over here). Detail is unbelievable. For the SNP database, you can even access the original read traces to convince yourself the SNP is real. Wow.
The problem: A first class product that I wouldn't have the first clue of how to use in my day to day research. Those of us still living in a Molecular Biology dreamland have yet to really learn how to integrate large data set tools into our research. Sure it could be nice every now and again for a sophisticated BLAST search, but average Joe researchers like me can get more than enough from NCBI.
My suggestion: Host workshops. Present at conferences (not the Genomics conferences, but basic biology conferences). Teach us not only how to physically manipulate the tool, but how to really use it. Until then, only the superstar labs will want the product.
Other problem: Cost. My rumor laden numbers are 15K per year per investigator (lab), minimum of 10 labs, minimum of 3 years. They seem to be pretty soft on the single lab requirement, so I be entire departments can get away with buying one subscription. But that's still about 450K for 3 years. The SNP database is 4K per year per investigator extra. Since it seems to hold the most value, I wouldn't subscribe without it. Add that in as well.
For a lab using this data regularly, no problem. A centrifuge rotor can cost 15K. But the problem is labs (and, sadly, departments) like mine haven't the foggiest where to start. No one is going to cough up that kind of cash unless they know how to use it well.
So what's the solution? Either lower the academic price to get people using the product, or make a concerted effort to teach. Without one of these, we're stuck in that chicken-egg chasm.
Worn out from rambling, so that's it. Hope it wasn't too rough. Any comments?
Industry Focus 2001
The companies highlighted in Industry Focus 2001 are a great place to start when you're planning your investments for the year ahead.
Become a Fool, it's Free!
Join us for: A free "Getting Started" series; Portfolio tracking; Free trials to IBD and others.
Read More Posts by This Author
Go To This Post
More Recommended Posts