I've got a more general question. Many of the recent metagenomics studies deal with read lengths that complicate assembly or render it impossible. After some preprocessing people hence blastx (or translate and blastp, HMMer etc.) these reads against nucleotide or protein databases (or COGs, KEGG etc.) to annotate them. So far so good, but none of these papers ever answer my simple question: as I understand it 454, Solexa, actually any kind of sequencing method, will produce many reads that map to the same gene of the same genome, just 'shifted' slightly. That means for one say dna polymerase gene i'd get 10 reads with similarly good blast scores or hmm hits. What I don't get is how can these papers say 'we got 10 hits in this and that COG category' - how can they be sure it's not reads of the same gene, the same physical piece of the DNA sample (I'm not talking about very similar genes from closely related species here, I'm talking about the exact same piece of DNA, so it's more a technical question for that bit). If my idea of the multiple reads per gene is true and you functionally annotate from the reads directly (as outlined above) you can only talk about relative abundances (of hits) but not about actual counts, no? Help would be much appreciated.
Cheers.
Rob
Cheers.
Rob
Comment