View Single Post
Old 11-08-2012, 01:32 PM   #5
Location: Sydney, Australia

Join Date: Jan 2012
Posts: 61

But what is puzzling me is the resulting sam with too many reads and many of them without an ID, not the low percentage of alignment (at least now).
So, I have seen tophat report more reads in the accepted_hits.bam than in the original fastq, but that was because for some reads it would report more than one alignment. I parsed this out using an awk/uniq pipe on the sam file. I have not seen your situtation with no read IDs at all.

This is true, in a way. But I don't agree completely: these are high throughput techniques so I will never be able to check if all the results make sense...and the overall idea is to get sensible data, so if two different versions of a software gave me different outputs (one of them which seems out of the standard format that it should be) I'm worried about bugs and not biological soundness. That would be an issue that I hope to face later.
Before I started doing RNA-Seq data analysis I thought that if a software tool was being so widely used it must be reliable, reproducible and sound. After trying/using most of the available software I have come to the conclusion that it is a "Wild West" in the field at the moment. There are rewards for publishing a tool that works (for the authors, to some extent, in many cases only with their annotations made with "a custom perl script" (not funny how many times that is written in a methods section, with no details on either the script or based on what algorithms it is actually doing what it's supposed to be)). The paper will get cited, mostly in reviews of available methods. There is no incentive for most groups to maintain the software, to fix bugs and make decent documentation, since this does not get you more papers => more grants => promotions... Basically, I think it's shoddy science, but there is nothing I can do about it, except spend boatloads of time benchmarking, reading this forum and help lists about bugs people encounter, and looking at dummy datasets to get an idea of how a particular tool handles a particular scenario.

And of course there's the brilliant scenario of cufflinks, which has been updated to include a plethora of new methods, but apart from the "How Cufflinks Works" page there is no peer-reviewed update of what has been incorporated and whether, together, it is statistically sound.

/end rant/

Coming back to your problem (and I know this is not exactly a solution) - have you tried using 2.0.5/6 on your reads? I've had much better results (am more happy with how the mapping "looks") with the .5 version, since it does seem to improve mapping accuracy by using both genome and transcriptome as a reference. Perhaps there was a bug in 2.0.0 that has been fixed in the later versions?...
dvanic is offline   Reply With Quote