View Single Post
Old 11-07-2012, 11:49 PM   #4
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66

Originally Posted by dvanic View Post
How many reads is tophat telling you it is filtering in
I had to delete the tophat 2.0.0 result to avoid filling our hd of data when doing "trial and error". Tophat 2.0.4 is telling:
prep_reads v2.0.4 (3480M)
204037 out of 18850030 reads have been filtered out
So in this case it is ok.

Originally Posted by dvanic View Post
How is the overall quality of your read, especially the end? Is your sequencing machine calling bases irrespective of what the quality of that base is, or does it start calling and N at low quality?

Tophat, unlike BWA, does not clip reads to remove low-quality ends. If your ends are <=q20 and your sequencer force-called the bases you may have nucleotides at the end that are making your reads unalignable, because those ends are preventing Tophat from positioning the read where it belongs - you're getting too many mismatches.
OK, I see, thanks. Those are public data and the original article aligned them without any trimming, I will anyhow check the quality of the ends. But what is puzzling me is the resulting sam with too many reads and many of them without an ID, not the low percentage of alignment (at least now).

Originally Posted by dvanic View Post

I think you're wrong
In trying to understand if the software that I'm using is bug free? In trying a new software less used I'm not happy, yes...I want the unit tests more, that's it
Yup, I understood just now that your two sentences were connected...definitely I will test STAR if I choose to use it, maybe more than tophat.
Originally Posted by dvanic View Post

Everyone (especially the biologists around me (and I used to be one)) think that NGS data analysis is easy and a technique, a service that can be provided and not something that involves a boatload of time and benchmarking and intellectual effort no less complicated than designing some "pretty" wet lab experiments. So, yes, if you're going to do an analysis you need to know what your tools are doing, and the sad state of the field is format incompatibility, weird mappings and activities by different softwares that give different outputs, and you need to look at the data and the biology of your system to figure out what makes sense and what is probably an artefact.
This is true, in a way. But I don't agree completely: these are high throughput techniques so I will never be able to check if all the results make sense...and the overall idea is to get sensible data, so if two different versions of a software gave me different outputs (one of them which seems out of the standard format that it should be) I'm worried about bugs and not biological soundness. That would be an issue that I hope to face later.

Thank you, I will let you know about the quality of the ends...I can also add that the counts resulting from using htseq-count on the resulting accepted_hits.bam with the two tophat versions had a high an significant pearson correlation...but I still would like to understand what's the matter with the empty IDs

Last edited by EGrassi; 11-08-2012 at 05:50 AM.
EGrassi is offline   Reply With Quote