SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   -r option - Does it make a difference to the alignment? (http://seqanswers.com/forums/showthread.php?t=20087)

vkartha 05-15-2012 05:36 PM

-r option - Does it make a difference to the alignment?
 
Hey guys,
I have paired end reads on RNASeq data and I wanted to test run Tophat2 to map these reads for a given sample using 5 different -r values(that is the mean inner distance between paired ends) just to see how it would affect the alignment.

I'm not exactly sure what I should look at once the alignment is over to figure out the so called 'alignment results'.

Tophat2 gives out a log folder along with the accepted_hits.bam, junction.bed files etc. in which I looked at two files namely:

bowtie.left_kept_reads.fixmap.log
bowtie.right_kept_reads.fixmap.log

These give summaries on the alignment from each end such as:

22630038 reads; of these:
22630038 (100.00%) were unpaired; of these:
2447457 (10.82%) aligned 0 times
15049769 (66.50%) aligned exactly 1 time
5132812 (22.68%) aligned >1 times
89.18% overall alignment rate

Now I notice there are also log files namely:

bowtie.right_kept_reads_seg1.fixmap.log
bowtie.right_kept_reads_seg2.fixmap.log
bowtie.right_kept_reads_seg3.fixmap.log
bowtie.right_kept_reads_seg4.fixmap.log

(And the same 4 for the left kept reads)

I don't understand what these files are or how they are different from the file I mentioned earlier (fixmap.log file)

I'm not sure if I'm looking at the correct log file, and if the fixmap.log file is indeed the right one to look at, I noticed that the alignment percentage is the SAME (i.e. 89%) for ALL -r options (I tried -50, -16, -33, 0, 50) using the same std deviation value. I just cannot understand why!!! - Does this mean that the -r option doesn't make a big difference in the alignment???

Hope to hear back from you guys soon. Thanks a lot.

rnaseek 05-16-2012 06:59 PM

Try using samtools flagstat command. flagstat will give the simple alignement stats using the bam file. With different "r" values you can see how they vary.

vkartha 05-16-2012 08:08 PM

I tried using samtools flagstat on one of the output bam files and this is what I got :


43603331 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
43603331 + 0 mapped (100.00%:nan%)
43603331 + 0 paired in sequencing
21831004 + 0 read1
21772327 + 0 read2
38524104 + 0 properly paired (88.35%:nan%)
42357862 + 0 with itself and mate mapped
1245469 + 0 singletons (2.86%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Why am I seeing the +0's with every result???

Also, I'm sorry I'm new to this - how exactly do I interpret this?
Thanks a lot.

swbarnes2 05-17-2012 08:54 AM

Quote:

Originally Posted by vkartha (Post 73693)
I tried using samtools flagstat on one of the output bam files and this is what I got :


43603331 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
43603331 + 0 mapped (100.00%:nan%)
43603331 + 0 paired in sequencing
21831004 + 0 read1
21772327 + 0 read2
38524104 + 0 properly paired (88.35%:nan%)
42357862 + 0 with itself and mate mapped
1245469 + 0 singletons (2.86%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Why am I seeing the +0's with every result???

Also, I'm sorry I'm new to this - how exactly do I interpret this?
Thanks a lot.

Just like it says...the +0 means that no reads in your .bam were flagged as QC-failures. Which makes sense, Bowtie wouldn't add that flag to the .bam it makes.

You have no unmapped reads, because bowtie by default doesn't include them, unlike, say, bwa, which does.

So basically, you want to compare the number of mapped reads, depending on the alignment settings.

vkartha 05-17-2012 08:59 AM

So in terms of studying the effect of different -r options on aligning a given sample, I should look at the 'properly paired' statistic?

It is weird because I estimated the mean and std dev inner distance using picard tools. For the two groups (say diseased cases vs control) I got an average inner distance of -16 and -33 respectively - which means there is overlap of the two reads being sequenced, and I know Tophat accepts a negative -r value for the same reason.

On running samtools flagstat, I get the highest % for the 'properly paired' stat as +50 at 92% and it is lower for the above two -r options I tried! Why is that??

Thanks for all your help guys. Appreciate it

rnaseek 05-17-2012 01:16 PM

I think the properly paired is a good metric to use and I would use the one with higher "Properly Paired" reads.

Depending on the library prep, rna fragments can be shorter and result in overlapping paired end reads. I am not sure trying negative r value is a good idea.

Anyway with positive "r" values you can vary different variances and choose the one that gives the most number of properly paired reads.

cedance 05-17-2012 01:30 PM

I usually estimate the inner distance by mapping reads using BWA and computing the mean inner distance and standard deviation. And for my samples, most of the time the reads overlap. The mean inner distance can be negative and I would say its important to choose a negative inner distance if your fragment size selection itself is low. If your chosen fragment size is 170 bases and your reads are 100 bases, of course your reads will have a lot of overlap and your mean inner distance is negative. So, I prefer to estimate it for each data every time. And it definitely is much better than choosing default parameters. The number of reads that map and the mapping efficiency is very reliable this way.

vkartha 05-23-2012 05:36 PM

Still unsure why I see +50 giving the best result
 
Quote:

Originally Posted by vkartha (Post 73752)
So in terms of studying the effect of different -r options on aligning a given sample, I should look at the 'properly paired' statistic?

It is weird because I estimated the mean and std dev inner distance using picard tools. For the two groups (say diseased cases vs control) I got an average inner distance of -16 and -33 respectively - which means there is overlap of the two reads being sequenced, and I know Tophat accepts a negative -r value for the same reason.

On running samtools flagstat, I get the highest % for the 'properly paired' stat as +50 at 92% and it is lower for the above two -r options I tried! Why is that??

Thanks for all your help guys. Appreciate it

Can someone please address the issue I brought up here? How can the -r option that gives the best alignment results be way off? from the means I had estimated from the samples? In fact that option that gives the best result was chosen as an extreme in the + direction just to see what effect it would have on the alignment had one taken a very large value.

Help?

arvid 05-24-2012 12:19 AM

Quote:

Originally Posted by vkartha (Post 74259)
Can someone please address the issue I brought up here? How can the -r option that gives the best alignment results be way off? from the means I had estimated from the samples? In fact that option that gives the best result was chosen as an extreme in the + direction just to see what effect it would have on the alignment had one taken a very large value.

Help?

Define "best". If you say the more flagged as properly aligned, the better, then setting a really high allowed insert size will give you more properly paired reads - this seems quite obvious. However, you might risk more falsely mapped pairs than you would want to.
In my way of thinking, it would be better to try to estimate the real mean insert size first (e.g. like in post #7), and then adapt the variance to allow for a sensible inclusion of deviating fragments. Then you aren't lying to the tools that include that parameter in their alignment/assembly/quantification algorithms and heuristics.

vkartha 05-24-2012 12:26 AM

So when I say best, I mean that I observed the highest % looking at the 'properly paired' flagstat output for +50.

I read that this 'properly paired' flag in the samtools flagstat output is a measure of how accurate your -r option was for that tophat run in some sense.

The -r option is mean insert size, so we have to provide Tophat with a rough estimate. It isn't really 'allowance' as you put it right? It isn't a threshold where everything less than that will get mapped. That is what I didn't get by your reply.

arvid 05-24-2012 12:39 AM

Yes, it is a mean, not a threshold, but to me it seems that the 'properly paired' flagging depends on fragments being smaller than the mean added by some multiple of the supplied mate-std-dev. I might be wrong, but it doesn't seem to be actually checking whether fragments are much *smaller* than expected, which would explain the behaviour in your case.

Still I believe it is less important to have a high % of 'properly paired' pairs, than using settings that are close to the actual fragment spectrum - this would lead to more cautious use of the outlier fragments for transcript assembly in the downstream analysis (if you are looking for previously unknown transcripts), or e.g. using quantification tools that model the insert size to assign weights for multi-mapping read pairs...

EGrassi 05-24-2012 12:58 AM

Just to report my experience.

without -r (that is with the default value of 50):
samtools flagstat S2_003.th/accepted_hits.bam
28257215 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
28257215 + 0 mapped (100.00%:-nan%)
28257215 + 0 paired in sequencing
14154028 + 0 read1
14103187 + 0 read2
23419378 + 0 properly paired (82.88%:-nan%)
26873012 + 0 with itself and mate mapped
1384203 + 0 singletons (4.90%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

with -r -20 (value gotten from analysis of an alignment done with bowtie on a subset of reads and from the wet lab infos):
samtools flagstat r20_S2_003.th/accepted_hits.bam
28386256 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
28386256 + 0 mapped (100.00%:-nan%)
28386256 + 0 paired in sequencing
14218550 + 0 read1
14167706 + 0 read2
17439266 + 0 properly paired (61.44%:-nan%)
27002056 + 0 with itself and mate mapped
1384200 + 0 singletons (4.88%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

It is true that in principle it seems better to give tophat the actual value but the difference in the properly paired percentages is relevant and with 100bp reads I doubt that all the properly paired reads in the first case that are missing in the second are wrongly aligned...

cedance 05-24-2012 01:04 AM

Do you also choose a value for --mate-std-dev?

arvid 05-24-2012 01:12 AM

Maybe you have skewed fragment size distributions (many I've seen are) - do you have wet lab info on that? In that case, I'd definitely choose a bigger --mate-std-dev, I think the default 20 is on the low side for typical distributions, anyway.

EGrassi 05-24-2012 01:17 AM

I got a SD of 37 on a subset of aligned reads so in order to reduce the number of tests I sticked with the default value (20). If I get some cpu time for other tests I will let you know the results...right now as long as I have an ETA of 3 weeks to complete the alignment on my data I will go without -r (as long as I have 1 week of work already done in this way...I know that this does not sound right but when I started to align I had read the other thread here about differences in results with negative values and I didn't have the lab data to confirm my extimated value and as long as someone stated that the -r value is less important in the recent releases of tophat I decided to start without an explicit -r [though I've not understood completely what "less important" means and I don't have the time to check in the source code...])

arvid 05-24-2012 01:23 AM

It depends on what you are doing downstream, as I said before. Are you looking for new transcripts, or using a quantification scheme where the insert size is used? I'd be especially careful if you are looking for new transcripts/splice variants, for other applications I find this setting to be less critical.

EGrassi 05-24-2012 01:29 AM

Quote:

Originally Posted by arvid (Post 74277)
Maybe you have skewed fragment size distributions (many I've seen are) - do you have wet lab info on that?

Nope, I've got only the fragment size calculated from the bioanalyzer trace (btw I checked and the picard analysis on the alignments gave me -20 as a result which was in accordance with a fragment size of 300 [adapters are 122] which was the first value given to me, but then I checked the precise numbers and they vary between 365 and 327, therefore I am not certain of which should be this "real" -r...).

EGrassi 05-24-2012 01:34 AM

Quote:

Originally Posted by arvid (Post 74281)
It depends on what you are doing downstream, as I said before. Are you looking for new transcripts, or using a quantification scheme where the insert size is used? I'd be especially careful if you are looking for new transcripts/splice variants, for other applications I find this setting to be less critical.

I see, thank you. We would like to start from known transcripts and compare this results with microarray data (and we are in a kind of a hurry to see some first numbers, the dimensions of this dataset as in some way surprised us - and our servers :) ), only afterwards we will look for new transcripts/splice variants...I hope to be able to compare results with different -r values or at least re-do the alignments with a sensible -r by that time.

cedance 05-24-2012 03:28 AM

If you are curious about trying to estimate the mean inner distance and its deviation, then you can take a look at the first reply here. I wrote this quite a while ago on biostars. It is what I have been doing to estimate the -r parameter and --mean-std-dev for quite a while. And with samtools flagstats I get over 75-90% of the reads as properly paired with ALL the data sets I have worked so far.

arvid 05-24-2012 03:35 AM

Quote:

Originally Posted by cedance (Post 74292)
If you are curious about trying to estimate the mean inner distance and its deviation, then you can take a look at the first reply here. I wrote this quite a while ago on biostars. It is what I have been doing to estimate the -r parameter and --mean-std-dev for quite a while. And with samtools flagstats I get over 75-90% of the reads as properly paired with ALL the data sets I have worked so far.

Just for clarification: do you run the alignment with BWA on the genome or on the transcriptome files? If it's on the genome, isn't it messy to deal with the inner distances for read pairs that span introns? Or does Picard do this for you?


All times are GMT -8. The time now is 11:25 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.