SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Make my own blast DB detq182 Bioinformatics 9 04-18-2012 01:45 PM
Zygosity difference oyvindbusk Bioinformatics 3 01-31-2012 05:56 AM
Can't make Maq nightsun Bioinformatics 1 12-30-2011 08:24 PM
Make Clean and Make all not working qnc Bioinformatics 27 10-21-2011 10:17 AM
Can't make Maq RockChalkJayhawk Bioinformatics 4 08-24-2009 10:32 AM

Reply
 
Thread Tools
Old 05-15-2012, 05:36 PM   #1
vkartha
Member
 
Location: Boston

Join Date: Feb 2012
Posts: 28
Default -r option - Does it make a difference to the alignment?

Hey guys,
I have paired end reads on RNASeq data and I wanted to test run Tophat2 to map these reads for a given sample using 5 different -r values(that is the mean inner distance between paired ends) just to see how it would affect the alignment.

I'm not exactly sure what I should look at once the alignment is over to figure out the so called 'alignment results'.

Tophat2 gives out a log folder along with the accepted_hits.bam, junction.bed files etc. in which I looked at two files namely:

bowtie.left_kept_reads.fixmap.log
bowtie.right_kept_reads.fixmap.log

These give summaries on the alignment from each end such as:

22630038 reads; of these:
22630038 (100.00%) were unpaired; of these:
2447457 (10.82%) aligned 0 times
15049769 (66.50%) aligned exactly 1 time
5132812 (22.68%) aligned >1 times
89.18% overall alignment rate

Now I notice there are also log files namely:

bowtie.right_kept_reads_seg1.fixmap.log
bowtie.right_kept_reads_seg2.fixmap.log
bowtie.right_kept_reads_seg3.fixmap.log
bowtie.right_kept_reads_seg4.fixmap.log

(And the same 4 for the left kept reads)

I don't understand what these files are or how they are different from the file I mentioned earlier (fixmap.log file)

I'm not sure if I'm looking at the correct log file, and if the fixmap.log file is indeed the right one to look at, I noticed that the alignment percentage is the SAME (i.e. 89%) for ALL -r options (I tried -50, -16, -33, 0, 50) using the same std deviation value. I just cannot understand why!!! - Does this mean that the -r option doesn't make a big difference in the alignment???

Hope to hear back from you guys soon. Thanks a lot.
vkartha is offline   Reply With Quote
Old 05-16-2012, 06:59 PM   #2
rnaseek
Member
 
Location: USA

Join Date: Nov 2011
Posts: 22
Default

Try using samtools flagstat command. flagstat will give the simple alignement stats using the bam file. With different "r" values you can see how they vary.
rnaseek is offline   Reply With Quote
Old 05-16-2012, 08:08 PM   #3
vkartha
Member
 
Location: Boston

Join Date: Feb 2012
Posts: 28
Default

I tried using samtools flagstat on one of the output bam files and this is what I got :


43603331 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
43603331 + 0 mapped (100.00%:nan%)
43603331 + 0 paired in sequencing
21831004 + 0 read1
21772327 + 0 read2
38524104 + 0 properly paired (88.35%:nan%)
42357862 + 0 with itself and mate mapped
1245469 + 0 singletons (2.86%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Why am I seeing the +0's with every result???

Also, I'm sorry I'm new to this - how exactly do I interpret this?
Thanks a lot.
vkartha is offline   Reply With Quote
Old 05-17-2012, 08:54 AM   #4
swbarnes2
Senior Member
 
Location: San Diego

Join Date: May 2008
Posts: 912
Default

Quote:
Originally Posted by vkartha View Post
I tried using samtools flagstat on one of the output bam files and this is what I got :


43603331 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
43603331 + 0 mapped (100.00%:nan%)
43603331 + 0 paired in sequencing
21831004 + 0 read1
21772327 + 0 read2
38524104 + 0 properly paired (88.35%:nan%)
42357862 + 0 with itself and mate mapped
1245469 + 0 singletons (2.86%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Why am I seeing the +0's with every result???

Also, I'm sorry I'm new to this - how exactly do I interpret this?
Thanks a lot.
Just like it says...the +0 means that no reads in your .bam were flagged as QC-failures. Which makes sense, Bowtie wouldn't add that flag to the .bam it makes.

You have no unmapped reads, because bowtie by default doesn't include them, unlike, say, bwa, which does.

So basically, you want to compare the number of mapped reads, depending on the alignment settings.
swbarnes2 is offline   Reply With Quote
Old 05-17-2012, 08:59 AM   #5
vkartha
Member
 
Location: Boston

Join Date: Feb 2012
Posts: 28
Default

So in terms of studying the effect of different -r options on aligning a given sample, I should look at the 'properly paired' statistic?

It is weird because I estimated the mean and std dev inner distance using picard tools. For the two groups (say diseased cases vs control) I got an average inner distance of -16 and -33 respectively - which means there is overlap of the two reads being sequenced, and I know Tophat accepts a negative -r value for the same reason.

On running samtools flagstat, I get the highest % for the 'properly paired' stat as +50 at 92% and it is lower for the above two -r options I tried! Why is that??

Thanks for all your help guys. Appreciate it
vkartha is offline   Reply With Quote
Old 05-17-2012, 01:16 PM   #6
rnaseek
Member
 
Location: USA

Join Date: Nov 2011
Posts: 22
Default

I think the properly paired is a good metric to use and I would use the one with higher "Properly Paired" reads.

Depending on the library prep, rna fragments can be shorter and result in overlapping paired end reads. I am not sure trying negative r value is a good idea.

Anyway with positive "r" values you can vary different variances and choose the one that gives the most number of properly paired reads.
rnaseek is offline   Reply With Quote
Old 05-17-2012, 01:30 PM   #7
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

I usually estimate the inner distance by mapping reads using BWA and computing the mean inner distance and standard deviation. And for my samples, most of the time the reads overlap. The mean inner distance can be negative and I would say its important to choose a negative inner distance if your fragment size selection itself is low. If your chosen fragment size is 170 bases and your reads are 100 bases, of course your reads will have a lot of overlap and your mean inner distance is negative. So, I prefer to estimate it for each data every time. And it definitely is much better than choosing default parameters. The number of reads that map and the mapping efficiency is very reliable this way.
cedance is offline   Reply With Quote
Old 05-23-2012, 05:36 PM   #8
vkartha
Member
 
Location: Boston

Join Date: Feb 2012
Posts: 28
Default Still unsure why I see +50 giving the best result

Quote:
Originally Posted by vkartha View Post
So in terms of studying the effect of different -r options on aligning a given sample, I should look at the 'properly paired' statistic?

It is weird because I estimated the mean and std dev inner distance using picard tools. For the two groups (say diseased cases vs control) I got an average inner distance of -16 and -33 respectively - which means there is overlap of the two reads being sequenced, and I know Tophat accepts a negative -r value for the same reason.

On running samtools flagstat, I get the highest % for the 'properly paired' stat as +50 at 92% and it is lower for the above two -r options I tried! Why is that??

Thanks for all your help guys. Appreciate it
Can someone please address the issue I brought up here? How can the -r option that gives the best alignment results be way off? from the means I had estimated from the samples? In fact that option that gives the best result was chosen as an extreme in the + direction just to see what effect it would have on the alignment had one taken a very large value.

Help?
vkartha is offline   Reply With Quote
Old 05-24-2012, 12:19 AM   #9
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Quote:
Originally Posted by vkartha View Post
Can someone please address the issue I brought up here? How can the -r option that gives the best alignment results be way off? from the means I had estimated from the samples? In fact that option that gives the best result was chosen as an extreme in the + direction just to see what effect it would have on the alignment had one taken a very large value.

Help?
Define "best". If you say the more flagged as properly aligned, the better, then setting a really high allowed insert size will give you more properly paired reads - this seems quite obvious. However, you might risk more falsely mapped pairs than you would want to.
In my way of thinking, it would be better to try to estimate the real mean insert size first (e.g. like in post #7), and then adapt the variance to allow for a sensible inclusion of deviating fragments. Then you aren't lying to the tools that include that parameter in their alignment/assembly/quantification algorithms and heuristics.
arvid is offline   Reply With Quote
Old 05-24-2012, 12:26 AM   #10
vkartha
Member
 
Location: Boston

Join Date: Feb 2012
Posts: 28
Default

So when I say best, I mean that I observed the highest % looking at the 'properly paired' flagstat output for +50.

I read that this 'properly paired' flag in the samtools flagstat output is a measure of how accurate your -r option was for that tophat run in some sense.

The -r option is mean insert size, so we have to provide Tophat with a rough estimate. It isn't really 'allowance' as you put it right? It isn't a threshold where everything less than that will get mapped. That is what I didn't get by your reply.
vkartha is offline   Reply With Quote
Old 05-24-2012, 12:39 AM   #11
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Yes, it is a mean, not a threshold, but to me it seems that the 'properly paired' flagging depends on fragments being smaller than the mean added by some multiple of the supplied mate-std-dev. I might be wrong, but it doesn't seem to be actually checking whether fragments are much *smaller* than expected, which would explain the behaviour in your case.

Still I believe it is less important to have a high % of 'properly paired' pairs, than using settings that are close to the actual fragment spectrum - this would lead to more cautious use of the outlier fragments for transcript assembly in the downstream analysis (if you are looking for previously unknown transcripts), or e.g. using quantification tools that model the insert size to assign weights for multi-mapping read pairs...
arvid is offline   Reply With Quote
Old 05-24-2012, 12:58 AM   #12
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

Just to report my experience.

without -r (that is with the default value of 50):
samtools flagstat S2_003.th/accepted_hits.bam
28257215 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
28257215 + 0 mapped (100.00%:-nan%)
28257215 + 0 paired in sequencing
14154028 + 0 read1
14103187 + 0 read2
23419378 + 0 properly paired (82.88%:-nan%)
26873012 + 0 with itself and mate mapped
1384203 + 0 singletons (4.90%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

with -r -20 (value gotten from analysis of an alignment done with bowtie on a subset of reads and from the wet lab infos):
samtools flagstat r20_S2_003.th/accepted_hits.bam
28386256 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
28386256 + 0 mapped (100.00%:-nan%)
28386256 + 0 paired in sequencing
14218550 + 0 read1
14167706 + 0 read2
17439266 + 0 properly paired (61.44%:-nan%)
27002056 + 0 with itself and mate mapped
1384200 + 0 singletons (4.88%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

It is true that in principle it seems better to give tophat the actual value but the difference in the properly paired percentages is relevant and with 100bp reads I doubt that all the properly paired reads in the first case that are missing in the second are wrongly aligned...
EGrassi is offline   Reply With Quote
Old 05-24-2012, 01:04 AM   #13
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

Do you also choose a value for --mate-std-dev?
cedance is offline   Reply With Quote
Old 05-24-2012, 01:12 AM   #14
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Maybe you have skewed fragment size distributions (many I've seen are) - do you have wet lab info on that? In that case, I'd definitely choose a bigger --mate-std-dev, I think the default 20 is on the low side for typical distributions, anyway.

Last edited by arvid; 05-24-2012 at 01:13 AM. Reason: typo
arvid is offline   Reply With Quote
Old 05-24-2012, 01:17 AM   #15
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

I got a SD of 37 on a subset of aligned reads so in order to reduce the number of tests I sticked with the default value (20). If I get some cpu time for other tests I will let you know the results...right now as long as I have an ETA of 3 weeks to complete the alignment on my data I will go without -r (as long as I have 1 week of work already done in this way...I know that this does not sound right but when I started to align I had read the other thread here about differences in results with negative values and I didn't have the lab data to confirm my extimated value and as long as someone stated that the -r value is less important in the recent releases of tophat I decided to start without an explicit -r [though I've not understood completely what "less important" means and I don't have the time to check in the source code...])
EGrassi is offline   Reply With Quote
Old 05-24-2012, 01:23 AM   #16
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

It depends on what you are doing downstream, as I said before. Are you looking for new transcripts, or using a quantification scheme where the insert size is used? I'd be especially careful if you are looking for new transcripts/splice variants, for other applications I find this setting to be less critical.
arvid is offline   Reply With Quote
Old 05-24-2012, 01:29 AM   #17
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

Quote:
Originally Posted by arvid View Post
Maybe you have skewed fragment size distributions (many I've seen are) - do you have wet lab info on that?
Nope, I've got only the fragment size calculated from the bioanalyzer trace (btw I checked and the picard analysis on the alignments gave me -20 as a result which was in accordance with a fragment size of 300 [adapters are 122] which was the first value given to me, but then I checked the precise numbers and they vary between 365 and 327, therefore I am not certain of which should be this "real" -r...).
EGrassi is offline   Reply With Quote
Old 05-24-2012, 01:34 AM   #18
EGrassi
Member
 
Location: Turin, Italy

Join Date: Oct 2010
Posts: 66
Default

Quote:
Originally Posted by arvid View Post
It depends on what you are doing downstream, as I said before. Are you looking for new transcripts, or using a quantification scheme where the insert size is used? I'd be especially careful if you are looking for new transcripts/splice variants, for other applications I find this setting to be less critical.
I see, thank you. We would like to start from known transcripts and compare this results with microarray data (and we are in a kind of a hurry to see some first numbers, the dimensions of this dataset as in some way surprised us - and our servers ), only afterwards we will look for new transcripts/splice variants...I hope to be able to compare results with different -r values or at least re-do the alignments with a sensible -r by that time.
EGrassi is offline   Reply With Quote
Old 05-24-2012, 03:28 AM   #19
cedance
Senior Member
 
Location: Germany

Join Date: Feb 2011
Posts: 108
Default

If you are curious about trying to estimate the mean inner distance and its deviation, then you can take a look at the first reply here. I wrote this quite a while ago on biostars. It is what I have been doing to estimate the -r parameter and --mean-std-dev for quite a while. And with samtools flagstats I get over 75-90% of the reads as properly paired with ALL the data sets I have worked so far.
cedance is offline   Reply With Quote
Old 05-24-2012, 03:35 AM   #20
arvid
Senior Member
 
Location: Berlin

Join Date: Jul 2011
Posts: 156
Default

Quote:
Originally Posted by cedance View Post
If you are curious about trying to estimate the mean inner distance and its deviation, then you can take a look at the first reply here. I wrote this quite a while ago on biostars. It is what I have been doing to estimate the -r parameter and --mean-std-dev for quite a while. And with samtools flagstats I get over 75-90% of the reads as properly paired with ALL the data sets I have worked so far.
Just for clarification: do you run the alignment with BWA on the genome or on the transcriptome files? If it's on the genome, isn't it messy to deal with the inner distances for read pairs that span introns? Or does Picard do this for you?
arvid is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:39 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO