Seqanswers Leaderboard Ad

**rnaseek** · 05-16-2012, 06:59 PM

Try using samtools flagstat command. flagstat will give the simple alignement stats using the bam file. With different "r" values you can see how they vary.

**vkartha** · 05-16-2012, 08:08 PM

I tried using samtools flagstat on one of the output bam files and this is what I got :

43603331 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
43603331 + 0 mapped (100.00%:nan%)
43603331 + 0 paired in sequencing
21831004 + 0 read1
21772327 + 0 read2
38524104 + 0 properly paired (88.35%:nan%)
42357862 + 0 with itself and mate mapped
1245469 + 0 singletons (2.86%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Why am I seeing the +0's with every result???

Also, I'm sorry I'm new to this - how exactly do I interpret this?
Thanks a lot.

**swbarnes2** · 05-17-2012, 08:54 AM

Originally posted by vkartha View Post

I tried using samtools flagstat on one of the output bam files and this is what I got :

43603331 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
43603331 + 0 mapped (100.00%:nan%)
43603331 + 0 paired in sequencing
21831004 + 0 read1
21772327 + 0 read2
38524104 + 0 properly paired (88.35%:nan%)
42357862 + 0 with itself and mate mapped
1245469 + 0 singletons (2.86%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Why am I seeing the +0's with every result???

Also, I'm sorry I'm new to this - how exactly do I interpret this?
Thanks a lot.

Just like it says...the +0 means that no reads in your .bam were flagged as QC-failures. Which makes sense, Bowtie wouldn't add that flag to the .bam it makes.

You have no unmapped reads, because bowtie by default doesn't include them, unlike, say, bwa, which does.

So basically, you want to compare the number of mapped reads, depending on the alignment settings.

**vkartha** · 05-17-2012, 08:59 AM

So in terms of studying the effect of different -r options on aligning a given sample, I should look at the 'properly paired' statistic?

It is weird because I estimated the mean and std dev inner distance using picard tools. For the two groups (say diseased cases vs control) I got an average inner distance of -16 and -33 respectively - which means there is overlap of the two reads being sequenced, and I know Tophat accepts a negative -r value for the same reason.

On running samtools flagstat, I get the highest % for the 'properly paired' stat as +50 at 92% and it is lower for the above two -r options I tried! Why is that??

Thanks for all your help guys. Appreciate it

**rnaseek** · 05-17-2012, 01:16 PM

I think the properly paired is a good metric to use and I would use the one with higher "Properly Paired" reads.

Depending on the library prep, rna fragments can be shorter and result in overlapping paired end reads. I am not sure trying negative r value is a good idea.

Anyway with positive "r" values you can vary different variances and choose the one that gives the most number of properly paired reads.

**cedance** · 05-17-2012, 01:30 PM

I usually estimate the inner distance by mapping reads using BWA and computing the mean inner distance and standard deviation. And for my samples, most of the time the reads overlap. The mean inner distance can be negative and I would say its important to choose a negative inner distance if your fragment size selection itself is low. If your chosen fragment size is 170 bases and your reads are 100 bases, of course your reads will have a lot of overlap and your mean inner distance is negative. So, I prefer to estimate it for each data every time. And it definitely is much better than choosing default parameters. The number of reads that map and the mapping efficiency is very reliable this way.

**vkartha** · 05-23-2012, 05:36 PM

Still unsure why I see +50 giving the best result

Originally posted by vkartha View Post

So in terms of studying the effect of different -r options on aligning a given sample, I should look at the 'properly paired' statistic?

It is weird because I estimated the mean and std dev inner distance using picard tools. For the two groups (say diseased cases vs control) I got an average inner distance of -16 and -33 respectively - which means there is overlap of the two reads being sequenced, and I know Tophat accepts a negative -r value for the same reason.

On running samtools flagstat, I get the highest % for the 'properly paired' stat as +50 at 92% and it is lower for the above two -r options I tried! Why is that??

Thanks for all your help guys. Appreciate it

Can someone please address the issue I brought up here? How can the -r option that gives the best alignment results be way off? from the means I had estimated from the samples? In fact that option that gives the best result was chosen as an extreme in the + direction just to see what effect it would have on the alignment had one taken a very large value.

Help?

**arvid** · 05-24-2012, 12:19 AM

Originally posted by vkartha View Post

Can someone please address the issue I brought up here? How can the -r option that gives the best alignment results be way off? from the means I had estimated from the samples? In fact that option that gives the best result was chosen as an extreme in the + direction just to see what effect it would have on the alignment had one taken a very large value.

Help?

Define "best". If you say the more flagged as properly aligned, the better, then setting a really high allowed insert size will give you more properly paired reads - this seems quite obvious. However, you might risk more falsely mapped pairs than you would want to.
In my way of thinking, it would be better to try to estimate the real mean insert size first (e.g. like in post #7), and then adapt the variance to allow for a sensible inclusion of deviating fragments. Then you aren't lying to the tools that include that parameter in their alignment/assembly/quantification algorithms and heuristics.

**vkartha** · 05-24-2012, 12:26 AM

So when I say best, I mean that I observed the highest % looking at the 'properly paired' flagstat output for +50.

I read that this 'properly paired' flag in the samtools flagstat output is a measure of how accurate your -r option was for that tophat run in some sense.

The -r option is mean insert size, so we have to provide Tophat with a rough estimate. It isn't really 'allowance' as you put it right? It isn't a threshold where everything less than that will get mapped. That is what I didn't get by your reply.

**arvid** · 05-24-2012, 12:39 AM

Yes, it is a mean, not a threshold, but to me it seems that the 'properly paired' flagging depends on fragments being smaller than the mean added by some multiple of the supplied mate-std-dev. I might be wrong, but it doesn't seem to be actually checking whether fragments are much *smaller* than expected, which would explain the behaviour in your case.

Still I believe it is less important to have a high % of 'properly paired' pairs, than using settings that are close to the actual fragment spectrum - this would lead to more cautious use of the outlier fragments for transcript assembly in the downstream analysis (if you are looking for previously unknown transcripts), or e.g. using quantification tools that model the insert size to assign weights for multi-mapping read pairs...

**EGrassi** · 05-24-2012, 12:58 AM

Just to report my experience.

without -r (that is with the default value of 50):
samtools flagstat S2_003.th/accepted_hits.bam
28257215 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
28257215 + 0 mapped (100.00%:-nan%)
28257215 + 0 paired in sequencing
14154028 + 0 read1
14103187 + 0 read2
23419378 + 0 properly paired (82.88%:-nan%)
26873012 + 0 with itself and mate mapped
1384203 + 0 singletons (4.90%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

with -r -20 (value gotten from analysis of an alignment done with bowtie on a subset of reads and from the wet lab infos):
samtools flagstat r20_S2_003.th/accepted_hits.bam
28386256 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
28386256 + 0 mapped (100.00%:-nan%)
28386256 + 0 paired in sequencing
14218550 + 0 read1
14167706 + 0 read2
17439266 + 0 properly paired (61.44%:-nan%)
27002056 + 0 with itself and mate mapped
1384200 + 0 singletons (4.88%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

It is true that in principle it seems better to give tophat the actual value but the difference in the properly paired percentages is relevant and with 100bp reads I doubt that all the properly paired reads in the first case that are missing in the second are wrongly aligned...

**cedance** · 05-24-2012, 01:04 AM

Do you also choose a value for --mate-std-dev?

**arvid** · 05-24-2012, 01:12 AM

Maybe you have skewed fragment size distributions (many I've seen are) - do you have wet lab info on that? In that case, I'd definitely choose a bigger --mate-std-dev, I think the default 20 is on the low side for typical distributions, anyway.

**EGrassi** · 05-24-2012, 01:17 AM

I got a SD of 37 on a subset of aligned reads so in order to reduce the number of tests I sticked with the default value (20). If I get some cpu time for other tests I will let you know the results...right now as long as I have an ETA of 3 weeks to complete the alignment on my data I will go without -r (as long as I have 1 week of work already done in this way...I know that this does not sound right but when I started to align I had read the other thread here about differences in results with negative values and I didn't have the lab data to confirm my extimated value and as long as someone stated that the -r value is less important in the recent releases of tophat I decided to start without an explicit -r [though I've not understood completely what "less important" means and I don't have the time to check in the source code...])

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 18 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 47 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

-r option - Does it make a difference to the alignment?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News