Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • -r option - Does it make a difference to the alignment?

    Hey guys,
    I have paired end reads on RNASeq data and I wanted to test run Tophat2 to map these reads for a given sample using 5 different -r values(that is the mean inner distance between paired ends) just to see how it would affect the alignment.

    I'm not exactly sure what I should look at once the alignment is over to figure out the so called 'alignment results'.

    Tophat2 gives out a log folder along with the accepted_hits.bam, junction.bed files etc. in which I looked at two files namely:

    bowtie.left_kept_reads.fixmap.log
    bowtie.right_kept_reads.fixmap.log

    These give summaries on the alignment from each end such as:

    22630038 reads; of these:
    22630038 (100.00%) were unpaired; of these:
    2447457 (10.82%) aligned 0 times
    15049769 (66.50%) aligned exactly 1 time
    5132812 (22.68%) aligned >1 times
    89.18% overall alignment rate

    Now I notice there are also log files namely:

    bowtie.right_kept_reads_seg1.fixmap.log
    bowtie.right_kept_reads_seg2.fixmap.log
    bowtie.right_kept_reads_seg3.fixmap.log
    bowtie.right_kept_reads_seg4.fixmap.log

    (And the same 4 for the left kept reads)

    I don't understand what these files are or how they are different from the file I mentioned earlier (fixmap.log file)

    I'm not sure if I'm looking at the correct log file, and if the fixmap.log file is indeed the right one to look at, I noticed that the alignment percentage is the SAME (i.e. 89%) for ALL -r options (I tried -50, -16, -33, 0, 50) using the same std deviation value. I just cannot understand why!!! - Does this mean that the -r option doesn't make a big difference in the alignment???

    Hope to hear back from you guys soon. Thanks a lot.

  • #2
    Try using samtools flagstat command. flagstat will give the simple alignement stats using the bam file. With different "r" values you can see how they vary.

    Comment


    • #3
      I tried using samtools flagstat on one of the output bam files and this is what I got :


      43603331 + 0 in total (QC-passed reads + QC-failed reads)
      0 + 0 duplicates
      43603331 + 0 mapped (100.00%:nan%)
      43603331 + 0 paired in sequencing
      21831004 + 0 read1
      21772327 + 0 read2
      38524104 + 0 properly paired (88.35%:nan%)
      42357862 + 0 with itself and mate mapped
      1245469 + 0 singletons (2.86%:nan%)
      0 + 0 with mate mapped to a different chr
      0 + 0 with mate mapped to a different chr (mapQ>=5)

      Why am I seeing the +0's with every result???

      Also, I'm sorry I'm new to this - how exactly do I interpret this?
      Thanks a lot.

      Comment


      • #4
        Originally posted by vkartha View Post
        I tried using samtools flagstat on one of the output bam files and this is what I got :


        43603331 + 0 in total (QC-passed reads + QC-failed reads)
        0 + 0 duplicates
        43603331 + 0 mapped (100.00%:nan%)
        43603331 + 0 paired in sequencing
        21831004 + 0 read1
        21772327 + 0 read2
        38524104 + 0 properly paired (88.35%:nan%)
        42357862 + 0 with itself and mate mapped
        1245469 + 0 singletons (2.86%:nan%)
        0 + 0 with mate mapped to a different chr
        0 + 0 with mate mapped to a different chr (mapQ>=5)

        Why am I seeing the +0's with every result???

        Also, I'm sorry I'm new to this - how exactly do I interpret this?
        Thanks a lot.
        Just like it says...the +0 means that no reads in your .bam were flagged as QC-failures. Which makes sense, Bowtie wouldn't add that flag to the .bam it makes.

        You have no unmapped reads, because bowtie by default doesn't include them, unlike, say, bwa, which does.

        So basically, you want to compare the number of mapped reads, depending on the alignment settings.

        Comment


        • #5
          So in terms of studying the effect of different -r options on aligning a given sample, I should look at the 'properly paired' statistic?

          It is weird because I estimated the mean and std dev inner distance using picard tools. For the two groups (say diseased cases vs control) I got an average inner distance of -16 and -33 respectively - which means there is overlap of the two reads being sequenced, and I know Tophat accepts a negative -r value for the same reason.

          On running samtools flagstat, I get the highest % for the 'properly paired' stat as +50 at 92% and it is lower for the above two -r options I tried! Why is that??

          Thanks for all your help guys. Appreciate it

          Comment


          • #6
            I think the properly paired is a good metric to use and I would use the one with higher "Properly Paired" reads.

            Depending on the library prep, rna fragments can be shorter and result in overlapping paired end reads. I am not sure trying negative r value is a good idea.

            Anyway with positive "r" values you can vary different variances and choose the one that gives the most number of properly paired reads.

            Comment


            • #7
              I usually estimate the inner distance by mapping reads using BWA and computing the mean inner distance and standard deviation. And for my samples, most of the time the reads overlap. The mean inner distance can be negative and I would say its important to choose a negative inner distance if your fragment size selection itself is low. If your chosen fragment size is 170 bases and your reads are 100 bases, of course your reads will have a lot of overlap and your mean inner distance is negative. So, I prefer to estimate it for each data every time. And it definitely is much better than choosing default parameters. The number of reads that map and the mapping efficiency is very reliable this way.

              Comment


              • #8
                Still unsure why I see +50 giving the best result

                Originally posted by vkartha View Post
                So in terms of studying the effect of different -r options on aligning a given sample, I should look at the 'properly paired' statistic?

                It is weird because I estimated the mean and std dev inner distance using picard tools. For the two groups (say diseased cases vs control) I got an average inner distance of -16 and -33 respectively - which means there is overlap of the two reads being sequenced, and I know Tophat accepts a negative -r value for the same reason.

                On running samtools flagstat, I get the highest % for the 'properly paired' stat as +50 at 92% and it is lower for the above two -r options I tried! Why is that??

                Thanks for all your help guys. Appreciate it
                Can someone please address the issue I brought up here? How can the -r option that gives the best alignment results be way off? from the means I had estimated from the samples? In fact that option that gives the best result was chosen as an extreme in the + direction just to see what effect it would have on the alignment had one taken a very large value.

                Help?

                Comment


                • #9
                  Originally posted by vkartha View Post
                  Can someone please address the issue I brought up here? How can the -r option that gives the best alignment results be way off? from the means I had estimated from the samples? In fact that option that gives the best result was chosen as an extreme in the + direction just to see what effect it would have on the alignment had one taken a very large value.

                  Help?
                  Define "best". If you say the more flagged as properly aligned, the better, then setting a really high allowed insert size will give you more properly paired reads - this seems quite obvious. However, you might risk more falsely mapped pairs than you would want to.
                  In my way of thinking, it would be better to try to estimate the real mean insert size first (e.g. like in post #7), and then adapt the variance to allow for a sensible inclusion of deviating fragments. Then you aren't lying to the tools that include that parameter in their alignment/assembly/quantification algorithms and heuristics.

                  Comment


                  • #10
                    So when I say best, I mean that I observed the highest % looking at the 'properly paired' flagstat output for +50.

                    I read that this 'properly paired' flag in the samtools flagstat output is a measure of how accurate your -r option was for that tophat run in some sense.

                    The -r option is mean insert size, so we have to provide Tophat with a rough estimate. It isn't really 'allowance' as you put it right? It isn't a threshold where everything less than that will get mapped. That is what I didn't get by your reply.

                    Comment


                    • #11
                      Yes, it is a mean, not a threshold, but to me it seems that the 'properly paired' flagging depends on fragments being smaller than the mean added by some multiple of the supplied mate-std-dev. I might be wrong, but it doesn't seem to be actually checking whether fragments are much *smaller* than expected, which would explain the behaviour in your case.

                      Still I believe it is less important to have a high % of 'properly paired' pairs, than using settings that are close to the actual fragment spectrum - this would lead to more cautious use of the outlier fragments for transcript assembly in the downstream analysis (if you are looking for previously unknown transcripts), or e.g. using quantification tools that model the insert size to assign weights for multi-mapping read pairs...

                      Comment


                      • #12
                        Just to report my experience.

                        without -r (that is with the default value of 50):
                        samtools flagstat S2_003.th/accepted_hits.bam
                        28257215 + 0 in total (QC-passed reads + QC-failed reads)
                        0 + 0 duplicates
                        28257215 + 0 mapped (100.00%:-nan%)
                        28257215 + 0 paired in sequencing
                        14154028 + 0 read1
                        14103187 + 0 read2
                        23419378 + 0 properly paired (82.88%:-nan%)
                        26873012 + 0 with itself and mate mapped
                        1384203 + 0 singletons (4.90%:-nan%)
                        0 + 0 with mate mapped to a different chr
                        0 + 0 with mate mapped to a different chr (mapQ>=5)

                        with -r -20 (value gotten from analysis of an alignment done with bowtie on a subset of reads and from the wet lab infos):
                        samtools flagstat r20_S2_003.th/accepted_hits.bam
                        28386256 + 0 in total (QC-passed reads + QC-failed reads)
                        0 + 0 duplicates
                        28386256 + 0 mapped (100.00%:-nan%)
                        28386256 + 0 paired in sequencing
                        14218550 + 0 read1
                        14167706 + 0 read2
                        17439266 + 0 properly paired (61.44%:-nan%)
                        27002056 + 0 with itself and mate mapped
                        1384200 + 0 singletons (4.88%:-nan%)
                        0 + 0 with mate mapped to a different chr
                        0 + 0 with mate mapped to a different chr (mapQ>=5)

                        It is true that in principle it seems better to give tophat the actual value but the difference in the properly paired percentages is relevant and with 100bp reads I doubt that all the properly paired reads in the first case that are missing in the second are wrongly aligned...

                        Comment


                        • #13
                          Do you also choose a value for --mate-std-dev?

                          Comment


                          • #14
                            Maybe you have skewed fragment size distributions (many I've seen are) - do you have wet lab info on that? In that case, I'd definitely choose a bigger --mate-std-dev, I think the default 20 is on the low side for typical distributions, anyway.
                            Last edited by arvid; 05-24-2012, 01:13 AM. Reason: typo

                            Comment


                            • #15
                              I got a SD of 37 on a subset of aligned reads so in order to reduce the number of tests I sticked with the default value (20). If I get some cpu time for other tests I will let you know the results...right now as long as I have an ETA of 3 weeks to complete the alignment on my data I will go without -r (as long as I have 1 week of work already done in this way...I know that this does not sound right but when I started to align I had read the other thread here about differences in results with negative values and I didn't have the lab data to confirm my extimated value and as long as someone stated that the -r value is less important in the recent releases of tophat I decided to start without an explicit -r [though I've not understood completely what "less important" means and I don't have the time to check in the source code...])

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X