Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BWA manual's -q description too technical

    Can someone please explain in normal language what is meant with the option -q in BWA manual and what criteria one should take into consideration when choosing a value for this parameter?


    i.e.:

    -q INT Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]


    Thanks,
    C.

  • #2
    yeah that is poorly phrased unless you are a robot.

    if you study this perl interpretation it will make more sense:


    basically envision scanning from the right of the read accumulating a badness sum or "area" based on encountering positions of lower quality than your -q parameter, and losing some of that area when you encounter positions that are higher than your -q. Now plot out that area over the read and trim to the position where the area was greatest (the point of no return)

    This differs from just trimming where quality drops off momentarily because subsequent bases might actually be pretty decent.
    Last edited by Zigster; 08-03-2010, 07:42 AM.
    --
    Jeremy Leipzig
    Bioinformatics Programmer
    --
    My blog
    Twitter

    Comment


    • #3
      Hi Zigster,

      Thank you very much. It is now clear. Especially the perl example was, as you've pointed out, very enlightening.

      A pity that your explanation is not included in the official manual, as this would have saved time and a few neurons to the most of the human users.


      Best regards,
      C.

      Comment


      • #4
        Originally posted by Zigster View Post
        yeah that is poorly phrased unless you are a robot.

        if you study this perl interpretation it will make more sense:


        basically envision scanning from the right of the read accumulating a badness sum or "area" based on encountering positions of lower quality than your -q parameter, and losing some of that area when you encounter positions that are higher than your -q. Now plot out that area over the read and trim to the position where the area was greatest (the point of no return)

        This differs from just trimming where quality drops off momentarily because subsequent bases might actually be pretty decent.
        hi Zigster,
        i have the same question .. but i didn't get your point .. can you please explain it a little bit more ??
        thank you

        Comment


        • #5
          here is an awesome diagram I have devoted hours to:

          accumulate badness right to left
          trim at max badness



          so notice if that small hump was higher then we would trim before encountering it, at the "first' badness peak on the right
          --
          Jeremy Leipzig
          Bioinformatics Programmer
          --
          My blog
          Twitter

          Comment


          • #6
            Zigstar,
            Say the quality scores are
            QS=c(30,30,30,30,27,28,24,20,17,20,20,17,17,10,17,17,19,20,12,10)
            and THR=17
            then
            cumsum(QS-THR)
            c(13 26 39 52 62 73 80 83 83 86 89 89 89 82 82 82 84 87 82 75)
            Thus the read would get trimmed at score of 89, i.e at positions 12,13, or 14.
            I stop where I have most goodness accumulated.

            Btw does trimming happen after alignment during writing cigars etc, or before it?
            I would like to think after, that way I can still use this information but not report it.

            Comment


            • #7
              well i think you accumulate badness from the 3' end, though I suppose you wind up with the same result in your approach

              Code:
              > cumsum(-(rev(QS)-THR))
               [1]   7  12   9   7   7   7  14  14  14  11   8   8   5  -2 -13 -23 -36 -49 -62 -75
              
              > QS[1:(length(QS)-which(cumsum(-(rev(QS)-THR))==max(cumsum(-(rev(QS)-THR))))[1])]
               [1] 30 30 30 30 27 28 24 20 17 20 20 17 17
              Last edited by Zigster; 04-12-2012, 01:45 PM.
              --
              Jeremy Leipzig
              Bioinformatics Programmer
              --
              My blog
              Twitter

              Comment


              • #8
                Yes,
                I completely agree,
                Thanks for demystifying the cryptic formula from the manual.

                Comment


                • #9
                  To dig up an old thread: I believe the perl approach to this function does not fit with the (correct) graphical interpretation that was provided here. As can be seen on the attached plots, reads will be discarded completely by the perl function if the area-value does not reach zero again. This might happen if the good part of the read is too short, or when the quality is just above the threshold. The top plot will be discarded incorrectly, the bottom plot will be trimmed correctly.

                  I believe that to correct this behaviour, reads should only be discarded if the maximal 'badness' value (area in the perl function) is reached at position 0.

                  I've included my version of the perl script here (snippet from a threaded program):

                  Code:
                  sub bwasingle {
                  	# local variables speed up threading
                  	my $tq = $trimq;
                  	while ( defined(my $rstring = $totrim->dequeue())){
                  	  my @reads = split(/\|/,$rstring);   # each rstring has 10000 reads
                            my $printstring = '';
                  	  my $failedstring = '';
                            foreach (@reads) {
                  		my $read = $_;
                  		my ($h1a,$s1,$h1b,$q1) = split(/\n/,$read);
                  		#############
                  		## trim 3' ##
                  		#############
                  		if ($trim3 == 1) {
                  			my $pos = length($q1) - 1;
                  			my $maxPos = $pos;
                  			## skip empty read.
                  			if ($pos <= 0) {   
                  				$discarded++;
                  				$failedstring .= $read . '|';	
                  				next;
                  			}  
                  			my $area = 0;
                  			my $maxArea = 0;
                  			my @qarray = unpack("C*",$q1);
                  			while ($pos>0 && $area>=0) {
                  				$area += $tq - ($qarray[($pos)] - 33);
                  				## not <=  for more aggressive trimming : if multiple equal maxima => use shortest read.
                  				if ($area < $maxArea) {
                  					$pos--;
                  					next;
                  				}
                  				else {
                  					$maxArea = $area;
                  					$maxPos = $pos;
                  					$pos--;
                  					next;
                  				}
                  			}
                  			## The online perl approach is wrong if the high-quality part is not long enough to reach 0.
                  			if ($maxPos == 0) {  # maximal badness on position 0 => no point of no return reached. 
                  				$discarded++;
                  				$failedstring .= $read . '|';	
                  				next;
                  			}  
                  			# otherwise trim before position where area reached a maximum 
                  			# $maxPos as length of substr gives positions zero to just before the maxpos.
                  			$s1 = substr($s1,0,$maxPos);
                  			$q1 = substr($q1,0,$maxPos);
                  		}
                  		#############
                  		## trim 5' ##
                  		#############
                  		if ($trim5 == 1) {
                  			my $skip = 0; #length($q1);
                  			my $maxPos = $skip;
                  			my $area = 0;
                  			my $maxArea = 0;
                  			my @qarray = unpack("C*",$q1);
                  			while ($skip<length($q1) && $area>=0) {
                  				#$area += $tq - (ord(unpack("x$skip A1",$q1)) - 33);
                  				$area += $tq - ($qarray[($skip)] - 33);
                  				if ($area < $maxArea) {
                  					$skip++;
                  					next;
                  				}
                  				else {
                  					$maxArea = $area;
                  					$maxPos = $skip;
                  					$skip++;
                  					next;
                  				}
                  			}
                  			#if ($skip==length($q1)) { 		# scanned whole read and didn't integrate to zero? skip (no output for this read
                  			if ($maxPos == (length($q1)-1)) { # => maximal value at end of string => no point of no return found.
                  				$discarded++;
                  				$failedstring .= $read . '|';	
                  				next;
                  			}  
                  			# trim after position where area reached a maximum 
                  			$s1 = substr($s1,($maxPos+1));
                  			$q1 = substr($q1,($maxPos+1));
                  		}
                  		# queue for printing
                  		my @arr = ($h1a,$s1,$h1b,$q1);
                  		$printstring .= join("\n",@arr) . "\n";  # join components by newline character for printing 
                  	   }
                  	   
                  	   if ($printstring ne '') {
                  		   $toprint->enqueue($printstring);
                  	   }
                  	   if ($failedstring ne '') {
                  		   $failed->enqueue($failedstring);
                             }
                  	}
                  }
                  Attached Files

                  Comment


                  • #10
                    Does BWA do this before aligning the reads? So as to begin mapping with a smaller set than the original?

                    Comment


                    • #11
                      Geertyvendeweyer,

                      Would you care to explain in more detail why the perl interpretation is wrong?

                      I want to understand your logic.

                      Thanks,
                      carmen

                      Comment


                      • #12
                        Hi cdry7ue
                        Thanks for your wonderful explanation. But I din't get what is THR. Can you or anybody else explain what is meant by THR or it's full form?
                        Thanks.

                        Comment


                        • #13
                          Hi cdry7ue,
                          Thanks for your wonderful explanation. But I din't get what is THR. Can you or anybody else explain what is meant by THR or it's full form?
                          Thanks.

                          Comment


                          • #14
                            I received a request for permission to use my awesome diagram from someone but they have Private Messages blocked so I could not respond directly.

                            In any case, yes feel free to use my awesome diagram as you please.
                            --
                            Jeremy Leipzig
                            Bioinformatics Programmer
                            --
                            My blog
                            Twitter

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            11 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            10 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            51 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            68 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X