SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Exome sequencing analysis manual ulz_peter Bioinformatics 138 03-09-2017 04:26 AM
technical replicates uncorrelate with each other tujchl Epigenetics 4 05-26-2011 06:16 PM
GS FLX Manual jordi 454 Pyrosequencing 0 03-21-2011 09:24 AM
SOAP output format - description? Aengus Bioinformatics 2 09-08-2010 06:26 AM
Comprehensive TopHat manual? Gus RNA Sequencing 1 12-21-2009 04:54 AM

Reply
 
Thread Tools
Old 08-03-2010, 06:22 AM   #1
CHRYSES
Member
 
Location: Netherlands

Join Date: Dec 2009
Posts: 13
Default BWA manual's -q description too technical

Can someone please explain in normal language what is meant with the option -q in BWA manual and what criteria one should take into consideration when choosing a value for this parameter?


i.e.:

-q INT Parameter for read trimming. BWA trims a read down to argmax_x{\sum_{i=x+1}^l(INT-q_i)} if q_l<INT where l is the original read length. [0]


Thanks,
C.
CHRYSES is offline   Reply With Quote
Old 08-03-2010, 07:24 AM   #2
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

yeah that is poorly phrased unless you are a robot.

if you study this perl interpretation it will make more sense:
http://wiki.bioinformatics.ucdavis.e...rimBWAstyle.pl

basically envision scanning from the right of the read accumulating a badness sum or "area" based on encountering positions of lower quality than your -q parameter, and losing some of that area when you encounter positions that are higher than your -q. Now plot out that area over the read and trim to the position where the area was greatest (the point of no return)

This differs from just trimming where quality drops off momentarily because subsequent bases might actually be pretty decent.
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter

Last edited by Zigster; 08-03-2010 at 07:42 AM.
Zigster is offline   Reply With Quote
Old 08-03-2010, 01:14 PM   #3
CHRYSES
Member
 
Location: Netherlands

Join Date: Dec 2009
Posts: 13
Default

Hi Zigster,

Thank you very much. It is now clear. Especially the perl example was, as you've pointed out, very enlightening.

A pity that your explanation is not included in the official manual, as this would have saved time and a few neurons to the most of the human users.


Best regards,
C.
CHRYSES is offline   Reply With Quote
Old 11-03-2011, 03:13 AM   #4
wisosonic
Member
 
Location: Rouen-France

Join Date: Mar 2011
Posts: 14
Default

Quote:
Originally Posted by Zigster View Post
yeah that is poorly phrased unless you are a robot.

if you study this perl interpretation it will make more sense:
http://wiki.bioinformatics.ucdavis.e...rimBWAstyle.pl

basically envision scanning from the right of the read accumulating a badness sum or "area" based on encountering positions of lower quality than your -q parameter, and losing some of that area when you encounter positions that are higher than your -q. Now plot out that area over the read and trim to the position where the area was greatest (the point of no return)

This differs from just trimming where quality drops off momentarily because subsequent bases might actually be pretty decent.
hi Zigster,
i have the same question .. but i didn't get your point .. can you please explain it a little bit more ??
thank you
wisosonic is offline   Reply With Quote
Old 11-03-2011, 04:51 AM   #5
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

here is an awesome diagram I have devoted hours to:

accumulate badness right to left
trim at max badness



so notice if that small hump was higher then we would trim before encountering it, at the "first' badness peak on the right
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Old 04-12-2012, 12:35 PM   #6
cdry7ue
Member
 
Location: Austin

Join Date: Feb 2011
Posts: 20
Default

Zigstar,
Say the quality scores are
QS=c(30,30,30,30,27,28,24,20,17,20,20,17,17,10,17,17,19,20,12,10)
and THR=17
then
cumsum(QS-THR)
c(13 26 39 52 62 73 80 83 83 86 89 89 89 82 82 82 84 87 82 75)
Thus the read would get trimmed at score of 89, i.e at positions 12,13, or 14.
I stop where I have most goodness accumulated.

Btw does trimming happen after alignment during writing cigars etc, or before it?
I would like to think after, that way I can still use this information but not report it.
cdry7ue is offline   Reply With Quote
Old 04-12-2012, 01:38 PM   #7
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

well i think you accumulate badness from the 3' end, though I suppose you wind up with the same result in your approach

Code:
> cumsum(-(rev(QS)-THR))
 [1]   7  12   9   7   7   7  14  14  14  11   8   8   5  -2 -13 -23 -36 -49 -62 -75

> QS[1:(length(QS)-which(cumsum(-(rev(QS)-THR))==max(cumsum(-(rev(QS)-THR))))[1])]
 [1] 30 30 30 30 27 28 24 20 17 20 20 17 17
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter

Last edited by Zigster; 04-12-2012 at 01:45 PM.
Zigster is offline   Reply With Quote
Old 04-12-2012, 02:54 PM   #8
cdry7ue
Member
 
Location: Austin

Join Date: Feb 2011
Posts: 20
Default

Yes,
I completely agree,
Thanks for demystifying the cryptic formula from the manual.
cdry7ue is offline   Reply With Quote
Old 06-19-2012, 02:21 AM   #9
geertvandeweyer
Member
 
Location: Antwerp, Belgium

Join Date: Jan 2011
Posts: 14
Default

To dig up an old thread: I believe the perl approach to this function does not fit with the (correct) graphical interpretation that was provided here. As can be seen on the attached plots, reads will be discarded completely by the perl function if the area-value does not reach zero again. This might happen if the good part of the read is too short, or when the quality is just above the threshold. The top plot will be discarded incorrectly, the bottom plot will be trimmed correctly.

I believe that to correct this behaviour, reads should only be discarded if the maximal 'badness' value (area in the perl function) is reached at position 0.

I've included my version of the perl script here (snippet from a threaded program):

Code:
sub bwasingle {
	# local variables speed up threading
	my $tq = $trimq;
	while ( defined(my $rstring = $totrim->dequeue())){
	  my @reads = split(/\|/,$rstring);   # each rstring has 10000 reads
          my $printstring = '';
	  my $failedstring = '';
          foreach (@reads) {
		my $read = $_;
		my ($h1a,$s1,$h1b,$q1) = split(/\n/,$read);
		#############
		## trim 3' ##
		#############
		if ($trim3 == 1) {
			my $pos = length($q1) - 1;
			my $maxPos = $pos;
			## skip empty read.
			if ($pos <= 0) {   
				$discarded++;
				$failedstring .= $read . '|';	
				next;
			}  
			my $area = 0;
			my $maxArea = 0;
			my @qarray = unpack("C*",$q1);
			while ($pos>0 && $area>=0) {
				$area += $tq - ($qarray[($pos)] - 33);
				## not <=  for more aggressive trimming : if multiple equal maxima => use shortest read.
				if ($area < $maxArea) {
					$pos--;
					next;
				}
				else {
					$maxArea = $area;
					$maxPos = $pos;
					$pos--;
					next;
				}
			}
			## The online perl approach is wrong if the high-quality part is not long enough to reach 0.
			if ($maxPos == 0) {  # maximal badness on position 0 => no point of no return reached. 
				$discarded++;
				$failedstring .= $read . '|';	
				next;
			}  
			# otherwise trim before position where area reached a maximum 
			# $maxPos as length of substr gives positions zero to just before the maxpos.
			$s1 = substr($s1,0,$maxPos);
			$q1 = substr($q1,0,$maxPos);
		}
		#############
		## trim 5' ##
		#############
		if ($trim5 == 1) {
			my $skip = 0; #length($q1);
			my $maxPos = $skip;
			my $area = 0;
			my $maxArea = 0;
			my @qarray = unpack("C*",$q1);
			while ($skip<length($q1) && $area>=0) {
				#$area += $tq - (ord(unpack("x$skip A1",$q1)) - 33);
				$area += $tq - ($qarray[($skip)] - 33);
				if ($area < $maxArea) {
					$skip++;
					next;
				}
				else {
					$maxArea = $area;
					$maxPos = $skip;
					$skip++;
					next;
				}
			}
			#if ($skip==length($q1)) { 		# scanned whole read and didn't integrate to zero? skip (no output for this read
			if ($maxPos == (length($q1)-1)) { # => maximal value at end of string => no point of no return found.
				$discarded++;
				$failedstring .= $read . '|';	
				next;
			}  
			# trim after position where area reached a maximum 
			$s1 = substr($s1,($maxPos+1));
			$q1 = substr($q1,($maxPos+1));
		}
		# queue for printing
		my @arr = ($h1a,$s1,$h1b,$q1);
		$printstring .= join("\n",@arr) . "\n";  # join components by newline character for printing 
	   }
	   
	   if ($printstring ne '') {
		   $toprint->enqueue($printstring);
	   }
	   if ($failedstring ne '') {
		   $failed->enqueue($failedstring);
           }
	}
}
Attached Images
File Type: png BWA_Trim_Principle.png (76.0 KB, 82 views)
geertvandeweyer is offline   Reply With Quote
Old 12-19-2012, 08:17 AM   #10
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Does BWA do this before aligning the reads? So as to begin mapping with a smaller set than the original?
carmeyeii is offline   Reply With Quote
Old 12-19-2012, 11:47 AM   #11
carmeyeii
Senior Member
 
Location: Mexico

Join Date: Mar 2011
Posts: 137
Default

Geertyvendeweyer,

Would you care to explain in more detail why the perl interpretation is wrong?

I want to understand your logic.

Thanks,
carmen
carmeyeii is offline   Reply With Quote
Old 12-21-2012, 01:48 AM   #12
dd_genome
Junior Member
 
Location: India

Join Date: Oct 2012
Posts: 4
Default

Hi cdry7ue
Thanks for your wonderful explanation. But I din't get what is THR. Can you or anybody else explain what is meant by THR or it's full form?
Thanks.
dd_genome is offline   Reply With Quote
Old 12-21-2012, 01:48 AM   #13
dd_genome
Junior Member
 
Location: India

Join Date: Oct 2012
Posts: 4
Default

Hi cdry7ue,
Thanks for your wonderful explanation. But I din't get what is THR. Can you or anybody else explain what is meant by THR or it's full form?
Thanks.
dd_genome is offline   Reply With Quote
Old 01-09-2014, 05:56 AM   #14
Zigster
(Jeremy Leipzig)
 
Location: Philadelphia, PA

Join Date: May 2009
Posts: 116
Default

I received a request for permission to use my awesome diagram from someone but they have Private Messages blocked so I could not respond directly.

In any case, yes feel free to use my awesome diagram as you please.
__________________
--
Jeremy Leipzig
Bioinformatics Programmer
--
My blog
Twitter
Zigster is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:10 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO