SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 138 11-12-2020 04:58 PM
Adapter trimming figo1019 RNA Sequencing 2 07-17-2018 05:00 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 05:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 08:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 01:53 PM

Reply
 
Thread Tools
Old 07-29-2015, 05:33 AM   #121
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi Brian,

I have started using BBDuk as it seemed more well documented and supported than alternatives, and alternatives did not function as I expected. Thanks for writing these tools.

Earlier in the thread you discuss kmer lengths:
Quote:
Originally Posted by Brian Bushnell View Post
If you want to determine the ideal kmer length, it may be useful to characterize the error rates of your data, or quantify your true and false positive assignment rates based on filtering when using synthetic reads with their origin tagged. "randomreads.sh" generates and tags reads from reference, with adjustable error rates; I use it a lot to validate approaches and determine optimal settings.
I am wondering how I might use this in my context. I have methylation data that suffers from read-through into adapter. Basically the insert sizes are small due to fragmented nature of the DNA, so there is a lot of adapter contamination. I have tested a few values for kmer (15,20,25). Instead of this arbitrary choosing, could you give me an example of what 'error rates' I can use in my context? I have a downstream plot to essentially determine by eye if trimming of adapter is successful, but it would be nice to explain how we arrive at the kmer value which resolves our issue.

Thanks again for the tools,

Bruce.
bruce01 is offline   Reply With Quote
Old 07-29-2015, 02:47 PM   #122
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Bruce,

I don't know an easy way to find the true error rate of bisulfite treated reads. But if you assume the quality values are correct, you can do this, starting with your real reads:

addadapters.sh in=reads.fq right literal=TCGGATAAGGCGCTCGCGCCGCATCCGACAAATGTGTTCAGCGA rate=0.5 adderrors out=x.fq reads=1m

This will add adapters to the reads and rename them to indicate where they were placed. Then, errors will be added to the adapters based on the quality value at each position. The sequence I specified is just a random piece of E.coli; don't use real adapter sequence, which would interfere with the test. Next, trim with BBDuk:

bbduk.sh in=x.fq out=y.fq literal=TCGGATAAGGCGCTCGCGCCGCATCCGACAAATGTGTTCAGCGA k=23 mink=11 hdist=1

Then grade the result:

addadapters.sh grade in=y.fq


It will print something like this:

Code:
Total output:                           3076 reads                      244458 bases
Perfectly Correct (% of output):        2765 reads (89.889%)            213358 bases (87.278%)
Incorrect (% of output):                311 reads (10.111%)             31100 bases (12.722%)

Adapters Remaining (% of adapters):     311 reads (21.917%)             3518 bases (1.439%)
Non-Adapter Removed (% of valid):       0 reads (0.0000%)               0 bases (0.0000%)
Note that this will only test kmer-based trimming. In practice, for paired reads, I recommend adding the flags "tbo tpe" which will extinguish virtually all adapter sequence, but those are hard to test on real data since you don't know the insert size. They can be tested on synthetic data, though.

At any rate, this will allow you to determine the expected true- and false-positive rates for your data for a given value of k and hdist.
Brian Bushnell is offline   Reply With Quote
Old 07-30-2015, 08:56 AM   #123
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi Brian,

thanks for the reply. I have gone about testing different kmer to see what they result in in terms of retained bases. My thinking was that lower kmer should result in retaining less sequence, and at the point that the amount of sequence retained plateaus is a good kmer value for the data. As it stands this was a reasonable assumption (see attached plot).

However I have neglected 'mink' in this regard. I had determined a mink=5 based on not calling methylation events from within 5bp of 5', 3' which is standard given there is extreme fluctuation of quality at ends of reads. But does mink actually mean that you can reduce kmer to 'mink' bp at ends of reads for trimming? I.e. if mink=10, then adapter in the final 10 bp of a read can be trimmed, but <10bp is untrimmed? I have tested one sample (12T1-10) for this and we see marginal increase (~1.7% bases more). Are these therefore more likely adapter?

Sorry if this is obvious, just want to check method before running all samples.

Thanks,

Bruce.
Attached Files
File Type: pdf trim_test_9-19.pdf (5.1 KB, 11 views)

Last edited by bruce01; 07-30-2015 at 08:57 AM. Reason: readability
bruce01 is offline   Reply With Quote
Old 07-30-2015, 10:10 AM   #124
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by bruce01 View Post
But does mink actually mean that you can reduce kmer to 'mink' bp at ends of reads for trimming? I.e. if mink=10, then adapter in the final 10 bp of a read can be trimmed, but <10bp is untrimmed?
That's correct. For "ktrim=r k=20 mink=10", a read will be trimmed if there is a 20-mer match anywhere in the read; but at the rightmost end of the read, it will also look for a 19-mer match, 18-mer match, etc. down to 10-mer but not below. So if the last 10bp are adapter they will be trimmed, but if only the last 9bp are adapter, they won't.

Quote:
I have tested one sample (12T1-10) for this and we see marginal increase (~1.7% bases more). Are these therefore more likely adapter?
On your graph, 12T1-10 retains more than 12T1 (for longer kmer lengths). This is a bit counter-intuitive - normally, setting mink below k should strictly result in fewer bases retained - but the reason is due to the "mm" (maskmiddle) flag. When mink is set, this flag gets disabled, which reduces sensitivity slightly for full-length kmers. As a result, even though sensitivity is being increased at the end of the read, full-length kmer hits in the middle of the read are being missed due to errors. You can recover this sensitivity by increasing hdist by 1 to allow an additional mismatch. For an apples-to-apples comparison, you could run 12T1 and 12T1-10 both with the "mm=f" flag.

I'm not sure what's going on at and below k=10 when mink=10; mink is supposed to always be less than k, so those configurations are untested. And double-checking my code, mink does in fact get disabled when mink>=k, so the results of 12T1-10 and 12T1 should be identical at K=10 or less.

Judging by the effects of disabling "mm", it looks like this data has a sufficient error rate that it would benefit from increasing hdist; perhaps "k=21 hdist=2".

Last edited by Brian Bushnell; 07-30-2015 at 10:13 AM.
Brian Bushnell is offline   Reply With Quote
Old 07-30-2015, 11:05 AM   #125
bruce01
Senior Member
 
Location: .

Join Date: Mar 2011
Posts: 157
Default

Hi Brian,

to clarify, I did have mink=5 for the rest of the runs based on my previous assertion that SNP, methylation would not be called in the first/final 5bp of a read, so the 'mm' would have been disabled for both. This should result in fewer bases between mink=5 vs. mink=10, based on missing those adapters that are <10, versus <5. Right?

I will test with an extra mismatch based on expecting increased error rates from the previous error estimation you specified. Comparing hdist=1 to =2 should be a good determinant of final parameter.

Thanks for your help,

Bruce.
bruce01 is offline   Reply With Quote
Old 07-30-2015, 11:12 AM   #126
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by bruce01 View Post
Hi Brian,

to clarify, I did have mink=5 for the rest of the runs based on my previous assertion that SNP, methylation would not be called in the first/final 5bp of a read, so the 'mm' would have been disabled for both. This should result in fewer bases between mink=5 vs. mink=10, based on missing those adapters that are <10, versus <5. Right?
Ah! OK, I understand now. Yes, the additional bases trimmed with mink=5 and not with mink=10 are adapter... or false-positive. 5 is pretty short and will be expected to yield false-positives roughly 2*1/(4^5) or .2% of the time, or 15x as much - 3% of the time - with hdist=1. But only 5 extra bases would get trimmed so that's not very important.

Quote:
Thanks for your help,

Bruce.
You're welcome! Remember, if you are using paired ends, I highly recommend the "tbo" and "tpe" flags.
Brian Bushnell is offline   Reply With Quote
Old 08-16-2015, 09:36 AM   #127
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 467
Default

Hi Brian,

do BBMap/BBDuk by chance have a low complexity sequence filtering or masking option?

Thanks!
luc is offline   Reply With Quote
Old 08-17-2015, 09:40 AM   #128
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Luc,

BBDuk has an entropy filtering option:
Code:
Entropy/Complexity parameters:
entropy=-1          Set between 0 and 1 to filter reads with entropy below
                    that value.  Higher is more stringent.
entropywindow=50    Calculate entropy using a sliding window of this length.
entropyk=5          Calculate entropy using kmers of this length.
For example:
bbduk.sh in=reads.fq out=filtered.fq entropy=0.01

That will probably only filter homopolymers... while this:

bbduk.sh in=reads.fq out=filtered.fq entropy=0.95

...will filter anything that is not very high complexity.

BBMask, on the other hand, can do entropy-masking as well as masking of simple repeats. The parameters are similar:

Code:
Processing parameters:
maskrepeats=f       (mr) Mask areas covered by exact repeat kmers.
kr=5                Kmer size to use for repeat detection (1-15).  Use minkr and maxkr to sweep a range of kmers.
minlen=40           Minimum length of repeat area to mask.
mincount=4          Minimum number of repeats to mask.
masklowentropy=t    (mle) Mask areas with low complexity by calculating entropy over a window for a fixed kmer size.
ke=5                Kmer size to use for entropy calculation (1-15).  Use minke and maxke to sweep a range.  Large ke uses more memory.
window=80           (w) Window size for entropy calculation.
entropy=0.70        (e) Mask windows with entropy under this value (0-1).  0.0001 will mask only homopolymers and 1 will mask everything.
lowercase=f         (lc) Convert masked bases to lower case.  Default is to convert them to N.
split=f             Split into unmasked pieces and discard masked pieces.
Entropy is calculated in a standard way, using the counts of unique short kmers that occur in a window, such that the more unique kmers occur within the window - and the more even the distribution of counts - the closer the value approaches 1. But, you will probably have to play with the exact cutoff a bit to get the result you want. Kmer length and window size are not overly important, but window size should be shorter than read length and similar to the length of the shortest features you want to mask, and kmer length should be much shorter then window size. The defaults for BBMask are the settings I used to mask the human genome to prevent non-vertebrate reads from mapping to it.

Last edited by Brian Bushnell; 08-17-2015 at 09:47 AM.
Brian Bushnell is offline   Reply With Quote
Old 08-18-2015, 06:24 PM   #129
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 467
Default

Hi Brian,

thanks a lot for the detailed information and for providing this great tool! Looks like it will do (more than) everything I will need.
luc is offline   Reply With Quote
Old 12-14-2015, 06:57 AM   #130
kcamnairb
Junior Member
 
Location: New Orleans, LA

Join Date: Sep 2013
Posts: 3
Default

Hi Brian,

I used bbduk.sh to remove adapters and quality trim my fastq files and for some reason it removes the x and y coordinates of the cluster within the read name and replaces it with 0. This makes most of my read names no longer unique. Is there a way to avoid this?

I ran it with:
Code:
bbduk.sh qtrim=rl trimq=20 in=../Gbar_Lib1_Index025_L001_R#_001.fastq out1=../Gbar_Lib1_Index025_L001_R#_001_bbduk_q20.fastq minlen=25 ktrim=r k=25 mink=11 ref=~/bbmap/resources/index25.fa

head  ../../Gbar_Lib1_Index025_L001_R1_001.fastq
@SN677:200:H5FCKBCXX:1:1106:2130:10004 1:N:0:ACTGAT
GTTAAAAAGATTAAAGCTACAAGAGCGAATCTTACTCCCCAGGCGATGCAGTGGAGCAGATTAAAGCCACAACAGTGAATCTTGCTATCCCGACATTGCAGTTAAAAAGATTAAAGCTACAGCAGCGAATCTTACCTCCCAGGCGGTGCATTGGAACAGATTAAAGACACATCGGTGAATCTTATTTCCCCGGTATTGCAGTTAAAAAGATTAAAGCCACAGCGGCGAATCCTACTTCCCTAGTGGTGC
+
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIGGIIIIIIIIIGIIIIAGGIGIIIIIIGGGIIGGIGIGGGGGGGGIGGGG.GGGAGGIA7GAAGIIIGGGGAGGGIGGIIIIIIGGGAG.7..<<<.GGGGGGGAAAA7.7AAA.
@SN677:200:H5FCKBCXX:1:1106:2046:10016 1:N:0:ACTGAT
AAATCGAGAGACGAAAAAAAGAAATGAAAGAAAAGGAACAAAAAATCAAAAAAGTGAGAGAAAAAGAAAATGAAGAGAAAGAATTCGAAAAAGAAATTGAGATTGAACAAGAATGCAAGCAAGAAGGAGAAATCGAAAGAAAAGAAGAAAAGATTGAAAAAGAAAAAGAAATCCAAAAAGAGAGAGAGGTCGAAAATAAAAGTGAAAATGAGAGAGGAAATGAAAAGAGAACAAAAGAAGTTGAGATG
+
<<GA<.<<AAGGGAGAGG<GGAGGIIIAGGGGGGGGGGGGGGGGAAGGGG.<GG<<<GGG.<A.GGGGGAAGGGGIGIGGGGGG.AGGGGGAGAGAA.<AA.<AGGGGGGIGGGGGGGGGGIIGAAGGIGGGAGGGG<AGAAGIGIIIGGGGG..<AGA.AGGGAGGGGGA<.77.7A.AGGAA.7GGA..7<7.<.7A.A7.77GA7.7G.G.A.....7A7.....G.7....7.7AA7..7....
@SN677:200:H5FCKBCXX:1:1106:2068:10031 1:N:0:ACTGAT
AGAGAGCTGTCTATACCACTGGCAAAGGGGCTTCTGCCGTGGGGCTGACAGCAGCAGTGCACAAGGATCCGGTAACCAGGGAGTGGACCCTTGAAGGAGGAGCCCTTGTTTTAGCTGACAAAGGGATTTGCCTTATTGACGAATTTGATAAAATGAATGATCAAGACAGGTAAGGGAAAGCCTGGCATAAATTTAGCCACTATAATTAGATAACTTCAGCAAACACCTTTCGTCGTTTGCTTTACTTTTT

head  ../Gbar_Lib1_Index025_L001_R1_001_bbduk_q20.fastq
@SN677:200:H5FCKBCXX:1:1106:0:0 1:N:0:ACTGAT
CTTGTTAACCAATGCTATTATAGGTTTGATGTCTCATACAGGAGTATAGGATAAAGCTCTCACGTTTTGTTTCAAAGAAGGGCCTGTAACTCATTATGAACTGCTGCTTGCCAACACTTGTGTTGCATTGCTTCATAGATGTCAGCAGGTGAATCATTTGACTTAGCAGGAAGGTGAATCGTTTTGTTTCATATATTGCAGTAGTTCATAATGAGTTACAG
+
GGGAGIIIIIIIIIIIIIGIGIIGIIIIGIGIIGGGIIIIGGGGIGGGGIIIIIIIIIIIGIIIIIIIGGIIIIIGIIIGGGIIIGIIGIIIIGIIIIGGGGGGIIIIIIIGGAGIGGIIIIIIIGGGIIIIIIIAGGGGGIGGGGAGGGGIGGGGIIIIGIGGGGIGGGAGGIGIG<AGAGGAGG7AGAGGAGGGGAGAGGGGGGGGGIGGGGGGGAGAA
@SN677:200:H5FCKBCXX:1:1106:0:0 1:N:0:ACTGAT
AAAATCTGATTTAAAAAGAATATTTATTTTAAAAATTATTGCATGACTATTAATATGATAGGAAAGTCGTATAAAAATATCGATAGAAAAATTTTACCGATTTAGTGGTATTGAAAAAATTGGGTAAAAACAAGGTATCGAGACCTTGATCTCATAAAACTGAGTAGAAAATATTTTTATAAATATTTATGGAGTGTCATTTAGTTAGTATTAAAGTTTTGTTAGAAAATTTTAAC
+
GGGGGIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGGIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIGGIGGIGIIIIIIIGIIIIIIIGGIGGIGIIIGGIGIGGGGGIIIGGAGGGIIIGGIIIIGGGGGGIGIIGIGGGGGG.GGGGGGIIGGGGGGGAAAAG.<GGAGGGGIGGGGGIIGGGG.AGG7AGGGGAAG.7GGGG
@SN677:200:H5FCKBCXX:1:1106:0:0 1:N:0:ACTGAT
ATCTCTTGGCTTTTATCATGTTTAAATCATGATGGCGAGTTGGCTAAGACATGGCAACCATTTACTACTTCCAGTGCGTTAGCAGTCAACTTATGCTTTTAGGATTTTCTATTTCTCCAGTTGGGTCATGGTTTATATGCTTAGTGCTTCCACCTGCTTATAACATTTTTCATTATGTTGGTTGCTCTACTTTGGATAGCATATTGGCTGTTCACTAAGCTTTAATGATTTTGAAAGACTTGCCATGCA

Last edited by kcamnairb; 12-14-2015 at 07:05 AM.
kcamnairb is offline   Reply With Quote
Old 12-14-2015, 11:02 AM   #131
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I can't replicate this...

Code:
bushnell@gpint109:/global/projectb/scratch/bushnell/temp$ bbduk.sh qtrim=rl trimq=20 in=x.fq out1=y.fq minlen=25 ktrim=r k=25 mink=11 ref=z.fa
java -Djava.library.path=/usr/common/jgi/utilities/bbtools/prod-v35.78/lib -ea -Xmx46673m -Xms46673m -cp /usr/common/jgi/utilities/bbtools/prod-v35.78/lib/BBTools.jar jgi.BBDukF qtrim=rl trimq=20 in=x.fq out1=y.fq minlen=25 ktrim=r k=25 mink=11 ref=z.fa
Executing jgi.BBDukF [qtrim=rl, trimq=20, in=x.fq, out1=y.fq, minlen=25, ktrim=r, k=25, mink=11, ref=z.fa]

BBDuk version 35.78
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=46901m, free=46167m, used=734m

Added 15 kmers; time:   0.031 seconds.
Memory: max=46901m, free=43965m, used=2936m

Input is being processed as unpaired
Started output streams: 0.014 seconds.
Processing time:                0.013 seconds.

Input:                          2 reads                 497 bases.
QTrimmed:                       2 reads (100.00%)       102 bases (20.52%)
KTrimmed:                       0 reads (0.00%)         0 bases (0.00%)
Result:                         2 reads (100.00%)       395 bases (79.48%)

Time:                           0.063 seconds.
Reads Processed:           2    0.03k reads/sec
Bases Processed:         497    0.01m bases/sec
bushnell@gpint109:/global/projectb/scratch/bushnell/temp$ cat x.fq
@SN677:200:H5FCKBCXX:1:1106:2130:10004 1:N:0:ACTGAT
GTTAAAAAGATTAAAGCTACAAGAGCGAATCTTACTCCCCAGGCGATGCAGTGGAGCAGATTAAAGCCACAACAGTGAATCTTGCTATCCCGACATTGCAGTTAAAAAGATTAAAGCTACAGCAGCGAATCTTACCTCCCAGGCGGTGCATTGGAACAGATTAAAGACACATCGGTGAATCTTATTTCCCCGGTATTGCAGTTAAAAAGATTAAAGCCACAGCGGCGAATCCTACTTCCCTAGTGGTGC
+
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIGGIIIIIIIIIGIIIIAGGIGIIIIIIGGGIIGGIGIGGGGGGGGIGGGG.GGGAGGIA7GAAGIIIGGGGAGGGIGGIIIIIIGGGAG.7..<<<.GGGGGGGAAAA7.7AAA.
@SN677:200:H5FCKBCXX:1:1106:2046:10016 1:N:0:ACTGAT
AAATCGAGAGACGAAAAAAAGAAATGAAAGAAAAGGAACAAAAAATCAAAAAAGTGAGAGAAAAAGAAAATGAAGAGAAAGAATTCGAAAAAGAAATTGAGATTGAACAAGAATGCAAGCAAGAAGGAGAAATCGAAAGAAAAGAAGAAAAGATTGAAAAAGAAAAAGAAATCCAAAAAGAGAGAGAGGTCGAAAATAAAAGTGAAAATGAGAGAGGAAATGAAAAGAGAACAAAAGAAGTTGAGATG
+
<<GA<.<<AAGGGAGAGG<GGAGGIIIAGGGGGGGGGGGGGGGGAAGGGG.<GG<<<GGG.<A.GGGGGAAGGGGIGIGGGGGG.AGGGGGAGAGAA.<AA.<AGGGGGGIGGGGGGGGGGIIGAAGGIGGGAGGGG<AGAAGIGIIIGGGGG..<AGA.AGGGAGGGGGA<.77.7A.AGGAA.7GGA..7<7.<.7A.A7.77GA7.7G.G.A.....7A7.....G.7....7.7AA7..7....
bushnell@gpint109:/global/projectb/scratch/bushnell/temp$ cat y.fq
@SN677:200:H5FCKBCXX:1:1106:2130:10004 1:N:0:ACTGAT
GTTAAAAAGATTAAAGCTACAAGAGCGAATCTTACTCCCCAGGCGATGCAGTGGAGCAGATTAAAGCCACAACAGTGAATCTTGCTATCCCGACATTGCAGTTAAAAAGATTAAAGCTACAGCAGCGAATCTTACCTCCCAGGCGGTGCATTGGAACAGATTAAAGACACATCGGTGAATCTTATTTCCCCGGTATTGCAGTTAAAAAGATTAAAGCCACAGC
+
GGGGGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGIIIIIGGIIIIIIIIIGIIIIAGGIGIIIIIIGGGIIGGIGIGGGGGGGGIGGGG.GGGAGGIA7GAAGIIIGGGGAGGGIGGIIIIIIGGGAG
@SN677:200:H5FCKBCXX:1:1106:2046:10016 1:N:0:ACTGAT
AAATCGAGAGACGAAAAAAAGAAATGAAAGAAAAGGAACAAAAAATCAAAAAAGTGAGAGAAAAAGAAAATGAAGAGAAAGAATTCGAAAAAGAAATTGAGATTGAACAAGAATGCAAGCAAGAAGGAGAAATCGAAAGAAAAGAAGAAAAGATTGAAAAAGAAAAAGAAAT
+
<<GA<.<<AAGGGAGAGG<GGAGGIIIAGGGGGGGGGGGGGGGGAAGGGG.<GG<<<GGG.<A.GGGGGAAGGGGIGIGGGGGG.AGGGGGAGAGAA.<AA.<AGGGGGGIGGGGGGGGGGIIGAAGGIGGGAGGGG<AGAAGIGIIIGGGGG..<AGA.AGGGAGGGGGA<
bushnell@gpint109:/global/projectb/scratch/bushnell/temp$ cat z.fa
>1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
BBDuk does not change read names. So, I can only imagine that many of the names already had zeroes going in. Note that BBDuk does reorder reads (unless you use the "ordered" flag), and some of the reads may have been filtered out, so the top 3 reads before and after are not necessarily the same.
Brian Bushnell is offline   Reply With Quote
Old 12-14-2015, 11:33 AM   #132
kcamnairb
Junior Member
 
Location: New Orleans, LA

Join Date: Sep 2013
Posts: 3
Default

You are right, sorry about that.
kcamnairb is offline   Reply With Quote
Old 01-05-2016, 12:44 PM   #133
salamay
Member
 
Location: canada

Join Date: May 2014
Posts: 20
Default

Hi Brian,

Would you mind explaining the the sliding window option for quality trimming (qtrim) in bbduk? How big is the window and is this similar to trimmomatic's sliding window approach? I notice I have less reads trimmed with the sliding window compared to rl trimming, although the % of bases trimmed is the same (so longer stretches of sequence are trimmed with the sliding window I suppose). Is one approach preferred over the other?

Thanks!
salamay is offline   Reply With Quote
Old 01-05-2016, 09:51 PM   #134
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by salamay View Post
Hi Brian,

Would you mind explaining the the sliding window option for quality trimming (qtrim) in bbduk? How big is the window and is this similar to trimmomatic's sliding window approach? I notice I have less reads trimmed with the sliding window compared to rl trimming, although the % of bases trimmed is the same (so longer stretches of sequence are trimmed with the sliding window I suppose). Is one approach preferred over the other?

Thanks!
I added a sliding window just so that people could use it as a drop-in replacement for Trimommatic, but I do not recommend using it. The normal "rl" trim mode guarantees optimal results (assuming the quality scores are accurate, of course) while window-based trimming is a heuristic that cannot ensure optimal results. It works by trimming until the average quality score in a user-specified window size exceeds the threshold.

You can set the window size by adding a comma, like this:

"qtrim=window,5 trimq=10"

That will use a 5bp window and trim until the average quality in the window is at least 10. The default window size is 4.
Brian Bushnell is offline   Reply With Quote
Old 01-11-2016, 02:39 AM   #135
willd
Junior Member
 
Location: France

Join Date: Jan 2016
Posts: 4
Default

Dear Brian,

I have many reads from a very specific scRNA contaminant of length 299 bp. Would BBDuk's contaminant filtering be a suitable tool for removing these, and what parameters would you recommend? (my reads are 100 bp PE)

Would it be possible to do it in one run along with adapter and quality trimming, e.g. by adding this 299bp sequence to the current adapters.fa file that you have provided in the resources folder?

Many thanks for any help

Last edited by willd; 01-11-2016 at 04:22 AM. Reason: Added 'along with adapter and quality trimming'
willd is offline   Reply With Quote
Old 01-11-2016, 04:01 AM   #136
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,132
Default

Quote:
Originally Posted by willd View Post
Dear Brian,

I have many reads from a very specific scRNA contaminant of length 299 bp. Would BBDuk's contaminant filtering be a suitable tool for removing these, and what parameters would you recommend? (my reads are 100 bp PE)

Would it be possible to do it in one run, e.g. by adding this 299bp sequence to the current adapters.fa file that you have provided in the resources folder?

Many thanks for any help
While you can probably use BBDuk, you may want to look at BBSplit which should be able to bin reads that map to your sequence away from the rest of the data.
GenoMax is offline   Reply With Quote
Old 01-11-2016, 07:23 PM   #137
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by willd View Post
Dear Brian,

I have many reads from a very specific scRNA contaminant of length 299 bp. Would BBDuk's contaminant filtering be a suitable tool for removing these, and what parameters would you recommend? (my reads are 100 bp PE)

Would it be possible to do it in one run along with adapter and quality trimming, e.g. by adding this 299bp sequence to the current adapters.fa file that you have provided in the resources folder?

Many thanks for any help
You can do this with either mapping (BBMap/BBSplit) or kmer-filtering (BBDuk/Seal); the result should be the same. BBDuk will be faster though.

Assuming the reads are named "read1.fq" and "read2.fq", and your 299bp sequence is in sequence.fa:

bbduk.sh in=read#.fq out=filtered#.fq ref=sequence.fa k=31

You can adjust the sensitivity/specificity with other flags (like adding hdist=1, to catch the very-low-quality reads) but the command above should be adequate. It's technically possible to do this at the same time as adapter-trimming by adding the sequence to the adapters file; reads of that sequence should end up fully trimmed and thus will be discarded because they are too short (shorter than minlen which is 10 by default, I think). But you would achieve higher sensitivity and specificity by filtering in a different pass, so I'd recommend doing it in 2 steps.
Brian Bushnell is offline   Reply With Quote
Old 01-13-2016, 05:11 AM   #138
willd
Junior Member
 
Location: France

Join Date: Jan 2016
Posts: 4
Default

Dear Brian - thanks very much!

In case it's of interest: I tried both 1 pass (contaminant filtering at same time as adapter-trimming) and 2 pass (adapter trimming then contaminant filtering) on two of my samples, and the output was close to identical: the number of remaining reads was ~0.4% lower for 1 pass v 2 pass (for both samples); differences in plots of per-base sequence quality, GC curves, etc. were almost unnoticeable.

Last edited by willd; 01-13-2016 at 05:13 AM. Reason: adding Pass 1 lower than pass 2
willd is offline   Reply With Quote
Old 01-14-2016, 03:08 AM   #139
willd
Junior Member
 
Location: France

Join Date: Jan 2016
Posts: 4
Default

Dear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?

Code:
    Executing jgi.BBDukF [in1=../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, in2=../../C467_RE_B00GGOO_8_2_C68T3ACXX.IND1.fastq.gz, out1=B00GGOO_1_TrimPass1.fastq, out2=B00GGOO_2_TrimPass1.fastq, minlen=25, qtrim=r, trimq=10, ktrim=r, k=25, mink=11, ref=my_adapters.fa, hdist=1, tpe, tbo]
    
    BBDuk version 35.82
    maskMiddle was disabled because useShortKmers=true
    Initial:
    Memory: max=777m, free=754m, used=23m
    
    Added 64944 kmers; time: 	0.212 seconds.
    Memory: max=777m, free=717m, used=60m
    
    Input is being processed as paired
    Started output streams:	0.176 seconds.
    java.lang.AssertionError: 
    Error in ../../C467_RE_B00GGOO_8_1_C68T3ACXX.IND1.fastq.gz, line 1210999, with these 4 lines:
    CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
    +
    CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
    @HISEQ10:C68T3ACXX:8:1101:19859:76022/1
    
    	at stream.FASTQ.quadToRead(FASTQ.java:722)
    	at stream.FASTQ.toReadList(FASTQ.java:653)
    	at stream.FastqReadInputStream.fillBuffer(FastqReadInputStream.java:111)
    	at stream.FastqReadInputStream.nextList(FastqReadInputStream.java:96)
    	at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
    	at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
    Exception in thread "Thread-19" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-20" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-18" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-23" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-21" Exception in thread "Thread-24" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Exception in thread "Thread-22" java.lang.NullPointerException
    	at jgi.BBDukF$ProcessThread.run(BBDukF.java:2399)
    Processing time:   		3.675 seconds.
    
    Input:                  	603200 reads 		60923200 bases.
    QTrimmed:               	67817 reads (11.24%) 	3329362 bases (5.46%)
    KTrimmed:               	52402 reads (8.69%) 	1521198 bases (2.50%)
    Trimmed by overlap:     	36830 reads (6.11%) 	190222 bases (0.31%)
    Result:                 	575826 reads (95.46%) 	55882418 bases (91.73%)
    
    Time:   			4.073 seconds.
    Reads Processed:        603k 	148.08k reads/sec
    Bases Processed:      60923k 	14.96m bases/sec
    Exception in thread "main" java.lang.RuntimeException: jgi.BBDukF terminated in an error state; the output may be corrupt.
    	at jgi.BBDukF.process(BBDukF.java:822)
    	at jgi.BBDukF.main(BBDukF.java:70)
willd is offline   Reply With Quote
Old 01-14-2016, 07:14 AM   #140
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by willd View Post
Dear Brian, for one fastq file I got the error output below. FastQC did not show any particular problems with this file. The error refers to a specific read - could there be a problem for BBDuk to parse this read?
It looks like the fastq file is corrupt and is missing a line, or something similar.

Code:
CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
+
CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
@HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold
should be more like

Code:
@HISEQ10:C68T3ACXX:8:1101:19859:76022/1#ID Median_fold
CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
+
CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
Try looking at the file in that region, by piping zcat to grep around the term "@HISEQ10:C68T3ACXX:8:1101:19859:76022/1" with plus and minus 10 lines, then posting the output.

Actually, "#ID Median_fold" should not be in there at all. It looks like you somehow got unrelated text mixed into your fastq file, which is what is corrupting it.
Brian Bushnell is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:02 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO