SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 136 01-18-2019 01:24 AM
Adapter trimming figo1019 RNA Sequencing 2 07-17-2018 04:00 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 04:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 07:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 12:53 PM

Reply
 
Thread Tools
Old 11-03-2016, 09:16 AM   #161
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 39
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi Jake,

It's still important to process both files together even if you have no minimum length cutoff, because the output order of BBDuk is not guaranteed to be the same as the input order (unless you add the "ordered" flag). So, I guess, if Geneious is running BBDuk on paired files individually, please add the "ordered" flag, and report that issue to the Geneious developers - it should process them together.
Hey Brian. I heard back from Geneious. Turns out that I had to pair the reads before running BBDuk. After pairing I'm left with a single file in Geneious, though the reads have not been merged. Running BBDuk on the 'combined' file results in the removal of both pairs, yay!
JVGen is offline   Reply With Quote
Old 11-07-2016, 10:32 AM   #162
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 39
Default

Hi Brian,

I just recently started using BBDuk to adapter and quality trim. I then use the reads to assemble in Spades, but ran into an issue. The error correction software that Spades uses cannot recognize the reads names.

The input read format is:
M00281:69:000000000-D22HU:1:1101:15164:1363 1:N:0:53

The output read format is:
M00281:69:000000000-D22HU:1:1101:15164:1363_1:N:0:53

The underscore isn't recognized by BWA, which is activated when the "careful" mode is used in Spades. Any way to switch that _ back to a space?

Thanks
Jake
JVGen is offline   Reply With Quote
Old 11-07-2016, 10:53 AM   #163
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Jake,

BBDuk does not add an underscore to read names. Reformat can, if you run it with the flag "addunderscore", but doesn't by default. Can you list all of the steps you are doing prior to running Spades and verify that the underscores are not present for the BBDuk input?
Brian Bushnell is offline   Reply With Quote
Old 11-07-2016, 11:02 AM   #164
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 39
Default

Hi Brian,

You are correct. While the file names were correct in the program interface, upon export the "_" were introduced. I apologize for not opening up the fastq file in a texteditor to be sure.

Jake
JVGen is offline   Reply With Quote
Old 11-07-2016, 12:30 PM   #165
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

No problem!
Brian Bushnell is offline   Reply With Quote
Old 12-02-2016, 08:57 PM   #166
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Hi Brian,

I just started to try BBDuk after being too used to the alternative trimmer and would like some stimulant. I found the functions in BBTools comprehensive and sophisticated. Thank you!

I have a couple hundred of PE libries to trim. I started by using the default for one of them,

Quote:
bbduk.sh -Xmx1g in1=in1.fastq in2=in2.fastq out1=out1.fastq out2.fastq ref=~/package/bbmap/resources/adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
The standard err output looks like this,

Quote:
Input: 50775382 reads 7667082682 bases.
KTrimmed: 5738256 reads (11.30%) 185600890 bases (2.42%)
Trimmed by overlap: 2313702 reads (4.56%) 17335164 bases (0.23%)
Total Removed: 16926 reads (0.03%) 202936054 bases (2.65%)
Result: 50758456 reads (99.97%) 7464146628 bases (97.35%)
Is the total removed rate (0.03%) expected to be smaller than the KTrimmed and overlay, which are 11.30% and 4.56% respectively?

Also, I found the size of output file somehow larger than the input file (2454285909 vs 2385789236), is is expected?

Thank you in advance for your answers.
chiayi is offline   Reply With Quote
Old 12-02-2016, 10:46 PM   #167
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by chiayi View Post
Hi Brian,

I just started to try BBDuk after being too used to the alternative trimmer and would like some stimulant. I found the functions in BBTools comprehensive and sophisticated. Thank you!

I have a couple hundred of PE libries to trim. I started by using the default for one of them, (...)
The standard err output looks like this,

Quote:
Input: 50775382 reads 7667082682 bases.
KTrimmed: 5738256 reads (11.30%) 185600890 bases (2.42%)
Trimmed by overlap: 2313702 reads (4.56%) 17335164 bases (0.23%)
Total Removed: 16926 reads (0.03%) 202936054 bases (2.65%)
Result: 50758456 reads (99.97%) 7464146628 bases (97.35%)
Is the total removed rate (0.03%) expected to be smaller than the KTrimmed and overlay, which are 11.30% and 4.56% respectively?
Hi Chiayi,

The reads that were trimmed will only be removed if their length drops below minlength, which is by default 10, so virtually nothing gets totally discarded except for adapter-dimers; you just lose 2.65% of your bases to trimming. That's the advantage of adapter-trimming rather than adapter-filtering. However, feel free to set minlength to a higher value; length-20 reads are not usually very useful. Typically, if I am going to do assembly, I set minlength to the shortest kmer length I plan to use.

Quote:
Also, I found the size of output file somehow larger than the input file (2454285909 vs 2385789236), is is expected?
Yes, that's because BBDuk defaults to lower compression (level 2) than gzip normally does (level 6), to increase speed. You can increase compression to gzip's default level by adding the flag "zl=6". That will make it run a little slower, unless you are on a multi-core computer and have pigz installed - which I highly recommend, especially if you will be processing hundreds of libraries.

Quote:
Thank you in advance for your answers.
You're welcome! Incidentally, if you are concerned about file size and want the files to be as small as possible, give Clumpify a try. It only supports interleaved files for paired reads (rather than dual files) but for libraries with sufficient coverage, meaning >10x or so, it can reduce filesize by around 30% losslessly by reordering the reads. I've found that this also typically accelerates subsequent analysis pipelines by a similar factor (up to 30%). Usage:

clumpify.sh in=reads.fastq.gz out=clumped.fastq.gz
Brian Bushnell is offline   Reply With Quote
Old 12-03-2016, 02:06 PM   #168
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Quote:
The reads that were trimmed will only be removed if their length drops below minlength, which is by default 10, so virtually nothing gets totally discarded except for adapter-dimers; you just lose 2.65% of your bases to trimming. That's the advantage of adapter-trimming rather than adapter-filtering. However, feel free to set minlength to a higher value; length-20 reads are not usually very useful. Typically, if I am going to do assembly, I set minlength to the shortest kmer length I plan to use.
Thank you, Brian. I guess I was confused with adapter trimming and adapter filtering, and it's clear now with your explanation. I have a follow-up question. I thought the percentage of reads with adapters with be higher than 10%, is not correct? Aren't all reads fused with adapters during library preparation? Thank you.

Last edited by chiayi; 12-03-2016 at 03:00 PM.
chiayi is offline   Reply With Quote
Old 12-03-2016, 02:22 PM   #169
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Chiayi,

Reads are not usually supposed to have adapters. The DNA fragments that are being sequenced are fused with adapters. But sequencing starts at the adapter and proceeds in the opposite direction, into the genomic portion of the molecule. So you do not see adapter sequence from the end of the molecule you are sequencing; rather, if the genomic part of the read is shorter than the number of cycles you run sequencing, adapter sequence from the other end of the molecule will appear at the end of the read.

In other words, if you do 2x150bp sequencing, you try to design the library with an insert size of over 150bp; perhaps 250bp so that the reads overlap by 50bp. Only the short fragments, under 150bp, will have adapter sequence appear in the reads which needs trimming.
Brian Bushnell is offline   Reply With Quote
Old 12-03-2016, 03:01 PM   #170
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Hi Brian,

Thank you again. That was very helpful. One last quesiton, is bz2 not compatible with BBDuk? Some of my files are bz2 and for those runs it looked like the files were not loaded and the std err stuck at memory usage line without going further.
chiayi is offline   Reply With Quote
Old 12-03-2016, 04:25 PM   #171
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Bzip2 should work fine if you have the bzip2 or pbzip2 executable in your path, and files are named with the .bz2 extension. Can you post your exact command line, the complete screen output, and the results when you type from the command line "bzip2"?
Brian Bushnell is offline   Reply With Quote
Old 12-03-2016, 05:37 PM   #172
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Sure.

Quote:
$ bbduk.sh -Xmx1g in=in.fastq.bz2 out=clean.fastq ref=adapters.fa ktrim=r k=25 mink=11 hdist=1 lhist=length-hist
java -Djava.library.path=jni/ -ea -Xmx1g -Xms1g -cp current/ jgi.BBDukF -Xmx1g in=in.fastq.bz2 out=clean.fastq ref=adapters.fa ktrim=r k=25 mink=11 hdist=1 lhist=length-hist
Executing jgi.BBDukF [-Xmx1g, in=in.fastq.bz2, out=clean.fastq, ref=adapters.fa, ktrim=r, k=25, mink=11, hdist=1, lhist=length-hist]

BBDuk version 36.62
Set length histogram output to length-hist
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=1029m, free=1004m, used=25m

Added 252967 kmers; time: 0.367 seconds.
Memory: max=1029m, free=956m, used=73m
This is where the run would stuck.
chiayi is offline   Reply With Quote
Old 12-03-2016, 11:06 PM   #173
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I just tested it, and both compression and decompression of .bz2 files is working fine for me. It should work as long as either bzip2 or pbzip2 is in your path. What happens when you type "bzip2" and "pbzip2"? Also, bzip2 decompression is pretty slow. When you run "top -c" do you see a java process hanging around with 0 cpu utilization?
Brian Bushnell is offline   Reply With Quote
Old 12-04-2016, 10:53 AM   #174
chiayi
Member
 
Location: New York

Join Date: Dec 2016
Posts: 22
Default

Hi Brian,

bzip2 was in my path yet pbzip2 was not. As you predicted, when I checked the status using "top -c" a java was hanging around with 0 cpu utilization. I intalled pbzip2 and the issue was resolved. Thank you so much for your patience and all the guidance.
chiayi is offline   Reply With Quote
Old 01-04-2017, 11:47 PM   #175
dejsha
Junior Member
 
Location: Czech Republic

Join Date: Mar 2013
Posts: 2
Default

Hi Brian,
I have a question regarding bbduk singleton reads output (outs). My impression was that outs should contain singleton reads which passed filtering criteria, while their mates did not pass the filtering criteria.
But when I filter for a contaminant (35x polyG string) and at the same time I trim and filter for quality and minimum read length ( qtrim=rl trimq=15 maq=15 minlength=42), the outs file contains also reads shorter than 42bp with very poor quality (e.g. 35bp long reads consisting of Ns with 0 phred scores). Is this result supposed to happen?

the bbduk stand. output looks like this:
Input: 6882588 reads 1013379854 bases.
Contaminants: 320048 reads (4.65%) 42774629 bases (4.22%)
QTrimmed: 1971637 reads (28.65%) 71416735 bases (7.05%)
Low quality discards: 0 reads (0.00%) 0 bases (0.00%)
Total Removed: 707746 reads (10.28%) 114191364 bases (11.27%)
Result: 6174842 reads (89.72%) 899188490 bases (88.73%)

Thank you in advance for an advice,
dejsha
dejsha is offline   Reply With Quote
Old 01-12-2017, 04:17 AM   #176
SylvainL
Senior Member
 
Location: Geneva

Join Date: Feb 2012
Posts: 179
Default

Hi Brian,
I also have a question. I downloaded today the latest version of BBmap and I have a strange behaviour with left- and right-trimming

For example:
Code:
TCCGAGGCAGTAGGCTCAACAGCAATATACCTTCTCGAGAGGTCTCTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
being a read (in red is the adapter I want to look for).

I used this command line:
Code:
~/Applications/bbmap/bbduk.sh in=./Data/To_scan.fastq out=./Data/Rejected.fastq outm=./out/No_mm_No_Indel/${Samplename}.fastq literal=CAACAGCAATATACCTTCTCGAGAGGTCT k=29 mm=f hdist=0 edist=0 ktrim=l rcomp=f
and I don't get
Code:
CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
as expected...


EDIT:
Funnily, I have to set maxlength=1 to make it work...

Last edited by SylvainL; 01-12-2017 at 04:52 AM.
SylvainL is offline   Reply With Quote
Old 01-12-2017, 04:54 AM   #177
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

Here is what I get. Using your sequence as a fasta file. Looks right to me.

Code:
$ cat nn.fa

>test
TCCGAGGCAGTAGGCTCAACAGCAATATACCTTCTCGAGAGGTCTCTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC

$ bbduk.sh in=nn.fa  out=stdout.fa mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT

java -Djava.library.path=/path_to/bbmap-36.81/bbmap/jni/ -ea -Xmx19502m -Xms19502m -cp /path_to/bbmap-36.81/bbmap/current/ jgi.BBDukF in=nn.fa out=stdout.fa mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT
Executing jgi.BBDukF [in=nn.fa, out=stdout.fa, mm=f, hdist=0, edist=0, ktrim=l, rcomp=f, k=29, literal=CAACAGCAATATACCTTCTCGAGAGGTCT]

BBDuk version 36.81
Initial:
Memory: max=19597m, free=19188m, used=409m

Added 1 kmers; time:    0.027 seconds.
Memory: max=19597m, free=18472m, used=1125m

Input is being processed as unpaired
Started output streams: 0.015 seconds.

>test
CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC

Processing time:                0.006 seconds.

Input:                          1 reads                 100 bases.
KTrimmed:                       1 reads (100.00%)       45 bases (45.00%)
Total Removed:                  0 reads (0.00%)         45 bases (45.00%)
Result:                         1 reads (100.00%)       55 bases (55.00%)

Time:                           0.055 seconds.
Reads Processed:           1    0.02k reads/sec
Bases Processed:         100    0.00m bases/sec

Last edited by GenoMax; 01-12-2017 at 05:04 AM.
GenoMax is offline   Reply With Quote
Old 01-12-2017, 05:19 AM   #178
SylvainL
Senior Member
 
Location: Geneva

Join Date: Feb 2012
Posts: 179
Default

Yes I know...

As I edited, it works only if I set maxlength=1 !!

At least for the version I downloaded this morning
SylvainL is offline   Reply With Quote
Old 01-12-2017, 05:24 AM   #179
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,053
Default

It worked for me with your original command (minus the maxlength). I used your sequence as a fasta file though. Let me try as fastq.

Edit: Worked as fastq as well.

Code:
$ cat nn.fq
@test
TCCGAGGCAGTAGGCTCAACAGCAATATACCTTCTCGAGAGGTCTCTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
+
CCCCCGGGGGFFFGGGGGGGGGGGGGGGGE7FGGGGGGGGGGGGGFGGGCEGGGGIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIGIIGFIIFIIIIIII

$ bbduk.sh in=nn.fq out=stdout.fq mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT

java -Djava.library.path=/path_to/bbmap-36.81/bbmap/jni/ -ea -Xmx19502m -Xms19502m -cp /path_to/bbmap-36.81/bbmap/current/ jgi.BBDukF in=nn.fq out=stdout.fq mm=f hdist=0 edist=0 ktrim=l rcomp=f k=29 literal=CAACAGCAATATACCTTCTCGAGAGGTCT
Executing jgi.BBDukF [in=nn.fq, out=stdout.fq, mm=f, hdist=0, edist=0, ktrim=l, rcomp=f, k=29, literal=CAACAGCAATATACCTTCTCGAGAGGTCT]

BBDuk version 36.81
Initial:
Memory: max=19597m, free=19188m, used=409m

Added 1 kmers; time:    0.031 seconds.
Memory: max=19597m, free=18472m, used=1125m

Input is being processed as unpaired
Started output streams: 0.019 seconds.

@test
CTGTCCACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACGTTTCGGAATCTC
+
FGGGCEGGGGIIIIIIIIIIIIIIIIGIIIIIIIIIIIIIGIIGFIIFIIIIIII
Processing time:                0.009 seconds.

Input:                          1 reads                 100 bases.
KTrimmed:                       1 reads (100.00%)       45 bases (45.00%)
Total Removed:                  0 reads (0.00%)         45 bases (45.00%)
Result:                         1 reads (100.00%)       55 bases (55.00%)

Time:                           0.066 seconds.
Reads Processed:           1    0.02k reads/sec
Bases Processed:         100    0.00m bases/sec

Last edited by GenoMax; 01-12-2017 at 05:29 AM.
GenoMax is offline   Reply With Quote
Old 01-12-2017, 05:36 AM   #180
SylvainL
Senior Member
 
Location: Geneva

Join Date: Feb 2012
Posts: 179
Default

I am using BBDuk version 36.84, I guess that's the main difference...
SylvainL is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:46 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO