SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics
Similar Threads
Thread Thread Starter Forum Replies Last Post
Suggestions on trimming RNA-seq file and compare splice-junction tools results kanewong Bioinformatics 0 04-10-2013 12:30 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 04:55 PM
trimming FASTA file baika Bioinformatics 7 03-05-2013 10:03 AM
Trimming WG BAM file for exome tahamasoodi Bioinformatics 0 11-16-2012 11:58 PM
Please Help: What is the differences between standard trimming and adaptive trimming byou678 Bioinformatics 8 08-22-2011 12:05 PM

Reply
 
Thread Tools
Old 11-09-2015, 06:29 AM   #1
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default BBDuk2 not trimming whole file

Hi,

I am running BBDuk to trim adapters from my fastq files and remove reads with low quality. My RNAseq was PE and done on 8 lanes for each sample, so for each sample I have 8 forward read files and 8 reverse read files.

I concatenated the 8 R1 fastq files into 1 big file and the same for the R2 fastq files, so in the end I have 2 files per samples (R1 and R2).

I then run BBDuk2 using the following options:

bash bbduk2.sh -Xmx60g in1=/PATH/R1_001.fastq.gz in2=/PATH/R2_001.fastq.gz out1=/PATH/trimmedR1_001.fastq.gz out2=/PATH/trimmedR2_001.fastq.gz ref="/home/ea11g10/bbmap/resources/truseq.fa.gz" ktrim=r literal=GCTCTTCCGATCT ktrim=l k=13 mink=11 hdist=1 rcomp=t minlen=25 qtrim=rl trimq=10 tpe tbo

However, the process completes very quickly and below is the input I get:

PHP Code:
Input is being processed as paired
Started output streams
0.126 seconds.
Processing time:                283.543 seconds.

Input:                          21426254 reads          2164051654 bases.
QTrimmed:                       4185899 reads (19.54%)  253488944 bases (11.71%)
KTrimmed:                       2490280 reads (11.62%)  97961555 bases (4.53%)
Trimmed by overlap:             1425360 reads (6.65%)   7478966 bases (0.35%)
Result:                         19219200 reads (89.70%)         1816127246 bases (83.92%)

Time:                           283.852 seconds.
Reads Processed:      21426k    75.48k reads/sec
Bases Processed
:       2164m    7.62m bases/sec 
Each individual file was around 10/11 million reads, so the fact that its only trimming 21 million reads suggest that its not getting the whole concatenated fastq file, which in total should input around 160 million reads (80 from R1 and 80 from R2)

Would anyone be able to help me with this,

Thanks
ea11 is offline   Reply With Quote
Old 11-09-2015, 06:46 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,143
Default

See post#4 here. You may want to recombine your fastq files with the method described there.
GenoMax is offline   Reply With Quote
Old 11-09-2015, 06:52 AM   #3
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default

Thanks for that, I shall give it a try and see if it works.

The thing is though, when I run the concatenated file through fastqc, it shows the correct number of reads you would expect (~80million). Does that still mean it didn't merge correctly?
ea11 is offline   Reply With Quote
Old 11-09-2015, 07:12 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,143
Default

Simon (author of FastQC) is probably accounting for that in his code (he was the one who posted this observation first).

As @Brian comes along later today he will comment on BBDuk (he generally takes these kinds of things into account but there are so many and there is only one @Brian).
GenoMax is offline   Reply With Quote
Old 11-09-2015, 07:25 AM   #5
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default

Ah right ok, thanks for the clarification. Well I am re-merging the files using the command in the other post, and will hopefully wait to hear back from Brian when he gets round to it.

Thanks
ea11 is offline   Reply With Quote
Old 11-09-2015, 09:42 AM   #6
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I've concatenated gzipped files before and not had a problem with it. That said, I don't remember ever having a problem with it, and I tried just now and it worked fine... it might be related to the java version? Do you mind running "java -Xmx20m -version" and posting the output? Also, it could have to do with the program used to do the compression...

Either way, Genomax's post (zcatting them) should solve it.

And by the way - BBDuk2 is not quite a drop-in replacement for BBDuk. They have somewhat different syntax. In this case, if you are trying to trim to the right using "ref" and to the left using "literal", you need to use the flag "rref" instead of "ref" and "lliteral" instead of "literal", so that it knows to use the ref for right-trimming and the literal for left-trimming. That is what you want to do, correct?
Brian Bushnell is offline   Reply With Quote
Old 11-09-2015, 10:23 AM   #7
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default

Thanks for the reply Brian. I am zcatting the files at the moment, so hopefully they should work like that, but would be nice to work out why my files didn't working just concatenating them.

java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.1) (rhel-1.45.1.11.1.el6-x86_64)
OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

Yes that is what I am trying to do. Thanks for letting me know about the slight differences I need to make in the code. So is BBDuk2 good to use for what I want to do or should I go back to BBDuk?

Thanks
ea11 is offline   Reply With Quote
Old 11-09-2015, 10:35 AM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBDuk2 should work fine if you adjust the parameters as I indicated. That said - personally, I would use 2 passes of BBDuk because BBDuk2 is a bit more confusing and less flexible (you can't use different kmer lengths for left and right trimming, for example). I designed BBDuk2 for integration into pipelines that get written once and then run exactly the same way for years, to achieve maximal efficiency, since it can do all kmer operations in a single pass (filtering, left-trimming, right-trimming, and masking). But actually I never use it because I usually want different values of K and a different hamming distance for the different steps.

The issue here is either that you are running OpenJDK, or version 1.6, and probably both combined. I only test with Oracle's JDK, and use version 1.7 and 1.8.
Brian Bushnell is offline   Reply With Quote
Old 11-09-2015, 10:40 AM   #9
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default

Ok, I shall give BBDuk a look and see the results of that.

With regards to the Java version, I'm running all my analysis on a computer cluster so the only version of Java installed by the administrators is 1.6. Should zcatting work with version 1.6?
ea11 is offline   Reply With Quote
Old 11-09-2015, 10:46 AM   #10
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Yes, zcatting should work with any version of Java; the only disadvantage is that it takes longer than cat. But, I recommend that you request your sysadmin upgrade to the latest supported version, which is 1.8 for Oracle and I believe 1.8 for OpenJDK (and I would suggest Oracle's, but that's just a personal preference since I test on Oracle's - they are supposed to be identical). Java is backwards compatible, and 1.6 is quite old now.
Brian Bushnell is offline   Reply With Quote
Old 11-09-2015, 10:48 AM   #11
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default

Yep, am currently finding out that zcatting takes longer, which is why I'm running it overnight. I shall put forward a request to upgrade the version of Java that we have.

Thanks for you help Brian
ea11 is offline   Reply With Quote
Old 11-09-2015, 10:52 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,143
Default

You could trim the 8 file pairs independently and then combine the bam's into one at later step. This would provide some (brute force) parallelization :-)
GenoMax is offline   Reply With Quote
Old 11-09-2015, 10:58 AM   #13
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default

Yea Geno, that was an option that we talked through, but then as a lab group, we decided to merge all the files from the start and work on 2 per sample rather than 16 per sample
ea11 is offline   Reply With Quote
Old 11-09-2015, 11:04 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,143
Default

It is easy enough for whoever you got the sequencing done from to generate the output as a single file, instead of the pieces. Next time you may want to request that.
GenoMax is offline   Reply With Quote
Old 11-09-2015, 11:06 AM   #15
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default

Yea that is what I was expecting from the people who did the sequencing, which is why I was shocked when I was given 640 files rather than the 80 I was expecting.
ea11 is offline   Reply With Quote
Old 11-09-2015, 11:11 AM   #16
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I don't know why they do that. I blame it on Illumina's demultiplexing software defaults. Like on NextSeq, which only has one physical lane (in terms of library separation), we still produce 8 files per library, which is really inefficient from a tracking and labor perspective.
Brian Bushnell is offline   Reply With Quote
Old 11-09-2015, 11:18 AM   #17
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,143
Default

Quote:
Originally Posted by Brian Bushnell View Post
Like on NextSeq, which only has one physical lane (in terms of library separation), we still produce 8 files per library, which is really inefficient from a tracking and labor perspective.
bcl2fastq (v. 2.17.x) has an option (--no-lane-splitting). I assume your NextSeq data is processed off-line, so may be worth looking into that (I have not personally used this option).
GenoMax is offline   Reply With Quote
Old 11-10-2015, 04:32 AM   #18
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default

Quote:
Originally Posted by Brian Bushnell View Post
BBDuk2 should work fine if you adjust the parameters as I indicated. That said - personally, I would use 2 passes of BBDuk because BBDuk2 is a bit more confusing and less flexible (you can't use different kmer lengths for left and right trimming, for example). I designed BBDuk2 for integration into pipelines that get written once and then run exactly the same way for years, to achieve maximal efficiency, since it can do all kmer operations in a single pass (filtering, left-trimming, right-trimming, and masking). But actually I never use it because I usually want different values of K and a different hamming distance for the different steps.

The issue here is either that you are running OpenJDK, or version 1.6, and probably both combined. I only test with Oracle's JDK, and use version 1.7 and 1.8.

Hi Brian,

So I have decided to use BBDuk instead of BBDul2 and do it twice. The following is my code:

bash bbduk.sh in1=/PATH/M_R1_001.fastq.gz in2=/PATH/M_R2_001.fastq.gz out1=/PATH/M1_R1_001.fastq.gz out2=/PATH/M1_R2_001.fastq.gz ref="/truseq.fa.gz" ktrim=r k=13 mink=11 hdist=1 rcomp=t minlen=25 qtrim=rl trimq=10 tpe tbo

For some of my samples, this works fine and I get a complete output and a summary of the input number of reads and reads left after trimming.

However, for some of my samples, I am not getting this, instead I get the following:

BBDuk version 35.66
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: max=41160m, free=40086m, used=1074m

Added 182 kmers; time: 0.032 seconds.
Memory: max=41160m, free=39012m, used=2148m

Input is being processed as paired
Started output streams: 0.221 seconds.
bbduk.sh: line 282: 888 Killed java -Djava.library.path=/PATH/bbmap/jni/ -ea -Xmx40g -Xms40g -cp /PATH/bbmap/current/ jgi.BBDukF -Xmx40g in1=/PATH/M_R1_001.fastq.gz in2=/PATH/M_R2_001.fastq.gz out1=/scratch/ea11g10/PATH/M1_R1_001.fastq.gz out2=/PATH/M1_R2_001.fastq.gz literal=GCTCTTCCGATCT ktrim=l k=13 mink=11 hdist=1 rcomp=t minlen=25 qtrim=rl trimq=10 tpe tbo


Do you know why it would be getting killed? I am not using Java 1.8 as it has been upgraded:
java version "1.8.0_51"
Java(TM) SE Runtime Environment (build 1.8.0_51-b16)
Java HotSpot(TM) 64-Bit Server VM (build 25.51-b03, mixed mode)

Am I running out of memory and so it is being killed?

Thanks
ea11 is offline   Reply With Quote
Old 11-10-2015, 04:39 AM   #19
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,143
Default

I wouldn't think so since you are using 40G. Are you hitting a storage quota limit?
GenoMax is offline   Reply With Quote
Old 11-10-2015, 04:40 AM   #20
ea11
Member
 
Location: Southampton

Join Date: Jun 2015
Posts: 36
Default

Nope, that's one of the first things I checked. I still have 100GB storage available
ea11 is offline   Reply With Quote
Reply

Tags
bbduk2

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off



All times are GMT -8. The time now is 05:39 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO