Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBDuk2 not trimming whole file

    Hi,

    I am running BBDuk to trim adapters from my fastq files and remove reads with low quality. My RNAseq was PE and done on 8 lanes for each sample, so for each sample I have 8 forward read files and 8 reverse read files.

    I concatenated the 8 R1 fastq files into 1 big file and the same for the R2 fastq files, so in the end I have 2 files per samples (R1 and R2).

    I then run BBDuk2 using the following options:

    bash bbduk2.sh -Xmx60g in1=/PATH/R1_001.fastq.gz in2=/PATH/R2_001.fastq.gz out1=/PATH/trimmedR1_001.fastq.gz out2=/PATH/trimmedR2_001.fastq.gz ref="/home/ea11g10/bbmap/resources/truseq.fa.gz" ktrim=r literal=GCTCTTCCGATCT ktrim=l k=13 mink=11 hdist=1 rcomp=t minlen=25 qtrim=rl trimq=10 tpe tbo

    However, the process completes very quickly and below is the input I get:

    PHP Code:
    Input is being processed as paired
    Started output streams
    0.126 seconds.
    Processing time:                283.543 seconds.

    Input:                          21426254 reads          2164051654 bases.
    QTrimmed:                       4185899 reads (19.54%)  253488944 bases (11.71%)
    KTrimmed:                       2490280 reads (11.62%)  97961555 bases (4.53%)
    Trimmed by overlap:             1425360 reads (6.65%)   7478966 bases (0.35%)
    Result:                         19219200 reads (89.70%)         1816127246 bases (83.92%)

    Time:                           283.852 seconds.
    Reads Processed:      21426k    75.48k reads/sec
    Bases Processed
    :       2164m    7.62m bases/sec 
    Each individual file was around 10/11 million reads, so the fact that its only trimming 21 million reads suggest that its not getting the whole concatenated fastq file, which in total should input around 160 million reads (80 from R1 and 80 from R2)

    Would anyone be able to help me with this,

    Thanks

  • #2
    See post#4 here. You may want to recombine your fastq files with the method described there.

    Comment


    • #3
      Thanks for that, I shall give it a try and see if it works.

      The thing is though, when I run the concatenated file through fastqc, it shows the correct number of reads you would expect (~80million). Does that still mean it didn't merge correctly?

      Comment


      • #4
        Simon (author of FastQC) is probably accounting for that in his code (he was the one who posted this observation first).

        As @Brian comes along later today he will comment on BBDuk (he generally takes these kinds of things into account but there are so many and there is only one @Brian).

        Comment


        • #5
          Ah right ok, thanks for the clarification. Well I am re-merging the files using the command in the other post, and will hopefully wait to hear back from Brian when he gets round to it.

          Thanks

          Comment


          • #6
            I've concatenated gzipped files before and not had a problem with it. That said, I don't remember ever having a problem with it, and I tried just now and it worked fine... it might be related to the java version? Do you mind running "java -Xmx20m -version" and posting the output? Also, it could have to do with the program used to do the compression...

            Either way, Genomax's post (zcatting them) should solve it.

            And by the way - BBDuk2 is not quite a drop-in replacement for BBDuk. They have somewhat different syntax. In this case, if you are trying to trim to the right using "ref" and to the left using "literal", you need to use the flag "rref" instead of "ref" and "lliteral" instead of "literal", so that it knows to use the ref for right-trimming and the literal for left-trimming. That is what you want to do, correct?

            Comment


            • #7
              Thanks for the reply Brian. I am zcatting the files at the moment, so hopefully they should work like that, but would be nice to work out why my files didn't working just concatenating them.

              java version "1.6.0_24"
              OpenJDK Runtime Environment (IcedTea6 1.11.1) (rhel-1.45.1.11.1.el6-x86_64)
              OpenJDK 64-Bit Server VM (build 20.0-b12, mixed mode)

              Yes that is what I am trying to do. Thanks for letting me know about the slight differences I need to make in the code. So is BBDuk2 good to use for what I want to do or should I go back to BBDuk?

              Thanks

              Comment


              • #8
                BBDuk2 should work fine if you adjust the parameters as I indicated. That said - personally, I would use 2 passes of BBDuk because BBDuk2 is a bit more confusing and less flexible (you can't use different kmer lengths for left and right trimming, for example). I designed BBDuk2 for integration into pipelines that get written once and then run exactly the same way for years, to achieve maximal efficiency, since it can do all kmer operations in a single pass (filtering, left-trimming, right-trimming, and masking). But actually I never use it because I usually want different values of K and a different hamming distance for the different steps.

                The issue here is either that you are running OpenJDK, or version 1.6, and probably both combined. I only test with Oracle's JDK, and use version 1.7 and 1.8.

                Comment


                • #9
                  Ok, I shall give BBDuk a look and see the results of that.

                  With regards to the Java version, I'm running all my analysis on a computer cluster so the only version of Java installed by the administrators is 1.6. Should zcatting work with version 1.6?

                  Comment


                  • #10
                    Yes, zcatting should work with any version of Java; the only disadvantage is that it takes longer than cat. But, I recommend that you request your sysadmin upgrade to the latest supported version, which is 1.8 for Oracle and I believe 1.8 for OpenJDK (and I would suggest Oracle's, but that's just a personal preference since I test on Oracle's - they are supposed to be identical). Java is backwards compatible, and 1.6 is quite old now.

                    Comment


                    • #11
                      Yep, am currently finding out that zcatting takes longer, which is why I'm running it overnight. I shall put forward a request to upgrade the version of Java that we have.

                      Thanks for you help Brian

                      Comment


                      • #12
                        You could trim the 8 file pairs independently and then combine the bam's into one at later step. This would provide some (brute force) parallelization :-)

                        Comment


                        • #13
                          Yea Geno, that was an option that we talked through, but then as a lab group, we decided to merge all the files from the start and work on 2 per sample rather than 16 per sample

                          Comment


                          • #14
                            It is easy enough for whoever you got the sequencing done from to generate the output as a single file, instead of the pieces. Next time you may want to request that.

                            Comment


                            • #15
                              Yea that is what I was expecting from the people who did the sequencing, which is why I was shocked when I was given 640 files rather than the 80 I was expecting.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X