Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • TauOvermind
    Member
    • Jul 2012
    • 14

    Illumina PE reads optimal pre-processing sequence

    Hello, everyone. I am trying to establish an optimal sequence of Illumina PE reads pre-processing steps and I have come up with the following protocol so far (the best tool for the step, in my opinion, is given in parentheses):
    1. Initial quality control (FastQC)
    2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads (bbduk.sh from BBMap/BBtools package)
    3. Deduplication (dedupe.sh from BBMap/BBtools package)
    4. Adapters removal (bbduk.sh from BBMap/BBtools package)
    5. [optional] Error-correction (ecc.sh from BBMap/BBtools package)
    6. Merging of PE reads (bbmerge.sh from BBMap/BBtools package)
    7. Hard and/or soft quality trimming (bbduk.sh from BBMap/BBtools package)
    8. Contamination check (Fastq Screen, maybe bbduk.sh from BBMap/BBtools package)


    I am not completely sure about the best order of several steps; for instance, what is it better to do first, deduplication or adapters removal? Also, isn't it better to conduct quality trimming first and then merge PE reads? Thank you in advance.
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    Deduplication of reads is possible with dedupe, but it requires a lot of memory (~1kb per read) so it is only practical with HiSeq data if you have high-memory nodes. If you have a reference, deduplication based on mapping will use much less memory, though it will also be much slower. And whether or not deduplication is a good idea depends on the experiment; i.e., probably not for quantitative RNA-seq, and also probably not for unamplified libraries in general.

    I suggest removing adapters before deduplication since adapter removal is less sensitive to errors, and identical reads will remain identical after adapter removal. And I suggest contaminant removal before quality trimming, if you do so with BBDuk (kmer-based) because it will be more sensitive. You can add phix/other contaminant removal into step 7 - BBDuk can do contaminant removal and quality trimming in one pass, and the contaminants are looked for prior to trimming each read.

    So, I would reorder it like this:

    1. Initial quality control
    2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads
    3. Adapter removal
    4. phix/contaminant removal
    5. [optional] Error-correction
    6. [optional] Deduplication (depending on experiment)
    7. Merging of PE reads
    8. Quality trimming + phix/contaminant removal
    9. Contamination check

    As for the order of quality trimming and merge, they could be done either way. But, BBMerge internally performs quality-trimming for reads that don't merge prior to quality trimming, so for that particular program, quality-trimming after merging tends to be better.

    Note that after merging, you will have 2 sets of reads (merged and unmerged) that must be processed independently, since not all of them will successfully merge.

    Comment

    • TauOvermind
      Member
      • Jul 2012
      • 14

      #3
      Brian, thank you very much for your comprehensive answers and also for your BBtools package, it is very good and allows to do almost anything with NGS data.

      Thank you for your advice regarding deduplication, but I will probably try to deduplicate my reads anyway, as I work with a metagenomic dataset and, due to a limiting step down the analysis pipeline, have to minimise the number of reads.

      May I ask you to clarify the meaning of
      Code:
      trimbyoverlap
      flag in bbduk? What is the mechanism of trimming adapters when this flag is set to true, does bbduk rely completely on reads' sequences in this situation or still use reference file? If I use it, will I be able to merge my PE reads afterwards?

      Just wondering, are you planning to publish a paper describing the toolkit any time soon? I think that it deserves to be better known to researchers, I found it just by chance, looking for alternatives to nesoni and trimmomatic.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        Originally posted by TauOvermind View Post
        Brian, thank you very much for your comprehensive answers and also for your BBtools package, it is very good and allows to do almost anything with NGS data.
        You're welcome!

        May I ask you to clarify the meaning of
        Code:
        trimbyoverlap
        flag in bbduk? What is the mechanism of trimming adapters when this flag is set to true, does bbduk rely completely on reads' sequences in this situation or still use reference file? If I use it, will I be able to merge my PE reads afterwards?
        "trimbyoverlap" or "tbo" is intended for use alongside a reference, when trimming normal PE fragment libraries. Normally, if you specify something like "k=25 mink=12" then adapters will be trimmed starting whenever a 25-mer match is detected; at the very end of the read, it will look for as little as a 12-mer match. But what if you have a read ending in only 2bp of adapter sequence? That will be missed. The "tbo" flag attempts to overlap the two reads. If you have 2x100bp reads, and they overlap perfectly in an orientation indicating a 98bp insert size, that means the last 2 bp of each read is adapter sequence. So, in practice, "k=25 mink=12" will not trim adapter sequences shorter than 12bp, but "tbo" can trim down to 1bp of adapter, as long as the reads have a low error rate so the overlap can be detected. I usually pair this with the "tpe" flag ("trimpairsequally") because with fragment libraries, the adapter is expected to be in the same position for each read; so if it is detected in one read but not the other (because one read has errors) they will still both get trimmed to the same length. Thus, the settings I generally use for paired fragment libraries (assuming the reads are interleaved):

        bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=25 mink=12 hdist=1 tbo tpe

        You should NOT use the "tbo" or "tpe" flags for libraries with other protocols, such as long-mate-pair libraries where maybe one read contains an adapter and the other doesn't; it's just for normal paired libraries, and greatly increases sensitivity.

        You can still merge the reads afterward. BBDuk will not actually merge the reads, it just finds overlaps to calculate the insert size.

        Just wondering, are you planning to publish a paper describing the toolkit any time soon? I think that it deserves to be better known to researchers, I found it just by chance, looking for alternatives to nesoni and trimmomatic.
        I certainly plan to publish it and hopefully will be able to do so soon (particularly since I have a collaborator now); it's mainly a problem of free time.
        Last edited by Brian Bushnell; 10-01-2014, 02:23 PM.

        Comment

        • TauOvermind
          Member
          • Jul 2012
          • 14

          #5
          Thank you, Brian, everything is clear now. Good luck with the paper!

          Comment

          • TauOvermind
            Member
            • Jul 2012
            • 14

            #6
            EDIT: Sorry, it seems that problem disappeared after updating to BBtools v33.54.

            Brian, may I ask for your help once again? I am getting these strange errors when I run dedupe.sh (I use BBtools v33.51):

            Code:
            dedupe.sh in=1pW_interleaved.fastq out=1pW_dd.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
            java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22936m -Xms22936m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_interleaved.fastq out=1pW_dd.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
            Executing jgi.Dedupe [in=1pW_interleaved.fastq, out=1pW_dd.fastq, outd=1pW_dupes.fastq, ac=f, interleaved=t, overwrite=t]
            
            Set INTERLEAVED to true
            Initial:
            Memory: free=22687m, used=361m
            
            Found 4 duplicates.
            Finished exact matches.    Time: 0.357 seconds.
            Memory: free=21124m, used=1924m
            
            Input:                  	194776 reads 		48427883 bases.
            Duplicates:             	4 reads (0.00%) 	836 bases (0.00%)     	39916 collisions.
            Result:                 	194772 reads (100.00%) 	48427047 bases (100.00%)
            
            Exception in thread "main" java.lang.AssertionError: 97386, 194772
            	at jgi.Dedupe.addToArray(Dedupe.java:1331)
            	at jgi.Dedupe.writeOutput(Dedupe.java:1342)
            	at jgi.Dedupe.process2(Dedupe.java:531)
            	at jgi.Dedupe.process(Dedupe.java:430)
            	at jgi.Dedupe.main(Dedupe.java:57)

            It seems that the script doesn't want to write the deduplicated interleaved PE .fastq file to the disk, but it works flawlessly when I use it to deduplicate just one of the PE files:

            Code:
            dedupe.sh in=1pW_R1_100000.fastq  ac=f interleaved=f overwrite=t
            java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22931m -Xms22931m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_R1_100000.fastq ac=f interleaved=f overwrite=t
            Executing jgi.Dedupe [in=1pW_R1_100000.fastq, ac=f, interleaved=f, overwrite=t]
            
            Set INTERLEAVED to false
            Initial:
            Memory: free=22803m, used=240m
            
            Found 32 duplicates.
            Finished exact matches.    Time: 0.251 seconds.
            Memory: free=21360m, used=1683m
            
            Input:                  	100000 reads 		24971771 bases.
            Duplicates:             	32 reads (0.03%) 	8918 bases (0.04%)     	0 collisions.
            Result:                 	99968 reads (99.97%) 	24962853 bases (99.96%)
            
            Time:   			0.262 seconds.
            Reads Processed:        100k 	381.80k reads/sec
            Bases Processed:      24971k 	95.34m bases/sec


            or when I omit setting output file:



            Code:
            dedupe.sh in=1pW_interleaved.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
            java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22936m -Xms22936m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_interleaved.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
            Executing jgi.Dedupe [in=1pW_interleaved.fastq, outd=1pW_dupes.fastq, ac=f, interleaved=t, overwrite=t]
            
            Set INTERLEAVED to true
            Initial:
            Memory: free=22687m, used=361m
            
            Found 4 duplicates.
            Finished exact matches.    Time: 0.367 seconds.
            Memory: free=21124m, used=1924m
            
            Input:                  	194776 reads 		48427883 bases.
            Duplicates:             	4 reads (0.00%) 	836 bases (0.00%)     	39916 collisions.
            Result:                 	194772 reads (100.00%) 	48427047 bases (100.00%)
            
            Time:   			0.376 seconds.
            Reads Processed:        194k 	517.84k reads/sec
            Bases Processed:      48427k 	128.75m bases/sec
            Last edited by TauOvermind; 10-02-2014, 07:58 AM. Reason: Problem disappeared after updating to BBtools v33.54

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              Sorry about that; there was an incorrect assertion checking for the size of an array that assumed unpaired reads. It's fixed in the latest version (33.54). Alternately, you can add the "-da" flag to disable assertions, but I don't recommend that.

              Comment

              • TauOvermind
                Member
                • Jul 2012
                • 14

                #8
                Thank you, Brian, I updated to the latest version and everything works perfectly now.

                Comment

                Latest Articles

                Collapse

                • SEQadmin2
                  Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                  by SEQadmin2


                  I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                  Here are nine questions we think about, in roughly the order they matter, before...
                  06-18-2026, 07:11 AM
                • SEQadmin2
                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                  by SEQadmin2


                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                  ...
                  06-02-2026, 10:05 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by SEQadmin2, 06-26-2026, 11:10 AM
                0 responses
                12 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-17-2026, 06:09 AM
                0 responses
                46 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-09-2026, 11:58 AM
                0 responses
                106 views
                0 reactions
                Last Post SEQadmin2  
                Started by SEQadmin2, 06-05-2026, 10:09 AM
                0 responses
                125 views
                0 reactions
                Last Post SEQadmin2  
                Working...