Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina PE reads optimal pre-processing sequence

    Hello, everyone. I am trying to establish an optimal sequence of Illumina PE reads pre-processing steps and I have come up with the following protocol so far (the best tool for the step, in my opinion, is given in parentheses):
    1. Initial quality control (FastQC)
    2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads (bbduk.sh from BBMap/BBtools package)
    3. Deduplication (dedupe.sh from BBMap/BBtools package)
    4. Adapters removal (bbduk.sh from BBMap/BBtools package)
    5. [optional] Error-correction (ecc.sh from BBMap/BBtools package)
    6. Merging of PE reads (bbmerge.sh from BBMap/BBtools package)
    7. Hard and/or soft quality trimming (bbduk.sh from BBMap/BBtools package)
    8. Contamination check (Fastq Screen, maybe bbduk.sh from BBMap/BBtools package)


    I am not completely sure about the best order of several steps; for instance, what is it better to do first, deduplication or adapters removal? Also, isn't it better to conduct quality trimming first and then merge PE reads? Thank you in advance.

  • #2
    Deduplication of reads is possible with dedupe, but it requires a lot of memory (~1kb per read) so it is only practical with HiSeq data if you have high-memory nodes. If you have a reference, deduplication based on mapping will use much less memory, though it will also be much slower. And whether or not deduplication is a good idea depends on the experiment; i.e., probably not for quantitative RNA-seq, and also probably not for unamplified libraries in general.

    I suggest removing adapters before deduplication since adapter removal is less sensitive to errors, and identical reads will remain identical after adapter removal. And I suggest contaminant removal before quality trimming, if you do so with BBDuk (kmer-based) because it will be more sensitive. You can add phix/other contaminant removal into step 7 - BBDuk can do contaminant removal and quality trimming in one pass, and the contaminants are looked for prior to trimming each read.

    So, I would reorder it like this:

    1. Initial quality control
    2. Quality and/or length-based reads discarding; trimming/discarding of N-containing reads
    3. Adapter removal
    4. phix/contaminant removal
    5. [optional] Error-correction
    6. [optional] Deduplication (depending on experiment)
    7. Merging of PE reads
    8. Quality trimming + phix/contaminant removal
    9. Contamination check

    As for the order of quality trimming and merge, they could be done either way. But, BBMerge internally performs quality-trimming for reads that don't merge prior to quality trimming, so for that particular program, quality-trimming after merging tends to be better.

    Note that after merging, you will have 2 sets of reads (merged and unmerged) that must be processed independently, since not all of them will successfully merge.

    Comment


    • #3
      Brian, thank you very much for your comprehensive answers and also for your BBtools package, it is very good and allows to do almost anything with NGS data.

      Thank you for your advice regarding deduplication, but I will probably try to deduplicate my reads anyway, as I work with a metagenomic dataset and, due to a limiting step down the analysis pipeline, have to minimise the number of reads.

      May I ask you to clarify the meaning of
      Code:
      trimbyoverlap
      flag in bbduk? What is the mechanism of trimming adapters when this flag is set to true, does bbduk rely completely on reads' sequences in this situation or still use reference file? If I use it, will I be able to merge my PE reads afterwards?

      Just wondering, are you planning to publish a paper describing the toolkit any time soon? I think that it deserves to be better known to researchers, I found it just by chance, looking for alternatives to nesoni and trimmomatic.

      Comment


      • #4
        Originally posted by TauOvermind View Post
        Brian, thank you very much for your comprehensive answers and also for your BBtools package, it is very good and allows to do almost anything with NGS data.
        You're welcome!

        May I ask you to clarify the meaning of
        Code:
        trimbyoverlap
        flag in bbduk? What is the mechanism of trimming adapters when this flag is set to true, does bbduk rely completely on reads' sequences in this situation or still use reference file? If I use it, will I be able to merge my PE reads afterwards?
        "trimbyoverlap" or "tbo" is intended for use alongside a reference, when trimming normal PE fragment libraries. Normally, if you specify something like "k=25 mink=12" then adapters will be trimmed starting whenever a 25-mer match is detected; at the very end of the read, it will look for as little as a 12-mer match. But what if you have a read ending in only 2bp of adapter sequence? That will be missed. The "tbo" flag attempts to overlap the two reads. If you have 2x100bp reads, and they overlap perfectly in an orientation indicating a 98bp insert size, that means the last 2 bp of each read is adapter sequence. So, in practice, "k=25 mink=12" will not trim adapter sequences shorter than 12bp, but "tbo" can trim down to 1bp of adapter, as long as the reads have a low error rate so the overlap can be detected. I usually pair this with the "tpe" flag ("trimpairsequally") because with fragment libraries, the adapter is expected to be in the same position for each read; so if it is detected in one read but not the other (because one read has errors) they will still both get trimmed to the same length. Thus, the settings I generally use for paired fragment libraries (assuming the reads are interleaved):

        bbduk.sh in=reads.fq out=trimmed.fq ref=adapters.fa ktrim=r k=25 mink=12 hdist=1 tbo tpe

        You should NOT use the "tbo" or "tpe" flags for libraries with other protocols, such as long-mate-pair libraries where maybe one read contains an adapter and the other doesn't; it's just for normal paired libraries, and greatly increases sensitivity.

        You can still merge the reads afterward. BBDuk will not actually merge the reads, it just finds overlaps to calculate the insert size.

        Just wondering, are you planning to publish a paper describing the toolkit any time soon? I think that it deserves to be better known to researchers, I found it just by chance, looking for alternatives to nesoni and trimmomatic.
        I certainly plan to publish it and hopefully will be able to do so soon (particularly since I have a collaborator now); it's mainly a problem of free time.
        Last edited by Brian Bushnell; 10-01-2014, 02:23 PM.

        Comment


        • #5
          Thank you, Brian, everything is clear now. Good luck with the paper!

          Comment


          • #6
            EDIT: Sorry, it seems that problem disappeared after updating to BBtools v33.54.

            Brian, may I ask for your help once again? I am getting these strange errors when I run dedupe.sh (I use BBtools v33.51):

            Code:
            dedupe.sh in=1pW_interleaved.fastq out=1pW_dd.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
            java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22936m -Xms22936m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_interleaved.fastq out=1pW_dd.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
            Executing jgi.Dedupe [in=1pW_interleaved.fastq, out=1pW_dd.fastq, outd=1pW_dupes.fastq, ac=f, interleaved=t, overwrite=t]
            
            Set INTERLEAVED to true
            Initial:
            Memory: free=22687m, used=361m
            
            Found 4 duplicates.
            Finished exact matches.    Time: 0.357 seconds.
            Memory: free=21124m, used=1924m
            
            Input:                  	194776 reads 		48427883 bases.
            Duplicates:             	4 reads (0.00%) 	836 bases (0.00%)     	39916 collisions.
            Result:                 	194772 reads (100.00%) 	48427047 bases (100.00%)
            
            Exception in thread "main" java.lang.AssertionError: 97386, 194772
            	at jgi.Dedupe.addToArray(Dedupe.java:1331)
            	at jgi.Dedupe.writeOutput(Dedupe.java:1342)
            	at jgi.Dedupe.process2(Dedupe.java:531)
            	at jgi.Dedupe.process(Dedupe.java:430)
            	at jgi.Dedupe.main(Dedupe.java:57)

            It seems that the script doesn't want to write the deduplicated interleaved PE .fastq file to the disk, but it works flawlessly when I use it to deduplicate just one of the PE files:

            Code:
            dedupe.sh in=1pW_R1_100000.fastq  ac=f interleaved=f overwrite=t
            java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22931m -Xms22931m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_R1_100000.fastq ac=f interleaved=f overwrite=t
            Executing jgi.Dedupe [in=1pW_R1_100000.fastq, ac=f, interleaved=f, overwrite=t]
            
            Set INTERLEAVED to false
            Initial:
            Memory: free=22803m, used=240m
            
            Found 32 duplicates.
            Finished exact matches.    Time: 0.251 seconds.
            Memory: free=21360m, used=1683m
            
            Input:                  	100000 reads 		24971771 bases.
            Duplicates:             	32 reads (0.03%) 	8918 bases (0.04%)     	0 collisions.
            Result:                 	99968 reads (99.97%) 	24962853 bases (99.96%)
            
            Time:   			0.262 seconds.
            Reads Processed:        100k 	381.80k reads/sec
            Bases Processed:      24971k 	95.34m bases/sec


            or when I omit setting output file:



            Code:
            dedupe.sh in=1pW_interleaved.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
            java -Djava.library.path=/home/calculon/PROGRAMS/bbmap/jni/ -ea -Xmx22936m -Xms22936m -cp /home/calculon/PROGRAMS/bbmap/current/ jgi.Dedupe in=1pW_interleaved.fastq outd=1pW_dupes.fastq ac=f interleaved=t overwrite=t
            Executing jgi.Dedupe [in=1pW_interleaved.fastq, outd=1pW_dupes.fastq, ac=f, interleaved=t, overwrite=t]
            
            Set INTERLEAVED to true
            Initial:
            Memory: free=22687m, used=361m
            
            Found 4 duplicates.
            Finished exact matches.    Time: 0.367 seconds.
            Memory: free=21124m, used=1924m
            
            Input:                  	194776 reads 		48427883 bases.
            Duplicates:             	4 reads (0.00%) 	836 bases (0.00%)     	39916 collisions.
            Result:                 	194772 reads (100.00%) 	48427047 bases (100.00%)
            
            Time:   			0.376 seconds.
            Reads Processed:        194k 	517.84k reads/sec
            Bases Processed:      48427k 	128.75m bases/sec
            Last edited by TauOvermind; 10-02-2014, 07:58 AM. Reason: Problem disappeared after updating to BBtools v33.54

            Comment


            • #7
              Sorry about that; there was an incorrect assertion checking for the size of an array that assumed unpaired reads. It's fixed in the latest version (33.54). Alternately, you can add the "-da" flag to disable assertions, but I don't recommend that.

              Comment


              • #8
                Thank you, Brian, I updated to the latest version and everything works perfectly now.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Yesterday, 06:37 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, Yesterday, 06:07 PM
                0 responses
                8 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                49 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                66 views
                0 likes
                Last Post seqadmin  
                Working...
                X