Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Trimmomatic Paired End - Low number of surviving reads

    Hi Friends,

    I am having this problem with Illumina Hiseq (v. 1.9) paired end libraries (150nt reads). The number of surviving reads after trimming are very low.

    This is my command:
    Code:
    java -jar trimmomatic-0.32.jar PE -threads 28 -phred33 WT_CTTGTA_L001_R1_001.fastq WT_CTTGTA_L001_R2_001.fastq Out_paired_WT_CTTGTA_L001_R1_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R1_001.fastq.gz Out_paired_WT_CTTGTA_L001_R2_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R2_001.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
    My adapter FASTA file:
    >PrefixPE/1
    TACACTCTTTCCCTACACGACGCTCTTCCGATCT
    >PrefixPE/2
    GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT
    >ReadThrough_PE/1
    AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
    >ReadThrough_PE/2
    AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA
    >PCR_Primer/1.1/1
    AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA
    >PCR_Primer/1.2/1
    ATCTCGTATGCCGTCTTCTGCTTG
    >PCR_Primer/1.1/2
    CAAGCAGAAGACGGCATACGAGAT
    >PCR_Primer/1.2/2
    TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
    Stats from trimmomatic:
    Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
    Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
    Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
    Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA'
    Using Long Clipping Sequence: 'TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
    Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
    Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGAT'
    ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 3 forward only sequences, 3 reverse only sequences
    Input Read Pairs: 22060013 Both Surviving: 14586934 (66.12%) Forward Only Surviving: 6580960 (29.83%) Reverse Only Surviving: 237640 (1.08%) Dropped: 654479 (2.97%)
    TrimmomaticPE: Completed successfully
    I would appreciate your help and/or suggestions.


    BADE

  • #2
    Can you give some more information about the run itself? Sample, Cluster density, Insert size, size selection and library prep might be helpful for troubleshooting.

    It might be worth checking if the quality of the reverse read or the insert size is the actual issue by running the adapter trimming and quality trimming in two separate steps.

    Comment


    • #3
      Give BBDuk a try on the side to see if you get better results.

      Comment


      • #4
        Hi Avo,

        Can you give some more information about the run itself? Sample, Cluster density, Insert size, size selection and library prep might be helpful for troubleshooting.
        I have e-mailed our sequencing center for the information and waiting for their reply.

        It might be worth checking if the quality of the reverse read or the insert size is the actual issue by running the adapter trimming and quality trimming in two separate steps.
        You are right - The quality of reverse reads is really low, and running Trimmomatic with just adapter trimming options (without quality trimming) reports back with 100% surviving reads in both:

        TrimmomaticPE: Started with arguments: -threads 28 -phred33 _WT_CTTGTA_L001_R1_001.fastq _WT_CTTGTA_L001_R2_001.fastq Out_paired_WT_CTTGTA_L001_R1_001.fastq.gz Out_unpaired__WT_CTTGTA_L001_R1_001.fastq.gz Out_paired_WT_CTTGTA_L001_R2_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R2_001.fastq.gz ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10:8:TRUE
        Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
        Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
        Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
        Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA'
        Using Long Clipping Sequence: 'TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
        Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
        Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGAT'
        ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 3 forward only sequences, 3 reverse only sequences
        Input Read Pairs: 22060013 Both Surviving: 22059856 (100.00%) Forward Only Surviving: 2 (0.00%) Reverse Only Surviving: 154 (0.00%) Dropped: 1 (0.00%)
        TrimmomaticPE: Completed successfully
        I have attached the quality scores for read1/Forward and read2/reverse below. What could be the reason for such low quality reverse reads? Is there a way I can rescue these low quality reads. For my analysis I need paired output.

        Thanks

        BADE
        Attached Files

        Comment


        • #5
          Median scores for R2 are still above Q30 so things are not that bad. If this is a re-sequencing project you shouldn't worry about trimming based on Q-scores. Is this a MiSeq run?

          Comment


          • #6
            Hi Genomax,

            Median scores for R2 are still above Q30 so things are not that bad. If this is a re-sequencing project you shouldn't worry about trimming based on Q-scores.
            But the problem is that I need paired files for my analysis and there are only 66% reads surviving in pair. Any suggestion on how to improve the number of surviving reads in both forward and reverse? Or should I combine unpaired reads and treat them as single end sequencing reads for my (RNA-seq) analysis to identify top expressed and differentially expressed genes?

            Is this a MiSeq run?
            Its from HiSeq2500.

            Thanks

            BADE

            Comment


            • #7
              Reason I asked about this being a MiSeq run was because of the # of reads. 22 million PE reads seems to be on the low end (11 mil unique clusters) for a HiSeq 2500 run.

              If you have a reference genome available then I would suggest that you trim only the adapters (and very low Q-scores (< 5), if you are worried about that). That should leave you with more reads to go forward.

              Comment


              • #8
                Hi Genomax,

                Reason I asked about this being a MiSeq run was because of the # of reads. 22 million PE reads seems to be on the low end (11 mil unique clusters) for a HiSeq 2500 run.
                The reason for such low reads in one paired library is because 6 samples (3 control and 3 test) were multiplexed on single lane. I am not sure is that's good or bad for a standard RNA-seq analysis for species with gold-standard reference genome like Mouse. Maybe you can comment on it.

                If you have a reference genome available then I would suggest that you trim only the adapters (and very low Q-scores (< 5), if you are worried about that). That should leave you with more reads to go forward.
                Thanks for your suggestion with parameter - SLIDINGWINDOW:4:5 - I am getting:

                TrimmomaticPE: Started with arguments: -threads 28 -phred33 _WT_CTTGTA_L001_R1_001.fastq _WT_CTTGTA_L001_R2_001.fastq Out_paired_WT_CTTGTA_L001_R1_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R1_001.fastq.gz Out_paired_WT_CTTGTA_L001_R2_001.fastq.gz Out_unpaired_WT_CTTGTA_L001_R2_001.fastq.gz ILLUMINACLIP:Trimmomatic-0.32/TruSeq3-PE-2.fa:2:30:10:8:true LEADING:3 TRAILING:3 SLIDINGWINDOW:4:5 MINLEN:36
                Using PrefixPair: 'TACACTCTTTCCCTACACGACGCTCTTCCGATCT' and 'GTGACTGGAGTTCAGACGTGTGCTCTTCCGATCT'
                Using Long Clipping Sequence: 'ATCTCGTATGCCGTCTTCTGCTTG'
                Using Long Clipping Sequence: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
                Using Long Clipping Sequence: 'AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTA'
                Using Long Clipping Sequence: 'TAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT'
                Using Long Clipping Sequence: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA'
                Using Long Clipping Sequence: 'CAAGCAGAAGACGGCATACGAGAT'
                ILLUMINACLIP: Using 1 prefix pairs, 0 forward/reverse sequences, 3 forward only sequences, 3 reverse only sequences
                Input Read Pairs: 22060013 Both Surviving: 20067327 (90.97%) Forward Only Surviving: 1989348 (9.02%) Reverse Only Surviving: 3200 (0.01%) Dropped: 138 (0.00%)
                I will continue with this and see how it goes. Actually, I was thinking of combining the unpaired reads and using the file as from single end sequencing for further analysis.

                Any further suggestions would be helpful.

                Bade

                Comment


                • #9
                  Hi All,

                  As suggested in this thread I did the pre-processing of all the samples and proceeded to map the reads with TopHat2 keeping the standard analysis options (pasted below)

                  * FASTQ Quality Scale: Sanger (PHRED33)
                  * Anchor length: 8
                  * Maximum number of mismatches that can appear in the anchor region of spliced alignment: 0
                  * The minimum intron length: 70
                  * The maximum intron length: 50000
                  * Minimum isoform fraction: 0.15
                  * Maximum number of alignments to be allowed: 20
                  * Minimum intron length that may be found during split-segment (default) search: 50
                  * Maximum intron length that may be found during split-segment (default) search: 500000
                  * Number of mismatches allowed in each segment alignment for reads mapped independently: 2
                  * Minimum length of read segments: 20
                  * Mate-Pair Inner Distance: 50
                  * Bowtie 2 speed and sensitivity: Sensitive (slower)
                  The TopHat alignment summary for WT and cKO samples is pasted below:

                  Sample: WT_CTTGTA_L001_R1_001.fastq
                  Left reads:
                  Input: 20067327
                  Mapped: 13674480 (68.1% of input)
                  of these: 1296943 ( 9.5%) have multiple alignments (23368 have >20)
                  Right reads:
                  Input: 15421666
                  Mapped: 6407370 (41.5% of input)
                  of these: 605201 ( 9.4%) have multiple alignments (6202 have >20)
                  56.6% overall read alignment rate.

                  Aligned pairs: 5421652
                  of these: 48595 ( 0.9%) have multiple alignments
                  and: 5298036 (97.7%) are discordant alignments
                  0.8% concordant pair alignment rate.

                  Sample: cKO_CTTGTA_L001_R1_001.fastq
                  Left reads:
                  Input: 18672105
                  Mapped: 10964334 (58.7% of input)
                  of these: 1118743 (10.2%) have multiple alignments (15345 have >20)
                  Right reads:
                  Input: 18672105
                  Mapped: 7635944 (40.9% of input)
                  of these: 736048 ( 9.6%) have multiple alignments (8155 have >20)
                  49.8% overall read alignment rate.

                  Aligned pairs: 7384018
                  of these: 550466 ( 7.5%) have multiple alignments
                  and: 13750 ( 0.2%) are discordant alignments
                  39.5% concordant pair alignment rate.
                  I am wondering why are there so many “discordant alignments” in WT sample? Can cKO sample be considered as “good” and used for further analysis?

                  Please suggest.

                  BADE

                  Comment


                  • #10
                    That normally means that your read ordering got messed up by some preprocessing step, and thus the reads are no longer properly paired. Note, for example -

                    Left reads:
                    Input: 20067327
                    Mapped: 13674480 (68.1% of input)
                    of these: 1296943 ( 9.5%) have multiple alignments (23368 have >20)
                    Right reads:
                    Input: 15421666
                    Properly paired files should have the same number of left and right reads. You need to redo the preprocessing on that data and ensure pairs are kept together.

                    Comment


                    • #11
                      @BADE: You had 22059856 pairs surviving at the end of trimmomatic run. Did you do something to the files afterwards?

                      Comment


                      • #12
                        @ Brian Properly paired files should have the same number of left and right reads. You need to redo the preprocessing on that data and ensure pairs are kept together.
                        Yes, ordering of two samples was messed up. I ran the TopHat again and below is the output:
                        Sample: WT_CTTGTA_L001_R1_001.fastq
                        Left reads:
                        Input: 20067327
                        Mapped: 12967700 (64.6% of input)
                        of these: 1229841 ( 9.5%) have multiple alignments (21956 have >20)
                        Right reads:
                        Input: 20067327
                        Mapped: 8661002 (43.2% of input)
                        of these: 784306 ( 9.1%) have multiple alignments (11071 have >20)
                        53.9% overall read alignment rate.

                        Aligned pairs: 8312616
                        of these: 551497 ( 6.6%) have multiple alignments
                        and: 33317 ( 0.4%) are discordant alignments
                        41.3% concordant pair alignment rate.
                        I understand the read alignment rate is 53 % that is because of low quality reverse reads. Also, concordant rate is only 41.2 %. I am getting similar alignment rate and concordant rate for all the other samples. Is it appropriate to proceed with this data to perform the next step- Cufflink?

                        @GenoMax:You had 22059856 pairs surviving at the end of trimmomatic run. Did you do something to the files afterwards?
                        I have 20067327 reads surviving out of 22060013. For TopHat I used the out_paired reads.

                        Please suggest

                        Thanks,

                        BADE

                        Comment


                        • #13
                          The data has an unexpectedly low mapping and pairing rate. You may want to do quality-trimming first, or use local alignment, or use a more error-tolerant aligner. As a first step, I would suggest quality-trimming. It's also possible that the quality is so low that adapter-trimming tools can't detect adapter sequence. In that case, unless the genomic material is incredibly precious, you should just resequence it.

                          What organism is this, and do you have a reference or at least some assembly?

                          Comment


                          • #14
                            hi BADE,
                            You may try skewer for preprocessing your data. It's demonstrated to produce better input for downstream analysis of RNA-Seq data. It's easy to use and runs fast.
                            Last edited by relipmoc; 10-22-2014, 10:12 PM. Reason: typo

                            Comment


                            • #15
                              Brian @ The data has an unexpectedly low mapping and pairing rate. You may want to do quality-trimming first, or use local alignment, or use a more error-tolerant aligner. As a first step, I would suggest quality-trimming. It's also possible that the quality is so low that adapter-trimming tools can't detect adapter sequence.In that case, unless the genomic material is incredibly precious, you should just resequence it.
                              I performed the quality trimming also but the number of surviving reads was only 66% (described in earlier post). I am not sure how many will align to the genome.

                              What organism is this, and do you have a reference or at least some assembly?
                              That data is from mouse samples and the ref genome is Mus musculus

                              @ Relipmoc: You may try skewer for preprocessing your data. It's demonstrated to produce better input for downstream analysis of RNA-Seq data. It's easy to use and runs fast.
                              Thanks for the suggestions. I will try it.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              48 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X