Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Zapages
    Member
    • Oct 2012
    • 98

    High amounts of Kmer at the 3' end after Trimming

    Hi Everyone,

    I have weird situation. I am getting high Kmers on my 3' end (right end) of the Illumina reads. I have been using trimmomatic 0.32 and its been fairly easy to work with. Also I have paired end sequences and this is present with both forward and reverse reads.

    I have already filtered out adapter sequences by using the most common Illumina sequences by using the iplant's illumina adapter file. Also I have removed the over expressed sequences. Usually in the past filtering out the over expressed sequences took care of the Kmers… But this time around it hasn't.

    I was wondering what I was doing wrong? Should I trim the sequences so that there are only 30 bp left? I would be losing approximately 50 bps.

    Also all the quality of the reads are great by being between 38 and 24.

    Any ideas will be appreciated on how I could go about fixing this issue.

    Thank you,

    Zapages

    EDIT: I was able to fix the Kmers, by removing the last 30 bps in the reads. I was wondering if this was correct manner of handling this or should I have used a different strategy for this? Everything passed except for high duplication levels, which is to be expected due to the data set being RNA-Seq.
    Attached Files
    Last edited by Zapages; 02-14-2014, 09:25 PM.
  • TiborNagy
    Senior Member
    • Mar 2010
    • 329

    #2
    It is a small RNA sequenceing?

    Comment

    • Zapages
      Member
      • Oct 2012
      • 98

      #3
      Originally posted by TiborNagy View Post
      It is a small RNA sequenceing?
      It was just total mRNA to cDNA and then sequencing using Illumina pipeline. The data was downloaded from NCBI GEO/SRA.

      Comment

      • mastal
        Senior Member
        • Mar 2009
        • 666

        #4
        Originally posted by Zapages View Post

        I was wondering what I was doing wrong? Should I trim the sequences so that there are only 30 bp left? I would be losing approximately 50 bps.

        Also all the quality of the reads are great by being between 38 and 24.

        Any ideas will be appreciated on how I could go about fixing this issue.

        Thank you,

        Zapages

        EDIT: I was able to fix the Kmers, by removing the last 30 bps in the reads. I was wondering if this was correct manner of handling this or should I have used a different strategy for this? Everything passed except for high duplication levels, which is to be expected due to the data set being RNA-Seq.
        Having a look at the sequence of the kmers at the 3' end in your plot, it looks like it is part of the Illumina adapter sequence, so maybe the trimming procedure didn't remove all the adapters. Try running trimmomatic with the option '-k 10' (default is '-k 5'), it will give a better idea of what the over-represented kmers are.

        Comment

        • GenoMax
          Senior Member
          • Feb 2008
          • 7142

          #5
          Originally posted by mastal View Post
          Try running trimmomatic with the option '-k 10' (default is '-k 5'), it will give a better idea of what the over-represented kmers are.
          @Maria: Did you mean to say fastqc?

          Comment

          • mastal
            Senior Member
            • Mar 2009
            • 666

            #6
            @GenoMax, Yes, I meant FastQC, thanks for spotting the error.

            Comment

            • relipmoc
              Member
              • Jul 2011
              • 58

              #7
              Originally posted by Zapages View Post
              It was just total mRNA to cDNA and then sequencing using Illumina pipeline. The data was downloaded from NCBI GEO/SRA.
              Could you tell us the SRR number?

              Comment

              • Zapages
                Member
                • Oct 2012
                • 98

                #8
                Thank you GenoMax and mastal. I tried to use the non-integrative mode for FastQC and unfortunately I received the heap out of memory error in terminal was trying to extract kmer (-k 10 values).

                Regardless, I tried different settings and I ended up doing various tests and trimming the datasets to 50bps. The original length was 101 or 76 bps. I believe this should be fine, right?

                @relipmoc, the SRR data set was: SRR330569

                Comment

                • relipmoc
                  Member
                  • Jul 2011
                  • 58

                  #9
                  try skewer

                  Originally posted by Zapages View Post
                  @relipmoc, the SRR data set was: SRR330569
                  The adapter sequence of the first pair is different from canonical TruSeq3 adapter.
                  >Real/1
                  AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
                  >Real/2
                  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT

                  the following are TruSeq3 adapters
                  >TruSeq3/1
                  AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
                  >TruSeq3/2
                  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA

                  Nevertheless, when I used trimmomatic for trimming, I got similar Kmer Content plot as yours.
                  Click image for larger version

Name:	trimmomatic-kmer-content.png
Views:	1
Size:	53.6 KB
ID:	304469

                  Then I used skewer for trimming instead and got a better Kmer Content plot.
                  Click image for larger version

Name:	skewer-kmer-content.png
Views:	1
Size:	96.8 KB
ID:	304470

                  The following are log information:
                  COMMAND LINE: skewer -x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG -y AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT -t 8 SRR330569_1.fastq SRR330569_2.fastq
                  Input file: SRR330569_1.fastq
                  Paired file: SRR330569_2.fastq
                  trimmed: SRR330569_1.trimmed-Q0L18-pair1.fastq, SRR330569_1.trimmed-Q0L18-pair2.fastq

                  Parameters used:
                  -- 3' end adapter sequence (-x): AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
                  -- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT
                  -- maximum error ratio allowed (-r): 0.100
                  -- maximum indel error ratio allowed (-d): 0.030
                  -- minimum read length allowed after trimming (-l): 18
                  -- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
                  -- number of concurrent threads (-t): 8
                  Tue Feb 25 11:32:32 2014 >> started
                  |=================================================>| (100.00%)
                  Tue Feb 25 11:34:14 2014 >> done (102.664s)
                  27005344 read pairs processed; of these:
                  1058 ( 0.00%) short read pairs filtered out after trimming by size control
                  924 ( 0.00%) empty read pairs filtered out after trimming by size control
                  27003362 (99.99%) read pairs available; of these:
                  11959533 (44.29%) trimmed read pairs available after processing
                  15043829 (55.71%) untrimmed read pairs available after processing
                  log has been saved to "SRR330569_1.trimmed-Q0L18.log".
                  Attached Files

                  Comment

                  • tonybolger
                    Senior Member
                    • Feb 2010
                    • 156

                    #10
                    Originally posted by Zapages View Post
                    Hi Everyone,

                    I have weird situation. I am getting high Kmers on my 3' end (right end) of the Illumina reads. I have been using trimmomatic 0.32 and its been fairly easy to work with. Also I have paired end sequences and this is present with both forward and reverse reads.
                    Based on relipmoc's reply, and the vintage of the dataset, i would say these are TruSeq2 adapters (i haven't downloaded the dataset to verify) - have you tried using the TruSeq2 PE adapter file supplied with Trimmomatic?

                    Comment

                    • relipmoc
                      Member
                      • Jul 2011
                      • 58

                      #11
                      Comparison of the trimming results of skewer and trimmomatic

                      The trimmed reads of skewer were aligned to dsim reference using tophat, with the following command:

                      Code:
                      $ tophat dsim SRR330569_1.trimmed-Q0L18-pair1.fastq SRR330569_1.trimmed-Q0L18-pair2.fastq
                      where dsim is the prefix of bowtie2 index files which were built from the dsim-r1.4 reference genome (ftp://ftp.flybase.net/genomes/Drosop...-r1.4.fasta.gz).

                      the content of "tophat_out/align_summary.txt" is:
                      PHP Code:
                      Left reads:
                                
                      Input     :  27003362
                                 Mapped   
                      :  22637908 (83.8of input)
                                  
                      of these:  13345683 (59.0%) have multiple alignments (2504776 have >20)
                      Right reads:
                                
                      Input     :  27003362
                                 Mapped   
                      :  21275973 (78.8of input)
                                  
                      of these:  12805304 (60.2%) have multiple alignments (2504737 have >20)
                      81.3overall read mapping rate.

                      Aligned pairs:  20542511
                           of these
                      :  12581418 (61.2%) have multiple alignments
                                       2509666 
                      (12.2%) are discordant alignments
                      66.8
                      concordant pair alignment rate
                      The trimmed reads of trimmomatic were aligned to dsim reference using tophat, with the following command:

                      Code:
                      $ tophat dsim real-10-pair1.paired.fq real-10-pair2.paired.fq
                      where real-10-pair1.paired.fq real-10-pair2.paired.fq were produced by trimmomatic-0.32.jar using default parameter (ILLUMINACLIP:real-adapters:2:30:10:1:1 MINLEN:18).

                      the content of "tophat_out/align_summary.txt" is:
                      PHP Code:
                      Left reads:
                                
                      Input     :  26987807
                                 Mapped   
                      :  17819297 (66.0of input)
                                  
                      of these:  10586873 (59.4%) have multiple alignments (2255543 have >20)
                      Right reads:
                                
                      Input     :  26987807
                                 Mapped   
                      :  16885216 (62.6of input)
                                  
                      of these:  10213191 (60.5%) have multiple alignments (2255529 have >20)
                      64.3overall read mapping rate.

                      Aligned pairs:  15637748
                           of these
                      :   9694646 (62.0%) have multiple alignments
                                       2074039 
                      (13.3%) are discordant alignments
                      50.3
                      concordant pair alignment rate

                      Comment

                      • Apexy
                        Member
                        • Apr 2011
                        • 62

                        #12
                        Hi Zapages and other,
                        If you plan on performining a de novo assembly of the trimmed reads, it may be good for you to have a look at this paper. http://journal.frontiersin.org/Journ...014.00017/full

                        Comment

                        • tonybolger
                          Senior Member
                          • Feb 2010
                          • 156

                          #13
                          I've finally gotten around to testing this and it appears that the appropriate included adapter sequences (TruSeq2-PE) do a perfectly fine job with this data. Of course, it is important to actually use the correct adapters

                          Using just adapter trimming (ILLUMINACLIP:adapters/TruSeq2-PE.fa:30:12:1:true, MINLEN:18), the aligned pairs are 20,335,738, slightly behind (~1.0%) the 20,542,511 achieved by skewer.

                          When the suggested LEADING:3 / TRAILING:3 steps are added, to remove the super-low quality stretches, 22,222,781 pairs align, about 8.2% more than skewer.

                          Addition of further quality trimming, e.g using MAXINFO:40:0.9, brings that up to 22,910,973 pairs aligning, ~11.5% more.

                          In any case, as pointed out by Apexy above, very aggressive trimming isn't always a good idea - it depends on the purpose, amount of data etc.

                          Comment

                          • modi2020
                            Member
                            • May 2012
                            • 22

                            #14
                            Hi Toney,

                            I noticed your ILLUMINACLIP command and wanted to verify I got the manual correctly. So the manual say its:
                            ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip
                            threshold>:<simple clip threshold>:<minAdapterLength>:<keepBothReads>

                            and the command you specified is ILLUMINACLIP:adapters/TruSeq2-PE.fa:30:12:1:true, MINLEN:18

                            As I understand, 30 is the seed mismatch, 12 is the palindrome clip threshold, 1 is the simple clip thrshold but I don't know what true is when I compare it to the command in the manual. I suppose its for keepBothReads but isn't that specified last.

                            Thank you
                            Originally posted by tonybolger View Post
                            I've finally gotten around to testing this and it appears that the appropriate included adapter sequences (TruSeq2-PE) do a perfectly fine job with this data. Of course, it is important to actually use the correct adapters

                            Using just adapter trimming (ILLUMINACLIP:adapters/TruSeq2-PE.fa:30:12:1:true, MINLEN:18), the aligned pairs are 20,335,738, slightly behind (~1.0%) the 20,542,511 achieved by skewer.

                            When the suggested LEADING:3 / TRAILING:3 steps are added, to remove the super-low quality stretches, 22,222,781 pairs align, about 8.2% more than skewer.

                            Addition of further quality trimming, e.g using MAXINFO:40:0.9, brings that up to 22,910,973 pairs aligning, ~11.5% more.

                            In any case, as pointed out by Apexy above, very aggressive trimming isn't always a good idea - it depends on the purpose, amount of data etc.

                            Comment

                            • mastal
                              Senior Member
                              • Mar 2009
                              • 666

                              #15
                              True is for 'keep both reads', and it is specified last in the ILLUMINACLIP command.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 10:09 AM
                              0 responses
                              10 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              26 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...