Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • High amounts of Kmer at the 3' end after Trimming

    Hi Everyone,

    I have weird situation. I am getting high Kmers on my 3' end (right end) of the Illumina reads. I have been using trimmomatic 0.32 and its been fairly easy to work with. Also I have paired end sequences and this is present with both forward and reverse reads.

    I have already filtered out adapter sequences by using the most common Illumina sequences by using the iplant's illumina adapter file. Also I have removed the over expressed sequences. Usually in the past filtering out the over expressed sequences took care of the Kmers… But this time around it hasn't.

    I was wondering what I was doing wrong? Should I trim the sequences so that there are only 30 bp left? I would be losing approximately 50 bps.

    Also all the quality of the reads are great by being between 38 and 24.

    Any ideas will be appreciated on how I could go about fixing this issue.

    Thank you,

    Zapages

    EDIT: I was able to fix the Kmers, by removing the last 30 bps in the reads. I was wondering if this was correct manner of handling this or should I have used a different strategy for this? Everything passed except for high duplication levels, which is to be expected due to the data set being RNA-Seq.
    Attached Files
    Last edited by Zapages; 02-14-2014, 09:25 PM.

  • #2
    It is a small RNA sequenceing?

    Comment


    • #3
      Originally posted by TiborNagy View Post
      It is a small RNA sequenceing?
      It was just total mRNA to cDNA and then sequencing using Illumina pipeline. The data was downloaded from NCBI GEO/SRA.

      Comment


      • #4
        Originally posted by Zapages View Post

        I was wondering what I was doing wrong? Should I trim the sequences so that there are only 30 bp left? I would be losing approximately 50 bps.

        Also all the quality of the reads are great by being between 38 and 24.

        Any ideas will be appreciated on how I could go about fixing this issue.

        Thank you,

        Zapages

        EDIT: I was able to fix the Kmers, by removing the last 30 bps in the reads. I was wondering if this was correct manner of handling this or should I have used a different strategy for this? Everything passed except for high duplication levels, which is to be expected due to the data set being RNA-Seq.
        Having a look at the sequence of the kmers at the 3' end in your plot, it looks like it is part of the Illumina adapter sequence, so maybe the trimming procedure didn't remove all the adapters. Try running trimmomatic with the option '-k 10' (default is '-k 5'), it will give a better idea of what the over-represented kmers are.

        Comment


        • #5
          Originally posted by mastal View Post
          Try running trimmomatic with the option '-k 10' (default is '-k 5'), it will give a better idea of what the over-represented kmers are.
          @Maria: Did you mean to say fastqc?

          Comment


          • #6
            @GenoMax, Yes, I meant FastQC, thanks for spotting the error.

            Comment


            • #7
              Originally posted by Zapages View Post
              It was just total mRNA to cDNA and then sequencing using Illumina pipeline. The data was downloaded from NCBI GEO/SRA.
              Could you tell us the SRR number?

              Comment


              • #8
                Thank you GenoMax and mastal. I tried to use the non-integrative mode for FastQC and unfortunately I received the heap out of memory error in terminal was trying to extract kmer (-k 10 values).

                Regardless, I tried different settings and I ended up doing various tests and trimming the datasets to 50bps. The original length was 101 or 76 bps. I believe this should be fine, right?

                @relipmoc, the SRR data set was: SRR330569

                Comment


                • #9
                  try skewer

                  Originally posted by Zapages View Post
                  @relipmoc, the SRR data set was: SRR330569
                  The adapter sequence of the first pair is different from canonical TruSeq3 adapter.
                  >Real/1
                  AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
                  >Real/2
                  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT

                  the following are TruSeq3 adapters
                  >TruSeq3/1
                  AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
                  >TruSeq3/2
                  AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA

                  Nevertheless, when I used trimmomatic for trimming, I got similar Kmer Content plot as yours.
                  Click image for larger version

Name:	trimmomatic-kmer-content.png
Views:	1
Size:	53.6 KB
ID:	304469

                  Then I used skewer for trimming instead and got a better Kmer Content plot.
                  Click image for larger version

Name:	skewer-kmer-content.png
Views:	1
Size:	96.8 KB
ID:	304470

                  The following are log information:
                  COMMAND LINE: skewer -x AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG -y AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT -t 8 SRR330569_1.fastq SRR330569_2.fastq
                  Input file: SRR330569_1.fastq
                  Paired file: SRR330569_2.fastq
                  trimmed: SRR330569_1.trimmed-Q0L18-pair1.fastq, SRR330569_1.trimmed-Q0L18-pair2.fastq

                  Parameters used:
                  -- 3' end adapter sequence (-x): AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCG
                  -- paired 3' end adapter sequence (-y): AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGAT
                  -- maximum error ratio allowed (-r): 0.100
                  -- maximum indel error ratio allowed (-d): 0.030
                  -- minimum read length allowed after trimming (-l): 18
                  -- file format (-f): Sanger/Illumina 1.8+ FASTQ (auto detected)
                  -- number of concurrent threads (-t): 8
                  Tue Feb 25 11:32:32 2014 >> started
                  |=================================================>| (100.00%)
                  Tue Feb 25 11:34:14 2014 >> done (102.664s)
                  27005344 read pairs processed; of these:
                  1058 ( 0.00%) short read pairs filtered out after trimming by size control
                  924 ( 0.00%) empty read pairs filtered out after trimming by size control
                  27003362 (99.99%) read pairs available; of these:
                  11959533 (44.29%) trimmed read pairs available after processing
                  15043829 (55.71%) untrimmed read pairs available after processing
                  log has been saved to "SRR330569_1.trimmed-Q0L18.log".
                  Attached Files

                  Comment


                  • #10
                    Originally posted by Zapages View Post
                    Hi Everyone,

                    I have weird situation. I am getting high Kmers on my 3' end (right end) of the Illumina reads. I have been using trimmomatic 0.32 and its been fairly easy to work with. Also I have paired end sequences and this is present with both forward and reverse reads.
                    Based on relipmoc's reply, and the vintage of the dataset, i would say these are TruSeq2 adapters (i haven't downloaded the dataset to verify) - have you tried using the TruSeq2 PE adapter file supplied with Trimmomatic?

                    Comment


                    • #11
                      Comparison of the trimming results of skewer and trimmomatic

                      The trimmed reads of skewer were aligned to dsim reference using tophat, with the following command:

                      Code:
                      $ tophat dsim SRR330569_1.trimmed-Q0L18-pair1.fastq SRR330569_1.trimmed-Q0L18-pair2.fastq
                      where dsim is the prefix of bowtie2 index files which were built from the dsim-r1.4 reference genome (ftp://ftp.flybase.net/genomes/Drosop...-r1.4.fasta.gz).

                      the content of "tophat_out/align_summary.txt" is:
                      PHP Code:
                      Left reads:
                                
                      Input     :  27003362
                                 Mapped   
                      :  22637908 (83.8of input)
                                  
                      of these:  13345683 (59.0%) have multiple alignments (2504776 have >20)
                      Right reads:
                                
                      Input     :  27003362
                                 Mapped   
                      :  21275973 (78.8of input)
                                  
                      of these:  12805304 (60.2%) have multiple alignments (2504737 have >20)
                      81.3overall read mapping rate.

                      Aligned pairs:  20542511
                           of these
                      :  12581418 (61.2%) have multiple alignments
                                       2509666 
                      (12.2%) are discordant alignments
                      66.8
                      concordant pair alignment rate
                      The trimmed reads of trimmomatic were aligned to dsim reference using tophat, with the following command:

                      Code:
                      $ tophat dsim real-10-pair1.paired.fq real-10-pair2.paired.fq
                      where real-10-pair1.paired.fq real-10-pair2.paired.fq were produced by trimmomatic-0.32.jar using default parameter (ILLUMINACLIP:real-adapters:2:30:10:1:1 MINLEN:18).

                      the content of "tophat_out/align_summary.txt" is:
                      PHP Code:
                      Left reads:
                                
                      Input     :  26987807
                                 Mapped   
                      :  17819297 (66.0of input)
                                  
                      of these:  10586873 (59.4%) have multiple alignments (2255543 have >20)
                      Right reads:
                                
                      Input     :  26987807
                                 Mapped   
                      :  16885216 (62.6of input)
                                  
                      of these:  10213191 (60.5%) have multiple alignments (2255529 have >20)
                      64.3overall read mapping rate.

                      Aligned pairs:  15637748
                           of these
                      :   9694646 (62.0%) have multiple alignments
                                       2074039 
                      (13.3%) are discordant alignments
                      50.3
                      concordant pair alignment rate

                      Comment


                      • #12
                        Hi Zapages and other,
                        If you plan on performining a de novo assembly of the trimmed reads, it may be good for you to have a look at this paper. http://journal.frontiersin.org/Journ...014.00017/full

                        Comment


                        • #13
                          I've finally gotten around to testing this and it appears that the appropriate included adapter sequences (TruSeq2-PE) do a perfectly fine job with this data. Of course, it is important to actually use the correct adapters

                          Using just adapter trimming (ILLUMINACLIP:adapters/TruSeq2-PE.fa:30:12:1:true, MINLEN:18), the aligned pairs are 20,335,738, slightly behind (~1.0%) the 20,542,511 achieved by skewer.

                          When the suggested LEADING:3 / TRAILING:3 steps are added, to remove the super-low quality stretches, 22,222,781 pairs align, about 8.2% more than skewer.

                          Addition of further quality trimming, e.g using MAXINFO:40:0.9, brings that up to 22,910,973 pairs aligning, ~11.5% more.

                          In any case, as pointed out by Apexy above, very aggressive trimming isn't always a good idea - it depends on the purpose, amount of data etc.

                          Comment


                          • #14
                            Hi Toney,

                            I noticed your ILLUMINACLIP command and wanted to verify I got the manual correctly. So the manual say its:
                            ILLUMINACLIP:<fastaWithAdaptersEtc>:<seed mismatches>:<palindrome clip
                            threshold>:<simple clip threshold>:<minAdapterLength>:<keepBothReads>

                            and the command you specified is ILLUMINACLIP:adapters/TruSeq2-PE.fa:30:12:1:true, MINLEN:18

                            As I understand, 30 is the seed mismatch, 12 is the palindrome clip threshold, 1 is the simple clip thrshold but I don't know what true is when I compare it to the command in the manual. I suppose its for keepBothReads but isn't that specified last.

                            Thank you
                            Originally posted by tonybolger View Post
                            I've finally gotten around to testing this and it appears that the appropriate included adapter sequences (TruSeq2-PE) do a perfectly fine job with this data. Of course, it is important to actually use the correct adapters

                            Using just adapter trimming (ILLUMINACLIP:adapters/TruSeq2-PE.fa:30:12:1:true, MINLEN:18), the aligned pairs are 20,335,738, slightly behind (~1.0%) the 20,542,511 achieved by skewer.

                            When the suggested LEADING:3 / TRAILING:3 steps are added, to remove the super-low quality stretches, 22,222,781 pairs align, about 8.2% more than skewer.

                            Addition of further quality trimming, e.g using MAXINFO:40:0.9, brings that up to 22,910,973 pairs aligning, ~11.5% more.

                            In any case, as pointed out by Apexy above, very aggressive trimming isn't always a good idea - it depends on the purpose, amount of data etc.

                            Comment


                            • #15
                              True is for 'keep both reads', and it is specified last in the ILLUMINACLIP command.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X