Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • strange FastQC kmer plot even after trimming

    Hi,
    I've the attached strange FastQC kmer plot even after adpter and quality trimming. The data is from 400bp PE library from GAII. I've used trimmomatic to trim the TruSeq adapter.
    Code:
    GATCGGAAGAGCACACGTCTGAACTCCAGTCACCTTGTAATCTCGTATGCCGTCTTCTGCTTG
    As suggested by many other posts not to worry too much about things like this. However, I am coming back to this only after getting a highly fragmented denovo assembly of a large genome. I understand that denovo assembly can be like that for many reasons, however, just to make sure I've high quality reads to supply to assembler and not to mention the plot looks Ugly.
    Thanks for any suggestions.
    Attached Files

  • #2
    Oh, for de novo assembly you should definitely worry about that (if you were just mapping reads to a genome, it likely wouldn't matter). Those 5-mers are being generated from two dinucleotide repeats (just in different frames and strands). That is going to screw up your assembly if you have very many of them infiltrating your reads, which we can't tell for that plot, but its just relative to the highest abundant k-mer.

    Are you sure you put in the correct adapter for trimming. Just the TruSeq adapter is often not correct. But rather you need some set of indexed adapters, PCR primers, etc. I generally give Trimmomatic a pretty long list of every adapter/primer set that was used in the whole group of library preps being sequenced, just to be sure. After your assemblies, you'll find adapter/primer sequence of all kinds of stuff if you don't.

    Comment


    • #3
      Here are two K-mer plots before (bottom) and after (top and that CCCCC repeat is very much lower than the spikes you see in the bottom window) aggressive trimming with trimmomatic (including a quality trim) and overlapping with flash (do your 150bp reads overlap?). Here is the adapter file I went with too, as you can see it was a bit of the kitchen sink.
      Click image for larger version

Name:	kmer_profiles.png
Views:	1
Size:	19.5 KB
ID:	304269

      Click image for larger version

Name:	kmer_profiles_1.png
Views:	1
Size:	64.5 KB
ID:	304270

      Adapters.txt

      Comment


      • #4
        Hi Wallysb01,

        Thanks for your reply. I haven't explored the overlapping reads.
        I've used your adapters and it seems most of those kmers are still having fun out there.
        Also for your info, here is the trimmomatic command I used:
        Code:
        java -classpath trimmomatic-0.30.jar org.usadellab.trimmomatic.TrimmomaticPE -threads 16 -phred33 ../lane2_NoIndex_L002_R1_001_val_1.fq ../lane2_NoIndex_L002_R2_001_val_2.fq paired21.fq unpaired21.fq paired22.fq unpaired22.fq ILLUMINACLIP:Adapters.txt:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:30 MINLEN:50
        Attached Files

        Comment


        • #5
          Eek, Ok. Did the frequency drop much. You can tell by the table under that figure.

          Also, how big are your inserts? Can you even attempt overlapping the reads?

          And you may just want to trim off those first 10bp for your next assembly. That may help.

          Finally, what kind of coverage do you have?

          Comment


          • #6
            Nope, the frequency doesn't drop much. Reads are 150bp and insert size is 400bp for 2 lanes and 700bp for another two lanes. hence, not much chance of overlaps.
            Yes, I did trim off 10bp in both directions and it's almost the same and seems like I am running out of options.
            Attached Files

            Comment


            • #7
              You might try COPE (http://sourceforge.net/projects/coperead/). It can overlap reads using kmers, so reads don't have to actually overlap and instead just be close enough for high frequency kmers to span the gap. It may work pretty well with 2x150bp reads, because you could increase kmer sizes up a little bigger, assuming your coverage is pretty high too. And you can use the 700bp insert library to add to the kmer pool, but not attempt overlaps.

              With my shorter 170bp library, I found flash to work better, but the library was actually that small with very few >190bp. So that kmer method didn't seem to help much. And while you library may look like its 400bp, I've generally found libraries to be shorter than what sequencing cores say.

              Comment


              • #8
                Thanks for your suggestions. It would be good to have longer reads through overlaps, however, think I need to get rid of those funny k-mers first, isn't it?. I can't find a way to deal with that. Once I've quality data I can move to the next step.

                Comment


                • #9
                  Those repetitive kmers are really just dinucleoties, so in kmer lengths around 21, for error correction and overlapping, they may not provide a huge obstacle.

                  In fact, you could up the kmer length to 10bp in fastqc to see if these sequences continue to be a problem. It maybe that certain reads are just filled with them and they could be removed with a very strict dust filtering. Say, you remove reads with a dust score of 30? There is really no reason to attempt to keep sequences with so many very, very low complexity sequences. While you of course ideally you'd want to try to assembly low complexity sequences, however in this case, they may be artifacts and providing more problems than they are worth.

                  Prinseq can do dust filtering, if you want to give it a shot. And it will separate out the good and bad seqs for inspection.

                  After playing with prinseq, you might actually want to drop that score a little lower, 20-ish?
                  Last edited by Wallysb01; 08-01-2013, 10:27 PM.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  24 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  25 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  22 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X