Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • MiSeq gDNA reads still fail "Kmer content" and "per base seq content" after trimming"

    I have genomic DNA that was PE sequenced on the MiSeq platform. I understand there must've been some adapter read through due to the large read sizes. Even after trimming, I still get some enriched kmers and skewed GC content on either end of both pairs of reads. Here are some Kmer content graphs: , , , , ,

    Here are some examples of per base GC content: ,


    I ran trimmomatic with
    PE -phred33 ILLUMINACLIP:TruSeq2-PE.fa:2:20:7:2 LEADING:13 TRAILING:13 SLIDINGWINDOW:4:15 MINLEN:36

    My adapter file
    $ cat TruSeq2-PE.fa
    >PrefixPE/1
    AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
    >PrefixPE/2
    CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
    >PCR_Primer1
    AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
    >PCR_Primer1_rc
    AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
    >PCR_Primer2
    CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT
    >PCR_Primer2_rc
    AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG
    >FlowCell1
    TTTTTTTTTTAATGATACGGCGACCACCGAGATCTACAC
    >FlowCell2
    TTTTTTTTTTCAAGCAGAAGACGGCATACGA

  • #2
    First, do you know what kind library prep was used? If it was Nextera, that would explain the biased sequence near the beginning, and also why some adapters are not being removed, since you're trimming for TruSeq sequences. But if it was in fact TruSeq, then I'm not really sure about the biased composition near the beginning.

    Unfortunately, because of the way FastQC compresses the base positions after base 9, it's impossible to get a good idea of what's going on at the end of the read from those graphs. But note that typical adapter-trimming will not remove adapters shorter than X bp at the very end, because it becomes too short to match the sequence confidently (X is usually a parameter). However, BBDuk can still remove those very short adapter sequences from PE reads by overlapping them to determine the insert size, so you might give that a try; just use the "tbo" flag.

    Comment


    • #3
      Just trim off the ends.
      Is probably less of a headache than trying to figure out the problem.

      For the high GC at the end: It seems to be that in general the longer reads have a higher chance to have GC at the end, not AT.
      So if your reads are of inequal length, then you'll just get an increase of GC content at the end, because all the AT is more likely to be removed.

      Comment


      • #4
        Originally posted by ysnapus View Post
        I ran trimmomatic with
        PE -phred33 ILLUMINACLIP:TruSeq2-PE.fa:2:20:7:2 LEADING:13 TRAILING:13 SLIDINGWINDOW:4:15 MINLEN:36
        I agree with Brian. Are you sure it is a TruSeq2 library? We often see this kind of sequence content plots for Nextera libraries. In this case you should just use the NexteraPE-PE.fa adapter file.

        Comment


        • #5
          Originally posted by avo View Post
          I agree with Brian. Are you sure it is a TruSeq2 library? We often see this kind of sequence content plots for Nextera libraries. In this case you should just use the NexteraPE-PE.fa adapter file.
          It definitely looks like a TruSeq (or other mechanically fragmented) library to me. Nextera (tagmentase fragmented) have a very distinct and more exaggerated base composition bias at the 5' end. TruSeq or other libraries in which the input DNA is fragmented in a Covaris still show a slight bias in their 5' base composition due to base composition influencing fragmentation sensitivity.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin


            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          17 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          22 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          46 views
          0 likes
          Last Post seqadmin  
          Working...
          X