Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 5' Kmers problem with whole genome re-sequencing data

    Hi all,

    I am having some issues with getting rid of kmers in my sequencing data sets. We have re-sequenced several African buffalo genomes on an Illumina HiSeq X machine, 150bp paired-end reads. I ran FastQC on the raw reads and all samples pass the adapters and overrepresented sequences tests, but some fail the kmer test (length = 7 nucleotides). The samples either have kmers overrepresented at the 5' or 3' end. I can see that at least the 3' kmers are actually part of the Illumina adapters used (but FastQC is not picking these up in the adapter test), but the 5' kmers do not seem to be part of the adapters, as far as I can tell.

    Trimmomatic successfully removes the 3' kmers of one sample, but not the 5' kmers of another sample. The same goes for Trim Galore (cutadapt), which can only remove adapter sequences from the 3' end of the reads (as far as I can tell from the cutadapt documentation). The per base quality is almost always above phred=20. This is the relevant part of the code I used for Trimmomatic and Trim Galore:

    Trimmomatic: ILLUMINACLIP:/apps/chpc/bio/trimmomatic/0.36/bin/adapters/TruSeq3-PE-2.fa:2:30:10:1:true SLIDINGWINDOW:4:20 MINLEN:36

    Trim Galore: trim_galore -q 20 --phred33 --stringency 1 -e 0.1 --length 36 -o /mnt/lustre/users/djager/trim_galore_out --paired --retain_unpaired -r1 37 -r2 37

    Is there any way I can remove the 5' kmers without having to trim the 5' nucleotides of all the reads? Are these 5' kmers perhaps biological sequences?

    The FastQC results are attached. The examples are all for the forward reads.
    Attached Files

  • #2
    5' is most likely biological, a known artifact of fragmentation site bias and should be ignored. Nextera libraries in particular demonstrate this bias very strongly but it does not seem to have much impact downstream, and trimming it won't accomplish anything.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      5' is most likely biological, a known artifact of fragmentation site bias and should be ignored. Nextera libraries in particular demonstrate this bias very strongly but it does not seem to have much impact downstream, and trimming it won't accomplish anything.
      Thanks for the reply, Brian! I remember reading something about this fragmentation bias, now that you mention it, but I wasn't too sure if it applied to TruSeq as well. So I thought I'd rather be as thorough as possible and get some thoughts from the community before moving onto the next step.

      Comment


      • #4
        With TruSeq the shearing and adapters are independent, so I suppose it depends on your shearing process. I don't have any data for this but I would expect any enzymatic process to incur bias but sonication much less so. You can always test whether the uneven base composition is genomic or synthetic by mapping and calculating the mismatch rate by read position. If it is no higher at the spiky 3' end then the spikes are genomic and just caused by bias. If there is a high mismatch rate they are probably synthetic and should be trimmed, but I have never seen that happen in a normal fragment library.

        Comment


        • #5
          Originally posted by Brian Bushnell View Post
          With TruSeq the shearing and adapters are independent, so I suppose it depends on your shearing process. I don't have any data for this but I would expect any enzymatic process to incur bias but sonication much less so. You can always test whether the uneven base composition is genomic or synthetic by mapping and calculating the mismatch rate by read position. If it is no higher at the spiky 3' end then the spikes are genomic and just caused by bias. If there is a high mismatch rate they are probably synthetic and should be trimmed, but I have never seen that happen in a normal fragment library.
          Thanks for this suggestion. I mapped the sample with kmers at the 5' end (sample 2 in graphs of first post = sample B98_289 in histograms below) using bwa, and also mapped a sample with no overrepresented kmers (B98_579 in histograms below) and then calculated the mismatch (or actually the match) rate per read position using the mhist function in bbduk.

          Code:
          ~/programs/bbmap/bbmap/bbduk.sh in=B98_289_DSW37627_HJMJMALXX_L2_clean_aln-PE_sorted.bam mhist=mhist.txt qhist=qhist.txt out=bbdukhist.txt
          Assuming that this was the correct way to look at the mismatch rate per read position, it would seem that the spikes are genomic and just caused by a bias of some sort, because the distribution looks the same as the sample without overrespresented Kmers (B98_579) and thus the reduced match at the ends of reads seems to be normal? But it is still high (>96%).
          Attached Files

          Comment


          • #6
            Originally posted by Brian Bushnell View Post
            You can always test whether the uneven base composition is genomic or synthetic by mapping and calculating the mismatch rate by read position. If it is no higher at the spiky 3' end then the spikes are genomic and just caused by bias. If there is a high mismatch rate they are probably synthetic and should be trimmed, but I have never seen that happen in a normal fragment library.
            Hi again Brian,

            I calculated the mismatch (or actually the match) rate from the .bam file using the mhist function in bbduk (mapping was done with bwa and of course the .bam files were produced with samtools).

            Code:
            ~/programs/bbmap/bbmap/bbduk.sh in=B98_289_DSW37627_HJMJMALXX_L2_clean_aln-PE_sorted.bam mhist=mhist.txt qhist=qhist.txt out=bbdukhist.txt
            For the sample with overrepresented kmers at the 5' end (sample2 in fastqc plots above = sample B98_289 in histograms below) there is a slight decrease in the match rate for the 5' and 3' ends of the reads, compared to the middle of the reads. However, this is not different from a sample that had no overrepresented kmers anywhere (sample B98_579). Therefore, it seems that the overrepresented kmers are genomic and not synthetic. Would you agree?
            Attached Files

            Comment


            • #7
              Originally posted by DeondeJager View Post
              For the sample with overrepresented kmers at the 5' end (sample2 in fastqc plots above = sample B98_289 in histograms below) there is a slight decrease in the match rate for the 5' and 3' ends of the reads, compared to the middle of the reads. However, this is not different from a sample that had no overrepresented kmers anywhere (sample B98_579). Therefore, it seems that the overrepresented kmers are genomic and not synthetic. Would you agree?
              Yes, it sounds like it's probably fine to me... no trimming needed.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X