Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • I'd like some guidance on bbduk.sh parameters for trimming and filtering raw reads that would fit best for meeting the following criterion. I'm dealing with PE 150bp raw reads and bbduk.sh version 38.84

    1) Discard a read pair if either one read contains adapter contamination;
    2) Discard a read pair if more than 10% of bases are uncertain in either one read;
    3) Discard a read pair if the proportion of low quality bases is over 50% in either one read.​

    From my understanding of the parameters, point 2 could be met by setting maxns=15. I am not sure on what paramters to use for points 1 and 3. Any help would be much appreciated.
    Last edited by brewseeker; 06-22-2023, 06:09 AM.

    Comment


    • Hi Brian,
      This question is regarding the parameter "mincovfraction or mcf" which is available in BBDuk but I notice that it is not available in Seal. Am I missing anything or was this a deliberate choice, perhaps due to computational considerations. Is there a way around? I know I could use "minkmerfraction" but I am a bit worried about missing a few corner cases. Any suggestions to accomplish this functionality using Seal would be very helpful.
      Thanks in advance.

      Comment


      • Hi Brian,

        Thanks for the super nice tool! I have been doing some mock tests to try to understand a bit better how BBDuk works, and found the results of one of them to be somewhat counterintuitive. In short, I was trying to right-trim a single adapter of length 19 from the same FASTQ file (coming from a 100bp single-read experiment), using two different values of kmin (10/19), while allowing a hamming distance of 1, and filtering out reads shorter than 50 bases after trimming. The exact commands that I ran were:

        Code:
        bbduk.sh in=SomeSample.fastq.gz out=SomeSample.Clean.10.fastq.gz ref=adapters.fa ktrim=r k=19 mink=10 hdist=1 minlength=50 ordered=t
        bbduk.sh in=SomeSample.fastq.gz out=SomeSample.Clean.19.fastq.gz ref=adapters.fa ktrim=r k=19 mink=19 hdist=1 minlength=50 ordered=t​
        ​
        where adapters.fa contains only my 19 base-long adapter. And here is the relevant output for both commands:

        A) For kmin=10:

        Added 832 kmers
        Input: 24597549 reads 2459754900 bases.
        KTrimmed: 792343 reads (3.22%) 14335014 bases (0.58%)
        Total Removed: 2618 reads (0.01%) 14335014 bases (0.58%)
        Result: 24594931 reads (99.99%) 2445419886 bases (99.42%)
        B) For kmin=19:
        Added 55 kmers
        Input: 24597549 reads 2459754900 bases.
        KTrimmed: 283169 reads (1.15%) 7480620 bases (0.30%)
        Total Removed: 2620 reads (0.01%) 7480620 bases (0.30%)
        Result: 24594929 reads (99.99%) 2452274280 bases (99.70%)​
        As expected, trimming with kmin=10 led to a larger number of reads being trimmed, due to the less stringent conditions imposed at the 3' end of the reads. What I found counterintuitive was the fact that trimming with kmin=19 actually resulted in 2 extra reads being discarded after trimming.

        Given that I am forcing right-trimming and the adapter sequence is only 19 bases long, while my input reads are all 100 bases long, I would have expected the same number of reads being discarded in both cases (or, if anything, more reads being discarded for kmin=10). My rationale for this is that, since k=19 in both cases, trimming at the 3' end of a read would in the worst case result in an 81 base-long trimmed read (regardless of kmin being 10, 19, or any other value lower than 19). Therefore, only trimming in the middle of a read (or, in the most extreme situation, trimming the entire read due to a kmer match at the 5' end) should potentially lead to a read being shorter than 50 bases (and thus discarded) after trimming. However, since again k=19 in both cases, any trimming happening in the middle of the reads should also be identical in both cases, whereas any potential 5' trimming should in fact be more aggressive in the case of kmin=10, correct?

        Is there something I am missing, specifically on how ktrim=r works? I would really appreciate it if you (or any of the good samaritans inhabiting this forum) could help me understand these results.

        Thanks a lot in advance for any insights you can provide, and thanks again for all your efforts in developing these awesome tools!

        Cheers,

        -Juan







        Last edited by razorofockham; 03-21-2024, 02:10 PM.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          Yesterday, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        57 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        45 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        55 views
        0 likes
        Last Post seqadmin  
        Working...
        X