Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    To expand upon that, BBMap has the options for internally quality-trimming (and untrimming) reads, but I only use them in special cases. For example, BBMap is used a lot for filtering out contamination. To ensure high specificity, I only want to remove reads that map to contaminant sequences with high identity; but I don't want the identity calculation to include low-quality bases. So, I map with "qtrim=rl trimq=15 untrim" or similar, which will trim before mapping but then output the untrimmed read.

    For normal cases such as variant calling or coverage calculation, don't quality-trim unless you have really low quality data. I would not expect that to be the case for a 100bp library.

    Comment


    • #17
      Aah, I see. Thanks to both of you. Would these flags work with BBDuk then?

      Data quality is decent. I will map to a de novo metagenome assembly generated from the same data. I am kind of hopeful to be able to do a little variant calling after metagenome binning and mapping. Given this situation, would you still rather not quality trim?

      Thanks

      Comment


      • #18
        Those flags work in BBDuk as well, except for "untrim", which is unique to BBMap.

        A good variant caller will take the quality into account when weighing possible variations. Quality-trimming before mapping gives the mapper less information, and thus a higher probability of ambiguously-mapped or incorrectly-mapped reads (though adapter-trimming before mapping is always good). Even if the trimmed bases are low quality - say, 50 bases at Q13 - that's still an expected 45 matches and 5 mismatches, which can greatly increase mapping confidence in a repetitive area, and thus improve the ability to call the variations from the high-quality part of the read.

        On the other hand, quality-trimming before assembly is often a good idea, though the exact threshold depends on the assembler and dataset.

        Comment


        • #19
          Is there a way of efficiently using the BBMap tools to cut variable length and variable sequence primer pairs from amplicon sequencing data either pre-alignment (fastq) or post-alignment (BAM)? I am thinking in regards to RainDance and Fluidigm data where there are hundreds of amplicons, many of which overlap, each with different forward/reverse primers.

          Comment


          • #20
            BBTools has is a pair of programs for cutting (and outputting to a file) the sequence between primer pairs, but not for cutting out primers themselves. Is that what you are after?

            The usage is like this:

            Assume your set of primers for one end is AAAAAA and AAAAAT, and for the other end is GGGGGG and GGGCCC:

            msa.sh in=amplicons.fq out=primer1.sam literal=AAAAAA,AAAAAT
            msa.sh in=amplicons.fq out=primer2.sam literal=GGGGGG,GGGCCC


            That will find the optimal alignment for the optimal primer, and output one line per amplicon (twice). Then run this:

            cutprimers.sh in=amplicons.fq out=middle.fq sam1=primer1.sam sam2=primer2.sam

            "middle.fq" will contain the sequence between the primers, one per amplicon, using the best alignment. I designed this to cut V4 regions from full-length 16s.

            Notes: msa.sh stands for MultiStateAligner, not for Multiple Sequence Alignment. All sequences can be in any orientation (forward or reverse complement).

            Please let me know if that doesn't address your question; I may have misunderstood it.

            Comment


            • #21
              I think @duane.hassane is asking to remove primers from amplicons with a common core sequence but with variable (length and/or sequence?) primers on the ends of that core sequence for each group.

              Comment


              • #22
                Thanks. To be more clear, I am after removing the primers themselves (either from the fastq or BAM) and preserving the intervening sequence only. In this use case, there are hundreds of different primer pairs with known mapping locations (because they are used for targeted enrichment of different exons). Thus, for each primer pair, the intervening sequence (target being enriched) is different.

                Comment


                • #23
                  Have you tried to remove these primer sequences with bbduk?

                  Comment


                  • #24
                    Yes, bbduk with restrictright and/or restrictleft set to constrain the search to the ends where primers are located. This is actually extremely fast and efficient even with hundreds of primers. Unfortunately, we have encountered some specificity issues with this approach (over- and under-trimming) leading to small gaps in coverage ... with no one set of parameters being optimal for all amplicons.

                    Comment


                    • #25
                      Hmm, I don't currently have a better solution than BBDuk or the msa/cutprimers pair. You might get a better result if you first bin the amplicons by which primers they use, though, so that then you can trim them only using the correct primers using aggressive settings, rather than with all possible primers. You can try using Seal for that. For example:

                      seal.sh in=amplicons.fq ref=primer1.fa pattern=out%.fq k=(whatever)

                      That will produce one file per sequence in the primer file, with all the reads that best match it (meaning, share the most kmers). For that purpose you should only use the left primers or right primers as reference, not both at the same time. Or, ideally, the reference would look like this:

                      Primer pairs: {(AAAT, GGCC), (TTTT, GGGG)}

                      Reference file:

                      >1
                      AAATNGGCC
                      >2
                      TTTTNGGGG

                      ...etc, where each primer pair is concatenated together with a single N between the sequences.
                      Last edited by Brian Bushnell; 06-11-2015, 11:17 AM.

                      Comment


                      • #26
                        Ok. Will take a look at this approach. Thank you!

                        Comment


                        • #27
                          Just a comment, not sure if this is the place:
                          It would be helpful for downstream analysis if the refstat files that can be output using bbsplit produced 0s (zeros) for references that have no reads mapped to them instead of omitting data for these references. I was working on generating a table for visualization when I noticed missing data.

                          Comment


                          • #28
                            Zero-count lines are suppressed by default, but they should be printed if you include the flag "nzo=f" (nonzeroonly=false). That's not documented in BBSplit's shellscript, just BBMap's; I'll add it to the documentation.

                            Comment


                            • #29
                              Wonderful, thanks!

                              Comment


                              • #30
                                I have been using bbmap (and other bbtools) for some of my analyses, and so far they are a great set of tools! Unfortunately I have come across a problem, and some preliminary searching hasn't turned up an answer.

                                I am trying to build bbmap into a galaxy workflow, using it to remove reads that map to a reference of a contaminant (E.coli in this case). The problem is that I just want the unmapped reads in fastq format, but because galaxy names everything with the .dat extension, I cannot tell bbmap to output in fastq format and it defaults to sam. Is there another way to specify the output format?

                                I'm sure I could convert the sam back to fastq, but the picard tools sam to fastq tool is throwing an error saying that there are unpaired reads in the sam. I know I can probably do this another way, but it seems silly to use another tool when bbmap can already output in the format I want!

                                I'm also confused as to why there would be unpaired reads, since the bbmap documentation for "outu" says:
                                Write only unmapped reads to this file. Does not include unmapped paired reads with a mapped mate.
                                Sounds to me like all the reads should be paired?


                                Thank you in advance, and thanks for the great tools!

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM
                                • seqadmin
                                  The Impact of AI in Genomic Medicine
                                  by seqadmin



                                  Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                  02-26-2024, 02:07 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-14-2024, 06:13 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-08-2024, 08:03 AM
                                0 responses
                                71 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-07-2024, 08:13 AM
                                0 responses
                                80 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-06-2024, 09:51 AM
                                0 responses
                                68 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X