Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #76
    Originally posted by Brian Bushnell View Post
    Hmmm. Odd; I assume the base frequency drift is some kind of machine issue, and the quality scores of those bases (at the tail of read 2) are inaccurate. But I can't really explain the seeming discrepancy between the bhist and mhist; after quality-trimming, I would expect the base frequencies and mismatches to both flatten out, not just one of them. Unless the reads with the divergent base frequencies are not mapping because (for example) they are something synthetic rather than genomic, though synthetic contaminant molecules don't tend to be so long.
    Perhaps I should generate the SAM file from bbmap and then generate the bhist from just the mapped reads in the SAM? Can a bbmap tool generate the bhist from mapped reads of a SAMfile?

    Also, the mhist isn't completely flat, it slopes downwards at the ends. Though it looks flatter than the bhist.

    Originally posted by Brian Bushnell View Post
    Overall, I'm not certain what's going on, or how to fix it, short of forcibly trimming the reads to ~280bp and ~220bp, which may not be necessary or even beneficial.
    I could do that, though I guess I'll have to manually set the forcetrimright and forcetrimleft for each file.

    Comment


    • #77
      Originally posted by ysnapus View Post
      Perhaps I should generate the SAM file from bbmap and then generate the bhist from just the mapped reads in the SAM? Can a bbmap tool generate the bhist from mapped reads of a SAMfile?
      BBMap can output mapped reads in fastq, which will make it a little easier to keep the pairs straight, then you can calculate the bhist with reformat:

      bbmap.sh in=reads.fq outm=mapped.fq outu=unmapped.fq po

      You can use the "outm2" and "outu2" flags if the reads are in two files.
      The "po" (pairedonly) flag will make reads that did not map paired go to the "outu" stream.

      Also, the mhist isn't completely flat, it slopes downwards at the ends. Though it looks flatter than the bhist.
      Exactly - I would expect the tail of the mhist plot to deflect at least as much as the bhist plot, if the bases with divergent frequencies are being mapped and are incorrect.

      I could do that, though I guess I'll have to manually set the forcetrimright and forcetrimleft for each file.
      Depending on your goal, it may not be necessary, even though the graphs look ugly. Most programs can tolerate some degree of error; trimming becomes more important when that error is biased such that it can change a quantitative result, or the amount of error makes you run out of memory, etc. Often the best approach is to try both ways and see if it made a positive impact, though obviously, that's labor-intensive.
      Last edited by Brian Bushnell; 12-01-2014, 06:03 PM.

      Comment


      • #78
        error in trimming

        Hi, I recently got RNA-seq fastq files generated from Ion Torrent.

        I tried using the BBDuk program to trim the reads on quality, but I ran into this error and I don't know what I did wrong. I'm also very unfamiliar with coding in general.

        java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF in=.\reads.fastq out=Q10_reads.fastq qtrim=rl trimq=10 minlen=20

        Exception in thread "main" java.lang.NullPointerException
        at jgi.BBDukF.spawnProcessThreads(BBDukF.java:1414)
        at jgi.BBDukF.process2(BBDukF.java:780)
        at jgi.BBDukF.process(BBDukF.java:716)
        at jgi.BBDukF.main(BBDukF.java:65)

        Thanks

        Comment


        • #79
          What version do you have? The current version does not even have anything on line 1414. If you download the current version it should work.

          An even easier solution is to use reformat instead, which has the same syntax, since you are not doing any kmer-based operations:

          java -ea -Xmx1g -cp C:\bbmap\current\ jgi.ReformatReads in=.\reads.fastq out=Q10_reads.fastq qtrim=rl trimq=10 minlen=20

          Comment


          • #80
            I have version 33.92. I'll download the newest version.

            But the Reformat worked! Thanks.

            Comment


            • #81
              Brian,

              Do you foresee making it possible to use kmers longer than 31bp at any point? Just curious.

              Also could you add the 'trd' parameter to seal so that when one enables the 'rename' option one gets those nicely trimmed reference names as in bbmap?

              Thanks!
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • #82
                I am considering making an assembler based on BBDuk/Seal's data structures. When I do that I will probably support k>31, which will probably percolate back to BBDuk/Seal, but that won't be anytime soon.

                BBDuk does allow processing of k>31 in emulated mode when filtering; specifically, if you set k=35, for example, it would require 5 consecutive 31-mer hits to trigger a match. It's not quite the same, but it does greatly increase specificity. Seal lacks this ability because it would be kind of difficult to track the number of consecutive matches for a kmer hit to an arbitrary number of different references simultaneously.

                Seal and BBDuk - in the latest version, 34.08 - already support the trd flag, it's just undocumented. I will document it. Between 34.00 and 34.08 I unified a lot of the parsing so there are a lot of BBMap flags that are suddenly now supported by other tools.

                Comment


                • #83
                  Hi Brian,

                  I'm using bbduk to trim adapters from a microRNA seq run and in this case I only want to keep reads that were successfully trimmed. My command line is something like this:

                  Code:
                   bbduk.sh in=reads.fastq.gz out=trimmed.fq ref=mirna_adapter.fa ktrim=r k=20 mink=12 hdist=1 minlength=15
                  bbduk finishes successfully and reports:

                  Code:
                  Input:                  	22241069 reads 		1134294519 bases.
                  KTrimmed:               	16273134 reads (73.17%) 	401433668 bases (35.39%)
                  Result:                 	21270190 reads (95.63%) 	732860851 bases (64.61%)
                  If I count the number of reads in my output file, trimmed.fq, I get 21270190, however according to bbduk only 16273134 reads were trimmed. What is going on here?
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • #84
                    22241069 reads were processed. Of those, 16273134 had adapter sequence detected, so they were trimmed. The output was 21270190 reads, which includes both trimmed reads (which had adapter detected) and untrimmed reads (which didn't). (22241069 - 21270190) = 970879 reads were discarded because after trimming, they were too short to use (default 10bp).

                    And that accounts for all the reads. Does that answer your question?

                    Comment


                    • #85
                      Yes. What option should I used so that I can get a FASTQ file with ONLY the reads that were trimmed?
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • #86
                        Originally posted by sdriscoll View Post
                        Yes. What option should I used so that I can get a FASTQ file with ONLY the reads that were trimmed?
                        Oh, oops, I didn't read your question carefully enough. Assuming the reads were the same length going in, you can subsequently filter the output file to retain only reads shorter than the original length:

                        bbduk.sh in=trimmed.fq outm=filtered.fq minlength=100

                        You could also use the minlength and outm flags on your original command line instead. "outm" is a little confusing but it gathers all the reads that 'fail', meaning they either match something (presumed to be a contaminant), OR after trimming are shorter than minlength. "outu" (aka "out") gets everything else.

                        If your input reads were variable-length then this approach would not work, but you could still do it like this:

                        Code:
                        bbduk.sh in=reads.fastq.gz [COLOR="DarkRed"][B]mlf=1.0 outm[/B][/COLOR]=trimmed.fq ref=mirna_adapter.fa ktrim=r k=20 mink=12 hdist=1 minlength=15
                        ...where "mlf" or "minlengthfraction" means that reads will fail if they are trimmed to less than that fraction of their original length.

                        Comment


                        • #87
                          Now that I think about it, I guess the problem is that it is not clear what "out" and "outm" do for BBDuk.

                          out: Accepts all reads that pass criteria. If BBDuk is used for kmer filtering, that means all reads that do not have kmer matches to the reference. If BBDuk is used for trimming, that means all reads that, after trimming, are longer than a minimum length, whether or not they had any matching kmers.

                          outm: Accepts all reads that fail criteria and thus would not go to "out". "m" stands for "match", and when kmer filtering it accepts reads with matching kmers, but it also gets reads shorter than minlen (again, the default is 10, so you'd have to set it to 0 to disable this behavior).
                          Last edited by Brian Bushnell; 12-16-2014, 01:19 PM.

                          Comment


                          • #88
                            OK, thanks that's useful. I guess logically if I'm also specifying mink=12 and I only want reads that were trimmed I could also use maxlength=[read_length - mink] and then out would contain only reads reads that were trimmed. I didn't notice the mlf option...that should be handy as well.
                            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                            Salk Institute for Biological Studies, La Jolla, CA, USA */

                            Comment


                            • #89
                              For seal, would there be any logical way to add in number of unique hits to the rpkm= output? I'd also be interested to know what it does when a read has equally good matches to two or more target references.

                              Thanks!
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */

                              Comment


                              • #90
                                By unique, do you mean "only matching that sequence" or "occurring in a unique position in that sequence"? The latter cannot be done, but the former can be controlled with the "ambig" flag that I forgot to document. It is similar to BBMap's ambig flag. This is the description (which will be present in the next release):

                                Code:
                                ambiguous=random    (ambig) Set behavior on ambiguously-mapped reads (with an
                                                    equal number of kmer matches to multiple sequences).
                                                         first:  Use the first best-matching sequence.
                                                         toss:   Consider unmapped.
                                                         random: Select one best-matching sequence randomly.
                                                         all:    Use all best-matching sequences.
                                So, the default is random. With BBMap the default is "first".

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                17 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                22 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                46 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X