Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #61
    By default BBDuk only tries to grab about 42% of physical memory. You can override the amount of RAM BBDuk uses with the -Xmx flag. And you can increase the efficiency of memory usage (as well as indexing speed) with the "prealloc" flag. For example:

    bbduk.sh -Xmx100g prealloc (other arguments)

    Note that you should not go all the way to 128g, though the exact amount you can use before disk-swapping or refusal to initialize depends on the system configuration. For our 128g machines, it's about 105g, but I think on most systems 120g should be fine.

    The memory usage is proportional to the reference, and specifically, "hdist=1" will increase the memory requirements by a factor of 3*K, which is pretty immense when you have a large reference. It roughly needs 15b per kmer at a minimum when you preallocate (up to 45 if you don't). You can also adjust memory usage by playing with the "skip" flag.

    In a little while I will release a new version that has some additional features for controlling memory use and speed.

    Comment


    • #62
      Oh yeah, sorry the -Xmx flag of course. I have a 3 year old and a 7 week old baby so I'm not all there all the time.

      Something that may be useful is a feature like that in some other tools which allows you to load a reference into memory and keep it there for subsequent runs or to allow multiple instances to access the same loaded reference in memory. That sometimes comes in handy with STAR, for example, which uses nearly 30BG for the human genome and sometimes that can take several minutes to load. As you know anything that takes more than a few seconds is practically intolerable these days.
      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
      Salk Institute for Biological Studies, La Jolla, CA, USA */

      Comment


      • #63
        BBMap has a wrapper, BBWrap, that allows you to load the index once and process multiple datasets with it. I could do something similar for BBDuk, though sharing it between independent processes would be much more difficult, if possible at all.

        Comment


        • #64
          I believe you can do it.

          Also I think you should call these things BBWeapons and not BBTools. They are much more like Weapons. Maybe one of my favorite tricks is this one:

          Code:
          samtools view -F 0x4 some_alignments.bam | \
            reformat.sh in=stdin.sam out=aligned_reads.fq.gz pigz=t
          That's got to be the smoothest way I've seen to translate aligned reads in a BAM file into FASTQ (or FASTA for that matter).
          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
          Salk Institute for Biological Studies, La Jolla, CA, USA */

          Comment


          • #65
            It seems like using mm=t may be enough for me in this case. I assume that means I'll be missing potential mismatches within K/2-1 of the ends of reads though. I guess I can live with that.
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #66
              Originally posted by sdriscoll View Post
              It seems like using mm=t may be enough for me in this case. I assume that means I'll be missing potential mismatches within K/2-1 of the ends of reads though. I guess I can live with that.
              Yes, that's true. Though depending on exactly what you are doing, you could split the reference into smaller pieces and run the reads through in multiple passes.

              I just released a new version of BBTools (33.95) which contains some features that you might find helpful. BBDuk now has a "speed" flag and "qskip" (query skip) flag. "speed" ranges from 0 to 15 (default 0), and ignores some fraction of the kmer space. So for example "speed=0" uses all kmers (highest sensitivity), while "speed=8" uses half of all kmers, and "speed=15" uses 15/16 of the kmers. The load time, run time, and memory use are all inversely proportional to the speed setting. "qskip=X" (default 1) uses only every Xth kmer in the read. This does not affect memory usage, but does affect speed. "skip" was renamed "rskip" (reference skip), and it affects which reference kmers are used, so it affects load time and memory usage, but not run time. Note that it is unwise to use more than one of speed, qskip, or rskip at one time. E.g. qskip=2 and rskip=2 you process every other kmer in the reference and in the read, so half of all reads will be out of phase with the reference and thus not match. Time and memory-wise, "speed=8" would be similar to "qskip=2 rskip=2" but without that disadvantage.

              Also, Seal is now finished and tested.

              Comment


              • #67
                Sorry, what is Seal?
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #68
                  Seal is kind of a relative of BBDuk that can associate each kmer with multiple values rather than just one. So, it can disambiguate in cases when a kmer comes from multiple sequences or references. As such, you can use it to for example calculate FPKM values in RNA-seq or annotate reads/sequences with the names of all other sequences with which they share kmers, and how many are shared.

                  Comment


                  • #69
                    Oh that sounds fun.
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • #70
                      Originally posted by Brian Bushnell View Post
                      Adapter contamination should be symmetric. Therefore, the mismatch rate caused by adapter sequences in read 1 should be the same as in read 2. But in your mhist graphs, read 2 has a much more dramatic increase than read 1 (and the base frequencies are much more clearly divergent) - this is not caused by adapters, but just a decline in quality, so quality trimming is probably a good idea; it would not surprise me if that greatly improved the base frequency divergence.

                      Judging visually, it seems like the draft genome is only about 97-98% identity to your organism, which makes it very noisy since most of the mismatches come from genomic difference rather than sequencing errors (or adapters). But certainly it would be good to trim the very last base (I'm guessing this is a 2x301bp run?) which you can do with the "ftr=299" flag, since it has a super-high 20%+ error rate.
                      I retried it with quality trimming (qtrim=t), removing the last base (ftr=299), and using both nextera and truseq sequences.

                      Code:
                      in1=../../raw-reads/MiSeq/JBad-Wichita2_140714_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_140714_R2_001.fastq.gz out1=Wichita2_140714_R1.fastq.gz out2=Wichita2_140714_R2.fastq.gz k=28 mink=13 hdist=1 qtrim=t ftr=299 ktrim=r ref=/myhome/software/bbmap/resources/truseq.fa.gz,/myhome/software/bbmap/resources/nextera.fa.gz bhist=bhist140714.txt stats=stats140714 tpe tbo
                      I still see a bias in the base composition towards the ends, and lower match rates in read 2.






                      Do you think there is still some contamination that remains in the reads? I can't of where the base composition bias might be coming from. Would fragmentation bias affect both ends? I need the reads for de novo assembly.

                      Comment


                      • #71
                        "qtrim=t" turns on quality-trimming, but does not actually specify the threshold; the default is very low, at 6. To see if quality-trimming helps with the base composition drift, I would suggest a higher threshold, for example, "qtrim=t trimq=20". That's probably too high for the actual assembly (where a value of around 10 is probably better), but it would be useful to run with aggressive trimming and generate the match histogram and base-frequency histograms, just to see if indeed the divergent and mismatching bases do indeed have lower quality. If so, then they are not caused by adapters.

                        By the way - I can't remember if you've already posted this, but generating the insert size histogram (ihist flag, in BBMap) tends to also be useful, as adapters should only appear on reads with insert size shorter than read length. Thus if you do not have a substantial population of reads shorter than insert size, adapters are not responsible for the issues at the read tails.

                        Comment


                        • #72
                          I tried it with more stringent quality trimming (trimq=20). First of all, in the bbduk.sh stderror output, it says that more than 100% reads were quality trimmed.
                          Code:
                          Input:                  	26109034 reads 		7577351149 bases.
                          QTrimmed:               	31595789 reads (121.01%) 	1299437454 bases (17.15%)
                          KTrimmed:               	119335 reads (0.46%) 	9161702 bases (0.12%)
                          Trimmed by overlap:     	2811266 reads (10.77%) 	34402173 bases (0.45%)
                          Result:                 	25717096 reads (98.50%) 	6234349820 bases (82.28%)
                          The stringent quality trimming seems to remove the read1 vs read2 mapping difference.


                          However, the base composition bias is still there at either end.


                          These are the insert size histograms before and after filtering (x-clipped to 1000 bases, but the tails continue to 5000 bases).



                          It seems there are a minority, but significant number of inserts shorter than the read lengths.

                          Comment


                          • #73
                            Originally posted by ysnapus View Post
                            I tried it with more stringent quality trimming (trimq=20). First of all, in the bbduk.sh stderror output, it says that more than 100% reads were quality trimmed.
                            Thanks for noting that I'm counting the number of reads in the numerator, and number of pairs in the denominator; I'll fix it in the next release.

                            The stringent quality trimming seems to remove the read1 vs read2 mapping difference.
                            ...
                            However, the base composition bias is still there at either end.
                            Just to confirm, are you generating the base frequency histogram from the trimmed reads in a second pass, or while trimming? The frequencies are calculated from the input reads prior to trimming, so the effects of trimming will not be reflected in the histogram generated by the same process that does the trimming.

                            Comment


                            • #74
                              Originally posted by Brian Bushnell View Post
                              Just to confirm, are you generating the base frequency histogram from the trimmed reads in a second pass, or while trimming? The frequencies are calculated from the input reads prior to trimming, so the effects of trimming will not be reflected in the histogram generated by the same process that does the trimming.
                              I am generating the bhist histograms from bbmap.sh, so they are after trimming (or before trimming, if I use the untrimmed reads as the input). I generated the figures for both before and after trimming. The figure in my last post is for reads post bbduk strict quality trimming, ktrim, tpem and tbo.

                              Comment


                              • #75
                                Hmmm. Odd; I assume the base frequency drift is some kind of machine issue, and the quality scores of those bases (at the tail of read 2) are inaccurate. But I can't really explain the seeming discrepancy between the bhist and mhist; after quality-trimming, I would expect the base frequencies and mismatches to both flatten out, not just one of them. Unless the reads with the divergent base frequencies are not mapping because (for example) they are something synthetic rather than genomic, though synthetic contaminant molecules don't tend to be so long.

                                Overall, I'm not certain what's going on, or how to fix it, short of forcibly trimming the reads to ~280bp and ~220bp, which may not be necessary or even beneficial.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X