SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 132 04-18-2017 01:04 AM
Adapter trimming figo1019 RNA Sequencing 1 04-07-2014 10:58 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 04:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 07:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 12:53 PM

Reply
 
Thread Tools
Old 11-21-2014, 10:24 AM   #61
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

By default BBDuk only tries to grab about 42% of physical memory. You can override the amount of RAM BBDuk uses with the -Xmx flag. And you can increase the efficiency of memory usage (as well as indexing speed) with the "prealloc" flag. For example:

bbduk.sh -Xmx100g prealloc (other arguments)

Note that you should not go all the way to 128g, though the exact amount you can use before disk-swapping or refusal to initialize depends on the system configuration. For our 128g machines, it's about 105g, but I think on most systems 120g should be fine.

The memory usage is proportional to the reference, and specifically, "hdist=1" will increase the memory requirements by a factor of 3*K, which is pretty immense when you have a large reference. It roughly needs 15b per kmer at a minimum when you preallocate (up to 45 if you don't). You can also adjust memory usage by playing with the "skip" flag.

In a little while I will release a new version that has some additional features for controlling memory use and speed.
Brian Bushnell is offline   Reply With Quote
Old 11-21-2014, 10:38 AM   #62
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

Oh yeah, sorry the -Xmx flag of course. I have a 3 year old and a 7 week old baby so I'm not all there all the time.

Something that may be useful is a feature like that in some other tools which allows you to load a reference into memory and keep it there for subsequent runs or to allow multiple instances to access the same loaded reference in memory. That sometimes comes in handy with STAR, for example, which uses nearly 30BG for the human genome and sometimes that can take several minutes to load. As you know anything that takes more than a few seconds is practically intolerable these days.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 11-21-2014, 10:42 AM   #63
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

BBMap has a wrapper, BBWrap, that allows you to load the index once and process multiple datasets with it. I could do something similar for BBDuk, though sharing it between independent processes would be much more difficult, if possible at all.
Brian Bushnell is offline   Reply With Quote
Old 11-21-2014, 10:48 AM   #64
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

I believe you can do it.

Also I think you should call these things BBWeapons and not BBTools. They are much more like Weapons. Maybe one of my favorite tricks is this one:

Code:
samtools view -F 0x4 some_alignments.bam | \
  reformat.sh in=stdin.sam out=aligned_reads.fq.gz pigz=t
That's got to be the smoothest way I've seen to translate aligned reads in a BAM file into FASTQ (or FASTA for that matter).
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 11-21-2014, 11:01 AM   #65
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

It seems like using mm=t may be enough for me in this case. I assume that means I'll be missing potential mismatches within K/2-1 of the ends of reads though. I guess I can live with that.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 11-21-2014, 01:40 PM   #66
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Quote:
Originally Posted by sdriscoll View Post
It seems like using mm=t may be enough for me in this case. I assume that means I'll be missing potential mismatches within K/2-1 of the ends of reads though. I guess I can live with that.
Yes, that's true. Though depending on exactly what you are doing, you could split the reference into smaller pieces and run the reads through in multiple passes.

I just released a new version of BBTools (33.95) which contains some features that you might find helpful. BBDuk now has a "speed" flag and "qskip" (query skip) flag. "speed" ranges from 0 to 15 (default 0), and ignores some fraction of the kmer space. So for example "speed=0" uses all kmers (highest sensitivity), while "speed=8" uses half of all kmers, and "speed=15" uses 15/16 of the kmers. The load time, run time, and memory use are all inversely proportional to the speed setting. "qskip=X" (default 1) uses only every Xth kmer in the read. This does not affect memory usage, but does affect speed. "skip" was renamed "rskip" (reference skip), and it affects which reference kmers are used, so it affects load time and memory usage, but not run time. Note that it is unwise to use more than one of speed, qskip, or rskip at one time. E.g. qskip=2 and rskip=2 you process every other kmer in the reference and in the read, so half of all reads will be out of phase with the reference and thus not match. Time and memory-wise, "speed=8" would be similar to "qskip=2 rskip=2" but without that disadvantage.

Also, Seal is now finished and tested.
Brian Bushnell is offline   Reply With Quote
Old 11-24-2014, 01:58 PM   #67
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

Sorry, what is Seal?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 11-24-2014, 04:21 PM   #68
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Seal is kind of a relative of BBDuk that can associate each kmer with multiple values rather than just one. So, it can disambiguate in cases when a kmer comes from multiple sequences or references. As such, you can use it to for example calculate FPKM values in RNA-seq or annotate reads/sequences with the names of all other sequences with which they share kmers, and how many are shared.
Brian Bushnell is offline   Reply With Quote
Old 11-24-2014, 04:22 PM   #69
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

Oh that sounds fun.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-01-2014, 08:38 AM   #70
ysnapus
Member
 
Location: Mid-Atlantic

Join Date: Jun 2013
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
Adapter contamination should be symmetric. Therefore, the mismatch rate caused by adapter sequences in read 1 should be the same as in read 2. But in your mhist graphs, read 2 has a much more dramatic increase than read 1 (and the base frequencies are much more clearly divergent) - this is not caused by adapters, but just a decline in quality, so quality trimming is probably a good idea; it would not surprise me if that greatly improved the base frequency divergence.

Judging visually, it seems like the draft genome is only about 97-98% identity to your organism, which makes it very noisy since most of the mismatches come from genomic difference rather than sequencing errors (or adapters). But certainly it would be good to trim the very last base (I'm guessing this is a 2x301bp run?) which you can do with the "ftr=299" flag, since it has a super-high 20%+ error rate.
I retried it with quality trimming (qtrim=t), removing the last base (ftr=299), and using both nextera and truseq sequences.

Code:
in1=../../raw-reads/MiSeq/JBad-Wichita2_140714_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_140714_R2_001.fastq.gz out1=Wichita2_140714_R1.fastq.gz out2=Wichita2_140714_R2.fastq.gz k=28 mink=13 hdist=1 qtrim=t ftr=299 ktrim=r ref=/myhome/software/bbmap/resources/truseq.fa.gz,/myhome/software/bbmap/resources/nextera.fa.gz bhist=bhist140714.txt stats=stats140714 tpe tbo
I still see a bias in the base composition towards the ends, and lower match rates in read 2.






Do you think there is still some contamination that remains in the reads? I can't of where the base composition bias might be coming from. Would fragmentation bias affect both ends? I need the reads for de novo assembly.
ysnapus is offline   Reply With Quote
Old 12-01-2014, 10:03 AM   #71
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

"qtrim=t" turns on quality-trimming, but does not actually specify the threshold; the default is very low, at 6. To see if quality-trimming helps with the base composition drift, I would suggest a higher threshold, for example, "qtrim=t trimq=20". That's probably too high for the actual assembly (where a value of around 10 is probably better), but it would be useful to run with aggressive trimming and generate the match histogram and base-frequency histograms, just to see if indeed the divergent and mismatching bases do indeed have lower quality. If so, then they are not caused by adapters.

By the way - I can't remember if you've already posted this, but generating the insert size histogram (ihist flag, in BBMap) tends to also be useful, as adapters should only appear on reads with insert size shorter than read length. Thus if you do not have a substantial population of reads shorter than insert size, adapters are not responsible for the issues at the read tails.
Brian Bushnell is offline   Reply With Quote
Old 12-01-2014, 03:12 PM   #72
ysnapus
Member
 
Location: Mid-Atlantic

Join Date: Jun 2013
Posts: 22
Default

I tried it with more stringent quality trimming (trimq=20). First of all, in the bbduk.sh stderror output, it says that more than 100% reads were quality trimmed.
Code:
Input:                  	26109034 reads 		7577351149 bases.
QTrimmed:               	31595789 reads (121.01%) 	1299437454 bases (17.15%)
KTrimmed:               	119335 reads (0.46%) 	9161702 bases (0.12%)
Trimmed by overlap:     	2811266 reads (10.77%) 	34402173 bases (0.45%)
Result:                 	25717096 reads (98.50%) 	6234349820 bases (82.28%)
The stringent quality trimming seems to remove the read1 vs read2 mapping difference.


However, the base composition bias is still there at either end.


These are the insert size histograms before and after filtering (x-clipped to 1000 bases, but the tails continue to 5000 bases).



It seems there are a minority, but significant number of inserts shorter than the read lengths.
ysnapus is offline   Reply With Quote
Old 12-01-2014, 03:32 PM   #73
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Quote:
Originally Posted by ysnapus View Post
I tried it with more stringent quality trimming (trimq=20). First of all, in the bbduk.sh stderror output, it says that more than 100% reads were quality trimmed.
Thanks for noting that I'm counting the number of reads in the numerator, and number of pairs in the denominator; I'll fix it in the next release.

Quote:
The stringent quality trimming seems to remove the read1 vs read2 mapping difference.
...
However, the base composition bias is still there at either end.
Just to confirm, are you generating the base frequency histogram from the trimmed reads in a second pass, or while trimming? The frequencies are calculated from the input reads prior to trimming, so the effects of trimming will not be reflected in the histogram generated by the same process that does the trimming.
Brian Bushnell is offline   Reply With Quote
Old 12-01-2014, 03:43 PM   #74
ysnapus
Member
 
Location: Mid-Atlantic

Join Date: Jun 2013
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
Just to confirm, are you generating the base frequency histogram from the trimmed reads in a second pass, or while trimming? The frequencies are calculated from the input reads prior to trimming, so the effects of trimming will not be reflected in the histogram generated by the same process that does the trimming.
I am generating the bhist histograms from bbmap.sh, so they are after trimming (or before trimming, if I use the untrimmed reads as the input). I generated the figures for both before and after trimming. The figure in my last post is for reads post bbduk strict quality trimming, ktrim, tpem and tbo.
ysnapus is offline   Reply With Quote
Old 12-01-2014, 03:58 PM   #75
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Hmmm. Odd; I assume the base frequency drift is some kind of machine issue, and the quality scores of those bases (at the tail of read 2) are inaccurate. But I can't really explain the seeming discrepancy between the bhist and mhist; after quality-trimming, I would expect the base frequencies and mismatches to both flatten out, not just one of them. Unless the reads with the divergent base frequencies are not mapping because (for example) they are something synthetic rather than genomic, though synthetic contaminant molecules don't tend to be so long.

Overall, I'm not certain what's going on, or how to fix it, short of forcibly trimming the reads to ~280bp and ~220bp, which may not be necessary or even beneficial.
Brian Bushnell is offline   Reply With Quote
Old 12-01-2014, 04:41 PM   #76
ysnapus
Member
 
Location: Mid-Atlantic

Join Date: Jun 2013
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hmmm. Odd; I assume the base frequency drift is some kind of machine issue, and the quality scores of those bases (at the tail of read 2) are inaccurate. But I can't really explain the seeming discrepancy between the bhist and mhist; after quality-trimming, I would expect the base frequencies and mismatches to both flatten out, not just one of them. Unless the reads with the divergent base frequencies are not mapping because (for example) they are something synthetic rather than genomic, though synthetic contaminant molecules don't tend to be so long.
Perhaps I should generate the SAM file from bbmap and then generate the bhist from just the mapped reads in the SAM? Can a bbmap tool generate the bhist from mapped reads of a SAMfile?

Also, the mhist isn't completely flat, it slopes downwards at the ends. Though it looks flatter than the bhist.

Quote:
Originally Posted by Brian Bushnell View Post
Overall, I'm not certain what's going on, or how to fix it, short of forcibly trimming the reads to ~280bp and ~220bp, which may not be necessary or even beneficial.
I could do that, though I guess I'll have to manually set the forcetrimright and forcetrimleft for each file.
ysnapus is offline   Reply With Quote
Old 12-01-2014, 05:00 PM   #77
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

Quote:
Originally Posted by ysnapus View Post
Perhaps I should generate the SAM file from bbmap and then generate the bhist from just the mapped reads in the SAM? Can a bbmap tool generate the bhist from mapped reads of a SAMfile?
BBMap can output mapped reads in fastq, which will make it a little easier to keep the pairs straight, then you can calculate the bhist with reformat:

bbmap.sh in=reads.fq outm=mapped.fq outu=unmapped.fq po

You can use the "outm2" and "outu2" flags if the reads are in two files.
The "po" (pairedonly) flag will make reads that did not map paired go to the "outu" stream.

Quote:
Also, the mhist isn't completely flat, it slopes downwards at the ends. Though it looks flatter than the bhist.
Exactly - I would expect the tail of the mhist plot to deflect at least as much as the bhist plot, if the bases with divergent frequencies are being mapped and are incorrect.

Quote:
I could do that, though I guess I'll have to manually set the forcetrimright and forcetrimleft for each file.
Depending on your goal, it may not be necessary, even though the graphs look ugly. Most programs can tolerate some degree of error; trimming becomes more important when that error is biased such that it can change a quantitative result, or the amount of error makes you run out of memory, etc. Often the best approach is to try both ways and see if it made a positive impact, though obviously, that's labor-intensive.

Last edited by Brian Bushnell; 12-01-2014 at 05:03 PM.
Brian Bushnell is offline   Reply With Quote
Old 12-03-2014, 10:34 AM   #78
tkc.yang
Junior Member
 
Location: canada

Join Date: Dec 2014
Posts: 3
Default error in trimming

Hi, I recently got RNA-seq fastq files generated from Ion Torrent.

I tried using the BBDuk program to trim the reads on quality, but I ran into this error and I don't know what I did wrong. I'm also very unfamiliar with coding in general.

java -ea -Xmx1g -cp C:\bbmap\current\ jgi.BBDukF in=.\reads.fastq out=Q10_reads.fastq qtrim=rl trimq=10 minlen=20

Exception in thread "main" java.lang.NullPointerException
at jgi.BBDukF.spawnProcessThreads(BBDukF.java:1414)
at jgi.BBDukF.process2(BBDukF.java:780)
at jgi.BBDukF.process(BBDukF.java:716)
at jgi.BBDukF.main(BBDukF.java:65)

Thanks
tkc.yang is offline   Reply With Quote
Old 12-03-2014, 10:41 AM   #79
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,668
Default

What version do you have? The current version does not even have anything on line 1414. If you download the current version it should work.

An even easier solution is to use reformat instead, which has the same syntax, since you are not doing any kmer-based operations:

java -ea -Xmx1g -cp C:\bbmap\current\ jgi.ReformatReads in=.\reads.fastq out=Q10_reads.fastq qtrim=rl trimq=10 minlen=20
Brian Bushnell is offline   Reply With Quote
Old 12-03-2014, 01:10 PM   #80
tkc.yang
Junior Member
 
Location: canada

Join Date: Dec 2014
Posts: 3
Default

I have version 33.92. I'll download the newest version.

But the Reformat worked! Thanks.
tkc.yang is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:49 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO