SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 132 04-18-2017 02:04 AM
Adapter trimming figo1019 RNA Sequencing 1 04-07-2014 11:58 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 05:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 08:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 01:53 PM

Reply
 
Thread Tools
Old 10-27-2014, 11:10 PM   #41
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 308
Default

Thanks a lot Brian!

Quote:
Originally Posted by Brian Bushnell View Post
@luc,

That bug is now fixed in v33.73. Thanks for reporting it!

....

-Brian
luc is offline   Reply With Quote
Old 10-28-2014, 10:45 AM   #42
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default bbduk forcetrimright option

Hi Brian,
Thanks for making the additions! That was exactly what i needed to improve my workflow. One observation i made while looking at differences between the output of bbduk and flexbar is that it appears that the forcetrimright trimming is being applied before the primer trimming step.

>read_right_trimmed by flexbar length = 154
ACTCGGCCCAAGACTGCCCCATCGGCGTCACCTCTCGGTACACCCCCACGTCGCTGTCGAAGCGCGCGAACTCTTCGCGGTTATAGACGTGTCTGGCTACTAACCGCACGCGCTCTGTCCCGTTGGTGAAGTAGCACATGCCTTTAAACTGGTA

position of primer on read yet to be right trimmed
>>flexbar trimmed left_1 (230 aa)
initn: 153 init1: 153 opt: 153 Z-score: 122.1 bits: 26.2 E(1): 5.4e-05
Smith-Waterman score: 153; 100.0% identity (100.0% similar) in 18 aa overlap (1-18:155-172)

10
DQB192 CACGAAATCCTCTGCGGG
::::::::::::::::::
flexba GGTGAAGTAGCACATGCCTTTAAACTGGTACACGAAATCCTCTGCGGGAGACCAAGTCTC
130 140 150 160 170 180



>read_right_trimmed_by_bbduk forcetrimright=158 length = 159
ACTCGGCCCAAGACTGCCCCATCGGCGTCACCTCTCGGTACACCCCCACGTCGCTGTCGAAGCGCGCGAACTCTTCGCGGTTATAGACGTGTCTGGCTACTAACCGCACGCGCTCTGTCCCGTTGGTGAAGTAGCACATGCCTTTAAACTGGTACACGA

Position of primer on read yet to be right trimmed

>>left trimmed 1 bbduk1 (230 aa)
initn: 153 init1: 153 opt: 153 Z-score: 118.1 bits: 25.4 E(1): 9.1e-05
Smith-Waterman score: 153; 100.0% identity (100.0% similar) in 18 aa overlap (1-18:155-172)

10
DQB192 CACGAAATCCTCTGCGGG
::::::::::::::::::
left GGTGAAGTAGCACATGCCTTTAAACTGGTACACGAAATCCTCTGCGGGAGACCAAGTCTC
130 140 150 160 170 180


So while an exact primer match should result in a trimming to 154 bases, it seems the forcetrimright is getting applied first and no primer match is found.

Is there a way to change the order of these operations through argument usage?

Last edited by lankage; 10-28-2014 at 10:50 AM.
lankage is offline   Reply With Quote
Old 10-28-2014, 11:00 AM   #43
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

lankage,

It's not possible to change the order of processing operations in BBDuk, and that would not be easy to do. The order is currently:

1) filter by minAvgQuality
2) force-trim
3) kmer-analyze (trim, filter, or mask)
4) trim by overlap
5) quality-trim

However, you can accomplish the same thing via piping it, like this:

bbduk.sh -Xmx1g in=reads.fq out=stdout.fq int=t other bbduk flags | reformat.sh in=stdin.fq out=out.fq forcetrimright=158 int=t


For the second stage either reformat.sh or bbduk.sh will work. The pipe has to be a single stream, so if you have two input files and want to generate two output files, the command would be:


bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out=stdout.fq other bbduk flags | reformat.sh in=stdin.fq out1=out1.fq out2=out2.fq forcetrimright=158 int=t


When reading an interleaved file from stdin, you must always specify "int=t" because it is only autodetected if reading from a file. Let me know if this is didn't make sense.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 11-12-2014, 11:46 AM   #44
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

Hi Brian (and anyone else in the know),

Is there a magic kmer length that is optimal for contaminant filtering? I know from short read alignment there are many fewer possible alignments for a 100b read than a 20b read (in general). In some of my tests if I use k=10 then nearly 100% of input sequences might come up as contaminated however if I raise that to k=20 the % drops significantly (to say less than 1%). If I'm filtering FASTQ files then it seems clear that k should always be shorter than my read length as well as shorter than the contaminant sequence length however I'm wondering how much shorter is really optimal. It feels a little arbitrary (not that arbitrary is uncommon in this field).
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 11-12-2014, 12:10 PM   #45
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I typically use k=25, hdist=1 for adapter trimming and removal of other short synthetic contaminant sequences, just because some of them are very short (under 30bp). For long contaminant sequences (phiX, ecoli, etc) I always use k=31 to maximize the specificity; with 100bp reads you should get a match SOMEWHERE, especially if you allow a mismatch. Note that the "mm" or "maskmiddle" flag (which defaults to true for filtering) is roughly equivalent to allowing an additional +1 hamming distance for free.

By the way - in case you did not already notice, in an effort to save memory, BBDuk automatically scales its skip size (the number of kmers it skips between reference kmers it stores) depending on contig length. By default this varies from 0 for short reference sequences (under 100kb) to K for long reference sequences (over 5Mb). For maximal sensitivity, you should disable this by setting "maxskip=0" which will store all reference kmers.

If you want to determine the ideal kmer length, it may be useful to characterize the error rates of your data, or quantify your true and false positive assignment rates based on filtering when using synthetic reads with their origin tagged. "randomreads.sh" generates and tags reads from reference, with adjustable error rates; I use it a lot to validate approaches and determine optimal settings.
Brian Bushnell is offline   Reply With Quote
Old 11-12-2014, 12:57 PM   #46
ysnapus
Member
 
Location: Mid-Atlantic

Join Date: Jun 2013
Posts: 22
Default bbduk2 and some clarifications

Thanks for bringing out a great tool. Noob here so please don't mind some questions, and I hope the answers will be of use to others as well.
  1. I noticed that there is a bbduk.sh and bbduk2.sh in the BBMap code I downloaded from Sourceforge. Is bbduk2 a replacement? Should we switch to using bbduk2.sh? Your example usage in this thread uses bbduk.sh.
  2. I have paired-end DNA Seq data, and I know there has been quite a lot of adapter readthrough since my read lengths are high and some inserts were small. Trimmomatic has the palindrome mode for adapter trimming in PE reads. Does bbduk do the palindromic alignment automatically or do we need to set tbo=t and tpe=t in command line parameters? If this is not what tbo is, what does tbo do?
  3. The usage help lists parameters like tbo=f, so to set it true you need tbo=t, but your example in a post lists it
    bbduk.sh ... tpe tbo.
    Is either usage correct?
  4. Am I correct in assuming that we just need to do ktrim=r, even for paired-end data, because of the opposite orientations of the paired reads?

Thank you.
ysnapus is offline   Reply With Quote
Old 11-12-2014, 01:06 PM   #47
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

Brian,

Would it be possible for BBDuk to optionally export the matched kmers and unmatched (from input not reference) kmers? In a way it would be like generating a kmer library with something like Jellyfish and then splitting it into two piles via BBDuk as it runs now. It just seems like BBDuk is a bit faster at generating mers than other tools and in the event someone wanted to do such a thing as this it would simplify it down into a single step and keep people within a single set of tools (i.e. more fame for you).
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 11-12-2014, 01:41 PM   #48
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by ysnapus View Post
Thanks for bringing out a great tool. Noob here so please don't mind some questions, and I hope the answers will be of use to others as well.

I noticed that there is a bbduk.sh and bbduk2.sh in the BBMap code I downloaded from Sourceforge. Is bbduk2 a replacement? Should we switch to using bbduk2.sh? Your example usage in this thread uses bbduk.sh.
BBDuk can operate in one of 4 kmer-matching modes:
Right-trimming (ktrim=r), left-trimming (ktrim=l), masking (ktrim=n), and filtering (default). But it can only do one at a time because all kmers are stored in a single table. It can still do non-kmer-based operations such as quality trimming at the same time.

BBDuk2 can do all 4 kmer operations at once and is designed for integration into automated pipelines where you do contaminant removal and adapter-trimming in a single pass to minimize filesystem I/O. Personally, I never use BBDuk2 from the command line. Both have identical capabilities and functionality otherwise, but the syntax is different.

Quote:
I have paired-end DNA Seq data, and I know there has been quite a lot of adapter readthrough since my read lengths are high and some inserts were small. Trimmomatic has the palindrome mode for adapter trimming in PE reads. Does bbduk do the palindromic alignment automatically or do we need to set tbo=t and tpe=t in command line parameters? If this is not what tbo is, what does tbo do?

The usage help lists parameters like tbo=f, so to set it true you need tbo=t, but your example in a post lists it
bbduk.sh ... tpe tbo.
Is either usage correct?
"tbo" and "tbo=t" and "trimbyoverlap=true" are all equivalent. "tpe" (trim pairs equally) is related, but independent. If the mode is ktrim=r, and BBDuk find a kmer match at position 75 of read 1, but finds no matches for read 2, the if "tpe=t" it will trim both read 1 and read 2 at position 75 so they end up the same length, because the both have the same insert size, but presumably a kmer did not match in read 2 due to errors.

"tbo" tries to determine the insert size by reverse-complimenting read 2 and finding the best overlap such that bases of the two reads match (using the same algorithm as BBMerge, with conservative settings). If a valid overlap is found, and the calculated insert size is shorter than the read length, both reads will be trimmed to the insert size (regardless of the "tpe" flag).

So they are a little different but I recommend enabling them both if you have paired fragment libraries. They are not enabled by default because with long-mate-pair libraries you would not want that behavior.

Please let me know if that does not fully answer your question, and I'll look into Trimmomatic's palindrome mode to see exactly what it does.

Quote:
Am I correct in assuming that we just need to do ktrim=r, even for paired-end data, because of the opposite orientations of the paired reads?

Thank you.
For fragment libraries, yes, that is correct. You only need to use "ktrim=l" in unusual situations with custom adapters, or long-mate-pair libraries.
Brian Bushnell is offline   Reply With Quote
Old 11-12-2014, 01:57 PM   #49
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by sdriscoll View Post
Brian,

Would it be possible for BBDuk to optionally export the matched kmers and unmatched (from input not reference) kmers? In a way it would be like generating a kmer library with something like Jellyfish and then splitting it into two piles via BBDuk as it runs now. It just seems like BBDuk is a bit faster at generating mers than other tools and in the event someone wanted to do such a thing as this it would simplify it down into a single step and keep people within a single set of tools (i.e. more fame for you).
There is currently another program, "kmercountexact.sh", which will count kmers in a file. It can generate a histogram of kmer frequencies, and dump a fasta file of kmers and counts, like Jellyfish, and is indeed substantially faster (and I believe more memory-efficient). It also has an optional bloom filter you can use to exclude kmers with counts below a certain level (e.g. prefilter=3) to save memory.

The speed is mainly affected by the flags:
"prefilter", "prealloc", "onepass" (if prefilter is enabled), and when dumping kmers, "mincount". Oh, and having pigz installed can help, too.

Note, though, that BBDuk and KmerCountExact are both limited to K={1 to 31}; I'm not sure about Jellyfish.

KmerCountExact does not currently support set subtraction (I think that's what you're after?), but it sounds like a useful thing to include, so I will add it to my todo list. The way it's threaded is a bit different from BBDuk, also, so it's much faster at counting kmers from reads than BBDuk is when counting them from a reference.
Brian Bushnell is offline   Reply With Quote
Old 11-14-2014, 09:13 AM   #50
ysnapus
Member
 
Location: Mid-Atlantic

Join Date: Jun 2013
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
Second, it's important to make sure these bases should really be trimmed. We have been generating some Nextera libraries recently with very erratic base frequency for the first 20 bases:



The top is the base composition histogram before adapter-trimming, and the bottom is after (this has read 1 from 0-150 and read 2 from 152-302); note how the right part of the read looks much better after adapter trimming. But the first 20 bases look terrible!
I am getting something similar with my data. I have a couple of questions.
  1. The skewed base content on the left side could be due to the bias in the enzyme mediated fragmentation process(?). Why is there a bias in the right side of the reads? I am getting the such a bias even in the bbduk and trimmomatic corrected reads.
  2. Are the figures generated from a BBMap component? They look different from the FastQC graphics. - It's from the bhist parameter of bbduk.sh.

Quote:
Originally Posted by Brian Bushnell View Post
However, I mapped the adapter-trimmed reads to the assembly with BBMap using the 'mhist' flag, which generates a histogram of the rates of match/substitution/insertion/deletion rates by read position:



The error rate is a little higher for the first few bases, but still well under 1%, so we are not going to trim the first 20bp off of those reads, as was initially proposed. The reads are accurate even though the base composition is highly biased, because the fragmentation was not random (this uses some kind of enzyme). Generally, before you trim off bases because of a skewed base composition histogram, I suggest mapping to see if there actually is a higher error rate there.

For reference, the command to generate those histograms:
bbmap.sh in=reads.fq ref=assembly.fa mhist=mhist.txt bhist=bhist.txt nodisk
If I don't really have a reference genome (only a related species), any other ideas to diagnose what's going on with the weird base composition at the ends?

Last edited by ysnapus; 11-14-2014 at 09:47 AM. Reason: found an answer
ysnapus is offline   Reply With Quote
Old 11-14-2014, 10:20 AM   #51
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

If you have sufficient coverage (at least 20x), and you have sufficient RAM for the genome size, you can do a quick assembly with (e.g.) Velvet and map to that. The assembly doesn't have to be good to generate the 'mhist' statistics.

Alternately, if your related species is very close - say, within 98% identity - you can still get a good feel for the error rate along the reads.

There is a bias on the right end of reads for two primary reasons. One is short inserts; even if you trim adapters, you won't get all of them. Do you mind giving your BBDuk command line, and std error output?

The other is probably the degradation in quality/intensity with increasing sequencing cycles. From talking to Illumina, it sounds like they don't recalibrate per cycle for relative base intensity drift. Whatever the cause, this has been causing us concern as well, so both we and Illumina are investigating it.
Brian Bushnell is offline   Reply With Quote
Old 11-14-2014, 11:07 AM   #52
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

Quote:
Originally Posted by Brian Bushnell View Post
KmerCountExact does not currently support set subtraction (I think that's what you're after?), but it sounds like a useful thing to include, so I will add it to my todo list.
This is great. Sorry there are so many tools in the bbmap folder I haven't explored them all yet. Can't the set subtraction be handled by BBDuk? So if you run sequence set A and B separately through kmercountexact to get their two kmer sets and then pass them through BBDuk witht he same k as in=set_a and ref=set_b you could either get the intersection of A and B with outm= or the subtraction of B from A with out=. That's good enough for me.

Quick question about BBDuk. I see in the help the setting 'findbestmatch=f' which is described "If multiple matches, associate read with sequence sharing most kmers." I was under the impression that BBDuk does not retain nor output such information as associations between a specific read and a specific sequence rather it only gives you a binary matched or not type result. Have you included some extra voodoo in here that actually reports associations? If not I'm wondering what use this setting is. thanks.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 11-14-2014, 11:54 AM   #53
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

BBDuk cannot do set subtraction alone because it only stores kmers from the reference, not the reads. But yes, you could use kmercountexact and BBDuk together as you describe to achieve various set operations.

As for the second question - BBDuk associates every kmer with a specific sequence. Normally that information is not used since the operations are binary (trimming and filtering). But if you write with the "stats=" flag, it will tell you the number and fraction of reads that matched each reference sequence, which is very useful when contaminant filtering if you want to know what kind of contaminants were present.

Normally, as soon as a single kmer match is found, the function exits early (and the counter for the associated sequence is incremented). If the "fbm"/"findbestmatch" flag is enabled, it will go through the entire read and look up every kmer, which makes it a little slower. But, this is useful if, for example, you want to see whether some long sequence is more closely related to one of two different bacterial species, and worry that they may share a few kmers (but not the vast majority).

BBDuk only associates each kmer with a single ref sequence, so if it is shared by two different sequences (for example, two different adapters that are the same except for the bar code), it is classified as matching the first one. I'm now mostly done with a new tool that associates each kmer with an arbitrary number of sequences, but it will be a bit slower and use more memory than BBDuk.
Brian Bushnell is offline   Reply With Quote
Old 11-14-2014, 12:23 PM   #54
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

These tools do so many things. The BBTools are my new favorite toys.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 11-14-2014, 12:43 PM   #55
ysnapus
Member
 
Location: Mid-Atlantic

Join Date: Jun 2013
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
There is a bias on the right end of reads for two primary reasons. One is short inserts; even if you trim adapters, you won't get all of them. Do you mind giving your BBDuk command line, and std error output?
The command line was
Code:
 ~/software/bbmap/bbduk.sh in1=../../raw-reads/MiSeq/JBad-Wichita2_${pre}_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_${pre}_R2_001.fastq.gz out1=Wichita2_${pre}_R1.fastq.gz out2=Wichita2_${pre}_R2.fastq.gz k=28 mink=13 hdist=1 ktrim=r ref=~/software/bbmap/resources/truseq.fa.gz tpe tbo
The stderror output is

Code:
java -ea -Xmx3235m -cp /myhome/software/bbmap/current/ jgi.BBDukF in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz out1=Wichita2_140908_R1.fastq.gz out2=Wichita2_140908_R2.fastq.gz k=28 mink=13 hdist=1 ktrim=r ref=/myhome/software/bbmap/resources/truseq.fa.gz tpe tbo
Executing jgi.BBDukF [in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz, in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz, out1=Wichita2_140908_R1.fastq.gz, out2=Wichita2_140908_R2.fastq.gz, k=28, mink=13, hdist=1, ktrim=r, ref=/myhome/software/bbmap/resources/truseq.fa.gz, tpe, tbo]

maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=2036m, used=22m

Added 64696 kmers; time: 	0.143 seconds.
Memory: free=1961m, used=97m

Input is being processed as paired
Started output streams:	0.952 seconds.

Input:                  	26109034 reads 		7577351149 bases.
KTrimmed:               	58084 reads (0.22%) 	4052612 bases (0.05%)
Trimmed by overlap:     	2793852 reads (10.70%) 	35617283 bases (0.47%)
Result:                 	26108900 reads (100.00%) 	7537681254 bases (99.48%)

Time:   			2409.165 seconds.
Reads Processed:      26109k 	10.84k reads/sec
Bases Processed:       7577m 	3.15m bases/sec
Quote:
Originally Posted by Brian Bushnell View Post
The other is probably the degradation in quality/intensity with increasing sequencing cycles. From talking to Illumina, it sounds like they don't recalibrate per cycle for relative base intensity drift. Whatever the cause, this has been causing us concern as well, so both we and Illumina are investigating it.
I'd love to know what's going on. Although looking at the stderror now it seems that may be it is not finding the right adapter sequence. Perhaps I should try the Nextera sequences too. I'd like to help in whatever way I can (I can privately share data if it helps.)
ysnapus is offline   Reply With Quote
Old 11-14-2014, 12:52 PM   #56
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by ysnapus
I'd love to know what's going on. Although looking at the stderror now it seems that may be it is not finding the right adapter sequence. Perhaps I should try the Nextera sequences too. I'd like to help in whatever way I can (I can privately share data if it helps.)
Considering that 10.7% of the reads were trimmed by overlap and only 0.22% were trimmed by kmer matching (adapter sequence identification), you are almost certainly using the wrong adapter sequences. BBTools comes with both Nextera and Trueseq adapter sequences: /bbmap/resources/nextera.fa.gz and /bbmap/resources/truseq.fa.gz. You can use both of them at once, by separating them with commas - "ref=file,file". Let me know if trimming with Nextera adapters does not solve the problem.

Quote:
Originally Posted by sdriscoll View Post
These tools do so many things. The BBTools are my new favorite toys.
Thanks! That's my new favorite quote
Brian Bushnell is offline   Reply With Quote
Old 11-16-2014, 09:46 AM   #57
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by Brian Bushnell View Post
BBDuk associates every kmer with a specific sequence. Normally that information is not used since the operations are binary (trimming and filtering). But if you write with the "stats=" flag, it will tell you the number and fraction of reads that matched each reference sequence, which is very useful when contaminant filtering if you want to know what kind of contaminants were present.

Normally, as soon as a single kmer match is found, the function exits early (and the counter for the associated sequence is incremented). If the "fbm"/"findbestmatch" flag is enabled, it will go through the entire read and look up every kmer, which makes it a little slower.
With v33.92, I added a couple new flags to BBDuk, "rename" and "refnames", which are best to use in conjunction with the "fbm" flag. Every input read that matches at least one reference kmer will be renamed, with its name indicating which reference sequences (or reference files) were matched, and how many kmers matched each sequence.

There's also a new tool, "seal.sh", which is still in beta, but can be used for the same purpose (renaming). Unlike BBDuk, it associates each reference kmer with multiple reference sequences, so it is still perfectly accurate in situations when multiple reference sequences may share kmers.

And I included sdriscoll's reformatted help menu, so that looks a lot nicer now!
Brian Bushnell is offline   Reply With Quote
Old 11-17-2014, 08:30 AM   #58
ysnapus
Member
 
Location: Mid-Atlantic

Join Date: Jun 2013
Posts: 22
Default

Quote:
Originally Posted by Brian Bushnell View Post
If you have sufficient coverage (at least 20x), and you have sufficient RAM for the genome size, you can do a quick assembly with (e.g.) Velvet and map to that. The assembly doesn't have to be good to generate the 'mhist' statistics.

Alternately, if your related species is very close - say, within 98% identity - you can still get a good feel for the error rate along the reads.

There is a bias on the right end of reads for two primary reasons. One is short inserts; even if you trim adapters, you won't get all of them. Do you mind giving your BBDuk command line, and std error output?

The other is probably the degradation in quality/intensity with increasing sequencing cycles. From talking to Illumina, it sounds like they don't recalibrate per cycle for relative base intensity drift. Whatever the cause, this has been causing us concern as well, so both we and Illumina are investigating it.
Quote:
Originally Posted by Brian Bushnell View Post
Considering that 10.7% of the reads were trimmed by overlap and only 0.22% were trimmed by kmer matching (adapter sequence identification), you are almost certainly using the wrong adapter sequences. BBTools comes with both Nextera and Trueseq adapter sequences: /bbmap/resources/nextera.fa.gz and /bbmap/resources/truseq.fa.gz. You can use both of them at once, by separating them with commas - "ref=file,file". Let me know if trimming with Nextera adapters does not solve the problem.
I ran one of my datasets against both nextera and truseq and I still don't get much kmer filtering. Edit: Perhaps I can figure out the adapter sequence from the overlapping read pairs?
Code:
java -ea -Xmx1234m -cp $HOME/software/bbmap/current/ jgi.BBDukF in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz out1=Wichita2_140908_R1.fastq.gz out2=Wichita2_140908_R2.fastq.gz k=28 mink=13 hdist=1 ktrim=r ref=$HOME/software/bbmap/resources/truseq.fa.gz,/$HOME/software/bbmap/resources/nextera.fa.gz tpe tbo
Executing jgi.BBDukF [in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz, in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz, out1=Wichita2_140908_R1.fastq.gz, out2=Wichita2_140908_R2.fastq.gz, k=28, mink=13, hdist=1, ktrim=r, ref=$HOME/software/bbmap/resources/truseq.fa.gz,$HOME/software/bbmap/resources/nextera.fa.gz, tpe, tbo]

maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=107m, used=16m

Added 187372 kmers; time: 	0.469 seconds.
Memory: free=101m, used=22m

Input is being processed as paired
Started output streams:	2.701 seconds.

Input:                  	26109034 reads 		7577351149 bases.
KTrimmed:               	119429 reads (0.46%) 	9212131 bases (0.12%)
Trimmed by overlap:     	2816312 reads (10.79%) 	34210537 bases (0.45%)
Result:                 	26108264 reads (100.00%) 	7533928481 bases (99.43%)

Time:   			1153.222 seconds.
Reads Processed:      26109k 	22.64k reads/sec
Bases Processed:       7577m 	6.57m bases/sec
I also got the mhist before and after filtering against a closely related draft genome. There is a slight improvement in matching after the bbduk filtering, but its hard to see visually.




Last edited by ysnapus; 11-17-2014 at 08:52 AM.
ysnapus is offline   Reply With Quote
Old 11-17-2014, 10:45 AM   #59
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Adapter contamination should be symmetric. Therefore, the mismatch rate caused by adapter sequences in read 1 should be the same as in read 2. But in your mhist graphs, read 2 has a much more dramatic increase than read 1 (and the base frequencies are much more clearly divergent) - this is not caused by adapters, but just a decline in quality, so quality trimming is probably a good idea; it would not surprise me if that greatly improved the base frequency divergence.

Judging visually, it seems like the draft genome is only about 97-98% identity to your organism, which makes it very noisy since most of the mismatches come from genomic difference rather than sequencing errors (or adapters). But certainly it would be good to trim the very last base (I'm guessing this is a 2x301bp run?) which you can do with the "ftr=299" flag, since it has a super-high 20%+ error rate.

It's possible to use overlap to determine adapter sequences, though it will currently require some manual work.

bbmerge.sh in=reads.fq out=joinable.fq join=f reads=200000 maxlength=280

...will create a file containing only reads that overlap, and have an insert size of at most 280bp, but not actually join them. And it will only process the first 200k reads, which will take a few seconds.

bbmerge.sh in=joinable.fq outinsert=insertSizes.txt

That will give you a file containing the insert sizes of the reads and their names. So you can find a read with an insert size of, say, 270 and then fetch the sequence from its name, and the last 30bp should be adapter.

I'll add a mode that will actually rename the reads based on their insert size, which would make this easier.
Brian Bushnell is offline   Reply With Quote
Old 11-21-2014, 10:54 AM   #60
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 423
Default

Hey Brian,

When we use hdist=1 with BBDuk is the increase in memory use proportional to the reference or the input? If I keep getting "java.lang.OutOfMemoryError: Java heap space" errors then should I:

a. split the reference up into smaller files and run them separately or
b. split the input into multiple files and run them separately

?

Also watching it load up in 'top' I only see java reach about 36GB of RAM usage before the errors start landing (separate ones for separate threads). My system has 128GB total RAM though. Is there a java specific limit I can change somewhere (Ubuntu Linux).

Thanks.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */

Last edited by sdriscoll; 11-21-2014 at 10:59 AM. Reason: more question added
sdriscoll is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO