![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Quality-, adapter- and RRBS-trimming with Trim Galore! | fkrueger | Bioinformatics | 132 | 04-18-2017 01:04 AM |
Adapter trimming | figo1019 | RNA Sequencing | 1 | 04-07-2014 10:58 AM |
Adapter trimming and trimming by quality question | alisrpp | Bioinformatics | 5 | 04-08-2013 04:55 PM |
adapter trimming - help | a_mt | Bioinformatics | 6 | 11-12-2012 07:36 PM |
3' Adapter Trimming | caddymob | Bioinformatics | 0 | 05-27-2009 12:53 PM |
![]() |
|
Thread Tools |
![]() |
#41 |
Senior Member
Location: US Join Date: Dec 2010
Posts: 324
|
![]() |
![]() |
![]() |
![]() |
#42 |
Member
Location: Madison, WI Join Date: Oct 2014
Posts: 20
|
![]()
Hi Brian,
Thanks for making the additions! That was exactly what i needed to improve my workflow. One observation i made while looking at differences between the output of bbduk and flexbar is that it appears that the forcetrimright trimming is being applied before the primer trimming step. >read_right_trimmed by flexbar length = 154 ACTCGGCCCAAGACTGCCCCATCGGCGTCACCTCTCGGTACACCCCCACGTCGCTGTCGAAGCGCGCGAACTCTTCGCGGTTATAGACGTGTCTGGCTACTAACCGCACGCGCTCTGTCCCGTTGGTGAAGTAGCACATGCCTTTAAACTGGTA position of primer on read yet to be right trimmed >>flexbar trimmed left_1 (230 aa) initn: 153 init1: 153 opt: 153 Z-score: 122.1 bits: 26.2 E(1): 5.4e-05 Smith-Waterman score: 153; 100.0% identity (100.0% similar) in 18 aa overlap (1-18:155-172) 10 DQB192 CACGAAATCCTCTGCGGG :::::::::::::::::: flexba GGTGAAGTAGCACATGCCTTTAAACTGGTACACGAAATCCTCTGCGGGAGACCAAGTCTC 130 140 150 160 170 180 >read_right_trimmed_by_bbduk forcetrimright=158 length = 159 ACTCGGCCCAAGACTGCCCCATCGGCGTCACCTCTCGGTACACCCCCACGTCGCTGTCGAAGCGCGCGAACTCTTCGCGGTTATAGACGTGTCTGGCTACTAACCGCACGCGCTCTGTCCCGTTGGTGAAGTAGCACATGCCTTTAAACTGGTACACGA Position of primer on read yet to be right trimmed >>left trimmed 1 bbduk1 (230 aa) initn: 153 init1: 153 opt: 153 Z-score: 118.1 bits: 25.4 E(1): 9.1e-05 Smith-Waterman score: 153; 100.0% identity (100.0% similar) in 18 aa overlap (1-18:155-172) 10 DQB192 CACGAAATCCTCTGCGGG :::::::::::::::::: left GGTGAAGTAGCACATGCCTTTAAACTGGTACACGAAATCCTCTGCGGGAGACCAAGTCTC 130 140 150 160 170 180 So while an exact primer match should result in a trimming to 154 bases, it seems the forcetrimright is getting applied first and no primer match is found. Is there a way to change the order of these operations through argument usage? Last edited by lankage; 10-28-2014 at 09:50 AM. |
![]() |
![]() |
![]() |
#43 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
lankage,
It's not possible to change the order of processing operations in BBDuk, and that would not be easy to do. The order is currently: 1) filter by minAvgQuality 2) force-trim 3) kmer-analyze (trim, filter, or mask) 4) trim by overlap 5) quality-trim However, you can accomplish the same thing via piping it, like this: bbduk.sh -Xmx1g in=reads.fq out=stdout.fq int=t other bbduk flags | reformat.sh in=stdin.fq out=out.fq forcetrimright=158 int=t For the second stage either reformat.sh or bbduk.sh will work. The pipe has to be a single stream, so if you have two input files and want to generate two output files, the command would be: bbduk.sh -Xmx1g in1=r1.fq in2=r2.fq out=stdout.fq other bbduk flags | reformat.sh in=stdin.fq out1=out1.fq out2=out2.fq forcetrimright=158 int=t When reading an interleaved file from stdin, you must always specify "int=t" because it is only autodetected if reading from a file. Let me know if this is didn't make sense. -Brian |
![]() |
![]() |
![]() |
#44 |
I like code
Location: San Diego, CA, USA Join Date: Sep 2009
Posts: 438
|
![]()
Hi Brian (and anyone else in the know),
Is there a magic kmer length that is optimal for contaminant filtering? I know from short read alignment there are many fewer possible alignments for a 100b read than a 20b read (in general). In some of my tests if I use k=10 then nearly 100% of input sequences might come up as contaminated however if I raise that to k=20 the % drops significantly (to say less than 1%). If I'm filtering FASTQ files then it seems clear that k should always be shorter than my read length as well as shorter than the contaminant sequence length however I'm wondering how much shorter is really optimal. It feels a little arbitrary (not that arbitrary is uncommon in this field).
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff Salk Institute for Biological Studies, La Jolla, CA, USA */ |
![]() |
![]() |
![]() |
#45 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
I typically use k=25, hdist=1 for adapter trimming and removal of other short synthetic contaminant sequences, just because some of them are very short (under 30bp). For long contaminant sequences (phiX, ecoli, etc) I always use k=31 to maximize the specificity; with 100bp reads you should get a match SOMEWHERE, especially if you allow a mismatch. Note that the "mm" or "maskmiddle" flag (which defaults to true for filtering) is roughly equivalent to allowing an additional +1 hamming distance for free.
By the way - in case you did not already notice, in an effort to save memory, BBDuk automatically scales its skip size (the number of kmers it skips between reference kmers it stores) depending on contig length. By default this varies from 0 for short reference sequences (under 100kb) to K for long reference sequences (over 5Mb). For maximal sensitivity, you should disable this by setting "maxskip=0" which will store all reference kmers. If you want to determine the ideal kmer length, it may be useful to characterize the error rates of your data, or quantify your true and false positive assignment rates based on filtering when using synthetic reads with their origin tagged. "randomreads.sh" generates and tags reads from reference, with adjustable error rates; I use it a lot to validate approaches and determine optimal settings. |
![]() |
![]() |
![]() |
#46 |
Member
Location: Mid-Atlantic Join Date: Jun 2013
Posts: 22
|
![]()
Thanks for bringing out a great tool. Noob here so please don't mind some questions, and I hope the answers will be of use to others as well.
Thank you. |
![]() |
![]() |
![]() |
#47 |
I like code
Location: San Diego, CA, USA Join Date: Sep 2009
Posts: 438
|
![]()
Brian,
Would it be possible for BBDuk to optionally export the matched kmers and unmatched (from input not reference) kmers? In a way it would be like generating a kmer library with something like Jellyfish and then splitting it into two piles via BBDuk as it runs now. It just seems like BBDuk is a bit faster at generating mers than other tools and in the event someone wanted to do such a thing as this it would simplify it down into a single step and keep people within a single set of tools (i.e. more fame for you).
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff Salk Institute for Biological Studies, La Jolla, CA, USA */ |
![]() |
![]() |
![]() |
#48 | |||
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]() Quote:
Right-trimming (ktrim=r), left-trimming (ktrim=l), masking (ktrim=n), and filtering (default). But it can only do one at a time because all kmers are stored in a single table. It can still do non-kmer-based operations such as quality trimming at the same time. BBDuk2 can do all 4 kmer operations at once and is designed for integration into automated pipelines where you do contaminant removal and adapter-trimming in a single pass to minimize filesystem I/O. Personally, I never use BBDuk2 from the command line. Both have identical capabilities and functionality otherwise, but the syntax is different. Quote:
"tbo" tries to determine the insert size by reverse-complimenting read 2 and finding the best overlap such that bases of the two reads match (using the same algorithm as BBMerge, with conservative settings). If a valid overlap is found, and the calculated insert size is shorter than the read length, both reads will be trimmed to the insert size (regardless of the "tpe" flag). So they are a little different but I recommend enabling them both if you have paired fragment libraries. They are not enabled by default because with long-mate-pair libraries you would not want that behavior. Please let me know if that does not fully answer your question, and I'll look into Trimmomatic's palindrome mode to see exactly what it does. Quote:
|
|||
![]() |
![]() |
![]() |
#49 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]() Quote:
The speed is mainly affected by the flags: "prefilter", "prealloc", "onepass" (if prefilter is enabled), and when dumping kmers, "mincount". Oh, and having pigz installed can help, too. Note, though, that BBDuk and KmerCountExact are both limited to K={1 to 31}; I'm not sure about Jellyfish. KmerCountExact does not currently support set subtraction (I think that's what you're after?), but it sounds like a useful thing to include, so I will add it to my todo list. The way it's threaded is a bit different from BBDuk, also, so it's much faster at counting kmers from reads than BBDuk is when counting them from a reference. |
|
![]() |
![]() |
![]() |
#50 | ||
Member
Location: Mid-Atlantic Join Date: Jun 2013
Posts: 22
|
![]() Quote:
Quote:
Last edited by ysnapus; 11-14-2014 at 08:47 AM. Reason: found an answer |
||
![]() |
![]() |
![]() |
#51 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
If you have sufficient coverage (at least 20x), and you have sufficient RAM for the genome size, you can do a quick assembly with (e.g.) Velvet and map to that. The assembly doesn't have to be good to generate the 'mhist' statistics.
Alternately, if your related species is very close - say, within 98% identity - you can still get a good feel for the error rate along the reads. There is a bias on the right end of reads for two primary reasons. One is short inserts; even if you trim adapters, you won't get all of them. Do you mind giving your BBDuk command line, and std error output? The other is probably the degradation in quality/intensity with increasing sequencing cycles. From talking to Illumina, it sounds like they don't recalibrate per cycle for relative base intensity drift. Whatever the cause, this has been causing us concern as well, so both we and Illumina are investigating it. |
![]() |
![]() |
![]() |
#52 | |
I like code
Location: San Diego, CA, USA Join Date: Sep 2009
Posts: 438
|
![]() Quote:
Quick question about BBDuk. I see in the help the setting 'findbestmatch=f' which is described "If multiple matches, associate read with sequence sharing most kmers." I was under the impression that BBDuk does not retain nor output such information as associations between a specific read and a specific sequence rather it only gives you a binary matched or not type result. Have you included some extra voodoo in here that actually reports associations? If not I'm wondering what use this setting is. thanks.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff Salk Institute for Biological Studies, La Jolla, CA, USA */ |
|
![]() |
![]() |
![]() |
#53 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
BBDuk cannot do set subtraction alone because it only stores kmers from the reference, not the reads. But yes, you could use kmercountexact and BBDuk together as you describe to achieve various set operations.
As for the second question - BBDuk associates every kmer with a specific sequence. Normally that information is not used since the operations are binary (trimming and filtering). But if you write with the "stats=" flag, it will tell you the number and fraction of reads that matched each reference sequence, which is very useful when contaminant filtering if you want to know what kind of contaminants were present. Normally, as soon as a single kmer match is found, the function exits early (and the counter for the associated sequence is incremented). If the "fbm"/"findbestmatch" flag is enabled, it will go through the entire read and look up every kmer, which makes it a little slower. But, this is useful if, for example, you want to see whether some long sequence is more closely related to one of two different bacterial species, and worry that they may share a few kmers (but not the vast majority). BBDuk only associates each kmer with a single ref sequence, so if it is shared by two different sequences (for example, two different adapters that are the same except for the bar code), it is classified as matching the first one. I'm now mostly done with a new tool that associates each kmer with an arbitrary number of sequences, but it will be a bit slower and use more memory than BBDuk. |
![]() |
![]() |
![]() |
#54 |
I like code
Location: San Diego, CA, USA Join Date: Sep 2009
Posts: 438
|
![]()
These tools do so many things. The BBTools are my new favorite toys.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff Salk Institute for Biological Studies, La Jolla, CA, USA */ |
![]() |
![]() |
![]() |
#55 | ||
Member
Location: Mid-Atlantic Join Date: Jun 2013
Posts: 22
|
![]() Quote:
Code:
~/software/bbmap/bbduk.sh in1=../../raw-reads/MiSeq/JBad-Wichita2_${pre}_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_${pre}_R2_001.fastq.gz out1=Wichita2_${pre}_R1.fastq.gz out2=Wichita2_${pre}_R2.fastq.gz k=28 mink=13 hdist=1 ktrim=r ref=~/software/bbmap/resources/truseq.fa.gz tpe tbo Code:
java -ea -Xmx3235m -cp /myhome/software/bbmap/current/ jgi.BBDukF in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz out1=Wichita2_140908_R1.fastq.gz out2=Wichita2_140908_R2.fastq.gz k=28 mink=13 hdist=1 ktrim=r ref=/myhome/software/bbmap/resources/truseq.fa.gz tpe tbo Executing jgi.BBDukF [in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz, in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz, out1=Wichita2_140908_R1.fastq.gz, out2=Wichita2_140908_R2.fastq.gz, k=28, mink=13, hdist=1, ktrim=r, ref=/myhome/software/bbmap/resources/truseq.fa.gz, tpe, tbo] maskMiddle was disabled because useShortKmers=true Initial: Memory: free=2036m, used=22m Added 64696 kmers; time: 0.143 seconds. Memory: free=1961m, used=97m Input is being processed as paired Started output streams: 0.952 seconds. Input: 26109034 reads 7577351149 bases. KTrimmed: 58084 reads (0.22%) 4052612 bases (0.05%) Trimmed by overlap: 2793852 reads (10.70%) 35617283 bases (0.47%) Result: 26108900 reads (100.00%) 7537681254 bases (99.48%) Time: 2409.165 seconds. Reads Processed: 26109k 10.84k reads/sec Bases Processed: 7577m 3.15m bases/sec Quote:
|
||
![]() |
![]() |
![]() |
#56 | ||
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]() Quote:
Quote:
![]() |
||
![]() |
![]() |
![]() |
#57 | |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]() Quote:
There's also a new tool, "seal.sh", which is still in beta, but can be used for the same purpose (renaming). Unlike BBDuk, it associates each reference kmer with multiple reference sequences, so it is still perfectly accurate in situations when multiple reference sequences may share kmers. And I included sdriscoll's reformatted help menu, so that looks a lot nicer now! |
|
![]() |
![]() |
![]() |
#58 | ||
Member
Location: Mid-Atlantic Join Date: Jun 2013
Posts: 22
|
![]() Quote:
Quote:
Code:
java -ea -Xmx1234m -cp $HOME/software/bbmap/current/ jgi.BBDukF in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz out1=Wichita2_140908_R1.fastq.gz out2=Wichita2_140908_R2.fastq.gz k=28 mink=13 hdist=1 ktrim=r ref=$HOME/software/bbmap/resources/truseq.fa.gz,/$HOME/software/bbmap/resources/nextera.fa.gz tpe tbo Executing jgi.BBDukF [in1=../../raw-reads/MiSeq/JBad-Wichita2_140908_R1_001.fastq.gz, in2=../../raw-reads/MiSeq/JBad-Wichita2_140908_R2_001.fastq.gz, out1=Wichita2_140908_R1.fastq.gz, out2=Wichita2_140908_R2.fastq.gz, k=28, mink=13, hdist=1, ktrim=r, ref=$HOME/software/bbmap/resources/truseq.fa.gz,$HOME/software/bbmap/resources/nextera.fa.gz, tpe, tbo] maskMiddle was disabled because useShortKmers=true Initial: Memory: free=107m, used=16m Added 187372 kmers; time: 0.469 seconds. Memory: free=101m, used=22m Input is being processed as paired Started output streams: 2.701 seconds. Input: 26109034 reads 7577351149 bases. KTrimmed: 119429 reads (0.46%) 9212131 bases (0.12%) Trimmed by overlap: 2816312 reads (10.79%) 34210537 bases (0.45%) Result: 26108264 reads (100.00%) 7533928481 bases (99.43%) Time: 1153.222 seconds. Reads Processed: 26109k 22.64k reads/sec Bases Processed: 7577m 6.57m bases/sec ![]() ![]() ![]() ![]() Last edited by ysnapus; 11-17-2014 at 07:52 AM. |
||
![]() |
![]() |
![]() |
#59 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,706
|
![]()
Adapter contamination should be symmetric. Therefore, the mismatch rate caused by adapter sequences in read 1 should be the same as in read 2. But in your mhist graphs, read 2 has a much more dramatic increase than read 1 (and the base frequencies are much more clearly divergent) - this is not caused by adapters, but just a decline in quality, so quality trimming is probably a good idea; it would not surprise me if that greatly improved the base frequency divergence.
Judging visually, it seems like the draft genome is only about 97-98% identity to your organism, which makes it very noisy since most of the mismatches come from genomic difference rather than sequencing errors (or adapters). But certainly it would be good to trim the very last base (I'm guessing this is a 2x301bp run?) which you can do with the "ftr=299" flag, since it has a super-high 20%+ error rate. It's possible to use overlap to determine adapter sequences, though it will currently require some manual work. bbmerge.sh in=reads.fq out=joinable.fq join=f reads=200000 maxlength=280 ...will create a file containing only reads that overlap, and have an insert size of at most 280bp, but not actually join them. And it will only process the first 200k reads, which will take a few seconds. bbmerge.sh in=joinable.fq outinsert=insertSizes.txt That will give you a file containing the insert sizes of the reads and their names. So you can find a read with an insert size of, say, 270 and then fetch the sequence from its name, and the last 30bp should be adapter. I'll add a mode that will actually rename the reads based on their insert size, which would make this easier. |
![]() |
![]() |
![]() |
#60 |
I like code
Location: San Diego, CA, USA Join Date: Sep 2009
Posts: 438
|
![]()
Hey Brian,
When we use hdist=1 with BBDuk is the increase in memory use proportional to the reference or the input? If I keep getting "java.lang.OutOfMemoryError: Java heap space" errors then should I: a. split the reference up into smaller files and run them separately or b. split the input into multiple files and run them separately ? Also watching it load up in 'top' I only see java reach about 36GB of RAM usage before the errors start landing (separate ones for separate threads). My system has 128GB total RAM though. Is there a java specific limit I can change somewhere (Ubuntu Linux). Thanks.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff Salk Institute for Biological Studies, La Jolla, CA, USA */ Last edited by sdriscoll; 11-21-2014 at 09:59 AM. Reason: more question added |
![]() |
![]() |
![]() |
Tags |
adapter, bbduk, bbtools, cutadapt, trimmomatic |
Thread Tools | |
|
|