SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Adapter trimming figo1019 RNA Sequencing 2 07-17-2018 04:00 AM
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 132 04-18-2017 01:04 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 04:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 07:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 12:53 PM

Reply
 
Thread Tools
Old 12-09-2014, 02:21 PM   #81
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Brian,

Do you foresee making it possible to use kmers longer than 31bp at any point? Just curious.

Also could you add the 'trd' parameter to seal so that when one enables the 'rename' option one gets those nicely trimmed reference names as in bbmap?

Thanks!
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-09-2014, 02:48 PM   #82
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I am considering making an assembler based on BBDuk/Seal's data structures. When I do that I will probably support k>31, which will probably percolate back to BBDuk/Seal, but that won't be anytime soon.

BBDuk does allow processing of k>31 in emulated mode when filtering; specifically, if you set k=35, for example, it would require 5 consecutive 31-mer hits to trigger a match. It's not quite the same, but it does greatly increase specificity. Seal lacks this ability because it would be kind of difficult to track the number of consecutive matches for a kmer hit to an arbitrary number of different references simultaneously.

Seal and BBDuk - in the latest version, 34.08 - already support the trd flag, it's just undocumented. I will document it. Between 34.00 and 34.08 I unified a lot of the parsing so there are a lot of BBMap flags that are suddenly now supported by other tools.
Brian Bushnell is offline   Reply With Quote
Old 12-16-2014, 10:54 AM   #83
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Hi Brian,

I'm using bbduk to trim adapters from a microRNA seq run and in this case I only want to keep reads that were successfully trimmed. My command line is something like this:

Code:
 bbduk.sh in=reads.fastq.gz out=trimmed.fq ref=mirna_adapter.fa ktrim=r k=20 mink=12 hdist=1 minlength=15
bbduk finishes successfully and reports:

Code:
Input:                  	22241069 reads 		1134294519 bases.
KTrimmed:               	16273134 reads (73.17%) 	401433668 bases (35.39%)
Result:                 	21270190 reads (95.63%) 	732860851 bases (64.61%)
If I count the number of reads in my output file, trimmed.fq, I get 21270190, however according to bbduk only 16273134 reads were trimmed. What is going on here?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-16-2014, 11:36 AM   #84
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

22241069 reads were processed. Of those, 16273134 had adapter sequence detected, so they were trimmed. The output was 21270190 reads, which includes both trimmed reads (which had adapter detected) and untrimmed reads (which didn't). (22241069 - 21270190) = 970879 reads were discarded because after trimming, they were too short to use (default 10bp).

And that accounts for all the reads. Does that answer your question?
Brian Bushnell is offline   Reply With Quote
Old 12-16-2014, 11:37 AM   #85
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Yes. What option should I used so that I can get a FASTQ file with ONLY the reads that were trimmed?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-16-2014, 12:09 PM   #86
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by sdriscoll View Post
Yes. What option should I used so that I can get a FASTQ file with ONLY the reads that were trimmed?
Oh, oops, I didn't read your question carefully enough. Assuming the reads were the same length going in, you can subsequently filter the output file to retain only reads shorter than the original length:

bbduk.sh in=trimmed.fq outm=filtered.fq minlength=100

You could also use the minlength and outm flags on your original command line instead. "outm" is a little confusing but it gathers all the reads that 'fail', meaning they either match something (presumed to be a contaminant), OR after trimming are shorter than minlength. "outu" (aka "out") gets everything else.

If your input reads were variable-length then this approach would not work, but you could still do it like this:

Code:
bbduk.sh in=reads.fastq.gz mlf=1.0 outm=trimmed.fq ref=mirna_adapter.fa ktrim=r k=20 mink=12 hdist=1 minlength=15
...where "mlf" or "minlengthfraction" means that reads will fail if they are trimmed to less than that fraction of their original length.
Brian Bushnell is offline   Reply With Quote
Old 12-16-2014, 12:17 PM   #87
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Now that I think about it, I guess the problem is that it is not clear what "out" and "outm" do for BBDuk.

out: Accepts all reads that pass criteria. If BBDuk is used for kmer filtering, that means all reads that do not have kmer matches to the reference. If BBDuk is used for trimming, that means all reads that, after trimming, are longer than a minimum length, whether or not they had any matching kmers.

outm: Accepts all reads that fail criteria and thus would not go to "out". "m" stands for "match", and when kmer filtering it accepts reads with matching kmers, but it also gets reads shorter than minlen (again, the default is 10, so you'd have to set it to 0 to disable this behavior).

Last edited by Brian Bushnell; 12-16-2014 at 12:19 PM.
Brian Bushnell is offline   Reply With Quote
Old 12-16-2014, 12:34 PM   #88
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

OK, thanks that's useful. I guess logically if I'm also specifying mink=12 and I only want reads that were trimmed I could also use maxlength=[read_length - mink] and then out would contain only reads reads that were trimmed. I didn't notice the mlf option...that should be handy as well.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-16-2014, 01:11 PM   #89
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

For seal, would there be any logical way to add in number of unique hits to the rpkm= output? I'd also be interested to know what it does when a read has equally good matches to two or more target references.

Thanks!
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-16-2014, 01:41 PM   #90
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

By unique, do you mean "only matching that sequence" or "occurring in a unique position in that sequence"? The latter cannot be done, but the former can be controlled with the "ambig" flag that I forgot to document. It is similar to BBMap's ambig flag. This is the description (which will be present in the next release):

Code:
ambiguous=random    (ambig) Set behavior on ambiguously-mapped reads (with an
                    equal number of kmer matches to multiple sequences).
                         first:  Use the first best-matching sequence.
                         toss:   Consider unmapped.
                         random: Select one best-matching sequence randomly.
                         all:    Use all best-matching sequences.
So, the default is random. With BBMap the default is "first".
Brian Bushnell is offline   Reply With Quote
Old 12-16-2014, 02:35 PM   #91
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Yes, "only matching that sequence" is what I mean. So in seal if you use ambig=all then that read (hit) is added to all targets it matched (if there were multiple best matches)?

Or say...what happens when a read can match one sequence with 20 kmers and a second with 18 kmers?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-16-2014, 03:20 PM   #92
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by sdriscoll View Post
So in seal if you use ambig=all then that read (hit) is added to all targets it matched (if there were multiple best matches)?
That's correct.
Quote:
Or say...what happens when a read can match one sequence with 20 kmers and a second with 18 kmers?
Seal will only consider something ambiguous if it matches multiple sequences with the same number of kmers, which is also to the highest number. However, I could add a flag that changes it to allow any number of kmers.
Brian Bushnell is offline   Reply With Quote
Old 12-19-2014, 03:21 PM   #93
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Hey Brian,

bbduk and seal are certainly great programs and since I have downloaded these tools I've been using them pretty much daily. While most kmer tools ignore the idea of allowing a mismatch (since obviously this complicates hash lookups) I like that you have included this however using it imposes such a severe spike in memory usage that it becomes basically unusable. Case in point - using seal to produce hits to genes in a full transcriptome while allowing a mismatch per kmer. Is there a reason that you put the burden of additional kmers on the reference and not on the read? It seems to me that there would be a performance hit but I would much rather have that than the massive amount of memory needed to produce a kmer index of a full transcriptome allowing even one mismatch. So during matching the read could be kmer'd into all possible single mismatch variants and matched to the normal 0-mismatch kmer index. Or does that bring in additional issues that are not easily resolved?
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-19-2014, 03:42 PM   #94
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I really kind of like the "mm=t" option because it's similar to allowing a mismatch, but with no speed or memory penalty. It's also possible, though, to generally reduce memory consumption when allowing mismatches using the "speed" or "rskip" flag. "speed=8", for example, would hash only 50% of the kmer space, so it would halve memory consumption; "speed=12" would quarter memory consumption. Of course these reduce sensitivity, but "speed=12 hdist=1" would probably be more sensitive than "speed=0 hdist=0" if there are a lot of mismatches.

There is no reason for me to not generate read kmers with mismatches in them, other than the speed penalty. But, that speed penalty is hefty - with 31mers, allowing 1 mismatch might make the program 93x slower (particularly if most reads do not match the reference; not so much if they mostly do).

I will plan to add that capability in, though I'm not convinced that it will be viable on large datasets (raw reads). However, it WOULD be viable for matching a handful of sequences to a very large reference.
Brian Bushnell is offline   Reply With Quote
Old 12-19-2014, 04:04 PM   #95
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Yeah that last point is totally fair. I'm also not totally sure that for most applications allowing a mismatch really tells me what I want to know anyways.

On another subject, are you familiar with the program Sailfish which uses kmer matching and the EM algorithm for gene expression? It is pretty much seal with an EM step that probabilistically estimates abundance.
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-19-2014, 04:10 PM   #96
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

I have heard of expectation maximization but not actually studied what it does. I should probably read about it to see if it seems useful.

I was not really aware of Sailfish until today or yesterday, when the author posted on SeqAnswers that it now has a successor named Salmon. I should probably do a comparative benchmark sometime.

Last edited by Brian Bushnell; 12-19-2014 at 04:33 PM.
Brian Bushnell is offline   Reply With Quote
Old 12-19-2014, 04:22 PM   #97
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

Yeah they came up with some kind of so-called perfect hash so that every kmer lookup is O(1). The slowest part of their program is probably the EM part (it slows down with increasing size of reference). I think these days if a program intends to put out expression levels for a set of references with ambiguity of mapping between features the EM is the standard. Until something better is figured out...
__________________
/* Shawn Driscoll, Gene Expression Laboratory, Pfaff
Salk Institute for Biological Studies, La Jolla, CA, USA */
sdriscoll is offline   Reply With Quote
Old 12-24-2014, 04:46 PM   #98
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by Brian Bushnell View Post
I'd like to introduce another member of the BBTools package, BBDuk. "Duk" stands for Decontamination Using Kmers. BBDuk is extremely fast, scalable, and memory-efficient, while maintaining greater sensitivity and specificity than other tools. It can do lots of different things, for example -

Adapter trimming:
bbduk.sh -Xmx1g in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=28 mink=12 hdist=1

...where "ktrim=r" is for right-trimming (3' adapters), "ktrim=l" is for left-trimming (5' adapters), and "ktrim=N" just masks the kmers with "N". "k" specifies the kmer size to use (must be at most the length of the adapters) while "mink" allows it to use shorter kmers at the ends of the read (for example, k=12 for the last 12 bases). "hdist" means "hamming distance"; this allows one mismatch. Instead of "ref=file" you can alternately (or additionally) say "literal=ACTGGT,TTTGGTG" for those two literal strings. The BBTools package currently includes Truseq and Nextera adapters sequences in /bbmap/resources/truseq.fa.gz and /bbmap/resources/nextera.fa.gz.
Hi Brian,

Got a bit confused with these commands, so I did RNA sequencing using Truseq Stranded mRNA Kit, 200-bp paired-end.

I understand that I'm supposed to trim the adapters off

*Read 1: I will have to trim (AGATCGGAAGAGCACACGTCTGAACTCCAGTCA)

*Read 2: I will have to trim (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT)

How should the command be? Could you please advise?

Thank you. Merry Christmas!
michaellim is offline   Reply With Quote
Old 12-24-2014, 06:49 PM   #99
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,795
Default

Basically something like this:

Code:
$ replace_windows_OS_specific_bbduk_command_here -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/truseq.fa.gz ktrim=r [and other options for bbduk to follow]
GenoMax is offline   Reply With Quote
Old 12-25-2014, 04:35 AM   #100
michaellim
Member
 
Location: England

Join Date: Dec 2014
Posts: 28
Default

Quote:
Originally Posted by GenoMax View Post
Basically something like this:

Code:
$ replace_windows_OS_specific_bbduk_command_here -Xmx2g in1=/path/to/sample_R1.fastq in2=/path/to/sample_R2.fastq out1=/path/to/sample_R1_no_adapters.fastq out2=/path/to/sample_R2_no_adapters.fastq ref=/path/to/bbmap/resources/truseq.fa.gz ktrim=r [and other options for bbduk to follow]
Hi Genomax,

Sorry, but I don't understand what is "/path/to/", if my files are all in C: drive Windows, will it be

jave -ea -Xmx2g in1=C:\sequencing\ST131_1_R1.fastq in2=C:\sequencing\ST131_1_R2 out1=??? out2=??? ref=C:\bbmap\resources\truseq.fa.gz ktrim=r k=???? mink=???? hdist=????

??? For out1 and out2 do I need to create some kind of new file?
??? I don't know what to put for k, mink and hdist, should k be 33 because the adapter sequences are 33 bases? not sure about mink and hdist.

Thank you. Merry Christmas
michaellim is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:30 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO