SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 132 04-18-2017 02:04 AM
Adapter trimming figo1019 RNA Sequencing 1 04-07-2014 11:58 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 05:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 08:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 01:53 PM

Reply
 
Thread Tools
Old 09-12-2017, 06:50 AM   #261
cuencam
Junior Member
 
Location: Zurich

Join Date: Aug 2017
Posts: 5
Default

Hi Brian,
In the same lines of my previous question, what is the rationale of using maq=10? We are interesting in de novo assembly of metagenomic data and we were worried that low quality bases at the ends of the reads might feed artificial k-mers in to the assembler (SPADES). I read that you recommend read normalization, but since our coverage is highly unequal (due to unequal species abundance, not because sequencing artifacts) we are worried that this might introduce more biases than the ones it solves.

We were thinking on using your newly implemented option "mbq" to secure that all bases have 20 as minimum quality. Do you believe that this is a good alternative?
cuencam is offline   Reply With Quote
Old 09-12-2017, 01:43 PM   #262
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

"maq=10" is to throw away really junky reads. The only way to really verify whether a setting is beneficial is to actually test it, unfortunately. But personally, I think "mbq=20" would be too aggressive (particularly if your sequencing run had a single low-quality cycle, in which case it would discard all of the data)... if you really want to get rid of the low-quality trailing bases, I'd suggest quality-trimming instead (qtrim=r trimq=14 or something like that). Spades is pretty robust with respect to low-quality data anyway; the biggest problem is that it low quality reads balloon the kmer-space which can make it run out of memory.

The main advantage of normalization with metagenomes, in fact, is that it removes a lot of data which allows Spades to run on datasets that it can't otherwise handle. It's not strictly beneficial and if you can assemble a metagenome without normalization, that may be better - sometimes normalization improves the assembly, sometimes it doesn't.
Brian Bushnell is offline   Reply With Quote
Old 09-13-2017, 04:19 AM   #263
cuencam
Junior Member
 
Location: Zurich

Join Date: Aug 2017
Posts: 5
Default

Thanks for this response! I'm pretty sure that your excellent user support is only comparable to the high quality of your tools!

I will implement quality-trimming at a higher threshold and then test. I do agree that mbq=20 is hard for assembly (but probably useful for SNV).
Cheers
cuencam is offline   Reply With Quote
Old 09-15-2017, 05:45 AM   #264
EssigSchurke
Junior Member
 
Location: Germany

Join Date: Jul 2013
Posts: 5
Default

Hi Brian,

I tried to filter reads longer 10bp. I used the following command:

Code:
bbduk.sh -in=input.fq -out=output.fq -maxlength=10
However, nothing happens, I get the same amount of reads as in the input. But all reads are longer 10bp.
I used the latest version of bbduk 37.53

Test Input:

Code:
@test
ACTGGACTTGGAGTCAGAAGGC
+
b\\[\ZZ[][a]_]]cbbbabc
Code:
Input:                  	1 reads 		22 bases.
Total Removed:          	0 reads (0.00%) 	0 bases (0.00%)
Result:                 	1 reads (100.00%) 	22 bases (100.00%)
EssigSchurke is offline   Reply With Quote
Old 09-15-2017, 06:11 AM   #265
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

The BBDuk commands don't have '-' before them. Your command should read:

bbduk.sh in=input.fq out=output.fq maxlength=10

Give that a shot?
jazz710 is offline   Reply With Quote
Old 09-15-2017, 06:15 AM   #266
EssigSchurke
Junior Member
 
Location: Germany

Join Date: Jul 2013
Posts: 5
Default

With or without "-" does not matter, I get same results.
EssigSchurke is offline   Reply With Quote
Old 09-15-2017, 06:30 AM   #267
cuencam
Junior Member
 
Location: Zurich

Join Date: Aug 2017
Posts: 5
Default

Hi EssigSchurke
The flag is minlength=10

The whole command is
bbduk.sh in=input.fq out=output.fq minlength=10

Edit:

I misread your question. The command provided by jazz710 is the appropriate, and works on my computer. You want to remove the big reads, correct?

Last edited by cuencam; 09-15-2017 at 06:33 AM.
cuencam is offline   Reply With Quote
Old 09-15-2017, 06:35 AM   #268
EssigSchurke
Junior Member
 
Location: Germany

Join Date: Jul 2013
Posts: 5
Default

Hi cuencam,

minlength=10 filters only reads shorter 10bp. I want to filter reads longer 10bp, whereas 10bp is only a dummy for my test case.
EssigSchurke is offline   Reply With Quote
Old 09-15-2017, 07:31 AM   #269
EssigSchurke
Junior Member
 
Location: Germany

Join Date: Jul 2013
Posts: 5
Default

Yes, I want to exclude large reads, but I tested the command provided by jazz710. It produces the same result, the test read is still in the output.
EssigSchurke is offline   Reply With Quote
Old 09-15-2017, 10:31 AM   #270
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Actually, all the BBTools strip off the leading "-" so you can put as many of them as you want

This is a bug. Thanks for the report! It looks like BBDuk only removes reads under minlen or over maxlen if they were trimmed; untrimmed sequences will pass regardless of their length. Sorry about that! Reformat actually works correctly in this case, though:

Code:
reformat.sh in=x.fq out=y.fq minlen=A maxlen=B
I'll fix BBDuk ASAP. Thanks again!
Brian Bushnell is offline   Reply With Quote
Old 09-17-2017, 11:01 PM   #271
EssigSchurke
Junior Member
 
Location: Germany

Join Date: Jul 2013
Posts: 5
Default

Thanks for the fast response. I will try reformat.
EssigSchurke is offline   Reply With Quote
Old 09-28-2017, 02:40 AM   #272
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 16
Default

Hi,

I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).
boulund is offline   Reply With Quote
Old 09-28-2017, 10:48 AM   #273
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Quote:
Originally Posted by boulund View Post
Hi,

I got the idea to use BBDuk as a tool to filter out kmers that are shared between two samples. Is this a good idea? It's metagenomics, and the concept is that one sample is representing the background composition of bacteria, and I want to remove that background composition from another sample. One challenge to the problem is that the samples are rather big, about 2x35 GB compressed fastq (paired end), each containing about 1 billion reads total (both read pairs combined).
You can find kmers that are shared between two samples like this (bearing in mind that it may take a lot of memory):

kcompress.sh in=a.fq.gz out=a_kmers.fa.gz
kcompress.sh in=b.fq.gz out=b_kmers.fa.gz
kcompress.sh in=a_kmers.fa.gz,b_kmers.fa.gz out=shared_kmers.fa.gz mincount=2

However, I think I'd probably tend to just assemble what you consider to be the background, and then map reads to the assembly requiring fairly high identity, keeping the reads that don't map. Either approach works (and also has disadvantages) but whole reads tend to be more specific than kmers.

Quote:
Originally Posted by EssigSchurke View Post
Thanks for the fast response. I will try reformat.
BBDuk is fixed now, by the way
Brian Bushnell is offline   Reply With Quote
Old 09-28-2017, 11:13 AM   #274
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,547
Default

"just assemble what you consider to be the background"

Assemble as?
GenoMax is offline   Reply With Quote
Old 09-28-2017, 11:27 AM   #275
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

To clarify, I meant running Megahit (for example) on the reads of the sample considered to be the background, then mapping the reads of the other sample to that assembly.
Brian Bushnell is offline   Reply With Quote
Old 09-28-2017, 11:32 PM   #276
boulund
Member
 
Location: Sweden

Join Date: Jan 2017
Posts: 16
Default

Ok, that's interesting. Our current approach was just that; assembly of the background sample with Megahit, and then mapping the sample to be filtered against the background assembly to remove anything that matches. I was hoping it'd be possible to do it without the massive overhead of assembling the background sample, as that's fairly time consuming and memory hungry for these large samples. I will have to evaluate the different approaches against each other to see which one fits our setup the best. Thanks for your input, and thanks for pointing me to kcompress.sh!
boulund is offline   Reply With Quote
Old 10-15-2017, 07:30 AM   #277
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 310
Default Different output with interleaved input

Hello again-

I think there is an inconsistent behavior in how bbduk handles interleaved input depending on whether the interleaved option is set to "true" or "auto".

Consider the case where only one read in a pair is discarded because too short.
With "interleaved=auto" we get in (interleaved) output only the read passing the filter, thus appearing as a single-end read. With "interleaved=true" both reads are discarded.

Is this difference intentional? In my opinion, interleaved=auto does the right thing in discarding only the bad read and keeping the other. However, this creates an interleaved output with single-end reads (which are actually paired-end but with the mate gone) intercalated in pair-end ones. I'm not sure if the interleaved format has ever been defined to allow such a case (for the record, bwa seems to handle it correctly).

I just thought useful to point this out...

This should reproduce the issue wih BBDuk version 37.54:

Code:
bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=auto
bbduk.sh in=int2.fq out=stdout.fq qtrim=rl minlength=35 trimq=15 interleaved=true
With int2.fq being:

Code:
@r1
ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@r1
TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT
+
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@s1
ATGGCATGCACCTGTAATCCCGCTACTTGTGAGGCTGAAGCAGGAGAAT
+
AAFFFJJJJJFJFJJJJJFAFJJJFJ7<FAJJJJJJJJAAFJJJJJJJJ
@s1
TAATTATATGTTTAAGTAAATGAGTAAAATTCAAGATTGCTATCGGATT
+
#################################################
dariober is offline   Reply With Quote
Old 10-15-2017, 03:25 PM   #278
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,547
Default

@Dario: Have you tried this with a file that has the (/1 of old Illumina style of 1:N:0 of new read headers)? Interleaved files may need those identifiers to be there. They can be added by reformat.sh.
GenoMax is offline   Reply With Quote
Old 10-16-2017, 01:28 AM   #279
dariober
Senior Member
 
Location: Cambridge, UK

Join Date: May 2010
Posts: 310
Default

Quote:
Originally Posted by GenoMax View Post
@Dario: Have you tried this with a file that has the (/1 of old Illumina style of 1:N:0 of new read headers)? Interleaved files may need those identifiers to be there. They can be added by reformat.sh.
Hi Genomax- These are R1 and R2 real identifiers:

Code:
@E00295:75:H7LLTALXX:8:1101:4553:1643 1:N:0:8
@E00295:75:H7LLTALXX:8:1101:4553:1643 2:N:0:8
I think one issue is that the interleaved format has never been formally defined so it's not clear which of option of interleaved, auto or true, is doing the right thing.

By the way, with interleaved=auto bbduk gives to stderr the message:

Code:
Input is being processed as unpaired
If I append /1 and /2 to the read names then I get the same output as with interleaved=true (both reads discarded) and the message becomes:

Code:
Input is being processed as paired
so you are right /1 and /2 must be there for interleaved=auto to work as expected. Maybe it would be useful to make this behavior explicit in the documentation (unless I missed it)?

Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.
dariober is offline   Reply With Quote
Old 10-16-2017, 05:39 AM   #280
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,547
Default

Quote:
Having said that, I think discarding both reads in a pair when only one read fails is unnecessary.
Not having reads in sync causes issues with mapping with most aligners (you note that bwa is an exception). I assume only case where R2 would be eliminated (while R1 remains) would be if you were trimming based on quality. If R2 contains adapter contamination then likely the entire insert is small and may result in elimination of R1 as well. I generally never have to trim data based on quality so have not run into this issue so far.
GenoMax is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO