SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 132 04-18-2017 02:04 AM
Adapter trimming figo1019 RNA Sequencing 1 04-07-2014 11:58 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 05:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 08:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 01:53 PM

Reply
 
Thread Tools
Old 07-18-2014, 08:00 PM   #21
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I just released a new version of BBTools, 33.11. BBDuk received a slight upgrade, and now has a much-needed additional flag, 'outsingle' (or 'outs'). This is the output stream for reads that pass filtering requirements, but their mate does not. It's most relevant if you set "minlen=X" for quality-trimming, when one read gets trimmed to be shorter than X and the other read is longer than X. It also captures reads such that one had a kmer match to an artifact and the other didn't, or one was adapter-trimmed to be shorter than minlen and the other wasn't, but those cases are probably less interesting.
Brian Bushnell is offline   Reply With Quote
Old 07-24-2014, 01:03 PM   #22
scottdson
Junior Member
 
Location: Canada

Join Date: Jul 2011
Posts: 5
Default

Hi Brian,

Thanks for releasing this tools it's a great help. I have a set of fastq files for paired-end reads, 9 sets of paired files with around 46 million reads in each. You're tool has been great for getting rid of left hand adapter contamination. I do have a bit of a weird experience though, one pair of files behaves strangely.

For this pair of files bbduk finishes without incident, but after roughly 12 million reads in the out1 file there is an error, while the out2 file is fine. Looking into the error it seems that around 4 hundred sequences were lost in the out1 file but remained in the out2 file.

out1 looked like this
Quote:
@HWI-D00423:52:H8TMKADXX:1:1116:16453:62330 1:N:0:GCCAAT
GCTGAGTCTGAAGAGTTTATTGCCTTCTCTGCTTCGTAGAGAACGTAGCATGTTTCTTCAGCACACCTTTTGAAGCGCATCGAGCATGGCTGAGTGTGTTGTATCTCTACTGTTAGTCCACTTCTCAGGACCAAATCATCAACAGATCGGA
+
CCCFFFFFHHHHHJJHHIJJJJJJJJJJJJJJJJJJHJJJIJJJJHIJIJIJJJJJJJJJJJJJJJJJJIJIHHEHFFDDDDDDDDDDDDDDDDCDDDDDDDCDEEDEEDEDEDDDDDDDDDDDDDEDDDDDDDDDDDDDDDDDDDDDDDD
@HWI-D00423:52:H8TMKADXX:1:1116:16262:62363 1:N:0:GCCAAT
GCCTGCATTAAACATTGAGTGAACTTTCCAGAAACACTCTTTCAGAGATCTTCAATGCGTGGGAAGAGTTTTTCACCTGCAGTATAGCTTGCCTCTACCAACTGTTTGCCCTTAGAAATGTTCAATTTCACAAAGCTATGTAJJJJJJJJHGIJJIJJJJJJIJJJHHHHHHFFFFFEEEEFFFEDDDDDDDDDECDEDDDDDDDDDDEDDEDDCDDDDDDDDDDDDDCCEEDDDEDDDDEEEEE
@HWI-D00423:52:H8TMKADXX:1:1116:20066:62469 1:N:0:GCCAAT
GTTCCAAGCTCCGGCGAGGGAGGCATCCGCCCCGACTCGGGGCTTCTCCTGCCCAGTCTGCCCAAGCGTAGAGCCCTGCTCTCTGGGAACTCACCTCCCCGCTCGGGGAGAGCCCGGTTAGGGCCGCGGGAGGCCCCAGTCTCAGACCTTC
+
CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJHHFFDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBBBBDDDDDDDDDDDD>BBDDDDDDDDDDDDDDDDDDDDDDDECDDDDDD
At first i thought the quality line of the "@HWI-D00423:52:H8TMKADXX:1:1116:16262:62363 1:N:0:GCCAAT" block had been concatenated to the sequence, but i soon saw that the fastq blocks below this one were out of step with the ones in the file from the second ends. Hence finding out that over 400 fastq blocks disappeared.

I've run this a couple of times and it happens in the same place which is suggesting a programmatic cause rather than a memory leak.

I also selected 10 fastq blocks from across the "@HWI-D00423:52:H8TMKADXX:1:1116:16262:62363 1:N:0:GCCAAT" block and bbduk works perfectly on these so it doesn't seem to be anything with the read or fastq format of the region.

As a quick work around i'm going to try splitting the fastq files in half and running the two smaller sets through bbduk.

Thanks again for the great tool, just thought i would provide this feed back here.

Scott
scottdson is offline   Reply With Quote
Old 07-24-2014, 01:53 PM   #23
scottdson
Junior Member
 
Location: Canada

Join Date: Jul 2011
Posts: 5
Default

Hi Brian,

I was using version 33.04 so i'll update and try this again.

Mnay thanks,

Scott
scottdson is offline   Reply With Quote
Old 07-24-2014, 02:04 PM   #24
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Scott,

Thanks a lot for the feedback. Would you mind also posting or emailing me the command line and stdout/stderr (especially if there is an error message), and telling me what version of BBTools this is? Also, if there's any way it is possible for me to get the input files, that would be great, though I realize it's unlikely.

My assumption is that the input file is either corrupted in some way, or contains some symbols that my I/O routines were not expecting, or that multiple programs/threads were writing to the same file at the same time (that's the only case I've encountered that gave similar-looking output).
Brian Bushnell is offline   Reply With Quote
Old 08-20-2014, 12:17 AM   #25
Paul Newport
Member
 
Location: Bristol, UK

Join Date: May 2014
Posts: 10
Default

Hi there! First of all, thanks for that nice piece of work!

I have a question regarding the 'ref=' option. In what format do I have to place the adapter sequences? Let's say for a illumina paired-end run.

Form the Illumina Customer Sequence Letter I get the following information:
PE Adapters
5' P-GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG
5' ACACTCTTTCCCTACACGACGCTCTTCCGATCT

What would I write in the file I hand over to 'ref='? Or is there another possibility using 'literal='? How does the algorithm distinguish between 5' and 3' adapter?


Thanks for your help!
Paul Newport is offline   Reply With Quote
Old 08-20-2014, 12:37 AM   #26
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Paul,

For the "ref=" flag, the adapters should be in fasta format. To use the "literal=" flag, you don't need to put them in a file. I'm not sure what the "P-" prefix signifies on the first adapter, but the way you would trim PE reads with those sequences is like this:

bbduk.sh -Xmx1g in1=read1.fq in2=read2.fq out1=trimmed1.fq out2=trimmed2.fq k=28 mink=13 hdist=1 ktrim=r literal=GATCGGAAGAGCGGTTCAGCAGGAATGCCGAG,ACACTCTTTCCCTACACGACGCTCTTCCGATCT tpe tbo

(all one line)

5' adapters will end up being seen on the 3' (right) end of the reads when the insert size is too short, so for that, you use the "ktrim=r" flag (meaning when a kmer matches the reference, trim that kmer and everything to the right).

I also included a file "truseq.fa.gz" in the /resources/ directory which contains the most commonly-used Illumina adapter sequences, though the ones you mention are not in it.

P.S. I added the "tpe" (trim pairs equally) and "tbo" (trim by overlap) flags. These improve the sensitivity for paired-end fragment libraries, in which the adapter should occur in the same place on each read and reads with adapter sequences should overlap when reverse-complemented. Adapter-trimming is actually possible using these flags only, without even specifying the sequences to trim, but the highest efficiency is attained by doing both overlap and sequence trimming. These flags should NOT be used with long-mate-pair libraries, and have no effect on single-ended libraries.

Last edited by Brian Bushnell; 08-20-2014 at 10:08 AM.
Brian Bushnell is offline   Reply With Quote
Old 08-28-2014, 10:20 AM   #27
lsayaved
Junior Member
 
Location: Germany

Join Date: Jul 2014
Posts: 2
Default

Dear Brian,
Thanks for creating this useful set of tools. I would like to use the sam output file from bbmap for SNP calling with Freebayes. The problem is that the sam file produced with bbmap includes identifiers that do not exist in the original fasta file (this does not happen with e.g. bowtie). For example a sequence id NODE_14_length_18854_cov_104.3_ID_21 will show up in the sam file multiple times with _ and a number like this:
@HD VN:1.3 SO:unsorted
@SQ SN:NODE_14_length_18854_cov_104.3_ID_21 LN:500
@SQ SN:NODE_14_length_18854_cov_104.3_ID_21_2 LN:500
@SQ SN:NODE_14_length_18854_cov_104.3_ID_21_3 LN:500
@SQ SN:NODE_14_length_18854_cov_104.3_ID_21_4 LN:500
@SQ SN:NODE_14_length_18854_cov_104.3_ID_21_5 LN:500
@SQ SN:NODE_14_length_18854_cov_104.3_ID_21_6 LN:500
@SQ SN:NODE_14_length_18854_cov_104.3_ID_21_7 LN:500
@SQ SN:NODE_14_length_18854_cov_104.3_ID_21_8 LN:500
@SQ SN:NODE_14_length_18854_cov_104.3_ID_21_9 LN:500


Is there a way to change the sam format to the standard?

Thanks for any advice,
Liz
lsayaved is offline   Reply With Quote
Old 08-28-2014, 10:26 AM   #28
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Liz,

There was a bug in a version of BBMap that was released briefly (and since been removed) in which the reference was loaded incorrectly. What version are you using? The latest one on Sourceforge (33.21) works correctly.
Brian Bushnell is offline   Reply With Quote
Old 08-28-2014, 02:43 PM   #29
lsayaved
Junior Member
 
Location: Germany

Join Date: Jul 2014
Posts: 2
Default

I downloaded the new version 33_21 but I still got the same result. I removed the "." from the my fasta headers and now seems to work. Thanks!
lsayaved is offline   Reply With Quote
Old 08-28-2014, 02:58 PM   #30
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Oh, yes, you just needed to regenerate the index, which could have been done by deleting the 'ref' folder, but which happened automatically when BBMap noticed the fasta reference had been modified. Sorry for the inconvenience!
Brian Bushnell is offline   Reply With Quote
Old 09-08-2014, 09:08 AM   #31
biodan
Junior Member
 
Location: NYC

Join Date: Oct 2012
Posts: 2
Default

Hi Brian,

Thanks for the bbduk tool, it is indeed very fast. I just downloaded bbmap today (Aug29 release), and get errors when i try to process a pair of files in 1 line:

/opt/bbmap/bbduk.sh -Xmx4g in1=clean_1g_C10_R1.fastq.gz in2=clean_1g_C10_R2.fastq.gz out1=tc_clean_1g_C10_R1.fastq.gz out2=tc_clean_1g_C10_R2.fastq.gz minlen=50 qtrim=rl trimq=20

Initial:
Memory: free=2013m, used=12m

Input is being processed as paired
Started output streams: 0.046 seconds.
Exception in thread "Thread-4" java.lang.AssertionError:
There appear to be different numbers of reads in the paired input files.
The pairing may have been corrupted by an upstream process. It may be fixable by running repair.sh.
at stream.ConcurrentGenericReadInputStream.pair(ConcurrentGenericReadInputStream.java:501)
at stream.ConcurrentGenericReadInputStream.readLists(ConcurrentGenericReadInputStream.java:367)
at stream.ConcurrentGenericReadInputStream.run(ConcurrentGenericReadInputStream.java:205)
at java.lang.Thread.run(Thread.java:744)

However, when i do 1 file at a time, it works with no errors. The input files are relatively small (about 1.5M reads from single cells).
biodan is offline   Reply With Quote
Old 09-08-2014, 01:37 PM   #32
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

If you are sure those files go together, then the most likely problem is that you used a different program upstream that was not pair-aware and removed some reads from one file but not the other. The best thing to do is redo the processing on the original files with something that retains pairing. Failing that, you can attempt to fix them like this:

cat clean_1g_C10_R2.fastq.gz tc_clean_1g_C10_R2.fastq.gz | repair.sh -Xmx4g in=stdin.fq.gz out1=r1.fq.gz out2=r2.fq.gz outs=singletons.fq

...which will redo the pairings based on the read names. It may help if you could print the first and last 4 lines of each file, as well as the word counts, and/or explain exactly what processing steps were done on the reads prior to quality-trimming.
Brian Bushnell is offline   Reply With Quote
Old 09-08-2014, 02:00 PM   #33
biodan
Junior Member
 
Location: NYC

Join Date: Oct 2012
Posts: 2
Default

Thanks for the tip Brian. The files are indeed pairs, the 'clean' files were generated by trimming the adapters (commands shown below) from parent files which have been previously, successfully mapped by Bowtie2 without adapter trimming.

/opt/bbmap/bbduk.sh -Xmx1g in=mrg_1g_C10_R1.fastq.gz out=clean_1g_C10_R1.fastq.gz ref=/opt/bbmap/resources/nextera.fa.gz ktrim=rl hdist=1

/opt/bbmap/bbduk.sh -Xmx1g in=mrg_1g_C10_R2.fastq.gz out=clean_1g_C10_R2.fastq.gz ref=/opt/bbmap/resources/nextera.fa.gz ktrim=rl hdist=1

In any event, your repair.sh script fixed the files.
biodan is offline   Reply With Quote
Old 09-08-2014, 03:31 PM   #34
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

I'm glad that repair.sh worked for you. In the future, I suggest running paired files simultaneously when doing any sort of trimming/filtering, like this:

bbduk.sh -Xmx1g in1=mrg_1g_C10_R1.fastq.gz in2=mrg_1g_C10_R2.fastq.gz out1=clean1.fastq.gz out2=clean2.fastq.gz ref=/opt/bbmap/resources/nextera.fa.gz ktrim=r hdist=1

That way, pairs will be kept together throughout the pipeline.
Brian Bushnell is offline   Reply With Quote
Old 10-26-2014, 10:48 PM   #35
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 305
Default

Hello Brian,
when using the kmer trimming/conversion option "ktrim=x" (i.e. conversion to a letter instead of trimming the bases) the quality scores of all the entire reads get converted to "!" , even the parts of the reads that are not supposed to be trimmed. This might not be supposed to be this way?

Last edited by luc; 10-27-2014 at 01:23 AM.
luc is offline   Reply With Quote
Old 10-27-2014, 01:05 AM   #36
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

That's definitely not supposed to happen! I will try to replicate and fix it tomorrow.
Brian Bushnell is offline   Reply With Quote
Old 10-27-2014, 10:41 AM   #37
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default bbduk end trimming

Hi Brian,
I have been trying to use bbduk to trim primer sequences from paired end fastq files and came across a problem. I have been wanting to replace a primer trimming tool, flexbar, with bbduk for speed reasons. One option i don't see in bbduk that prevents unwanted trimming is a way to specify only to left-trim primer sequences within the first 50 bases of a given read. Explanation here:
http://sourceforge.net/p/flexbar/wik...trim-end-modes

Are there options i can specify to replicate this behavior with bbduk?
lankage is offline   Reply With Quote
Old 10-27-2014, 12:15 PM   #38
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

@luc,

That bug is now fixed in v33.73. Thanks for reporting it!

@lankage,

There were no such options, so I just added them in v33.73. The flags are:

restrictleft=X If positive, only look for kmer matches in the leftmost X bases.
restrictright=X If positive, only look for kmer matches in the rightmost X bases.


Both default to 0, meaning they will be ignored unless you set them.

-Brian
Brian Bushnell is offline   Reply With Quote
Old 10-27-2014, 01:11 PM   #39
lankage
Member
 
Location: Madison, WI

Join Date: Oct 2014
Posts: 20
Default Version 33.73 BBduk

Thanks alot for the addition! I downloaded version 33.73 of BBtools and I can see the options for restrictleft and restrictright in the help text when running the script with no parameters. I get a java exception when trying to use them however.

Error message:
Exception in thread "main" java.lang.RuntimeException: Unknown parameter restrictleft=50
at jgi.BBDukF.<init>(BBDukF.java:427)
at jgi.BBDukF.main(BBDukF.java:66)
lankage is offline   Reply With Quote
Old 10-27-2014, 02:51 PM   #40
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,695
Default

Hmmm, somehow the altered file didn't make it into the package. I've reuploaded it as v33.73b. Sorry about that!
Brian Bushnell is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:35 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO