SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality-, adapter- and RRBS-trimming with Trim Galore! fkrueger Bioinformatics 138 11-12-2020 04:58 PM
Adapter trimming figo1019 RNA Sequencing 2 07-17-2018 05:00 AM
Adapter trimming and trimming by quality question alisrpp Bioinformatics 5 04-08-2013 05:55 PM
adapter trimming - help a_mt Bioinformatics 6 11-12-2012 08:36 PM
3' Adapter Trimming caddymob Bioinformatics 0 05-27-2009 01:53 PM

Reply
 
Thread Tools
Old 01-15-2016, 03:02 AM   #141
willd
Junior Member
 
Location: France

Join Date: Jan 2016
Posts: 4
Default

Dear Brian, indeed there seems to be a "@HISEQ....." line missing in this region of the file, and I found that this has happened in at least one other part of the file as well. I guess I will check with the platform who generated this if they have another version of the file, though the md5sum checks out so I guess this sample might be lost if we can't trust it.

Anyway... NO PROBLEM WITH YOUR EXCELLENT BBDuk! (sorry to have suggested that, and thanks again for your help).

Code:
@HISEQ10:C68T3ACXX:8:1101:20187:76019/1
ATTTATTTATTTTATTATTTTTTTGAAACATTTTTACATGTTTATTTATCAAAAGTGAAAAAGCATACAATTCAACACAATTTCAATAGCATTTGACACCA
+
CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJJJJJJJIJJJJJJJJJJJJJJFHIJJJJJIGGIJJHHHHGHEDFFDEEEEEEEEDCDEDDEEDDDBB
@HISEQ10:C68T3ACXX:8:1101:20097:76043/1
GGTTAGTATAGCTTAGTTAAACTTTCGTTTATTGCTAAAGGTTAATCACTGCTGTTTCCCGTGGGGGTGTGGCTAGGCTAAGCGTTTTGAGCTGCATTGCT
+
@@?DDFADFHHHHJIJHGIJJIJJIJIFHIJGJIAC,=?B01:19795:76016/1
CTCGGTCACAACCTGAGGGCTCAGCCCCAGGGCACGGCCATAGCCGGCGCGAGGCCCAAGGCTGCGGTGCTGCGGCTACTGGCGGCGACGAGCCGATCGGA
+
CCCFFFFFHGHFHIJIJJJJJJIICHJIIGIICHIEGGBHHIHIJJJHFDBD:@BDDDDDDD>ABDD>@@CACDD9>@CDDDBBD################
@HISEQ10:C68T3ACXX:8:1101:19859:76022/1
GTACATTTAACCCAGTTTAGTGGCAAGTTCTTTAGCCTTTGCCTTTTCGAGCTTGGCGATACGAGCCACAGACTTAGGACCCAGGACGTTGCCACCCCAGT
+
C@BFFFFFHHHHHJJHIJJIJIJIJJJHIIIIJIJIJJJJJJIIIJJJJIJIIJJJJJJIJIJHHHFFFEEEEEDCCCCDDDDDDDDDDDDDDDDDDDDD>
@HISEQ10:C68T3ACXX:8:1101:19768:76027/1
ACACGCTGGACTCCCCTGGCCATTTCTGGGTTATATAAATGTTTGATGAAAAATCTCTCAGGGGAAATCAAATTTTTGTAAGAAGTTAATGACAGCCCATC
+
@@CFFFFFHHHHHJJJJJJDIJJJJJJIIJDHIJIIFIIJJGHIJIJJGIFIBHJJJJJHIIJJGHFHHFFFFFF;CDEEEDDDDCCDEEDDDDDDDDDDC
@HISEQ10:C68T3ACXX:8:1101:19877:76028/1
AGCCATCCAACCACTCAGTCTTGGCAGTGCAGATGAAAAACTGGGAACCATTTGTGTTGGATCGGAAGAGCACACGACTGAACTCAAGTCACATCACGATC
+
;;?DDDDDFFFAFD3C@:<+C<CBEEG@9CFFCAFGI9DD)?F<2)9?1D3??4).8)==C@)'5C1=263?;@@<',;8:@B(5(5>>:,>AABBB####
(PS: the "#ID Median_fold" that you mentioned was not in my output)
willd is offline   Reply With Quote
Old 03-02-2016, 12:15 AM   #142
chrism
Junior Member
 
Location: Germany

Join Date: Jul 2013
Posts: 2
Default Filtering different contaminants with different k-mer sizes

Dear Brian,

I use BBDuk for filtering reads based on quality, adapter content, and chloroplast content. It is a great tool!

For some applications, I also want to filter out reads that contain a restriction enzyme motif (a sequence of length 4-8).

Currently, this leads to up to four BBduk jobs that are connected by a pipe:

Code:
bbduk.sh  in=in_R1.fastq in2=in_R2.fastq removeifeitherbad=t out=stdout.fq ref=restriction_motifs.fa k=4 rcomp=t maskmiddle=f qout=33  | bbduk.sh  in=stdin.fq int=t removeifeitherbad=t out=stdout.fq minlength=30 qout=33 ref=adapters.fa hammingdistance=2 k=20 mink=10 hammingdistance2=1 ktrim=n maxns=0 maxbadkmers=0 rcomp=t | bbduk.sh  in=stdin.fq int=t removeifeitherbad=t out=stdout.fq ref=chloroplast.fa k=31 rcomp=t qout=33  | bbduk.sh in=stdin.fq int=t out1=out_R1.fastq removeifeitherbad=t out2=out_R2.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33
My question is:
Can I condense these four jobs into less (ideally one) jobs?

I assume that if a FASTA entry in the file given with "ref=" is smaller than the k-mer size, it is effectively not used for the filtering.
Therefore, the main problem that I see is to define different k-mer sizes for different references (e.g. 4 for the restriction motif, and 31 for the chloroplast).

Thanks!

Chris
chrism is offline   Reply With Quote
Old 03-02-2016, 08:05 PM   #143
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Chris,

You can merge the last two stages:

Code:
bbduk.sh in=stdin.fq int=t out=out_R#.fastq removeifeitherbad=t minlength=30 qtrim=r trimq=15 minavgquality=15 qout=33 ref=chloroplast.fa k=31 rcomp=t
And, you don't actually need to set "qout=33 rcomp=t removeifeitherbad=t" since those are all defaults, so the last two stages shorten to:

Code:
bbduk.sh in=stdin.fq int=t out=out_R#.fastq minlength=30 qtrim=r trimq=15 minavgquality=15 ref=chloroplast.fa k=31
Unfortunately, you cannot combine stages 1, 2, and 3, as they each need different kmer lengths or hamming distances. So the best way to do it is in a pipeline. Your assumption about ignoring sequences shorter than kmer length is correct, but unfortunately it does not help in this situation.

Note that (depending on the environment) BBDuk will try to grab around half of available memory for each process, which may lead to unpredictable results. I suggest using the -Xmx flag in each stage to tell it how much memory to use (-Xmx1g should be fine in each case) whenever piping BBDuk.

The way you accomplished variable-kmer-length adapter-filtering via the "ktrim=n" and "maxns=0" combo was clever, by the way!
Brian Bushnell is offline   Reply With Quote
Old 03-07-2016, 09:40 AM   #144
chrism
Junior Member
 
Location: Germany

Join Date: Jul 2013
Posts: 2
Default

Dear Brian,

thanks for your help!

Best,
Chris
chrism is offline   Reply With Quote
Old 06-14-2016, 02:08 PM   #145
cmbetts
Senior Member
 
Location: Bay Area

Join Date: Jun 2012
Posts: 121
Default Some pairs breaking after multiple steps

I've been doing some multi-step trimming of paired libraries that has been causing a low frequency of pairs breaking, and I'm trying to figure out if there's a way to avoid having this happen.
These libraries have a structure where read1 contains exclusively sample barcoding information, and read2 contains the sequences that will eventually be aligned. Ideally, I would like to preserve every base of read1, but on read2, trim several bases from the 5' of the read, as well as do some kmer and quality trimming on read2.

my solution was to do this in two steps, first doing the 5' trim on just read2, then doing the kmer trim with the skipr1 flag active. (I found out the hard way that the skipr1 flag doesn't skip ftl operations)
ie
Code:
bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz
bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE
However, even though there's the same number of reads before and after 5' trimming, and the same number of reads in the final trimmed reads, some of the pairing breaks at a low frequency (I haven't picked out all the broken pairs yet, but the head and tail of the file have the same read names, and the first broken pair is ~2000 reads into the files). Is there a way to run this sort of trimming on only read2 while keeping the pairs properly mated? My other alternative would be to do the trimming on just read2, and then write a custom python script to pull out the correct ordered reads in the read1 fastq.

Thanks
cmbetts is offline   Reply With Quote
Old 06-14-2016, 02:15 PM   #146
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBDuk is multithreaded, and will reorder reads unless you explicitly tell it not to via the "ordered" flag. So in this case, you could do this:

Code:
bbduk.sh in=read2.fq.gz ftl=6 out=read2_ftl6.fq.gz ordered minlength=1
bbduk.sh in1=read1.fq in2=read2_ftl6.fq.gz out=read#_trimmed.fq.gz literal=<adapter sequence> ktrim=r k=23 mink=11 hdist=1 skipr1=TRUE
It does not matter whether the reads are re-ordered in the second pass, because pairs will be kept together; but when processing just one of the files alone, it's important to retain the original order, and not discard any reads (which is the reason for the minlength=1 flag; the default is 10).

It's possible to repair the pairing with "repair.sh" but of course it's always most efficient to just avoid the problem in the first place.
Brian Bushnell is offline   Reply With Quote
Old 06-14-2016, 03:41 PM   #147
cmbetts
Senior Member
 
Location: Bay Area

Join Date: Jun 2012
Posts: 121
Default

Thanks!
I thought it was probably something simple I was leaving out. Adding in the "ordered minlength=1" to the first command preserved the pairs as checked with reformat.sh vpair=TRUE
cmbetts is offline   Reply With Quote
Old 09-09-2016, 01:04 PM   #148
Mizzou55
Junior Member
 
Location: st. louis

Join Date: Mar 2010
Posts: 7
Default

Hi, I'm trying to trim fasta reads w/ quals but get this error.

Thanks,

Pat

~pminx/bin/bbmap/bbduk.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125

error:
java -Djava.library.path=/gscuser/pminx/bin/bbmap/jni/ -ea -Xmx991m -Xms991m -cp /gscuser/pminx/bin/bbmap/current/ jgi.BBDukF in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual ftl=125 ftr=125
Executing jgi.BBDukF [in=Cat_BES_325_500.fasta, qfin=Cat_BES_325_500.qual, out=Cat_BES_325_500_trim.fa, qfout=Cat_BES_325_500_trim.qual, ftl=125, ftr=125]
BBDuk version 36.30
Exception in thread "main" java.lang.RuntimeException: Unknown parameter qfout=Cat_BES_325_500_trim.qual
at jgi.BBDukF.<init>(BBDukF.java:421)
at jgi.BBDukF.main(BBDukF.java:66)
Mizzou55 is offline   Reply With Quote
Old 09-09-2016, 01:38 PM   #149
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Pat,

My apologies, I somehow forgot to include support for qual file output with BBDuk! And, reformat does not support ftl/ftr. I will fix BBDuk's support for qual files in the next release. What you can do now is this (and apologies for it being tedious):

reformat.sh in=Cat_BES_325_500.fasta qfin=Cat_BES_325_500.qual out=x.fq
bbduk.sh in=x.fq out=y.fq ftl=125 ftr=125
reformat.sh in=y.fq out=Cat_BES_325_500_trim.fa qfout=Cat_BES_325_500_trim.qual


However, please note that "ftl=125 ftr=125" will trim everything to the left of position 125, and everything to the right of position 125, leaving you with only 1bp reads, which are not very useful. What are you trying to accomplish with this command?
Brian Bushnell is offline   Reply With Quote
Old 09-12-2016, 07:54 PM   #150
Mizzou55
Junior Member
 
Location: st. louis

Join Date: Mar 2010
Posts: 7
Default

Thanks for the quick response. I'll give it a try.
Mizzou55 is offline   Reply With Quote
Old 09-22-2016, 02:23 PM   #151
Mizzou55
Junior Member
 
Location: st. louis

Join Date: Mar 2010
Posts: 7
Default

Hi Brian,

In this case I'm trying to trim vector and low qual from Sanger (BAC End Seq) reads (~850bp) to run through SSPACE. Will your script have problems with this?
Mizzou55 is offline   Reply With Quote
Old 09-22-2016, 04:01 PM   #152
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

No, that will work... but generally I would recommend quality-trimming using "qtrim=r trimq=15" or similar, using the vector sequence as the reference with "ktrim=r" or "kmask=N" for vector trimming. "ref=vector kmask=N k=31 edist=1" would, for example, mask all of the vector sequence with Ns, which you could then remove via subsequent quality-trimming on both ends with "qtrim=rl trimq=2" (Ns have quality 0).

If you just want to remove the first and last 125bp the command would be "ftl=125 ftr2=125".
Brian Bushnell is offline   Reply With Quote
Old 09-24-2016, 07:58 AM   #153
germelcar
Junior Member
 
Location: Mexico

Join Date: Sep 2016
Posts: 1
Default

Hi Brian:

I ran your "repair.sh" script yesterday and it has generated only "singletons.fq" file, the "r1.fq" and "r2.fq" files are empty. The command that I ran was the same as you mention on the page 2:

Quote:
cat SRR867646_1.fastq SRR867646_2.fastq | repair.sh -Xmx4g in=stdin.fq out1=r1.fq out2=r2.fq outs=singletons.fq
I ran it with SRR867646's reads which seems to have beem trimmed previously to be uploaded to SRA archive files, and in fact, I am not able to do an alignment because R1 has 10919266 sequences and R2 has 2177589 sequences.

But... I don't know how to interpret that, why r1.fq and r2.fq files are empty and singletons.fq file not? Should I treat the reads as single-end using the singletons.fq file?

Also, the singletons.fq file contains 13096855 sequences, the same amount that R1+R2 from the original reads.

How should I handle that?
What recommendations do you have?


Thanks in advance.
~g
germelcar is offline   Reply With Quote
Old 09-25-2016, 08:07 AM   #154
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,138
Default

@germelcar's questions has been answered over at Biostars.
GenoMax is offline   Reply With Quote
Old 09-26-2016, 10:27 AM   #155
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Thanks, Genomax!
Brian Bushnell is offline   Reply With Quote
Old 11-02-2016, 09:09 AM   #156
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 39
Default BBDuk - Remove singletons?

Hi Brian,

I'm using BBDuk to adapter and quality trim Illumina reads that were prepped with the Nextera kit. I'm using the Plugin within Geneious, and I have a few questions:

1) I'm getting different a number or reads as outputs for pairs. I tried using removeifeitherbad=t, but still get different number of reads in each file. Would this command normally result in an equal number of reads for each file-pair? If so, I'll inquire with Geneious.

2) When trimming adapters, what is the benefit of selecting only the right or left end? I'm guessing my reads will only have adapter sequence on the 5' end, correct? This assumes that the insert size is longer than the read length, as I think in principal I should have adapters on both ends? Why not just trim both ends to be safe? Does it result in non-specific trimming of sample sequence?

Next, I intend to pair and merge the reads, and then de novo assemble.

Thanks for any help.

Jake
JVGen is offline   Reply With Quote
Old 11-02-2016, 10:18 AM   #157
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Jake,

1) When using paired reads with BBDuk, if the reads are in two files, you must run BBDuk just once using both files as input (using the in1= and in2= flags), rather than on one file at a time. As long as both files are used as input together, pairs will always be kept together.

2) Adapter-trimming should only be done on the right (5') end for fragment libraries. Left-trimming is only for special circumstances like specific long-mate pair protocols and amplicons with custom inline barcodes. For fragment libraries, the original molecule has adapters on both ends, but reading starts just after the adapter so the reads have no adapter sequence on the left end, and only have adapter sequence on the right end if the insert size was shorter than read length. If you left-trim a fragment library after right-trimming it, nothing will happen except that, as you note, you will get occasional random trimming of genomic sequence, though that will be very rare. Also note that BBDuk can't do left and right trimming simultaneously, as based on how it does trimming (when a reference kmer is found, trim that kmer and everything to the left or right) it would trim all bases in the entire read.
Brian Bushnell is offline   Reply With Quote
Old 11-02-2016, 11:05 AM   #158
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 39
Default

Quote:
Originally Posted by Brian Bushnell View Post
Hi Jake,

1) When using paired reads with BBDuk, if the reads are in two files, you must run BBDuk just once using both files as input (using the in1= and in2= flags), rather than on one file at a time. As long as both files are used as input together, pairs will always be kept together.
Thanks Brian. I found that the unequal read number following trimming was due to the " were being generated from read length cut-off. I was specifying a minimum read length of 30, but it was removing the read's mate. It could be an issue with Geneious. I can quality- and adapter- trim and get the same read numbers in each file. I'll just dictate read length at a later step.

Jake
JVGen is offline   Reply With Quote
Old 11-02-2016, 11:19 AM   #159
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hi Jake,

It's still important to process both files together even if you have no minimum length cutoff, because the output order of BBDuk is not guaranteed to be the same as the input order (unless you add the "ordered" flag). So, I guess, if Geneious is running BBDuk on paired files individually, please add the "ordered" flag, and report that issue to the Geneious developers - it should process them together.
Brian Bushnell is offline   Reply With Quote
Old 11-02-2016, 11:45 AM   #160
JVGen
Member
 
Location: East Coast

Join Date: Jul 2016
Posts: 39
Default

Hi Brian, I'll report the issue to Geneious. I'm using the "keep order" feature in Geneious. I've attached a screenshot. The check boxes and insert fields that Geneious created are nice for those of us that don't code, but only if they actually do what they say :P

Jake

Never mind, forums have some pretty stringent rules on picture attachment dimension...painful.
JVGen is offline   Reply With Quote
Reply

Tags
adapter, bbduk, bbtools, cutadapt, trimmomatic

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2022, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO