SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Trimmomatic vs bbduk.sh mslider Bioinformatics 1 04-18-2017 11:10 AM
Adapter trimming with BBduk PeatMaster Illumina/Solexa 4 04-01-2016 12:18 PM
bbduk for mirna mapping danova Bioinformatics 1 02-03-2016 08:01 PM
bbduk and kmer masking cmccabe Bioinformatics 2 10-30-2015 11:16 AM
what are the output files (barcode name: no barcode) after running sequencing? super0925 Ion Torrent 2 09-02-2014 03:24 AM

Reply
 
Thread Tools
Old 11-16-2017, 05:25 PM   #1
lamon
Junior Member
 
Location: Seattle, WA

Join Date: Aug 2011
Posts: 9
Default bbduk.sh barcode filter

Does the barcode filter in bbduk.sh (v37.66) only allow perfect matches or are mismatches allowed?
Thanks,
Lynn
lamon is offline   Reply With Quote
Old 11-17-2017, 04:26 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,579
Default

What barcode filter are you referring? Generally with "hdist=N" paramter you can allow or disallow (hdist=0) mismatches.
GenoMax is offline   Reply With Quote
Old 11-17-2017, 12:48 PM   #3
lamon
Junior Member
 
Location: Seattle, WA

Join Date: Aug 2011
Posts: 9
Default

I'm referring to these parameters:
barcodefilter=t barcodes=TCTCGCGC
As far as I can see from my results right now, only reads with exactly this sequence in the header are retained. I think this may be too stringent. Does the 'hdist' parameter affect the barcode given how short it is?
lamon is offline   Reply With Quote
Old 11-17-2017, 12:53 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,579
Default

I see. I have not personally used this feature since most of barcode work is done at the bcl2fastq stage, where you can allow for errors in sequence.

What exactly are you trying to do? Eliminate reads with some barcodes? I don't think the hdist= parameter is going to apply for the barcodes. It is for errors in the main read. You may have to look for an alternate way to do this. Perhaps using "demuxbyname.sh" may be a better option. Take a look at that.
GenoMax is offline   Reply With Quote
Old 11-17-2017, 01:37 PM   #5
lamon
Junior Member
 
Location: Seattle, WA

Join Date: Aug 2011
Posts: 9
Default

Thanks for the reply. I have a set of large paired-end fastq files that were preprocessed by a sequencing core and I suspect that they were never demultiplexed because the file is quite large and may have had its own lane or flowcell. The headers contain barcodes that are mostly TCTCGCGC or that string with one or two mismatches but there are also barcodes that are wildly different and I want to strip those out without stripping what are probably legit barcodes with 1 or 2 mismatches. When I tried demuxbyname.sh, it started writing out 2x80,000 files, two for every barcode variant present.
lamon is offline   Reply With Quote
Old 11-17-2017, 01:53 PM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,579
Default

Use the code I have in this post to enumerate all the different barcodes present in your file. That should give you an idea of the complexity of the problem. Then choose the ones you want (that actually should belong to your samples since you made them) to demux and use only those with demuxbyname.sh.
GenoMax is offline   Reply With Quote
Old 11-17-2017, 01:59 PM   #7
lamon
Junior Member
 
Location: Seattle, WA

Join Date: Aug 2011
Posts: 9
Default

I have already looked at all the barcodes. I've got 76,959 different barcodes that are not exact matches accounting for about 48 million reads. If allow 1 mismatch, I could recover 16 million of those. If I allow 2 mismatches, I could recover 24 million.
lamon is offline   Reply With Quote
Old 11-17-2017, 08:37 PM   #8
lamon
Junior Member
 
Location: Seattle, WA

Join Date: Aug 2011
Posts: 9
Default

I wrote a perl script using the fuzzy match module (Text::Fuzzy) to pull all the entries with no more than two mismatches in the barcode. It's not much but I can send to anyone who is interested.
lamon is offline   Reply With Quote
Old 11-18-2017, 04:51 AM   #9
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,579
Default

In future just ask the sequence provider to re-do the demultiplexing with bcl2fastq. You are paying them for it anyway :-)
GenoMax is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:44 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO