![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Trimmomatic vs bbduk.sh | mslider | Bioinformatics | 1 | 04-18-2017 11:10 AM |
Adapter trimming with BBduk | PeatMaster | Illumina/Solexa | 4 | 04-01-2016 12:18 PM |
bbduk for mirna mapping | danova | Bioinformatics | 1 | 02-03-2016 08:01 PM |
bbduk and kmer masking | cmccabe | Bioinformatics | 2 | 10-30-2015 11:16 AM |
what are the output files (barcode name: no barcode) after running sequencing? | super0925 | Ion Torrent | 2 | 09-02-2014 03:24 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Seattle, WA Join Date: Aug 2011
Posts: 9
|
![]()
Does the barcode filter in bbduk.sh (v37.66) only allow perfect matches or are mismatches allowed?
Thanks, Lynn |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
What barcode filter are you referring? Generally with "hdist=N" paramter you can allow or disallow (hdist=0) mismatches.
|
![]() |
![]() |
![]() |
#3 |
Junior Member
Location: Seattle, WA Join Date: Aug 2011
Posts: 9
|
![]()
I'm referring to these parameters:
barcodefilter=t barcodes=TCTCGCGC As far as I can see from my results right now, only reads with exactly this sequence in the header are retained. I think this may be too stringent. Does the 'hdist' parameter affect the barcode given how short it is? |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
I see. I have not personally used this feature since most of barcode work is done at the bcl2fastq stage, where you can allow for errors in sequence.
What exactly are you trying to do? Eliminate reads with some barcodes? I don't think the hdist= parameter is going to apply for the barcodes. It is for errors in the main read. You may have to look for an alternate way to do this. Perhaps using "demuxbyname.sh" may be a better option. Take a look at that. |
![]() |
![]() |
![]() |
#5 |
Junior Member
Location: Seattle, WA Join Date: Aug 2011
Posts: 9
|
![]()
Thanks for the reply. I have a set of large paired-end fastq files that were preprocessed by a sequencing core and I suspect that they were never demultiplexed because the file is quite large and may have had its own lane or flowcell. The headers contain barcodes that are mostly TCTCGCGC or that string with one or two mismatches but there are also barcodes that are wildly different and I want to strip those out without stripping what are probably legit barcodes with 1 or 2 mismatches. When I tried demuxbyname.sh, it started writing out 2x80,000 files, two for every barcode variant present.
|
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
Use the code I have in this post to enumerate all the different barcodes present in your file. That should give you an idea of the complexity of the problem. Then choose the ones you want (that actually should belong to your samples since you made them) to demux and use only those with demuxbyname.sh.
|
![]() |
![]() |
![]() |
#7 |
Junior Member
Location: Seattle, WA Join Date: Aug 2011
Posts: 9
|
![]()
I have already looked at all the barcodes. I've got 76,959 different barcodes that are not exact matches accounting for about 48 million reads. If allow 1 mismatch, I could recover 16 million of those. If I allow 2 mismatches, I could recover 24 million.
|
![]() |
![]() |
![]() |
#8 |
Junior Member
Location: Seattle, WA Join Date: Aug 2011
Posts: 9
|
![]()
I wrote a perl script using the fuzzy match module (Text::Fuzzy) to pull all the entries with no more than two mismatches in the barcode. It's not much but I can send to anyone who is interested.
|
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,088
|
![]()
In future just ask the sequence provider to re-do the demultiplexing with bcl2fastq. You are paying them for it anyway :-)
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|