![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to split fastq into small fastq based on barcode? | peterrjp | Illumina/Solexa | 6 | 12-30-2013 07:25 PM |
Split Large FASTQ file in small FASTQ files with user defined number of reads Windows | deepbiomed | Bioinformatics | 3 | 04-04-2013 08:14 AM |
Split fastq files for tophat analysis | Bobbieshaban | Bioinformatics | 2 | 03-12-2013 07:44 AM |
Split fastq into smaller files | lorendarith | Bioinformatics | 10 | 12-13-2012 05:28 AM |
different Illumina convention in fastq files? | mchaisso | Bioinformatics | 1 | 08-07-2008 08:22 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Junior Member
Location: Sweden Join Date: Jan 2014
Posts: 4
|
![]()
I have received new NextSeq reads from our core facility in a semi-demultiplexed state. The P5 and P7 indices are placed in the sequence ID in an unsorted state. Here is an example of four reads in one of the pair's fastq files:
Code:
@NS500551:36:H5VJNBGXX:1:11101:17033:1044 2:N:0:GGACTCCT+GCGATCTA GGGAGGTCTATATAAGCAGAGCTGGTACCA............ + AAAAA.FF<)<.<FFFFFFA<.FFFFFF.F.FFFFA.......... @NS500551:36:H5VJNBGXX:1:11101:2211:1044 2:N:0:TAAGGCGA+TCTACTCT GGGAGGTCTATATAAGCAGAGCTATAACCTC....... + AAA<A.FFF)7.<FFF7.AFFAA)F<AA)FFFFAA....... @NS500551:36:H5VJNBGXX:1:11101:24462:1044 2:N:0:TCCTGAGC+GCGATCTA GGGAGGTCTATATAAGCAGAGCTGGTACCAC........ + <AA.A.FA<.7.FFFF<)FFFFAFF<A<.<FF<FF..... @NS500551:36:H5VJNBGXX:1:11101:16844:1044 2:N:0:AGGCAGAA+TCTACTCT GGGAGGTCTATATAAGCAGAGCTATAACTTCG........ + AAA<A.F.F<A<.FFFF<F)F.FAFFF<FFAFFFFFFFFF...... All help is very appreciated. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
Was the facility unwilling to demultiplex these for you? That seems kind of odd (unless you chose not to provide them with index information beforehand).
|
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: US Join Date: Dec 2010
Posts: 452
|
![]()
The most flexible demultiplexing tool I am aware off is this:
http://comailab.genomecenter.ucdavis...paration_tools I assume that it should work. Perhaps you have to remove the "+" between the barcodes or modify the script so that it ignores the "+". |
![]() |
![]() |
![]() |
#4 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
BBTools has a program, "demuxbyname", which will do this. Usage:
demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,... "Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though. In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming. Oh, and it's extremely fast. Last edited by Brian Bushnell; 04-07-2015 at 03:37 PM. |
![]() |
![]() |
![]() |
#5 |
Junior Member
Location: Sweden Join Date: Jan 2014
Posts: 4
|
![]()
Thank you all for your help! The "demuxbyname.sh" approach works well and is fast.
Could I ask you two more things Brian? What I cannot manage to find with this this script is if there is a function to save the unmatched reads to a separate file. Is there such a function? The reason I would like this is that the reads are from NextSeq v1 chemistry and thus, there is a significant amount of reads that have missmatches in the indices. In this software, I cannot manage to find any function for allowing 1 or 2 missmatches (The Illumina demultiplex normally allows for 1 missmatch per index, i.e., a total of two missmatches). For the other question why I need to do this. The core facility can and will demultiplex the file for me. They avoided it due to a misdirected kindness due to a miscommunication. It is just that I need the data really soon and they need some time to do it. Thank you again! |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,080
|
![]()
Brian's programs share common options so you may want to try adding "outu=file_name" to your command to see if the unmatched reads are captured there.
|
![]() |
![]() |
![]() |
#7 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Hmmm, actually it doesn't have an "outu" flag right now; I'll add that for the next release.
We strictly throw away all reads with imperfect barcodes to minimize the risk of cross-contamination. But, I could add an option to the program to allow mismatches, I suppose; I might as well. This will require a lot of memory, but if you want to capture all of the reads that did not have matching barcodes right now, you can do so like this: 1) Concatenate all of the output files that did have correct barcodes into a single file: cat out_*_1.fq > combined.fq 2) Run filterbyname.sh: filterbyname.sh in=r#.fq out=nonmatching#.fq names=combined.fq include=f |
![]() |
![]() |
![]() |
#8 |
Super Moderator
Location: Walnut Creek, CA Join Date: Jan 2014
Posts: 2,707
|
![]()
Demuxbyname now supports an "outu" flag. Does not support substitutions yet, though.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|