SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to split fastq into small fastq based on barcode? peterrjp Illumina/Solexa 6 12-30-2013 07:25 PM
Split Large FASTQ file in small FASTQ files with user defined number of reads Windows deepbiomed Bioinformatics 3 04-04-2013 08:14 AM
Split fastq files for tophat analysis Bobbieshaban Bioinformatics 2 03-12-2013 07:44 AM
Split fastq into smaller files lorendarith Bioinformatics 10 12-13-2012 05:28 AM
different Illumina convention in fastq files? mchaisso Bioinformatics 1 08-07-2008 08:22 AM

Reply
 
Thread Tools
Old 04-07-2015, 11:35 AM   #1
TompaB
Junior Member
 
Location: Sweden

Join Date: Jan 2014
Posts: 4
Default Efficient way to split FASTQ files based on Illumina indexes in the ID

I have received new NextSeq reads from our core facility in a semi-demultiplexed state. The P5 and P7 indices are placed in the sequence ID in an unsorted state. Here is an example of four reads in one of the pair's fastq files:
Code:
@NS500551:36:H5VJNBGXX:1:11101:17033:1044 2:N:0:GGACTCCT+GCGATCTA
GGGAGGTCTATATAAGCAGAGCTGGTACCA............
+
AAAAA.FF<)<.<FFFFFFA<.FFFFFF.F.FFFFA..........
@NS500551:36:H5VJNBGXX:1:11101:2211:1044 2:N:0:TAAGGCGA+TCTACTCT
GGGAGGTCTATATAAGCAGAGCTATAACCTC.......
+
AAA<A.FFF)7.<FFF7.AFFAA)F<AA)FFFFAA.......
@NS500551:36:H5VJNBGXX:1:11101:24462:1044 2:N:0:TCCTGAGC+GCGATCTA
GGGAGGTCTATATAAGCAGAGCTGGTACCAC........
+
<AA.A.FA<.7.FFFF<)FFFFAFF<A<.<FF<FF.....
@NS500551:36:H5VJNBGXX:1:11101:16844:1044 2:N:0:AGGCAGAA+TCTACTCT
GGGAGGTCTATATAAGCAGAGCTATAACTTCG........
+
AAA<A.F.F<A<.FFFF<F)F.FAFFF<FFAFFFFFFFFF......
I would like to split the reads into separate fastq files based on the indices, but I cannot find any suitable tools to do it. It needs to be reasonably fast as well, as this sequencing run has 400 million reads ....

All help is very appreciated.
TompaB is offline   Reply With Quote
Old 04-07-2015, 12:42 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,814
Default

Was the facility unwilling to demultiplex these for you? That seems kind of odd (unless you chose not to provide them with index information beforehand).
GenoMax is offline   Reply With Quote
Old 04-07-2015, 01:49 PM   #3
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 344
Default

The most flexible demultiplexing tool I am aware off is this:
http://comailab.genomecenter.ucdavis...paration_tools

I assume that it should work. Perhaps you have to remove the "+" between the barcodes or modify the script so that it ignores the "+".
luc is offline   Reply With Quote
Old 04-07-2015, 03:34 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

BBTools has a program, "demuxbyname", which will do this. Usage:

demuxbyname.sh in=r#.fq out=out_%_#.fq prefixmode=f names=GGACTCCT+GCGATCTA,TAAGGCGA+TCTACTCT,...

"Names" can also be a text file with one barcode per line (in exactly the format found in the read header). You do have to include all of the expected barcodes, though.

In the output filename, the "%" symbol gets replaced by the barcode; in both the input and output names, the "#" symbol gets replaced by 1 or 2 for read 1 or read 2. It's optional, though; you can leave it out for interleaved input/output, or specify in1=/in2=/out1=/out2= if you want custom naming.

Oh, and it's extremely fast.

Last edited by Brian Bushnell; 04-07-2015 at 03:37 PM.
Brian Bushnell is offline   Reply With Quote
Old 04-08-2015, 11:38 AM   #5
TompaB
Junior Member
 
Location: Sweden

Join Date: Jan 2014
Posts: 4
Default bbmap is the solution!

Thank you all for your help! The "demuxbyname.sh" approach works well and is fast.

Could I ask you two more things Brian? What I cannot manage to find with this this script is if there is a function to save the unmatched reads to a separate file. Is there such a function? The reason I would like this is that the reads are from NextSeq v1 chemistry and thus, there is a significant amount of reads that have missmatches in the indices. In this software, I cannot manage to find any function for allowing 1 or 2 missmatches (The Illumina demultiplex normally allows for 1 missmatch per index, i.e., a total of two missmatches).

For the other question why I need to do this. The core facility can and will demultiplex the file for me. They avoided it due to a misdirected kindness due to a miscommunication. It is just that I need the data really soon and they need some time to do it.

Thank you again!
TompaB is offline   Reply With Quote
Old 04-08-2015, 11:42 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,814
Default

Brian's programs share common options so you may want to try adding "outu=file_name" to your command to see if the unmatched reads are captured there.
GenoMax is offline   Reply With Quote
Old 04-08-2015, 01:39 PM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Hmmm, actually it doesn't have an "outu" flag right now; I'll add that for the next release.

We strictly throw away all reads with imperfect barcodes to minimize the risk of cross-contamination. But, I could add an option to the program to allow mismatches, I suppose; I might as well.

This will require a lot of memory, but if you want to capture all of the reads that did not have matching barcodes right now, you can do so like this:

1) Concatenate all of the output files that did have correct barcodes into a single file:
cat out_*_1.fq > combined.fq

2) Run filterbyname.sh:
filterbyname.sh in=r#.fq out=nonmatching#.fq names=combined.fq include=f
Brian Bushnell is offline   Reply With Quote
Old 04-09-2015, 05:01 PM   #8
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Demuxbyname now supports an "outu" flag. Does not support substitutions yet, though.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:47 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO