SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
extract reads matching barcodes from fastq file? odoyle81 Bioinformatics 7 12-03-2014 12:52 PM
Meta-velvet only producing one file of small size thh32 Bioinformatics 1 03-27-2014 03:07 AM
producing a .spec file from smrt data tonybert Pacific Biosciences 1 12-04-2013 03:44 PM
Splitting 454 paired reads in a FASTQ file sjackman Bioinformatics 5 09-10-2010 12:09 PM
file format headaches - producing interleaved fastq natstreet SOLiD 1 07-28-2010 02:04 AM

Reply
 
Thread Tools
Old 06-25-2014, 07:30 AM   #1
a.cardilini
Junior Member
 
Location: Melbourne, VIc, Australia

Join Date: Mar 2012
Posts: 4
Default Splitting fastq file by barcodes without producing unmatched.fq file?

G'day Everyone,

I am trying to split six plates of data by the barcodes of 576 indviduals. We are split our data using 'fastx_barcode_splitter.pl', but unfortunately this tool isn't able to work with a barcodes file where barcodes have different lengths. To combat this we decided to feed one barcode in at a time which works fine.

Our problem is that during this process 'fastx_barcode_splitter.pl' also writes out an unmatched.fq file which is about 50X larger than the file we are interested in and is taking up a lot of the processing time. It took longer than an hour to split out 1 individual, multiplied by 576 means it will take way to long to run.

Is there a way to stop 'fastx_barcode_splitter.pl' producing an unmatched.fq file? I think this would help us reduce a lot of unnecessary read and writing processing time.

Thanks for your help in advance.

Cheers,
Adam
a.cardilini is offline   Reply With Quote
Old 06-25-2014, 08:33 AM   #2
wolma
Member
 
Location: Germany

Join Date: May 2014
Posts: 23
Default

I don't think there's a command line switch for it, but this is just a perl script so it shouldn't be too hard to modify it to do what you want.
I haven't tested this, but in the fastx_barcode_splitter.pl file there is this function:

Code:
sub match_sequences {

.. lots of truncated lines ..
.. but ending in ..

		$best_barcode_ident = 'unmatched' 
			if ( (!defined $best_barcode_ident) || $best_barcode_mismatches_count>$allowed_mismatches) ;

		print STDERR "sequence $seq_bases matched barcode: $best_barcode_ident\n" if $debug;

		$counts{$best_barcode_ident}++;

		#get the file associated with the matched barcode.
		#(note: there's also a file associated with 'unmatched' barcode)
		my $file = $files{$best_barcode_ident};

		write_record($file);
	}
}
I think if you just enclose the write_record($file); in an if clause like this:

Code:
if ($best_barcode_ident ne  'unmatched') {
    write_record($file);
}
it should help. The unmatched output file will still be generated, but nothing should be written into it.

As I said untested, but I hope it helps,
Wolfgang
wolma is offline   Reply With Quote
Old 06-25-2014, 07:23 PM   #3
a.cardilini
Junior Member
 
Location: Melbourne, VIc, Australia

Join Date: Mar 2012
Posts: 4
Default

Thanks wolfgang,

that works great! I no longer get the unmatched.fq file printed out.

Unfortunately, it is still pretty slow because it is processing these reads. Do you think it is possible to skip the processing of unmatched reads, or is this likely to cause problems with running the script? This python script is largely illegible to me so I am not sure how intertwined the unmatched stuff is with the match stuff.

Thanks again for your help, I really appreciate it.

All the best,
Adam
a.cardilini is offline   Reply With Quote
Old 06-25-2014, 11:38 PM   #4
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Adam,

That's perl, not python... and if you want it to run faster, you might need to write or find a version written in a compiled language like C or Java, rather than an interpreted language like perl and python. It depends on whether it is CPU or I/O limited; run "top" and see if the cpu load is 100% while running. If it is, then you're cpu-limited and probably need a different language to speed it up.

And you probably can't speed it up by skipping reads. You have to process a read (or bar code) at least once in order to determine whether or not it matches one of your bins!

If it runs fast enough when you run it once, rather than 576 times, you may be able trick it by padding your short barcodes with extra characters, so that all are the same length.
Brian Bushnell is offline   Reply With Quote
Old 06-26-2014, 01:34 AM   #5
wolma
Member
 
Location: Germany

Join Date: May 2014
Posts: 23
Default

Adam,
there is no way to skip processing of the reads. As Brian points out correctly you need to look at them to see if they match. My suggestion should save some time by minimizing disk write access, but that's all you can do.

Another option would be to in fact write the unmatched reads, then use only this file as input in the next round. With such a subtraction approach, the unmatched reads file would become smaller at every step, so even though your first round may take very long, subsequent steps would run faster.
Depending on how similar your barcodes are this would also eliminate the risk of accidentally assigning the same read to two different barcodes.
wolma is offline   Reply With Quote
Old 06-26-2014, 03:03 PM   #6
gringer
David Eccles (gringer)
 
Location: Wellington, New Zealand

Join Date: May 2011
Posts: 836
Default

Or modify the algorithm so that it works with barcodes of different lengths. If you're already changing the code, you might as well make that little fix as well.
/usr/bin/fastx_barcode_splitter.pl
edit: or not so little.... If you can give me a bit more information about how your barcoding system works (e.g. do you have a separate barcode file and sequence file? do barcodes always appear in the first 10 bases? Do long barcodes always start at the same place in the sequence?), I might be able to crank out something that works.

Last edited by gringer; 06-26-2014 at 03:12 PM.
gringer is offline   Reply With Quote
Reply

Tags
barcode, fastx, splitting

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:23 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO