Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting fastq file by barcodes without producing unmatched.fq file?

    G'day Everyone,

    I am trying to split six plates of data by the barcodes of 576 indviduals. We are split our data using 'fastx_barcode_splitter.pl', but unfortunately this tool isn't able to work with a barcodes file where barcodes have different lengths. To combat this we decided to feed one barcode in at a time which works fine.

    Our problem is that during this process 'fastx_barcode_splitter.pl' also writes out an unmatched.fq file which is about 50X larger than the file we are interested in and is taking up a lot of the processing time. It took longer than an hour to split out 1 individual, multiplied by 576 means it will take way to long to run.

    Is there a way to stop 'fastx_barcode_splitter.pl' producing an unmatched.fq file? I think this would help us reduce a lot of unnecessary read and writing processing time.

    Thanks for your help in advance.

    Cheers,
    Adam

  • #2
    I don't think there's a command line switch for it, but this is just a perl script so it shouldn't be too hard to modify it to do what you want.
    I haven't tested this, but in the fastx_barcode_splitter.pl file there is this function:

    Code:
    sub match_sequences {
    
    .. lots of truncated lines ..
    .. but ending in ..
    
    		$best_barcode_ident = 'unmatched' 
    			if ( (!defined $best_barcode_ident) || $best_barcode_mismatches_count>$allowed_mismatches) ;
    
    		print STDERR "sequence $seq_bases matched barcode: $best_barcode_ident\n" if $debug;
    
    		$counts{$best_barcode_ident}++;
    
    		#get the file associated with the matched barcode.
    		#(note: there's also a file associated with 'unmatched' barcode)
    		my $file = $files{$best_barcode_ident};
    
    		write_record($file);
    	}
    }
    I think if you just enclose the write_record($file); in an if clause like this:

    Code:
    if ($best_barcode_ident ne  'unmatched') {
        write_record($file);
    }
    it should help. The unmatched output file will still be generated, but nothing should be written into it.

    As I said untested, but I hope it helps,
    Wolfgang

    Comment


    • #3
      Thanks wolfgang,

      that works great! I no longer get the unmatched.fq file printed out.

      Unfortunately, it is still pretty slow because it is processing these reads. Do you think it is possible to skip the processing of unmatched reads, or is this likely to cause problems with running the script? This python script is largely illegible to me so I am not sure how intertwined the unmatched stuff is with the match stuff.

      Thanks again for your help, I really appreciate it.

      All the best,
      Adam

      Comment


      • #4
        Adam,

        That's perl, not python... and if you want it to run faster, you might need to write or find a version written in a compiled language like C or Java, rather than an interpreted language like perl and python. It depends on whether it is CPU or I/O limited; run "top" and see if the cpu load is 100% while running. If it is, then you're cpu-limited and probably need a different language to speed it up.

        And you probably can't speed it up by skipping reads. You have to process a read (or bar code) at least once in order to determine whether or not it matches one of your bins!

        If it runs fast enough when you run it once, rather than 576 times, you may be able trick it by padding your short barcodes with extra characters, so that all are the same length.

        Comment


        • #5
          Adam,
          there is no way to skip processing of the reads. As Brian points out correctly you need to look at them to see if they match. My suggestion should save some time by minimizing disk write access, but that's all you can do.

          Another option would be to in fact write the unmatched reads, then use only this file as input in the next round. With such a subtraction approach, the unmatched reads file would become smaller at every step, so even though your first round may take very long, subsequent steps would run faster.
          Depending on how similar your barcodes are this would also eliminate the risk of accidentally assigning the same read to two different barcodes.

          Comment


          • #6
            Or modify the algorithm so that it works with barcodes of different lengths. If you're already changing the code, you might as well make that little fix as well.
            /usr/bin/fastx_barcode_splitter.pl
            edit: or not so little.... If you can give me a bit more information about how your barcoding system works (e.g. do you have a separate barcode file and sequence file? do barcodes always appear in the first 10 bases? Do long barcodes always start at the same place in the sequence?), I might be able to crank out something that works.
            Last edited by gringer; 06-26-2014, 02:12 PM.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            29 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            25 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Working...
            X