Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • perl woes and the processing with big fatq files

    Hi,
    the last fastq files I got for an RNA-seq experiment had for some reason a glitch, that is
    there was at least one field which was not starting with + like:
    +HWI-ST667_0144:3:1101:1140:2047#GTGGCC/1

    So I thought I could easily get around this and write a short script to filter out all weird entries:

    Code:
    #!/usr/bin/perl
    
    use warnings;
    use strict;
    use Carp;
    
    my $a = {};
    my $n = 0;
    my $k = 0;
    
    my $filenames = $ARGV[0];
    
    unless( open( FILEIOFH, $filenames ) ){
    	croak("Cannot open file" . $filenames );
    }
    while (defined(my $line = <FILEIOFH>) ) {
    	unless ( eof(FILEIOFH) ){
    		chomp $line;
    		if ($n < 2 ){
    			$a->{$n} = $line;
    			$n++;
    			next;
    		}
    		if ($n == 2 && $line =~/^\+HWI/){
    			print $a->{0},"\n";
    			print $a->{1},"\n";
    			print $line,"\n";
    			$n++;
    			$k=1;
    			next;
    		}
    		if ($n == 2){
    			$n++;
    			next;
    		
    		}
    		if ($k){
    			print $line,"\n";
    			$n = 0;$k = 0;
    		}else{
    			$n = 0;
    		}
    	}
    }
    close(FILEIOFH);
    So, while the the code is probably not the smartest way to do it, it runs fine for small files but a 12 gig fastq file will abort after processing 130 Mb, though I explicitly state it should run until the end with eof(). I tried reading everything in an arrray first and then iterating through the array but nope, still aborts after 130 Mb (and uses awfully amount of memory).
    So, I am missing something here, could someone help me out?

    Thanks,
    Marc

  • #2
    Are you sure it aborts, or does it just produce 130Mb of output from the whole file? The problem might be that if you are missing a line from somewhere in the middle of your file (which we've seen happen when transfers have corrupted a file) then your program will end up out of phase for the rest of the file and won't print any more entries.

    I actually posted the script we used to fix a similar problem a while back. It might be worth running that to see if it finds anything odd in your file.

    http://seqanswers.com/forums/showpos...75&postcount=8

    Hope this helps

    Simon.

    Comment


    • #3
      Hi Simon,

      thanks a lot, I was not aware that you could capture the next line with <> inside a while loop (but then every time you call <> it iterates to the next line, so I was not seeing the obvious ) .
      You are right, I think I had a file error which resulted in an out of phase.
      Running your script, I get a clean file again

      Marc

      Comment


      • #4
        Originally posted by simonandrews View Post
        Are you sure it aborts, or does it just produce 130Mb of output from the whole file? The problem might be that if you are missing a line from somewhere in the middle of your file (which we've seen happen when transfers have corrupted a file) then your program will end up out of phase for the rest of the file and won't print any more entries.

        I actually posted the script we used to fix a similar problem a while back. It might be worth running that to see if it finds anything odd in your file.

        http://seqanswers.com/forums/showpos...75&postcount=8

        Hope this helps

        Simon.
        If I understand this line:

        Code:
        if ($qual =~/^[@+]/) {
            warn "Quality '$qual' looked like an id";
            next;
          }
        You are checking to see if the quality string begins with the '@' but isn't this character allowed in Sanger encoded Fastq files (i.e., probably most Fastq files being generated these days)? I remember this discussion coming up previously and I think the conclusion was that checking to see that a record has 4 lines (4 newlines specifically) may be a safer approach. I ran into this issue myself when the Illumina format changed to Sanger and I started getting odd results when using grep or Perl to count the '@' characters at the beginning of that line. Nonetheless, this is a really good thing to keep in mind because I often have issues when trying to transfer a batch of Fastq files.
        Last edited by SES; 04-19-2012, 03:35 AM.

        Comment


        • #5
          You're right - that script is from a while back when we were using the older Illumina encoding. It would indeed produce some false positives for Sanger encoded qualities. Unless you're prepared to include part of the sequencer name in the pattern I'm not sure you can rigorously tell apart a header line from a quality line in the general case.

          Now I'm conflicted about who I should be more annoyed at - Illumina for using a non-standard encoding in the first place, or Sanger for not using a title delimiter which can't be found at the start of non-title lines

          Comment


          • #6
            Originally posted by simonandrews View Post
            Unless you're prepared to include part of the sequencer name in the pattern I'm not sure you can rigorously tell apart a header line from a quality line in the general case.
            That is exactly what I ended up doing for some previous work, and here is some Bash code to do that (assuming the first argument would be the forward reads from a paired-end Fastq):

            Code:
            leftmate=$(echo ${1%.*})
            leftids=$leftmate"_ids.txt"
            read -r fq_header < $1
            instr_name=$(echo $fq_header | grep -oE '\@\w+')
            grep "$instr_name" $1 | sed 's/\/1$//;s/^\@//' | sort > $leftids # you would probably want to fine tune the sort command for the data
            This worked for me as a test for whether or not two files are actually forward and reverse reads from the same paired-end run. Of course, this is not good because you have to keep changing the code, and the '/1' convention is another moving target. Yes, I've also been really frustrated trying to incorporate tests with multiple data sources, species, etc.

            Comment


            • #7
              OK, but that's solving a different problem. My original script was to fix a single fastq file which had become corrupted at some point internally (ie an unknown number of lines were either missing or truncated).

              Trying to match up and validate paired files is a different problem. The worst ones we found were from NCBI SRA where we got a pair of fastq files where every read id in both forward and reverse files was the sample accession number. We've given up validating pairs now, and just assume that the read ordering is correct. If it's wrong then you end up getting stupid distances between the ends so it's normally pretty obvious (if not fixable).

              Comment


              • #8
                You are correct, what I was describing specifically was related to something different. Though, I was just trying to provide some context for the original issue of identifying the qual string correctly. If the goal is to identify the qual string then the general approach is relevant (if you modify the code to fit your needs).

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                30 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                32 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Working...
                X