Seqanswers Leaderboard Ad

**simonandrews** · 04-17-2012, 11:50 PM

Are you sure it aborts, or does it just produce 130Mb of output from the whole file? The problem might be that if you are missing a line from somewhere in the middle of your file (which we've seen happen when transfers have corrupted a file) then your program will end up out of phase for the rest of the file and won't print any more entries.

I actually posted the script we used to fix a similar problem a while back. It might be worth running that to see if it finds anything odd in your file.

http://seqanswers.com/forums/showpos...75&postcount=8

Hope this helps

Simon.

**Kcornelius** · 04-18-2012, 09:12 AM

Hi Simon,

thanks a lot, I was not aware that you could capture the next line with <> inside a while loop (but then every time you call <> it iterates to the next line, so I was not seeing the obvious

) .
You are right, I think I had a file error which resulted in an out of phase.
Running your script, I get a clean file again

Marc

**SES** · 04-19-2012, 03:31 AM

Originally posted by simonandrews View Post

Are you sure it aborts, or does it just produce 130Mb of output from the whole file? The problem might be that if you are missing a line from somewhere in the middle of your file (which we've seen happen when transfers have corrupted a file) then your program will end up out of phase for the rest of the file and won't print any more entries.

I actually posted the script we used to fix a similar problem a while back. It might be worth running that to see if it finds anything odd in your file.

http://seqanswers.com/forums/showpos...75&postcount=8

Hope this helps

Simon.

If I understand this line:

Code:

if ($qual =~/^[@+]/) {
    warn "Quality '$qual' looked like an id";
    next;
  }

You are checking to see if the quality string begins with the '@' but isn't this character allowed in Sanger encoded Fastq files (i.e., probably most Fastq files being generated these days)? I remember this discussion coming up previously and I think the conclusion was that checking to see that a record has 4 lines (4 newlines specifically) may be a safer approach. I ran into this issue myself when the Illumina format changed to Sanger and I started getting odd results when using grep or Perl to count the '@' characters at the beginning of that line. Nonetheless, this is a really good thing to keep in mind because I often have issues when trying to transfer a batch of Fastq files.

**simonandrews** · 04-19-2012, 03:41 AM

You're right - that script is from a while back when we were using the older Illumina encoding. It would indeed produce some false positives for Sanger encoded qualities. Unless you're prepared to include part of the sequencer name in the pattern I'm not sure you can rigorously tell apart a header line from a quality line in the general case.

Now I'm conflicted about who I should be more annoyed at - Illumina for using a non-standard encoding in the first place, or Sanger for not using a title delimiter which can't be found at the start of non-title lines

**SES** · 04-19-2012, 04:01 AM

Originally posted by simonandrews View Post

Unless you're prepared to include part of the sequencer name in the pattern I'm not sure you can rigorously tell apart a header line from a quality line in the general case.

That is exactly what I ended up doing for some previous work, and here is some Bash code to do that (assuming the first argument would be the forward reads from a paired-end Fastq):

Code:

leftmate=$(echo ${1%.*})
leftids=$leftmate"_ids.txt"
read -r fq_header < $1
instr_name=$(echo $fq_header | grep -oE '\@\w+')
grep "$instr_name" $1 | sed 's/\/1$//;s/^\@//' | sort > $leftids # you would probably want to fine tune the sort command for the data

This worked for me as a test for whether or not two files are actually forward and reverse reads from the same paired-end run. Of course, this is not good because you have to keep changing the code, and the '/1' convention is another moving target. Yes, I've also been really frustrated trying to incorporate tests with multiple data sources, species, etc.

**simonandrews** · 04-19-2012, 04:18 AM

OK, but that's solving a different problem. My original script was to fix a single fastq file which had become corrupted at some point internally (ie an unknown number of lines were either missing or truncated).

Trying to match up and validate paired files is a different problem. The worst ones we found were from NCBI SRA where we got a pair of fastq files where every read id in both forward and reverse files was the sample accession number. We've given up validating pairs now, and just assume that the read ordering is correct. If it's wrong then you end up getting stupid distances between the ends so it's normally pretty obvious (if not fixable).

**SES** · 04-19-2012, 04:37 AM

You are correct, what I was describing specifically was related to something different. Though, I was just trying to provide some context for the original issue of identifying the qual string correctly. If the goal is to identify the qual string then the general approach is relevant (if you modify the code to fit your needs).

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

perl woes and the processing with big fatq files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News