Seqanswers Leaderboard Ad

**jkbonfield** · 04-09-2009, 05:02 AM

I have a totally hideous script, but FAST, to convert solexa fastq with log-odds +64 to phred +33 format.

The horrid tr is basically just doing the quality mapping and was generated by a simple perl 1-liner.

However that said, your data doesn't look to be in solexa format anyway. All those 'I's are quality 40 (ascii 73 => 33+40).

Code:

# Read the fastq file, with blind faith it's in the correct format.
while (<>) {
    print;                      # name
    $_=<>; print;               # sequence
    $_=<>; print "+\n";         # quality header (was name)
    $_=<>;
    tr/\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175/\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\042\042\042\042\042\042\043\043\044\044\045\045\046\046\047\050\051\052\053\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136/;
    print;                      # quality
}

edit: that hideous auto-generated tr I think actually boils down to:

tr/!-\175/!!!!!!!!!!!!!!!!!!!!!!""""""##$$%%&&-++,-\136/;

It still looks like wonderful line noise though :-)

**kmcarr** · 04-09-2009, 05:08 AM

The file you downloaded appears to already be in standard Sanger FASTQ format so there is no reason to convert. For Sanger FASTQ the conversion is to Phred score is ASCII(n)-33 (where 'n' is the character in the quality string). The majority of your quality values are 'I' which is ASCII 73, so 73-33 = 40, reasonable Phred scores. If the file was using Solexa scoring (ASCII(n)-64) the majority of Phred scores would be 9!. Further, the original file has one '3' in the quality string; if this were a Solexa file this would translate to a Q score of -14 which I think is below the lower limit for Solexa Q scores.

I think you can skip the sol2sanger step and proceed with the file as downloaded.

**aaronh** · 04-09-2009, 09:42 AM

I had this problem and solved it by removing the spaces from the sequence name. If you look at the fastq definition on the MAQ page, you will see that spaces are not allowed, <seqname>:=[A-Za-z0-9_.:-]. I'm not sure if this is the official definition of a fastq file but that is what MAQ uses. Get rid of the spaces and you should be fine.

**Ender985** · 04-14-2009, 02:43 AM

As I'm still fairly new to the world of DNA-seq I didn't realise the sequences were already in solexa fastq format, so the sol2sanger step was indeed not needed at all. Nontheless, the problem with maq was still persisting when I tried to $maq match using those sequences.

So I tried aaronh solution, and it worked perfectly! After replacing all of the blank spaces, the sequencing is running smoothly and with no errors.
I still don't get why only the sequences containing an @ on their quality score were failing since all of them contained blank spaces on the name, but I guess it is just the way it is coded.

Lots of thanks!

**aaronh** · 04-14-2009, 01:04 PM

From what I recall, actually all of the reads are failing but it is only complaining about the ones with the @. If you take the bfq file and convert it back to fastq, I think it looks like junk.

**polivares** · 08-01-2009, 11:25 AM

As wikipedia's article states, SRA files are already in Sanger's qualities. You should only remove the spaces. Please tell me if I am wrong.

**abelcable** · 07-28-2010, 11:23 AM

Space in the name

In case anyone stumbles across this problem again, I figured out how to solve it in the code. Open file seq.c in the top directory of maq (This is for maq 0.7.1, may work for other versions if the code is the same in this file) and look for the function called seq_read_fastq. Look for this while loop:

Code:

   while (!feof(fp) && (c = fgetc(fp)) != ' ' && c != '\t' && c != '\n')
		if (c != '\r' && *p++ != c) {
			fprintf(stderr, "[seq_read_fastq] Inconsistent sequence name: %s. Continue anyway.\n", name);
			return seq->l;
		}

Insert this code immediately after the loop and re-compile.

Code:

 if (c != '\n') while (!feof(fp) && fgetc(fp) != '\n');

That should fix the space in the name problem. The @ symbols have nothing to do with it, the code uses that symbol as an anchor and when there are spaces in the name it messes it all up. So you don't have to remove the @ symbols in the quality scores.

This makes the code ignore anything after the first white space. If your name includes spaces, this will truncate the name to the part before the space. I guess that's not that great if you need the whole name, but this will at least give you a hint as to how to fix that too.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Short Read Archive format problems

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News