Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Short Read Archive format problems

    Hello all,

    I've been dowloading some Illumina/Solexa short read files from SRA such as this one, to test and get used to MAQ and BWA.

    It seems the format of the provided short reads is Solexa fastq, ie.,
    Code:
    @SRR002322.60 080317_CM-KID-LIV-2-REPEAT_0003:1:1:88:275 length=36
    TCTGTCTCAAAAACAAAACAAAACAAAACAAAAAAA
    +SRR002322.60 080317_CM-KID-LIV-2-REPEAT_0003:1:1:88:275 length=36
    IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIAII1
    However, whenever I try to either convert this format to sagner format using
    $ maq sol2sanger SRR002322.fastq SRR002322.sang.fastq
    or try to convert this .fastq file to binary .bfq format, an extremely large warning list is shown on the terminal, spanning several thousands of errors like these,

    Code:
    [seq_read_fastq] Inconsistent sequence name: II9II<%IIIII6I. Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: +II'IIII(). Continue anyway.
    [seq_read_fastq] Inconsistent sequence name: .IIIIII'IIIIIIIIIIIIIII3IIE. Continue anyway.
    (...)
    Well, I've investigated a little, and I think I've found the origin of all this errors. All the problems concern short reads whose quality score involves an '@' symbol. For example, the three short reads matching the three errors I've just shown are

    Code:
    @SRR002322.11 080317_CM-KID-LIV-2-REPEAT_0003:1:1:121:511 length=36
    GTTTGGCTAAGGTTGTCTGGTAGTTAGGTGGAGTTG
    +SRR002322.11 080317_CM-KID-LIV-2-REPEAT_0003:1:1:121:511 length=36
    IIIIIIIIIDIIHIIIIIIII[B]@II9II<%IIIII6I[/B]
    
    @SRR002322.33 080317_CM-KID-LIV-2-REPEAT_0003:1:1:110:444 length=36
    TGTATTTTTAGTAGAGACGTGGTTTCACCATCTTGT
    +SRR002322.33 080317_CM-KID-LIV-2-REPEAT_0003:1:1:110:444 length=36
    IIIIIIIII%III+IIIIIIIIIII[B]@+II'IIII()[/B]
    
    @SRR002322.63 080317_CM-KID-LIV-2-REPEAT_0003:1:1:108:770 length=36
    TAAAAATGCCCTAGCCTACTTCTTACCACAAGGCAC
    +SRR002322.63 080317_CM-KID-LIV-2-REPEAT_0003:1:1:108:770 length=36
    IIIIIIII[B]@.IIIIII'IIIIIIIIIIIIIII3IIE[/B]
    all the other sequences are converted just fine.

    My bet is that MAQ scripts interprets everything after an @ as a sequence name and thus misinterprets the following lines as well. If I let the script run to the end of the file, the resulting .sagner.fastq file contains some funny short reads, apart from the normal reads like this one,

    Code:
    @SRR002322.11
    GTTTGGCTAAGGTTGTCTGGTAGTTAGGTGGAGTTG
    +
    !"!!!"@&.!,+&!-+7!!!3'1'%5@!!!!"!"!"
    I also get a ton of for example

    Code:
    @II9II<%IIIII6I
    SRR.CM-KID-LIV--REPEATlengthTTTTTGCATCAAAAAGCTTTATTTCCATTTGGTCCA
    +
    %&%%%&B)0%.-)%/-9%%%5*3*(7B%%%%&%&%&!!!!!!!!!!!!!!!!!!!!!!!!!!!!
    Note how the 'name' of this nonsensical short read is the end of the first problematic quailty score I've shown before, II9II<%IIIII6I.


    So since I've searched this forum and haven't found anyone else with the same problems as me, I think I must be doing something wrong. Are the SRA files not in Solexa/Illumina fastq format? What am I missing?

    Lots of thanks!

  • #2
    I have a totally hideous script, but FAST, to convert solexa fastq with log-odds +64 to phred +33 format.

    The horrid tr is basically just doing the quality mapping and was generated by a simple perl 1-liner.

    However that said, your data doesn't look to be in solexa format anyway. All those 'I's are quality 40 (ascii 73 => 33+40).

    Code:
    # Read the fastq file, with blind faith it's in the correct format.
    while (<>) {
        print;                      # name
        $_=<>; print;               # sequence
        $_=<>; print "+\n";         # quality header (was name)
        $_=<>;
        tr/\041\042\043\044\045\046\047\050\051\052\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136\137\140\141\142\143\144\145\146\147\150\151\152\153\154\155\156\157\160\161\162\163\164\165\166\167\170\171\172\173\174\175/\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\041\042\042\042\042\042\042\043\043\044\044\045\045\046\046\047\050\051\052\053\053\054\055\056\057\060\061\062\063\064\065\066\067\070\071\072\073\074\075\076\077\100\101\102\103\104\105\106\107\110\111\112\113\114\115\116\117\120\121\122\123\124\125\126\127\130\131\132\133\134\135\136/;
        print;                      # quality
    }
    edit: that hideous auto-generated tr I think actually boils down to:

    tr/!-\175/!!!!!!!!!!!!!!!!!!!!!!""""""##$$%%&&-++,-\136/;

    It still looks like wonderful line noise though :-)
    Last edited by jkbonfield; 04-09-2009, 05:13 AM.

    Comment


    • #3
      The file you downloaded appears to already be in standard Sanger FASTQ format so there is no reason to convert. For Sanger FASTQ the conversion is to Phred score is ASCII(n)-33 (where 'n' is the character in the quality string). The majority of your quality values are 'I' which is ASCII 73, so 73-33 = 40, reasonable Phred scores. If the file was using Solexa scoring (ASCII(n)-64) the majority of Phred scores would be 9!. Further, the original file has one '3' in the quality string; if this were a Solexa file this would translate to a Q score of -14 which I think is below the lower limit for Solexa Q scores.

      I think you can skip the sol2sanger step and proceed with the file as downloaded.

      Comment


      • #4
        I had this problem and solved it by removing the spaces from the sequence name. If you look at the fastq definition on the MAQ page, you will see that spaces are not allowed, <seqname>:=[A-Za-z0-9_.:-]. I'm not sure if this is the official definition of a fastq file but that is what MAQ uses. Get rid of the spaces and you should be fine.

        Comment


        • #5
          As I'm still fairly new to the world of DNA-seq I didn't realise the sequences were already in solexa fastq format, so the sol2sanger step was indeed not needed at all. Nontheless, the problem with maq was still persisting when I tried to $maq match using those sequences.

          So I tried aaronh solution, and it worked perfectly! After replacing all of the blank spaces, the sequencing is running smoothly and with no errors.
          I still don't get why only the sequences containing an @ on their quality score were failing since all of them contained blank spaces on the name, but I guess it is just the way it is coded.

          Lots of thanks!

          Comment


          • #6
            From what I recall, actually all of the reads are failing but it is only complaining about the ones with the @. If you take the bfq file and convert it back to fastq, I think it looks like junk.

            Comment


            • #7
              As wikipedia's article states, SRA files are already in Sanger's qualities. You should only remove the spaces. Please tell me if I am wrong.

              Comment


              • #8
                Space in the name

                In case anyone stumbles across this problem again, I figured out how to solve it in the code. Open file seq.c in the top directory of maq (This is for maq 0.7.1, may work for other versions if the code is the same in this file) and look for the function called seq_read_fastq. Look for this while loop:
                Code:
                   while (!feof(fp) && (c = fgetc(fp)) != ' ' && c != '\t' && c != '\n')
                		if (c != '\r' && *p++ != c) {
                			fprintf(stderr, "[seq_read_fastq] Inconsistent sequence name: %s. Continue anyway.\n", name);
                			return seq->l;
                		}
                Insert this code immediately after the loop and re-compile.
                Code:
                 if (c != '\n') while (!feof(fp) && fgetc(fp) != '\n');
                That should fix the space in the name problem. The @ symbols have nothing to do with it, the code uses that symbol as an anchor and when there are spaces in the name it messes it all up. So you don't have to remove the @ symbols in the quality scores.

                This makes the code ignore anything after the first white space. If your name includes spaces, this will truncate the name to the part before the space. I guess that's not that great if you need the whole name, but this will at least give you a hint as to how to fix that too.
                Last edited by abelcable; 07-28-2010, 11:25 AM.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                28 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X