Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina pipeline 1.3 fastq and Maq sol2sanger

    Hi,

    I see that the Maq sol2sanger (v0.7.1) has not yet been updated to handle the new pipeline fastq files (now has Phred scores) from pipeline 1.3

    Does anyone have a handy script / advice for formatting the illumina fastq files for Maq??

    thanks folks

    Rick

  • #2
    It's a relatively simple change, since the format is now (phred+64) and the standard is (phred+33).

    So, I added a function to the fq_all2std.pl script in the MAQ scripts subdirectory:

    Code:
    sub sol2std2 {
    	my $max = 0;
    	while (<>) {
    		if (/^@/) {
    			print;
    			$_ = <>;
    			print;
    			$_ = <>;
    			$_ = <>;
    
    			# Added to eliminate carriage return conversion
    			chomp;
    			my @t = split( '', $_ );
    			my $qual = '';
    			$qual .= chr(ord($_) - 31) for (@t);
    			print "+\n$qual\n";
    		}
    	}
    }
    Then just add it as a valid command by adding

    Code:
    sol2std2    => \&sol2std2,
    to the my %cmd_hash line.

    Comment


    • #3
      The main problem with this new format is that it's now nigh on impossible to tell the difference between phred+64 and logodds+64 formats without resorting to a large amount of statistical analysis on the file contents.

      It's easy enough to convert of course, but knowing precisely what format your input data is in is getting trickier by the day. Time for fastq to retire I think!

      James

      Comment


      • #4
        I totally second that thought. Mapping algorithms that expect some form of Quality values, given others, still give you mapped reads! But the accuracy and efficiency can be very different..
        --
        bioinfosm

        Comment


        • #5
          Originally posted by jkbonfield View Post
          It's easy enough to convert of course, but knowing precisely what format your input data is in is getting trickier by the day. Time for fastq to retire I think!
          Fastq just needs to be standardized. It looks to me like everyone is eventually moving to Sanger/Phred scores for all fastq files; hopefully the next Illumina pipeline version will produce this as well.
          @1
          NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
          +
          """"""""""""""""""""""""""""""""""""

          Comment


          • #6
            sol2std2 function

            Hello lparsons

            Trying to convert my solexa phred 64 qualities in ascii format to phred 33 I realised I was unable to use ./maq sol2sanger. I came across your method and added this command into the fq_all2std.pl script in maq. However when executing, I get an error
            ./fq_all2std.pl test.txt one.fastq
            ** Unrecognized command test.txt at ./fq_all2std.pl line 45.

            #line 45 in the script is die("** Unrecognized command $cmd");

            I added sol2std2 => \&sol2std2 into the my %cmd_hash line and also the script before the sub instruction { command.

            Any help would be much appreciated

            Cheers

            L

            Comment


            • #7
              command missed out

              oops i was missing specifying the sol2std2 command

              However is there anywhere where I can understand the meaning of the #,!@ etc etc symbols?

              Cheers
              L

              Comment


              • #8
                It sounds like you were able to get things working. Let me know if you are still having trouble.

                As for understanding the meaning of the symbols, do you mean you would like to get the corresponding numerical qualities? If so, you could modify the script to output the numeric qualities or just look at an ASCII table and subtract the appropriate value.

                If you would like an explanation of what the numbers mean, you could start here: http://maq.sourceforge.net/qual.shtml

                Comment


                • #9
                  I used maq to call SNPs on a dataset. Using sol2sanger I get 800 odd SNPs reported after the recommended filtering. However, not using sol2sanger gives a whooping 11000 odd SNP calls, al other pipeline remaining same!

                  These are solexa v1.3 generated reads .. and I am not sure why this huge difference, and which one to trust
                  --
                  bioinfosm

                  Comment


                  • #10
                    hi bioinfosm

                    I can try and help you but someone correct me if I am wrong .

                    Solexa v1.3 reads are phred 64 probability scores instead of absolute base values. These need to be converted to phred 33 probabilities.

                    The sol2sanger is ok for converting the absolute base values to phred 33. but not suitable for converting phred 64 to phred 33 unless you adjust the fq_all2std.pl script using lparsons which method worked for me.

                    Phred scores probability scores of how correct the nucleotide is that has been added and you would need to adjust the v1.3 probability scores to this standard sanger format before using maq.

                    HTH
                    L

                    Comment


                    • #11
                      thanx lparsons, I stumbled upon a pdf table showing what the symbols means, it was a pdf i found online. Do you have any idea how maq handles N's? I have reads with many N's and was thinking to eliminate reads where N=>20 from the raw solexa data before I do any conversions with maq.....

                      Cheers
                      L

                      Comment


                      • #12
                        Originally posted by bioinfosm View Post
                        I used maq to call SNPs on a dataset. Using sol2sanger I get 800 odd SNPs reported after the recommended filtering. However, not using sol2sanger gives a whooping 11000 odd SNP calls, al other pipeline remaining same!

                        These are solexa v1.3 generated reads .. and I am not sure why this huge difference, and which one to trust
                        Trust the first one, using the sol2sanger conversion. The pipeline 1.3 scores are represented as ASCII(phred+64). Maq is expecting the qualities to be represented in the Sanger manner of ASCII(phred+33). If you do not first run sol2sanger, then when Maq encounters, for example, a 'D' (ASCII=68) in the quality string it will subtract 33 from this and give it a phred score of 35, which is pretty darn good. But since the file was still in Illumina FASTQ format the true phred score is 4 (68-64) which is pretty darn bad. By not running the file through sol2sanger you have essentially added 31 to the phred score of each and every base. Since Maq believes every mismatch it sees are from high quality base calls it will call them as SNPs but they are really just sequencing errors.

                        Comment


                        • #13
                          Thanks kmcarr.. the one follow-up query is, what were the pipeline 1.1 scores then? I heard that there has been a change in solexa's fastq qualities..
                          --
                          bioinfosm

                          Comment


                          • #14
                            If we call the current (pipeline 1.3.2) Q(phred)+64 then the previous version could be called Q(solexa)+64. The difference between Phred and Solexa qualities has been well described by Heng Li in the documentation of his Maq package (http://maq.sourceforge.net/qual.shtml). These differ most significantly at the low end, with Q(solexa) allowing negative numbers. At Q scores above ~11 the two are essentially identical.

                            Technically the sol2sanger conversion is meant to convert Q(solexa)+64 into Q(phred)+33. There will be slight errors in the scores assigned for low quality bases. I actually added a new command and subroutine to the fq_all2std.pl script to deal with Solexa FASTQ from v1.3.2.

                            Add a new command named "solP2std" to the %cmd_hash:

                            solP2std=>\&solP2std,

                            Add the following to create a hash to convert from Q(phred)+64 to Q(phred)+33.

                            --

                            my %solP2stdP;
                            for (64..126) {
                            $solP2stdP{chr($_)} = chr($_-31);
                            }

                            --

                            Add the following subroutine to do the conversion:

                            --

                            sub solP2std {
                            while (<>) {
                            if (/^@/) {
                            print;
                            $_ = <>; print; $_ = <>; $_ = <>;
                            chomp;
                            my @t = split('', $_);
                            my $qual = '';
                            $qual .= $solP2stdP{$_} for (@t);
                            print "+\n$qual\n";
                            }
                            }
                            }

                            --
                            [Arrg! Stupid whitespace stripping messing up my code.]

                            To use this on a fastq produced by the v1.3.2 pipeline:

                            fq_all2std solP2std mySolexa_1.3.2_File.fastq > myStandardSanger_File.fastq
                            Last edited by kmcarr; 05-20-2009, 09:32 AM.

                            Comment


                            • #15
                              Originally posted by Layla View Post
                              hi bioinfosm

                              I can try and help you but someone correct me if I am wrong .

                              Solexa v1.3 reads are phred 64 probability scores instead of absolute base values. These need to be converted to phred 33 probabilities.

                              The sol2sanger is ok for converting the absolute base values to phred 33. but not suitable for converting phred 64 to phred 33 unless you adjust the fq_all2std.pl script using lparsons which method worked for me.

                              Phred scores probability scores of how correct the nucleotide is that has been added and you would need to adjust the v1.3 probability scores to this standard sanger format before using maq.

                              HTH
                              L
                              Thanks Layla. As I understand, after converting phred 64 to phred 33, there is no need to run sol2sanger, and one can directly convert the reads to bfq and run maq map...
                              --
                              bioinfosm

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              23 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              24 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              52 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X