Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • FASTQC guessing wrong quality encoding

    Hello,

    I have some Illumina files processed with CASAVA 1.8.
    The program FASTQC is guessing the format to be be Illumina 1.5
    Is there a way to explicitly tell fastqc what encoding the data is? If not, what else can I do?

    Thanks!

  • #2
    Just to make sure, do you have the most recent version of FastQC? 9-9-11: Version 0.10.0 released. That version added support for CASAVA 1.8 type of files and thus may be a solution to your problem.

    Comment


    • #3
      I thought that v.0.9 should be able to distinguish between encodings (see below) ... but I will try to see if the latest version can help.

      Thanks!


      From the release notes:
      "30-3-11: Version 0.9.1 released
      Added --quiet and --nogroup options to command line
      Added encoding type to the basic stats
      Added detection of Illumina <1.3 1.3 1.5 and 1.9 encodings"

      Comment


      • #4
        The encoding detection hasn't changed since v0.9.1 so moving to 0.10.0 won't help.

        The encoding detection is done entirely on the basis of the range of Phred values seen in the file. In order to incorrectly detect Sanger encoded data as Illumina 1.5 you'd have to have a dataset where no base call's quality value was lower than 31. This would seem very unlikely in any normal illumina dataset, unless it had been (very harshly?) quality trimmed before being put through fastqc.

        I've just double checked on some of our casava 1.8 data and the encoding is correctly detected in all of the cases I looked at.

        Is there something unusual about the sequence file you analysed? Very low number of reads, or very unusual quality distribution? If it's not obvious what went wrong in this case would you be willing to make a small subset of the data available so I can see what happened?

        Comment


        • #5
          Maybe it's time to add a feature that allows users to specifically tell the program what encoding it should use (especially considering the ambiguity between the different formats/encondings).

          Comment


          • #6
            Originally posted by robs View Post
            Maybe it's time to add a feature that allows users to specifically tell the program what encoding it should use (especially considering the ambiguity between the different formats/encodings).
            I'm really not keen on doing this. In practice there is very little ambiguity between the different encodings and in real samples it's extremely unlikely that the encoding will be mis-detected (I'm still waiting for the original author of this thread to get back to me about their sample). The only cases we've ever seen where this went wrong were in simulated datasets where samples were being given an artificially narrow range of quality values.

            What we have seen numerous times is complaints that FastQC was getting the quality detection wrong when it was actually correct. Providing an option to set the encoding type will result in people getting it wrong, and this is not going to be handled well in the program. You're likely to end up with corrupted plots and odd errors which are just going to generate confusion and unnecessary bug reports.

            If there are cases starting to crop up where the detection is actually wrong then please let me know. We're not seeing them, but I'm absolutely prepared to believe they exist. It may be that we can improve the algorithm which guesses the encoding to cope with them or there may be other bugs we can fix, but I think the correct answer is to get the automatic detection correct rather than have people specify the encoding manually.

            Comment


            • #7
              I think you should give users more credit for knowing what they do. Having the automatic detection as default, but still offering an option to specify the encoding would be nice to have. You could add a meaningful warning if someone specifies an encoding that the program does not agree with. (The overlap between the different encodings allows an incorrect prediction, no matter how good your automatic detection is.)

              Given the "numerous times" people complained, maybe a short report/output why the specific encoding has been selected by the program might be quite useful for both sides.

              Comment


              • #8
                The point I'd stress is that we have never yet seen a real sample where the encoding was guessed incorrectly (maybe between illumina 1.3 and 1.5, but the offset is the same for those two anyway so it makes no difference). I know there are cases where this could theoretically happen but until we actually see that then adding this option is just something to go wrong.

                The complaints we've had before have all either been resolved by either finding that the pipeline version used wasn't what people expected, or that the encodings had been altered by a third party (SRA recodes into Sanger encoding in some cases for example), or on a couple of occasions finding that the file had become corrupted. None of these cases would have been helped by adding a forced encoding mode.

                In terms of reporting why an encoding was selected, it's really just done off the lowest untransformed value so there's not much which could be reported.

                Comment


                • #9
                  Simon,

                  First, we love FastQC, and are particularly addicted to having it available in our local Galaxy installation! It has saved us from many headaches.

                  So, I'm not sure you would consider this a "real" sample, but it's a real nuisance for us. We're working on a type of metagenomics project where we must use only reads with no low-quality bases. So, after FastQC'ing the raw reads, we *do* filter them very aggressively. We then run FastQC again to see what our selected subpopulation of high quality reads look like. Unfortunatley, FastQC decided our Illumina1.9/fastqsanger reads are really illumina1.3 reads, and the result is hard to work with. So, we will implement the ability to pass the encoding type down from Galaxy. Is there an easy way to contribute a code modification back to FastQC? I don't see an associated SourceForge site...

                  Comment


                  • #10
                    Originally posted by curtish View Post
                    Is there an easy way to contribute a code modification back to FastQC? I don't see an associated SourceForge site...
                    We don't have a publicly accessible source repository for FastQC, but I'm happy to take patches against the source of the latest release.

                    If you want to add this option then it will require a change to the wrapper to collect and validate the forced offset. This will then need to be picked up in the Sequence.QualityEncoding.PhredEncoding class . I'd suggest that the change be structured such that the suggested offset is overridden if the lowest encoding found in the file is lower than the offset supplied to avoid odd errors elsewhere. Alternatively you could have the getFastQEncodingOffset method throw an exception if the supplied encoding isn't compatible with the data, but this will require modifications in a number of places.
                    Last edited by simonandrews; 10-07-2011, 10:24 AM. Reason: Spelling fail!

                    Comment


                    • #11
                      encoding-specification through command-line option would be welcome

                      Hey Simon,

                      I can only second curtish. Both in how useful FastQC is as a tool and in how useful it would be, to have a command-line option that specifies a certain quality encoding.

                      I in my case, I did some strong quality trimming, resulting in no quality scores 31 or lower. And that in turn makes FastQC guess it is Illumina <1.3 encoding as opposed to the correct encoding, which is Illumina 1.8+.

                      So is there a patch available yet, curtish? Or is this planned for future versions of FastQC?

                      Thanks,
                      David

                      Comment


                      • #12
                        Same problem as above.
                        Does someone know a way to fix or bypass it?

                        Thanks,

                        Leo.

                        Comment


                        • #13
                          Same problem as those above. I have reads encoded at Illumina 1.9 which a first pass of FastQC correctly identifies. I filter my reads very heavily leaving no reads with quality below 31. On the second pass FastQC mis-identifies the encoding as Illumina <1.3.

                          I love the tool as it is and will continue using it, but a function where the user can specify encoding in addition to the automatic detection would be really good.

                          Comment


                          • #14
                            We've had an ongoing discussion about this issue for some time and we've gone over this again this morning and I think we've decided on a way forward.

                            Our basic position has always been that we didn't want to introduce a flag to force an encoding since our experience has been that the vast majority (but not all) of reports of mis-detection we've had have turned out to be correct detection, and the file wasn't what the user thought it was. True mis-detection only occurs on data which has been manipulated (usually by quality trimming) - we've never seen a raw sequencing file which got the detection wrong.

                            The problem is that for trimmed data the window for unambiguous detection isn't as wide as we'd like. From a base 33 encoding you become ambiguous at 59, meaning that data trimmed to a phred of above 26 (about 3/1000 errors), which is a realistic level at which people could filter.

                            The reason for putting the break at 59 was to support the Illumina <1.3 files, which used a Base64 encoding, but which allowed quality scores down to -5. Normal Phred 64 wouldn't become ambiguous until 64 which would be a Phred of 31 (below 1/1000 errors).

                            To try to alleviate this situation we're therefore going to remove support for the Illumina <1.3 encoding in the next (imminent) fastqc release. Since this was replaced in 2009 we don't envisage that this will have much of an effect on anyone, and will mean that as long as data is not trimmed so that no base is less than Q31 the auto-detection will still work.

                            Comment


                            • #15
                              Could you include a read at the beginning of the fastq file with the following structure:
                              @readName
                              AA
                              +
                              mM
                              where m and M are the min and max possible quality scores used by your encoding, respectively? Sure, this will throw off your data, but since it's only one read, I think that it won't make that much of a difference. I'm not sure how FASTQC works, but I assume that it keeps track of the 'smallest' and 'biggest' qual scores that are observed throughout all of the reads. If both extremes are present right at the start, it would seem to me that it wouldn't have much of a chance at getting it wrong.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X