SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality Score: FastQC vs Illumina ericguo Illumina/Solexa 8 10-22-2015 04:08 AM
Per Base Quality scores in FastQC mittymat Illumina/Solexa 3 03-30-2012 05:34 AM
bowtie command line for Illumina Hiseq 2000 with Illumina 1.5+ quality encoding files rworthi Illumina/Solexa 4 09-28-2011 11:25 AM
FASTQC for checking quality of 120 bp reads madsaan Bioinformatics 4 06-06-2011 11:17 PM
Default quality encoding system of SAMTools&GATK dingxiaofan1 Bioinformatics 11 03-03-2011 11:27 PM

Reply
 
Thread Tools
Old 09-12-2011, 05:52 AM   #1
PFS
Member
 
Location: USA

Join Date: Mar 2010
Posts: 55
Default FASTQC guessing wrong quality encoding

Hello,

I have some Illumina files processed with CASAVA 1.8.
The program FASTQC is guessing the format to be be Illumina 1.5
Is there a way to explicitly tell fastqc what encoding the data is? If not, what else can I do?

Thanks!
PFS is offline   Reply With Quote
Old 09-12-2011, 06:27 AM   #2
westerman
Rick Westerman
 
Location: Purdue University, Indiana, USA

Join Date: Jun 2008
Posts: 1,104
Default

Just to make sure, do you have the most recent version of FastQC? 9-9-11: Version 0.10.0 released. That version added support for CASAVA 1.8 type of files and thus may be a solution to your problem.
westerman is offline   Reply With Quote
Old 09-12-2011, 07:23 AM   #3
PFS
Member
 
Location: USA

Join Date: Mar 2010
Posts: 55
Default

I thought that v.0.9 should be able to distinguish between encodings (see below) ... but I will try to see if the latest version can help.

Thanks!


From the release notes:
"30-3-11: Version 0.9.1 released
Added --quiet and --nogroup options to command line
Added encoding type to the basic stats
Added detection of Illumina <1.3 1.3 1.5 and 1.9 encodings"
PFS is offline   Reply With Quote
Old 09-12-2011, 08:19 AM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

The encoding detection hasn't changed since v0.9.1 so moving to 0.10.0 won't help.

The encoding detection is done entirely on the basis of the range of Phred values seen in the file. In order to incorrectly detect Sanger encoded data as Illumina 1.5 you'd have to have a dataset where no base call's quality value was lower than 31. This would seem very unlikely in any normal illumina dataset, unless it had been (very harshly?) quality trimmed before being put through fastqc.

I've just double checked on some of our casava 1.8 data and the encoding is correctly detected in all of the cases I looked at.

Is there something unusual about the sequence file you analysed? Very low number of reads, or very unusual quality distribution? If it's not obvious what went wrong in this case would you be willing to make a small subset of the data available so I can see what happened?
simonandrews is offline   Reply With Quote
Old 09-13-2011, 06:50 PM   #5
robs
Senior Member
 
Location: San Diego, CA

Join Date: May 2010
Posts: 116
Default

Maybe it's time to add a feature that allows users to specifically tell the program what encoding it should use (especially considering the ambiguity between the different formats/encondings).
robs is offline   Reply With Quote
Old 09-14-2011, 12:04 AM   #6
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by robs View Post
Maybe it's time to add a feature that allows users to specifically tell the program what encoding it should use (especially considering the ambiguity between the different formats/encodings).
I'm really not keen on doing this. In practice there is very little ambiguity between the different encodings and in real samples it's extremely unlikely that the encoding will be mis-detected (I'm still waiting for the original author of this thread to get back to me about their sample). The only cases we've ever seen where this went wrong were in simulated datasets where samples were being given an artificially narrow range of quality values.

What we have seen numerous times is complaints that FastQC was getting the quality detection wrong when it was actually correct. Providing an option to set the encoding type will result in people getting it wrong, and this is not going to be handled well in the program. You're likely to end up with corrupted plots and odd errors which are just going to generate confusion and unnecessary bug reports.

If there are cases starting to crop up where the detection is actually wrong then please let me know. We're not seeing them, but I'm absolutely prepared to believe they exist. It may be that we can improve the algorithm which guesses the encoding to cope with them or there may be other bugs we can fix, but I think the correct answer is to get the automatic detection correct rather than have people specify the encoding manually.
simonandrews is offline   Reply With Quote
Old 09-14-2011, 09:53 AM   #7
robs
Senior Member
 
Location: San Diego, CA

Join Date: May 2010
Posts: 116
Default

I think you should give users more credit for knowing what they do. Having the automatic detection as default, but still offering an option to specify the encoding would be nice to have. You could add a meaningful warning if someone specifies an encoding that the program does not agree with. (The overlap between the different encodings allows an incorrect prediction, no matter how good your automatic detection is.)

Given the "numerous times" people complained, maybe a short report/output why the specific encoding has been selected by the program might be quite useful for both sides.
robs is offline   Reply With Quote
Old 09-15-2011, 12:43 AM   #8
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

The point I'd stress is that we have never yet seen a real sample where the encoding was guessed incorrectly (maybe between illumina 1.3 and 1.5, but the offset is the same for those two anyway so it makes no difference). I know there are cases where this could theoretically happen but until we actually see that then adding this option is just something to go wrong.

The complaints we've had before have all either been resolved by either finding that the pipeline version used wasn't what people expected, or that the encodings had been altered by a third party (SRA recodes into Sanger encoding in some cases for example), or on a couple of occasions finding that the file had become corrupted. None of these cases would have been helped by adding a forced encoding mode.

In terms of reporting why an encoding was selected, it's really just done off the lowest untransformed value so there's not much which could be reported.
simonandrews is offline   Reply With Quote
Old 10-07-2011, 09:31 AM   #9
curtish
Junior Member
 
Location: UAB (Birmingham, AL, USA)

Join Date: Oct 2011
Posts: 2
Default

Simon,

First, we love FastQC, and are particularly addicted to having it available in our local Galaxy installation! It has saved us from many headaches.

So, I'm not sure you would consider this a "real" sample, but it's a real nuisance for us. We're working on a type of metagenomics project where we must use only reads with no low-quality bases. So, after FastQC'ing the raw reads, we *do* filter them very aggressively. We then run FastQC again to see what our selected subpopulation of high quality reads look like. Unfortunatley, FastQC decided our Illumina1.9/fastqsanger reads are really illumina1.3 reads, and the result is hard to work with. So, we will implement the ability to pass the encoding type down from Galaxy. Is there an easy way to contribute a code modification back to FastQC? I don't see an associated SourceForge site...
curtish is offline   Reply With Quote
Old 10-07-2011, 10:23 AM   #10
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by curtish View Post
Is there an easy way to contribute a code modification back to FastQC? I don't see an associated SourceForge site...
We don't have a publicly accessible source repository for FastQC, but I'm happy to take patches against the source of the latest release.

If you want to add this option then it will require a change to the wrapper to collect and validate the forced offset. This will then need to be picked up in the Sequence.QualityEncoding.PhredEncoding class . I'd suggest that the change be structured such that the suggested offset is overridden if the lowest encoding found in the file is lower than the offset supplied to avoid odd errors elsewhere. Alternatively you could have the getFastQEncodingOffset method throw an exception if the supplied encoding isn't compatible with the data, but this will require modifications in a number of places.

Last edited by simonandrews; 10-07-2011 at 10:24 AM. Reason: Spelling fail!
simonandrews is offline   Reply With Quote
Old 03-02-2012, 07:53 AM   #11
david_2012
Junior Member
 
Location: Germany

Join Date: Mar 2012
Posts: 4
Default encoding-specification through command-line option would be welcome

Hey Simon,

I can only second curtish. Both in how useful FastQC is as a tool and in how useful it would be, to have a command-line option that specifies a certain quality encoding.

I in my case, I did some strong quality trimming, resulting in no quality scores 31 or lower. And that in turn makes FastQC guess it is Illumina <1.3 encoding as opposed to the correct encoding, which is Illumina 1.8+.

So is there a patch available yet, curtish? Or is this planned for future versions of FastQC?

Thanks,
David
david_2012 is offline   Reply With Quote
Old 04-20-2012, 01:20 AM   #12
magofiura
Junior Member
 
Location: Siena (Italy)

Join Date: Jan 2012
Posts: 2
Default

Same problem as above.
Does someone know a way to fix or bypass it?

Thanks,

Leo.
magofiura is offline   Reply With Quote
Old 05-20-2014, 09:29 AM   #13
Axel
Junior Member
 
Location: St Andrews

Join Date: Feb 2014
Posts: 8
Default

Same problem as those above. I have reads encoded at Illumina 1.9 which a first pass of FastQC correctly identifies. I filter my reads very heavily leaving no reads with quality below 31. On the second pass FastQC mis-identifies the encoding as Illumina <1.3.

I love the tool as it is and will continue using it, but a function where the user can specify encoding in addition to the automatic detection would be really good.
Axel is offline   Reply With Quote
Old 05-21-2014, 02:56 AM   #14
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

We've had an ongoing discussion about this issue for some time and we've gone over this again this morning and I think we've decided on a way forward.

Our basic position has always been that we didn't want to introduce a flag to force an encoding since our experience has been that the vast majority (but not all) of reports of mis-detection we've had have turned out to be correct detection, and the file wasn't what the user thought it was. True mis-detection only occurs on data which has been manipulated (usually by quality trimming) - we've never seen a raw sequencing file which got the detection wrong.

The problem is that for trimmed data the window for unambiguous detection isn't as wide as we'd like. From a base 33 encoding you become ambiguous at 59, meaning that data trimmed to a phred of above 26 (about 3/1000 errors), which is a realistic level at which people could filter.

The reason for putting the break at 59 was to support the Illumina <1.3 files, which used a Base64 encoding, but which allowed quality scores down to -5. Normal Phred 64 wouldn't become ambiguous until 64 which would be a Phred of 31 (below 1/1000 errors).

To try to alleviate this situation we're therefore going to remove support for the Illumina <1.3 encoding in the next (imminent) fastqc release. Since this was replaced in 2009 we don't envisage that this will have much of an effect on anyone, and will mean that as long as data is not trimmed so that no base is less than Q31 the auto-detection will still work.
simonandrews is offline   Reply With Quote
Old 05-21-2014, 07:41 AM   #15
blakeoft
Member
 
Location: Connecticut

Join Date: Oct 2013
Posts: 79
Default

Could you include a read at the beginning of the fastq file with the following structure:
Quote:
@readName
AA
+
mM
where m and M are the min and max possible quality scores used by your encoding, respectively? Sure, this will throw off your data, but since it's only one read, I think that it won't make that much of a difference. I'm not sure how FASTQC works, but I assume that it keeps track of the 'smallest' and 'biggest' qual scores that are observed throughout all of the reads. If both extremes are present right at the start, it would seem to me that it wouldn't have much of a chance at getting it wrong.
blakeoft is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 02:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO