SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
quality control from fastq to vcf dongshenglulv Bioinformatics 3 11-05-2014 02:08 PM
Quality control of genomic resequencing data from a HiSeq gavin.oliver Genomic Resequencing 2 06-30-2013 01:48 AM
Webinar on Quality Control of NGS Data - FREE Strand SI Events / Conferences 0 09-09-2011 06:33 PM
TileQC: a system for tile-based quality control of Solexa data ScottC Illumina/Solexa 0 06-03-2008 04:54 PM
PubMed: TileQC: a system for tile-based quality control of Solexa data. Newsbot! Literature Watch 0 05-30-2008 08:21 AM

Reply
 
Thread Tools
Old 06-20-2010, 11:39 PM   #61
mard
Member
 
Location: Melbourne

Join Date: Jan 2010
Posts: 21
Default

Quote:
Originally Posted by simonandrews View Post
Maybe it's the longer sequence length which is causing the problem. Can you try changing the -Xmx250m to -Xmx500m and see if that works.
That worked. Thanks!
mard is offline   Reply With Quote
Old 06-21-2010, 04:56 AM   #62
Martin R
Junior Member
 
Location: Germany

Join Date: May 2010
Posts: 7
Default

thanks for such a nice tool. It saves me much time to develop such statistics by myself.

There are two tiny things that could be easily included:
1) labeling of the x and y axis for the plots
2) option to convert CS not to real NS but to pseudo NS: e.g. 0->A, 1->C, 2->G, 3->T

Regards
M
Martin R is offline   Reply With Quote
Old 06-24-2010, 04:34 AM   #63
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I've just put FastQC v0.4.1 up on our website.

This is a bugfix release which should hopefully fix the out of memory problems people were seeing when analysing files containing longer sequences.

It also changes the way the duplicate levels are calculated (each sequence is now tracked to the end of the file), to give more realistic duplication counts. The cutoffs have also been altered to accommodate the new counts.

You can get the new version from:

http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

[If you don't see the new version of any page hit control+refresh to force our cache to update]
simonandrews is offline   Reply With Quote
Old 07-11-2010, 07:54 AM   #64
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Just a quick plug - there's a poster on FastQC at ISMB this week, and I'm around if anyone has any questions or suggestions about the package. It's poster U37.
simonandrews is offline   Reply With Quote
Old 07-11-2010, 02:50 PM   #65
Howie Goodell
Member
 
Location: Boston, MA

Join Date: Feb 2010
Posts: 10
Default

First, congratulations and thanks for producing such a useful application! I've spent much too much time hacking to figure out stupid problems (e.g. short sequence+adapter n-mers we size-selected) that just pop out visually running your tool. I added it to our standard pipeline and just ran it retrospectively on the past 6 months of data, and I've already recommended it to several people.

One small but annoying problem: I think when you added colorspace support, you unwittingly created a failure mode for low-quality non-colorspace data with a "." in the first base position if it hasn't seen any bases yet -- it obviously falls through to testing for colorspace, where no initial base call is an illegal situation. Note that this is guaranteed to happen for paired-end runs on Illumina; since they apparently mark the second paired-end reads of any quality PF if their mates passed filter. Example:

Processing s_1_2_sequence.txt
Exception in thread "main" java.lang.IllegalArgumentException: Refbase was . at position 1
at uk.ac.bbsrc.babraham.FastQC.Sequence.FastQFile.convertColorspaceToBases(FastQFile.java:179)
at uk.ac.bbsrc.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:124)
at uk.ac.bbsrc.babraham.FastQC.Sequence.FastQFile.<init>(FastQFile.java:54)
at uk.ac.bbsrc.babraham.FastQC.Analysis.OfflineRunner.processFile(OfflineRunner.java:45)
at uk.ac.bbsrc.babraham.FastQC.Analysis.OfflineRunner.<init>(OfflineRunner.java:28)
at uk.ac.bbsrc.babraham.FastQC.FastQCApplication.main(FastQCApplication.java:71)

This hangs the pipeline in non-X server mode; not sure if it's X related as in the previous posts, but it's not what you want to find in the morning ;-)

Cheers!
Howie
Howie Goodell is offline   Reply With Quote
Old 07-11-2010, 04:13 PM   #66
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by Howie Goodell View Post
One small but annoying problem: I think when you added colorspace support, you unwittingly created a failure mode for low-quality non-colorspace data with a "." in the first base position if it hasn't seen any bases yet -- it obviously falls through to testing for colorspace, where no initial base call is an illegal situation. Note that this is guaranteed to happen for paired-end runs on Illumina; since they apparently mark the second paired-end reads of any quality PF if their mates passed filter.
That's strange - we've never seen that with out Illumina data - I wasn't aware that . was a valid character in a base call fastq file.

Any chance you could post a few entries which exhibit this problem, so I can adjust the colorspace detection so it recognises this kind of file correctly?
simonandrews is offline   Reply With Quote
Old 07-11-2010, 06:34 PM   #67
Howie Goodell
Member
 
Location: Boston, MA

Join Date: Feb 2010
Posts: 10
Default

Sure Simon (one example is enough -- there are many just like it at the start of the filtered file):
$head -4 s_1_2_sequence.txt
@HWUSI-EAS572_0001:1:1:1066:17989#0/2
............................................................................
+HWUSI-EAS572_0001:1:1:1066:17989#0/2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

I originally planned to temporarily create unfiltered FASTQ files just for FASTQC; so I'd have stats on the total run. This bug forced me to filter first, but for paired-end even that fails.

The ideal might be to defer the determination of colorspace or not until you find a called base. Then either start over or just put the reads with no called bases in a bin -- they are pretty homogeneous and pretty uninteresting; so not much to compute?

Cheers!
Howie
Howie Goodell is offline   Reply With Quote
Old 07-12-2010, 06:54 AM   #68
zlu
Member
 
Location: UK

Join Date: Nov 2008
Posts: 32
Default

Using the fastqc and fastx_toolkit checking on my same set of sanger fastq file, the per base quality plots (attached) seems to give different result. I know that the boxplots from the 2 programs vary slightly but I'll expect to see the median being the same.

I'm using the latest GUI fastqc downloaded from your site.
Attached Images
File Type: jpg per_base_quality.jpg (18.7 KB, 72 views)
Attached Files
File Type: pdf fastx_per_base_quality.pdf (5.3 KB, 59 views)
zlu is offline   Reply With Quote
Old 07-12-2010, 06:58 AM   #69
lletourn
Member
 
Location: Montreal

Join Date: Oct 2009
Posts: 63
Default

Just to be sure, did you use the -Q33 flag with fastx since illumina fastqs are phred+64 and standard fastq are phred+33

FastQC guesses the right format. fastx doesn't.


I must admit I've seen slight differences between the plots, but nothing as far as your results.
lletourn is offline   Reply With Quote
Old 07-12-2010, 07:00 AM   #70
zlu
Member
 
Location: UK

Join Date: Nov 2008
Posts: 32
Default

Yes, I did:

$ fastx_quality_stats -Q 33 -i s_5_1_BC1_36bp_q5.fq -o s_5_1_BC1_36bp_q5.qualstats &
zlu is offline   Reply With Quote
Old 07-16-2010, 07:06 AM   #71
sowmyai
Member
 
Location: America

Join Date: Jan 2010
Posts: 27
Default

Sounds like a great tool. Thanks. I am going through the documentation before trying it and I agree with Martin that it would be great to label the X and Y axes.

On that note, could you please explain the X and Y axis in "Per Sequence Quality Scores" and "Per Base Sequence Quality"

And why do these look quite similar for the "good dataset" and "poor dataset" ?
sowmyai is offline   Reply With Quote
Old 07-16-2010, 07:10 AM   #72
sowmyai
Member
 
Location: America

Join Date: Jan 2010
Posts: 27
Default

Also, what could be the reason for a dataset passing the "Per sequence Quality Score" test and failing the "Per base sequence quality" test ?
sowmyai is offline   Reply With Quote
Old 07-16-2010, 07:32 AM   #73
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

The labels for the Y axes appear in the inset boxes at the top right. There is a full description of each graph in the documentation. I'll look at adding a label for the x-axis.

The quality graphs look quite similar in the good and bad datasets because they're both real datasets, and it's hard to find a dataset which is poor in every respect! If you have a worse dataset you're prepared to (anonymously) donate to produce a worse 'bad' report then please contact me off list (simon.andrews@bbsrc.ac.uk).

As for why you could get different results from the per-base and per-sequence qualities - they tell you quite different things. The per-base quality plot will tell you if there was a systematics problem with your run and whether this only affected a few cycles or all of them. If you find you have poor quality it would also give you an idea which cycle you could trim your sequence at to leave mostly good sequence.

The per-sequence plot would allow you to distinguish a run where all of the sequences showed poor quality from a run where a subset of sequences (say one end of the flowcell) had generally poor quality and the other end had good quality. If 5% of your sequences were of poor quality then you could pass the per base quality check, but fail the per sequence check.
simonandrews is offline   Reply With Quote
Old 07-16-2010, 07:37 AM   #74
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by zlu View Post
Using the fastqc and fastx_toolkit checking on my same set of sanger fastq file, the per base quality plots (attached) seems to give different result. I know that the boxplots from the 2 programs vary slightly but I'll expect to see the median being the same.

I'm using the latest GUI fastqc downloaded from your site.
That's odd. The relative scales of the two graphs are the same, but all of the FastQC ones are offset by 6. Can you check in the text file generated by FastQC and see if the numbers there agree with the FastQC or FastX plots. If the text values are different to the scores then it could be a plotting bug.
simonandrews is offline   Reply With Quote
Old 07-16-2010, 07:42 AM   #75
sowmyai
Member
 
Location: America

Join Date: Jan 2010
Posts: 27
Default

Thanks very much. I am a newbie to NGS, so please bear with me. I apologize if my questions are too basic.

How is the "Per Sequence Quality" calculated ? Is it an average of the quality of each base in the sequence ? Or is it more complicated than that ?
sowmyai is offline   Reply With Quote
Old 07-16-2010, 07:44 AM   #76
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by sowmyai View Post
How is the "Per Sequence Quality" calculated ? Is it an average of the quality of each base in the sequence ? Or is it more complicated than that ?
No, that's it. We calculate the mean quality for each sequence and then display the distribution of those means.
simonandrews is offline   Reply With Quote
Old 07-16-2010, 07:54 AM   #77
sowmyai
Member
 
Location: America

Join Date: Jan 2010
Posts: 27
Default

Thanks.

In that case, I don't understand how my data could fail the Per Base sequence test(miserably at that - the blue line dips sharply after 35 bp - these are 76 bp reads) and pass the "Per Sequence" test. Thanks for your patience.
sowmyai is offline   Reply With Quote
Old 07-16-2010, 08:04 AM   #78
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by sowmyai View Post
Thanks.

In that case, I don't understand how my data could fail the Per Base sequence test(miserably at that - the blue line dips sharply after 35 bp - these are 76 bp reads) and pass the "Per Sequence" test. Thanks for your patience.
Ah right - if you look in the documentation you'll see that the per-sequence quality check won't actually issue a warning or a fail - it just shows you the results and lets you decide. There are a couple of tests like this (the GC plot I think is another one). If anyone has ideas for an easy metric to decide if these tests should warn or fail I'd be interested to hear.
simonandrews is offline   Reply With Quote
Old 07-16-2010, 08:10 AM   #79
sowmyai
Member
 
Location: America

Join Date: Jan 2010
Posts: 27
Default

I was not so much concerned about the pass/fail from the software.

When I say "Passed the Per Sequence quality test" I meant that the histogram peaks very steeply at 34. How can most reads have an average quality of 34 while the individual base qualities are very poor ?
sowmyai is offline   Reply With Quote
Old 07-16-2010, 08:30 AM   #80
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by sowmyai View Post
I was not so much concerned about the pass/fail from the software.

When I say "Passed the Per Sequence quality test" I meant that the histogram peaks very steeply at 34. How can most reads have an average quality of 34 while the individual base qualities are very poor ?
Without seeing your data it's difficult to say exactly, but you should note that in the per-base plots things can look worse than they are. You tend to find that there is a sudden drop from high to low quality rather than a steady decline. This means that even if you see the bottom of the yellow box extending far down the graph then that only represents 25% of your sequences being poor. This could easily average out to a good mean score across a sequence.
simonandrews is offline   Reply With Quote
Reply

Tags
fastq, quality, report

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:05 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO