![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Quality Score: FastQC vs Illumina | ericguo | Illumina/Solexa | 8 | 10-22-2015 05:08 AM |
about illumina reads quality score | gridbird | Illumina/Solexa | 4 | 08-08-2011 06:10 AM |
Illumina quality score | whereisshe | Bioinformatics | 3 | 11-26-2010 07:45 AM |
Threshold quality score to determine the quality read of ILLUMINA reads problem | edge | General | 1 | 09-13-2010 03:22 PM |
Quality score threshold? | chris | Bioinformatics | 8 | 04-29-2008 01:43 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: China Join Date: Sep 2009
Posts: 199
|
![]()
Hi,
Does anybody have idea regarding the general threshold average quality score to determine whether the ILLUMINA raw reads is good or bad quality? Below is one of my ILLUMINA raw reads in fastq format: @HWI-EAS001 AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC +HWI-EAS001 ^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[ The above reads got 50 nucleotide in length. Is it got any program or script able to calculate the average quality score of above read? My purpose is hope to calculate the average quality score of each read. Based on the average quality score of each read, I plan to filter out those "low quality reads" that below threshold of average quality score. Thanks a lot for any advice. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]()
Try: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ (fastqc)
__________________
-drd |
![]() |
![]() |
![]() |
#3 |
Member
Location: Cambridge, UK Join Date: Mar 2010
Posts: 21
|
![]() Code:
cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print " avg score: " .($score/length($_))."\n"' @HWI-EAS001 avg score: 25.7647058823529 |
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: China Join Date: Sep 2009
Posts: 199
|
![]()
Hi Brugger,
If my fastq_file is a multi-sequence (multi-fastq sequence). Your command will work as well? Do you know what is the general threshold average quality score for determine the good or bad quality of ILLUMINA data? (eg. if the read average score above 25, we consider good quality ILLUMINA reads or etc ) Thanks for advice ya ![]() |
![]() |
![]() |
![]() |
#5 |
Member
Location: Cambridge, UK Join Date: Mar 2010
Posts: 21
|
![]()
Yes it should work for a whole fastq file as well. I do not do any filtering like this so I do not have any recommended cut-off. I recommend collecting the stats for the whole dataset and plot them to see where the majority of the reads are, and then filter based on this.
BTW: You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff. |
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: China Join Date: Sep 2009
Posts: 199
|
![]()
Hi Brugger,
Thanks a lot for your knowledge sharing. I'm very appreciate it ![]() eg. If I got 10000 fastq read and 95% of the fastq read shown average score of at least 40 or above. Can I use that 40 average score as a threshold and remove those read that average score below 40? Which perl line that I can edit to filter the low quality read once I have established a usable cutoff? "You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff" Thanks again for your advice. I'm still fresh with perl script ![]() Quote:
|
|
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]()
Kungfu time:
Code:
cat input.fastq |\ perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; \ map{ $score += ord($_)-64} split("");\ print "$o$w$r$_" if ($score/length($_)) > 36; ' Example: Code:
*$ cat input.fastq @30BL3AAXX081011:2:4:689:1812#0 GCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTTG + ffffffffffffffffffffffffffffffffffff @30BL3AAXX081011:6:9:58:234#0 AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT + ffffffffffffffffffffffffffffffffffff@30BL3AAXX081011:6:3:1471:1594#0 TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC + ffffffffffffffffffffffffffffffffffff @30BL3AAXX081011:6:9:152:70#0 TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC + ffffffffffffffffffffffffffffffffffff $ cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; ' @30BL3AAXX081011:6:9:58:234#0 AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT + ffffffffffffffffffffffffffffffffffff @30BL3AAXX081011:6:3:1471:1594#0 TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC + ffffffffffffffffffffffffffffffffffff @30BL3AAXX081011:6:9:152:70#0 TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC + ffffffffffffffffffffffffffffffffffff
__________________
-drd |
![]() |
![]() |
![]() |
#8 |
Member
Location: Cambridge, UK Join Date: Mar 2010
Posts: 21
|
![]()
Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.
More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40. My initial oneliner should have been: Code:
cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print " avg score: " .($score/length($_))."\n";$code=0;' Code:
cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0' |
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: China Join Date: Sep 2009
Posts: 199
|
![]()
Hi Brugger,
Thanks again for your explanation in detail. I will try it out the perl one-liner to the real case later. Hopefully it is worked perfectly. I will update the latest status to you for verify the result quality at that moment ![]() |
![]() |
![]() |
![]() |
#10 | |
Senior Member
Location: China Join Date: Sep 2009
Posts: 199
|
![]()
Thanks a lot, drio.
Below is the output result once I apply your perl script to my input data: Code:
cat input_file.txt @sample_1 AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC +sample_1 ^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[ @sample_2 TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA +sample_2 aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`] @sample_3 GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC +sample_3 aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB cat input_file.txt | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; ' @sample_2 TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA +sample_2 aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`] @sample_3 GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC +sample_3 aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB Thanks a lot for your advice ya ![]() Quote:
|
|
![]() |
![]() |
![]() |
#11 | |
Senior Member
Location: China Join Date: Sep 2009
Posts: 199
|
![]()
Thanks, Brugger.
I just try both of your perl script. The first script we calculate the average quality score of each read is worked perfectly and nice ![]() But the second perl script seems like can't apply to filter the read based on pre-determined average quality score cut-off? Code:
cat input_file.txt @sample_1 AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC +sample_1 ^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[ @sample_2 TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA +sample_2 aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`] @sample_3 GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC +sample_3 aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB cat input_file.txt | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print " avg score: " .($score/length($_))."\n";$code=0;' @sample_1 avg score:25.7647058823529 @sample_2 avg score:54.4705882352941 @sample_3 avg score:73.7058823529412 cat input_file.txt | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0' Thanks a lot for your advice to improve it ![]() Quote:
|
|
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: 41°17'49"N / 2°4'42"E Join Date: Oct 2008
Posts: 323
|
![]()
Brugger made a little typo, change the last assignment from $code=0 to $score=0 and you'll have the behaviour you were looking for (in the first oneliner):
Code:
cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print " avg score: " .($score/length($_))."\n";$score=0;' Back to watching enter the dragon. ![]()
__________________
-drd |
![]() |
![]() |
![]() |
#13 |
Senior Member
Location: China Join Date: Sep 2009
Posts: 199
|
![]()
Hi drio,
Thanks for your perl script ![]() I just try it out and it gives the following info: Code:
cat input_file.txt | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print " avg score: " .($score/length($_))."\n";$score=0;' @sample_1 avg score:25.7647058823529 @sample_2 avg score:28.7058823529412 @sample_3 avg score:13.2352941176471 Thanks for your advice and enjoy "enter the dragon" ya ![]() |
![]() |
![]() |
![]() |
#14 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
I think the focus on 'average quality' may be a bit misleading here, as it probably isn't the most useful measure to make. What you will often find is that for runs which have problems with poor quality it tends to be that the quality drops off as the run progresses. In general therefore you will see all reads showing poor quality later on in the run. Finding individual reads which are poor in an otherwise good run won't normally remove many sequences from your set.
What we've sometimes found to be beneficial is trimming the entire set of sequences to a length where the quality across the run was still OK. This will leave you with a shorter read length, but with sequence which is of much higher average quality. I'd not recommend doing this unless there is a really serious quality problem in your data, but we've seen quite a few instances of longer read lengths (75bp+) where the later stages of the read were just noise. |
![]() |
![]() |
![]() |
#15 | |
Senior Member
Location: China Join Date: Sep 2009
Posts: 199
|
![]()
Hi simon,
Thanks a lot for your knowledge sharing regarding dealing which fastq format read ![]() Actually I got long list of read in fastq format from ILLUMINA GENOME ANALYZER now. Each of the read is 50 nucleotide in length. My initial purpose of calculating the average quality score of each read is planned to determine a cut-off to filter "low quality read". After then, I planned to preprocessing the raw fastq format data and use those "high quality read" for further analysis. Based on your understanding and experience regarding fastq format read, what is the best way I should do when receiving the fastq read from ILLUMINA? Thanks a lot for your advice. This is the first time I dealing with ILLUMINA data in fastq format. Thus still fresh on dealing with their data ![]() Quote:
|
|
![]() |
![]() |
![]() |
#16 | |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]() Quote:
I'd run the sequences through FastQC. This will generate a per-base quality plot and a per-sequence quality plot. The per base plot will tell you if there is a systematic loss of quality as your run progresses. If there is you might want to consider trimming a bit off the length of all of your sequences. You can also see if there is a sub-population of sequences which have universally poor quality within a generally good run. This is much more unusual in my experience, but if it is a problem in your data you could filter these out as well. I have to say that in most cases we don't filter our data unless the qualities are really poor. Many downstream analyses can take the assigned quality values into account so they don't need to have it all removed, but this may be different in your case depending on what you're intending doing with the data. |
|
![]() |
![]() |
![]() |
#17 |
Senior Member
Location: China Join Date: Sep 2009
Posts: 199
|
![]()
Thanks a lot and again for your explanation in detail, simon
![]() I'm trying out my data regarding your suggested pipeline now. Will ask your advice or comment if I'm facing any error or misunderstanding when dealing for the data read later. Thanks again, simon ^^ |
![]() |
![]() |
![]() |
#18 | |
Junior Member
Location: US Join Date: Sep 2010
Posts: 1
|
![]()
Excellent post! Two questions:
1) Is there a way to include standard dev of the quality scores along with the average using the one-liner? 2) I agree with Simon in that the quality of a read degrades as the length of the read increases. Is there a way (using the same one-liner) one can limit to say 30 base pairs instead of the entire line of a seq read and get the ave and std dev of the quality scores? Thanks, and your help is immense! Quote:
|
|
![]() |
![]() |
![]() |
#19 |
Simon Andrews
Location: Babraham Inst, Cambridge, UK Join Date: May 2009
Posts: 871
|
![]()
One other thing to mention here - quality scores are nowhere near normally distributed. They are heavily skewed towards high values, with a fairly rapid descent to values close to zero (especially for Illumina data where they use a score of 2 as a sort of flag rather than a true probability measure). This means that using mean and stdev tend to give a fairly poor reflection of the range of values you have. We tend to prefer looking at percentiles through the distribution to get a clearer picture of what's going on (we often find that the 25th, 50th, 75th and 90th percentiles can all have the same value, but that the 10th can vary wildly).
|
![]() |
![]() |
![]() |
#20 | |
Member
Location: Cambridge, UK Join Date: Mar 2010
Posts: 21
|
![]() Quote:
Code:
cat fastq_file | perl -ne 'chomp;$_ = substr($_,0,30); print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $_=ord($_)-64; $s+=$_;$s2=$_*$_;} split(""); $l=length($_); print " avg score: " .($score/$l)." stddev".(sqrt(($l*$s2-$s*$s)/($l*($l-1)))) ."\n";$s=$s2=$l=0;' And I agree with simon, this does not really make any sense. We normally run bwa with aln with a '-q 15' flag to remove low quality ends of if you do assemblies, velvet comes with a suite of scripts for clipping off the low quality ends. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|