SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Quality Score: FastQC vs Illumina ericguo Illumina/Solexa 8 10-22-2015 04:08 AM
about illumina reads quality score gridbird Illumina/Solexa 4 08-08-2011 05:10 AM
Illumina quality score whereisshe Bioinformatics 3 11-26-2010 06:45 AM
Threshold quality score to determine the quality read of ILLUMINA reads problem edge General 1 09-13-2010 02:22 PM
Quality score threshold? chris Bioinformatics 8 04-29-2008 12:43 AM

Reply
 
Thread Tools
Old 09-13-2010, 06:55 AM   #1
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default Threshold quality score to determine the quality read of ILLUMINA reads problem

Hi,

Does anybody have idea regarding the general threshold average quality score to determine whether the ILLUMINA raw reads is good or bad quality?

Below is one of my ILLUMINA raw reads in fastq format:

@HWI-EAS001
AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC
+HWI-EAS001
^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[

The above reads got 50 nucleotide in length.
Is it got any program or script able to calculate the average quality score of above read?
My purpose is hope to calculate the average quality score of each read.
Based on the average quality score of each read, I plan to filter out those "low quality reads" that below threshold of average quality score.
Thanks a lot for any advice.
edge is offline   Reply With Quote
Old 09-13-2010, 08:01 AM   #2
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Try: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ (fastqc)
__________________
-drd
drio is offline   Reply With Quote
Old 09-24-2010, 04:33 AM   #3
Brugger
Member
 
Location: Cambridge, UK

Join Date: Mar 2010
Posts: 21
Default

Code:
cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n"'
result:

@HWI-EAS001 avg score: 25.7647058823529
Brugger is offline   Reply With Quote
Old 09-24-2010, 03:33 PM   #4
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi Brugger,

If my fastq_file is a multi-sequence (multi-fastq sequence).
Your command will work as well?
Do you know what is the general threshold average quality score for determine the good or bad quality of ILLUMINA data? (eg. if the read average score above 25, we consider good quality ILLUMINA reads or etc )
Thanks for advice ya
Quote:
Originally Posted by Brugger View Post
Code:
cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n"'
result:

@HWI-EAS001 avg score: 25.7647058823529
edge is offline   Reply With Quote
Old 09-25-2010, 12:36 AM   #5
Brugger
Member
 
Location: Cambridge, UK

Join Date: Mar 2010
Posts: 21
Default

Yes it should work for a whole fastq file as well. I do not do any filtering like this so I do not have any recommended cut-off. I recommend collecting the stats for the whole dataset and plot them to see where the majority of the reads are, and then filter based on this.

BTW: You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff.
Brugger is offline   Reply With Quote
Old 09-25-2010, 02:57 AM   #6
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi Brugger,

Thanks a lot for your knowledge sharing.
I'm very appreciate it
eg.
If I got 10000 fastq read and 95% of the fastq read shown average score of at least 40 or above. Can I use that 40 average score as a threshold and remove those read that average score below 40?

Which perl line that I can edit to filter the low quality read once I have established a usable cutoff?
"You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff"

Thanks again for your advice.
I'm still fresh with perl script

Quote:
Originally Posted by Brugger View Post
Yes it should work for a whole fastq file as well. I do not do any filtering like this so I do not have any recommended cut-off. I recommend collecting the stats for the whole dataset and plot them to see where the majority of the reads are, and then filter based on this.

BTW: You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff.
edge is offline   Reply With Quote
Old 09-25-2010, 05:53 AM   #7
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Kungfu time:

Code:
cat input.fastq |\ 
perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; \
map{ $score += ord($_)-64} split("");\ 
print "$o$w$r$_" if ($score/length($_)) > 36; '
The cutoff is 36 in this case.

Example:

Code:
*$ cat input.fastq @30BL3AAXX081011:2:4:689:1812#0
GCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTTG
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:58:234#0
AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
+
ffffffffffffffffffffffffffffffffffff@30BL3AAXX081011:6:3:1471:1594#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:152:70#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
$ cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
@30BL3AAXX081011:6:9:58:234#0
AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:3:1471:1594#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:152:70#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
__________________
-drd
drio is offline   Reply With Quote
Old 09-25-2010, 11:22 AM   #8
Brugger
Member
 
Location: Cambridge, UK

Join Date: Mar 2010
Posts: 21
Default

Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.


More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.


My initial oneliner should have been:
Code:
cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'
And to do the filtration how ever one solution would be:
Code:
cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'
Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.
Brugger is offline   Reply With Quote
Old 09-25-2010, 04:07 PM   #9
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi Brugger,

Thanks again for your explanation in detail.
I will try it out the perl one-liner to the real case later.
Hopefully it is worked perfectly.
I will update the latest status to you for verify the result quality at that moment
edge is offline   Reply With Quote
Old 09-26-2010, 02:45 AM   #10
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Thanks a lot, drio.
Below is the output result once I apply your perl script to my input data:
Code:
cat input_file.txt
@sample_1
AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC
+sample_1
^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[
@sample_2
TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA
+sample_2
aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`]
@sample_3
GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC
+sample_3
aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB

cat input_file.txt | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
@sample_2
TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA
+sample_2
aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`]
@sample_3
GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC
+sample_3
aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB
Do you familiar to write a perl script to generate another output file which contain the average quality score of each read in my input_data.txt?
Thanks a lot for your advice ya

Quote:
Originally Posted by drio View Post
Kungfu time:

Code:
cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
The cutoff is 36 in this case.

Example:

Code:
*$ cat input.fastq @30BL3AAXX081011:2:4:689:1812#0
GCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTTG
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:58:234#0
AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
+
ffffffffffffffffffffffffffffffffffff@30BL3AAXX081011:6:3:1471:1594#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:152:70#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
$ cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
@30BL3AAXX081011:6:9:58:234#0
AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:3:1471:1594#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:152:70#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
edge is offline   Reply With Quote
Old 09-26-2010, 02:54 AM   #11
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Thanks, Brugger.
I just try both of your perl script.
The first script we calculate the average quality score of each read is worked perfectly and nice
But the second perl script seems like can't apply to filter the read based on pre-determined average quality score cut-off?
Code:
cat input_file.txt
@sample_1
AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC
+sample_1
^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[
@sample_2
TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA
+sample_2
aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`]
@sample_3
GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC
+sample_3
aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB

cat input_file.txt | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'

@sample_1   avg score:25.7647058823529
@sample_2   avg score:54.4705882352941
@sample_3   avg score:73.7058823529412

cat input_file.txt | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'
When I try for the second perl script. It seems like can't function well and generate any output result?
Thanks a lot for your advice to improve it
Quote:
Originally Posted by Brugger View Post
Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.


More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.


My initial oneliner should have been:
Code:
cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'
And to do the filtration how ever one solution would be:
Code:
cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'
Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.
edge is offline   Reply With Quote
Old 09-26-2010, 06:31 AM   #12
drio
Senior Member
 
Location: 4117'49"N / 24'42"E

Join Date: Oct 2008
Posts: 323
Default

Brugger made a little typo, change the last assignment from $code=0 to $score=0 and you'll have the behaviour you were looking for (in the first oneliner):

Code:
cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print "  avg score: " .($score/length($_))."\n";$score=0;'
The second one works just fine.

Back to watching enter the dragon.
__________________
-drd
drio is offline   Reply With Quote
Old 09-26-2010, 07:35 PM   #13
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi drio,
Thanks for your perl script
I just try it out and it gives the following info:
Code:
cat input_file.txt | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print "  avg score: " .($score/length($_))."\n";$score=0;'

@sample_1   avg score:25.7647058823529
@sample_2   avg score:28.7058823529412
@sample_3   avg score:13.2352941176471
I just wondering why the average quality score by your perl script is different with the average quality score given by Brugger's perl script?
Thanks for your advice and enjoy "enter the dragon" ya
edge is offline   Reply With Quote
Old 09-26-2010, 11:39 PM   #14
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I think the focus on 'average quality' may be a bit misleading here, as it probably isn't the most useful measure to make. What you will often find is that for runs which have problems with poor quality it tends to be that the quality drops off as the run progresses. In general therefore you will see all reads showing poor quality later on in the run. Finding individual reads which are poor in an otherwise good run won't normally remove many sequences from your set.

What we've sometimes found to be beneficial is trimming the entire set of sequences to a length where the quality across the run was still OK. This will leave you with a shorter read length, but with sequence which is of much higher average quality. I'd not recommend doing this unless there is a really serious quality problem in your data, but we've seen quite a few instances of longer read lengths (75bp+) where the later stages of the read were just noise.
simonandrews is offline   Reply With Quote
Old 09-27-2010, 12:08 AM   #15
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Hi simon,

Thanks a lot for your knowledge sharing regarding dealing which fastq format read
Actually I got long list of read in fastq format from ILLUMINA GENOME ANALYZER now. Each of the read is 50 nucleotide in length.
My initial purpose of calculating the average quality score of each read is planned to determine a cut-off to filter "low quality read".
After then, I planned to preprocessing the raw fastq format data and use those "high quality read" for further analysis.
Based on your understanding and experience regarding fastq format read, what is the best way I should do when receiving the fastq read from ILLUMINA?
Thanks a lot for your advice.
This is the first time I dealing with ILLUMINA data in fastq format. Thus still fresh on dealing with their data

Quote:
Originally Posted by simonandrews View Post
I think the focus on 'average quality' may be a bit misleading here, as it probably isn't the most useful measure to make. What you will often find is that for runs which have problems with poor quality it tends to be that the quality drops off as the run progresses. In general therefore you will see all reads showing poor quality later on in the run. Finding individual reads which are poor in an otherwise good run won't normally remove many sequences from your set.

What we've sometimes found to be beneficial is trimming the entire set of sequences to a length where the quality across the run was still OK. This will leave you with a shorter read length, but with sequence which is of much higher average quality. I'd not recommend doing this unless there is a really serious quality problem in your data, but we've seen quite a few instances of longer read lengths (75bp+) where the later stages of the read were just noise.
edge is offline   Reply With Quote
Old 09-27-2010, 12:20 AM   #16
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by edge View Post
Hi simon,
Based on your understanding and experience regarding fastq format read, what is the best way I should do when receiving the fastq read from ILLUMINA?
I think the posts above have covered what I'd do, one way or another.

I'd run the sequences through FastQC. This will generate a per-base quality plot and a per-sequence quality plot. The per base plot will tell you if there is a systematic loss of quality as your run progresses. If there is you might want to consider trimming a bit off the length of all of your sequences.

You can also see if there is a sub-population of sequences which have universally poor quality within a generally good run. This is much more unusual in my experience, but if it is a problem in your data you could filter these out as well.

I have to say that in most cases we don't filter our data unless the qualities are really poor. Many downstream analyses can take the assigned quality values into account so they don't need to have it all removed, but this may be different in your case depending on what you're intending doing with the data.
simonandrews is offline   Reply With Quote
Old 09-27-2010, 12:30 AM   #17
edge
Senior Member
 
Location: China

Join Date: Sep 2009
Posts: 199
Default

Thanks a lot and again for your explanation in detail, simon
I'm trying out my data regarding your suggested pipeline now.
Will ask your advice or comment if I'm facing any error or misunderstanding when dealing for the data read later.
Thanks again, simon ^^
edge is offline   Reply With Quote
Old 09-30-2010, 09:45 AM   #18
Seq_in_Perl
Junior Member
 
Location: US

Join Date: Sep 2010
Posts: 1
Default Include Std dev?

Excellent post! Two questions:

1) Is there a way to include standard dev of the quality scores along with the average using the one-liner?

2) I agree with Simon in that the quality of a read degrades as the length of the read increases. Is there a way (using the same one-liner) one can limit to say 30 base pairs instead of the entire line of a seq read and get the ave and std dev of the quality scores?

Thanks, and your help is immense!

Quote:
Originally Posted by Brugger View Post
Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.


More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.


My initial oneliner should have been:
Code:
cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'
And to do the filtration how ever one solution would be:
Code:
cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'
Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.
Seq_in_Perl is offline   Reply With Quote
Old 09-30-2010, 11:10 AM   #19
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Quote:
Originally Posted by Seq_in_Perl View Post
Is there a way to include standard dev of the quality scores along with the average using the one-liner?
One other thing to mention here - quality scores are nowhere near normally distributed. They are heavily skewed towards high values, with a fairly rapid descent to values close to zero (especially for Illumina data where they use a score of 2 as a sort of flag rather than a true probability measure). This means that using mean and stdev tend to give a fairly poor reflection of the range of values you have. We tend to prefer looking at percentiles through the distribution to get a clearer picture of what's going on (we often find that the 25th, 50th, 75th and 90th percentiles can all have the same value, but that the 10th can vary wildly).
simonandrews is offline   Reply With Quote
Old 10-01-2010, 02:30 AM   #20
Brugger
Member
 
Location: Cambridge, UK

Join Date: Mar 2010
Posts: 21
Default

Quote:
Originally Posted by Seq_in_Perl View Post
Excellent post! Two questions:

1) Is there a way to include standard dev of the quality scores along with the average using the one-liner?

2) I agree with Simon in that the quality of a read degrades as the length of the read increases. Is there a way (using the same one-liner) one can limit to say 30 base pairs instead of the entire line of a seq read and get the ave and std dev of the quality scores?

Thanks, and your help is immense!
Code:
cat fastq_file | perl -ne 'chomp;$_ = substr($_,0,30); print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $_=ord($_)-64; $s+=$_;$s2=$_*$_;} split(""); $l=length($_); print "  avg score: " .($score/$l)." stddev".(sqrt(($l*$s2-$s*$s)/($l*($l-1)))) ."\n";$s=$s2=$l=0;'
I believe (not really tested). But now things are getting silly. A proper script would increase readability greatly...


And I agree with simon, this does not really make any sense. We normally run bwa with aln with a '-q 15' flag to remove low quality ends of if you do assemblies, velvet comes with a suite of scripts for clipping off the low quality ends.
Brugger is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:57 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO