Seqanswers Leaderboard Ad

**drio** · 09-13-2010, 08:01 AM

Try: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ (fastqc)

**Brugger** · 09-24-2010, 04:33 AM

Code:

cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n"'

result:

@HWI-EAS001 avg score: 25.7647058823529

**edge** · 09-24-2010, 03:33 PM

Hi Brugger,

If my fastq_file is a multi-sequence (multi-fastq sequence).
Your command will work as well?
Do you know what is the general threshold average quality score for determine the good or bad quality of ILLUMINA data? (eg. if the read average score above 25, we consider good quality ILLUMINA reads or etc )
Thanks for advice ya

Originally posted by Brugger View Post

Code:

cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n"'

result:

@HWI-EAS001 avg score: 25.7647058823529

**Brugger** · 09-25-2010, 12:36 AM

Yes it should work for a whole fastq file as well. I do not do any filtering like this so I do not have any recommended cut-off. I recommend collecting the stats for the whole dataset and plot them to see where the majority of the reads are, and then filter based on this.

BTW: You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff.

**edge** · 09-25-2010, 02:57 AM

Hi Brugger,

Thanks a lot for your knowledge sharing.
I'm very appreciate it

eg.
If I got 10000 fastq read and 95% of the fastq read shown average score of at least 40 or above. Can I use that 40 average score as a threshold and remove those read that average score below 40?

Which perl line that I can edit to filter the low quality read once I have established a usable cutoff?
"You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff"

Thanks again for your advice.
I'm still fresh with perl script

Originally posted by Brugger View Post

Yes it should work for a whole fastq file as well. I do not do any filtering like this so I do not have any recommended cut-off. I recommend collecting the stats for the whole dataset and plot them to see where the majority of the reads are, and then filter based on this.

BTW: You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff.

**drio** · 09-25-2010, 05:53 AM

Kungfu time:

Code:

cat input.fastq |\ 
perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; \
map{ $score += ord($_)-64} split("");\ 
print "$o$w$r$_" if ($score/length($_)) > 36; '

The cutoff is 36 in this case.

Example:

Code:

*$ cat input.fastq @30BL3AAXX081011:2:4:689:1812#0
GCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTTG
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:58:234#0
AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
+
ffffffffffffffffffffffffffffffffffff@30BL3AAXX081011:6:3:1471:1594#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:152:70#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
$ cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
@30BL3AAXX081011:6:9:58:234#0
AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:3:1471:1594#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:152:70#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff

**Brugger** · 09-25-2010, 11:22 AM

Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.

More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.

My initial oneliner should have been:

Code:

cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'

And to do the filtration how ever one solution would be:

Code:

cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'

Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.

**edge** · 09-25-2010, 04:07 PM

Hi Brugger,

Thanks again for your explanation in detail.
I will try it out the perl one-liner to the real case later.
Hopefully it is worked perfectly.
I will update the latest status to you for verify the result quality at that moment

**edge** · 09-26-2010, 02:45 AM

Thanks a lot, drio.
Below is the output result once I apply your perl script to my input data:

Code:

cat input_file.txt
@sample_1
AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC
+sample_1
^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[
@sample_2
TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA
+sample_2
aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`]
@sample_3
GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC
+sample_3
aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB

cat input_file.txt | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
@sample_2
TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA
+sample_2
aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`]
@sample_3
GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC
+sample_3
aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB

Do you familiar to write a perl script to generate another output file which contain the average quality score of each read in my input_data.txt?
Thanks a lot for your advice ya

Originally posted by drio View Post

Kungfu time:

Code:

cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '

The cutoff is 36 in this case.

Example:

Code:

*$ cat input.fastq @30BL3AAXX081011:2:4:689:1812#0
GCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTTG
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:58:234#0
AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
+
ffffffffffffffffffffffffffffffffffff@30BL3AAXX081011:6:3:1471:1594#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:152:70#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
$ cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
@30BL3AAXX081011:6:9:58:234#0
AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:3:1471:1594#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff
@30BL3AAXX081011:6:9:152:70#0
TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
+
ffffffffffffffffffffffffffffffffffff

**edge** · 09-26-2010, 02:54 AM

Thanks, Brugger.
I just try both of your perl script.
The first script we calculate the average quality score of each read is worked perfectly and nice

But the second perl script seems like can't apply to filter the read based on pre-determined average quality score cut-off?

Code:

cat input_file.txt
@sample_1
AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC
+sample_1
^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[
@sample_2
TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA
+sample_2
aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`]
@sample_3
GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC
+sample_3
aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB

cat input_file.txt | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'

@sample_1   avg score:25.7647058823529
@sample_2   avg score:54.4705882352941
@sample_3   avg score:73.7058823529412

cat input_file.txt | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'

When I try for the second perl script. It seems like can't function well and generate any output result?
Thanks a lot for your advice to improve it

Originally posted by Brugger View Post

Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.

More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.

My initial oneliner should have been:

Code:

cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'

And to do the filtration how ever one solution would be:

Code:

cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'

Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.

**drio** · 09-26-2010, 06:31 AM

Brugger made a little typo, change the last assignment from $code=0 to $score=0 and you'll have the behaviour you were looking for (in the first oneliner):

Code:

cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print "  avg score: " .($score/length($_))."\n";$score=0;'

The second one works just fine.

Back to watching enter the dragon.

**edge** · 09-26-2010, 07:35 PM

Hi drio,
Thanks for your perl script

I just try it out and it gives the following info:

Code:

cat input_file.txt | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print "  avg score: " .($score/length($_))."\n";$score=0;'

@sample_1   avg score:25.7647058823529
@sample_2   avg score:[B]28.7058823529412[/B]
@sample_3   avg score:[B]13.2352941176471[/B]

I just wondering why the average quality score by your perl script is different with the average quality score given by Brugger's perl script?
Thanks for your advice and enjoy "enter the dragon" ya

**simonandrews** · 09-26-2010, 11:39 PM

I think the focus on 'average quality' may be a bit misleading here, as it probably isn't the most useful measure to make. What you will often find is that for runs which have problems with poor quality it tends to be that the quality drops off as the run progresses. In general therefore you will see all reads showing poor quality later on in the run. Finding individual reads which are poor in an otherwise good run won't normally remove many sequences from your set.

What we've sometimes found to be beneficial is trimming the entire set of sequences to a length where the quality across the run was still OK. This will leave you with a shorter read length, but with sequence which is of much higher average quality. I'd not recommend doing this unless there is a really serious quality problem in your data, but we've seen quite a few instances of longer read lengths (75bp+) where the later stages of the read were just noise.

**edge** · 09-27-2010, 12:08 AM

Hi simon,

Thanks a lot for your knowledge sharing regarding dealing which fastq format read

Actually I got long list of read in fastq format from ILLUMINA GENOME ANALYZER now. Each of the read is 50 nucleotide in length.
My initial purpose of calculating the average quality score of each read is planned to determine a cut-off to filter "low quality read".
After then, I planned to preprocessing the raw fastq format data and use those "high quality read" for further analysis.
Based on your understanding and experience regarding fastq format read, what is the best way I should do when receiving the fastq read from ILLUMINA?
Thanks a lot for your advice.
This is the first time I dealing with ILLUMINA data in fastq format. Thus still fresh on dealing with their data

Originally posted by simonandrews View Post

I think the focus on 'average quality' may be a bit misleading here, as it probably isn't the most useful measure to make. What you will often find is that for runs which have problems with poor quality it tends to be that the quality drops off as the run progresses. In general therefore you will see all reads showing poor quality later on in the run. Finding individual reads which are poor in an otherwise good run won't normally remove many sequences from your set.

What we've sometimes found to be beneficial is trimming the entire set of sequences to a length where the quality across the run was still OK. This will leave you with a shorter read length, but with sequence which is of much higher average quality. I'd not recommend doing this unless there is a really serious quality problem in your data, but we've seen quite a few instances of longer read lengths (75bp+) where the later stages of the read were just noise.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Threshold quality score to determine the quality read of ILLUMINA reads problem

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News