Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Threshold quality score to determine the quality read of ILLUMINA reads problem

    Hi,

    Does anybody have idea regarding the general threshold average quality score to determine whether the ILLUMINA raw reads is good or bad quality?

    Below is one of my ILLUMINA raw reads in fastq format:

    @HWI-EAS001
    AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC
    +HWI-EAS001
    ^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[

    The above reads got 50 nucleotide in length.
    Is it got any program or script able to calculate the average quality score of above read?
    My purpose is hope to calculate the average quality score of each read.
    Based on the average quality score of each read, I plan to filter out those "low quality reads" that below threshold of average quality score.
    Thanks a lot for any advice.

  • #2
    Try: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/ (fastqc)
    -drd

    Comment


    • #3
      Code:
      cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
      split(""); print "  avg score: " .($score/length($_))."\n"'
      result:

      @HWI-EAS001 avg score: 25.7647058823529

      Comment


      • #4
        Hi Brugger,

        If my fastq_file is a multi-sequence (multi-fastq sequence).
        Your command will work as well?
        Do you know what is the general threshold average quality score for determine the good or bad quality of ILLUMINA data? (eg. if the read average score above 25, we consider good quality ILLUMINA reads or etc )
        Thanks for advice ya
        Originally posted by Brugger View Post
        Code:
        cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
        split(""); print "  avg score: " .($score/length($_))."\n"'
        result:

        @HWI-EAS001 avg score: 25.7647058823529

        Comment


        • #5
          Yes it should work for a whole fastq file as well. I do not do any filtering like this so I do not have any recommended cut-off. I recommend collecting the stats for the whole dataset and plot them to see where the majority of the reads are, and then filter based on this.

          BTW: You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff.

          Comment


          • #6
            Hi Brugger,

            Thanks a lot for your knowledge sharing.
            I'm very appreciate it
            eg.
            If I got 10000 fastq read and 95% of the fastq read shown average score of at least 40 or above. Can I use that 40 average score as a threshold and remove those read that average score below 40?

            Which perl line that I can edit to filter the low quality read once I have established a usable cutoff?
            "You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff"

            Thanks again for your advice.
            I'm still fresh with perl script

            Originally posted by Brugger View Post
            Yes it should work for a whole fastq file as well. I do not do any filtering like this so I do not have any recommended cut-off. I recommend collecting the stats for the whole dataset and plot them to see where the majority of the reads are, and then filter based on this.

            BTW: You can alter the oneliner slightly and it can do the filtering for you as well when you have established a usable cutoff.

            Comment


            • #7
              Kungfu time:

              Code:
              cat input.fastq |\ 
              perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; \
              map{ $score += ord($_)-64} split("");\ 
              print "$o$w$r$_" if ($score/length($_)) > 36; '
              The cutoff is 36 in this case.

              Example:

              Code:
              *$ cat input.fastq @30BL3AAXX081011:2:4:689:1812#0
              GCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTTG
              +
              ffffffffffffffffffffffffffffffffffff
              @30BL3AAXX081011:6:9:58:234#0
              AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
              +
              ffffffffffffffffffffffffffffffffffff@30BL3AAXX081011:6:3:1471:1594#0
              TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
              +
              ffffffffffffffffffffffffffffffffffff
              @30BL3AAXX081011:6:9:152:70#0
              TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
              +
              ffffffffffffffffffffffffffffffffffff
              $ cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
              @30BL3AAXX081011:6:9:58:234#0
              AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
              +
              ffffffffffffffffffffffffffffffffffff
              @30BL3AAXX081011:6:3:1471:1594#0
              TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
              +
              ffffffffffffffffffffffffffffffffffff
              @30BL3AAXX081011:6:9:152:70#0
              TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
              +
              ffffffffffffffffffffffffffffffffffff
              -drd

              Comment


              • #8
                Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.


                More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.


                My initial oneliner should have been:
                Code:
                cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
                split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'
                And to do the filtration how ever one solution would be:
                Code:
                cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'
                Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.

                Comment


                • #9
                  Hi Brugger,

                  Thanks again for your explanation in detail.
                  I will try it out the perl one-liner to the real case later.
                  Hopefully it is worked perfectly.
                  I will update the latest status to you for verify the result quality at that moment

                  Comment


                  • #10
                    Thanks a lot, drio.
                    Below is the output result once I apply your perl script to my input data:
                    Code:
                    cat input_file.txt
                    @sample_1
                    AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC
                    +sample_1
                    ^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[
                    @sample_2
                    TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA
                    +sample_2
                    aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`]
                    @sample_3
                    GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC
                    +sample_3
                    aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB
                    
                    cat input_file.txt | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
                    @sample_2
                    TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA
                    +sample_2
                    aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`]
                    @sample_3
                    GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC
                    +sample_3
                    aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB
                    Do you familiar to write a perl script to generate another output file which contain the average quality score of each read in my input_data.txt?
                    Thanks a lot for your advice ya

                    Originally posted by drio View Post
                    Kungfu time:

                    Code:
                    cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
                    The cutoff is 36 in this case.

                    Example:

                    Code:
                    *$ cat input.fastq @30BL3AAXX081011:2:4:689:1812#0
                    GCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTTG
                    +
                    ffffffffffffffffffffffffffffffffffff
                    @30BL3AAXX081011:6:9:58:234#0
                    AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
                    +
                    ffffffffffffffffffffffffffffffffffff@30BL3AAXX081011:6:3:1471:1594#0
                    TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
                    +
                    ffffffffffffffffffffffffffffffffffff
                    @30BL3AAXX081011:6:9:152:70#0
                    TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
                    +
                    ffffffffffffffffffffffffffffffffffff
                    $ cat input.fastq | perl -ne '$o=$_; $w=<STDIN>; $r=<STDIN>; $_=<STDIN>; map{ $score += ord($_)-64} split(""); print "$o$w$r$_" if ($score/length($_)) > 36; '
                    @30BL3AAXX081011:6:9:58:234#0
                    AGCGTTCATTCTGAGCCAGGATCAAACTCTCAAGTT
                    +
                    ffffffffffffffffffffffffffffffffffff
                    @30BL3AAXX081011:6:3:1471:1594#0
                    TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
                    +
                    ffffffffffffffffffffffffffffffffffff
                    @30BL3AAXX081011:6:9:152:70#0
                    TTGAGAGTTTGATCCTGGCTCAGAATGAACGCTGGC
                    +
                    ffffffffffffffffffffffffffffffffffff

                    Comment


                    • #11
                      Thanks, Brugger.
                      I just try both of your perl script.
                      The first script we calculate the average quality score of each read is worked perfectly and nice
                      But the second perl script seems like can't apply to filter the read based on pre-determined average quality score cut-off?
                      Code:
                      cat input_file.txt
                      @sample_1
                      AACATTCAACGCTGNCGGTGAGTTGGAATTCTCGGGTGCCAAGGAACTCC
                      +sample_1
                      ^`Y^aa__^_]\`_B\U][RV`W`^`][``__Z^P[UUZZUUa^Z[^^Z[
                      @sample_2
                      TATTGCACTTGTCCNGGCCTGTTGGAATTCGCGGGTGCCAAGGAACTCCA
                      +sample_2
                      aabbb_a_ab``a_BaX`abb_aba^abb[N]P\a]`a`^\`a`a`aS`]
                      @sample_3
                      GCCAAACAATACCANGGCTGCCGGGCTGGACTTCTCGGGTGCCAAGGAAC
                      +sample_3
                      aWW[]\VY\`V_L\B^TR\ZQO_W\MXTXPS]SJVPWMYNYKDQV^`BBB
                      
                      cat input_file.txt | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
                      split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'
                      
                      @sample_1   avg score:25.7647058823529
                      @sample_2   avg score:54.4705882352941
                      @sample_3   avg score:73.7058823529412
                      
                      cat input_file.txt | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'
                      When I try for the second perl script. It seems like can't function well and generate any output result?
                      Thanks a lot for your advice to improve it
                      Originally posted by Brugger View Post
                      Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.


                      More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.


                      My initial oneliner should have been:
                      Code:
                      cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
                      split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'
                      And to do the filtration how ever one solution would be:
                      Code:
                      cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'
                      Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.

                      Comment


                      • #12
                        Brugger made a little typo, change the last assignment from $code=0 to $score=0 and you'll have the behaviour you were looking for (in the first oneliner):

                        Code:
                        cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print "  avg score: " .($score/length($_))."\n";$score=0;'
                        The second one works just fine.

                        Back to watching enter the dragon.
                        -drd

                        Comment


                        • #13
                          Hi drio,
                          Thanks for your perl script
                          I just try it out and it gives the following info:
                          Code:
                          cat input_file.txt | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} split(""); print "  avg score: " .($score/length($_))."\n";$score=0;'
                          
                          @sample_1   avg score:25.7647058823529
                          @sample_2   avg score:[B]28.7058823529412[/B]
                          @sample_3   avg score:[B]13.2352941176471[/B]
                          I just wondering why the average quality score by your perl script is different with the average quality score given by Brugger's perl script?
                          Thanks for your advice and enjoy "enter the dragon" ya

                          Comment


                          • #14
                            I think the focus on 'average quality' may be a bit misleading here, as it probably isn't the most useful measure to make. What you will often find is that for runs which have problems with poor quality it tends to be that the quality drops off as the run progresses. In general therefore you will see all reads showing poor quality later on in the run. Finding individual reads which are poor in an otherwise good run won't normally remove many sequences from your set.

                            What we've sometimes found to be beneficial is trimming the entire set of sequences to a length where the quality across the run was still OK. This will leave you with a shorter read length, but with sequence which is of much higher average quality. I'd not recommend doing this unless there is a really serious quality problem in your data, but we've seen quite a few instances of longer read lengths (75bp+) where the later stages of the read were just noise.

                            Comment


                            • #15
                              Hi simon,

                              Thanks a lot for your knowledge sharing regarding dealing which fastq format read
                              Actually I got long list of read in fastq format from ILLUMINA GENOME ANALYZER now. Each of the read is 50 nucleotide in length.
                              My initial purpose of calculating the average quality score of each read is planned to determine a cut-off to filter "low quality read".
                              After then, I planned to preprocessing the raw fastq format data and use those "high quality read" for further analysis.
                              Based on your understanding and experience regarding fastq format read, what is the best way I should do when receiving the fastq read from ILLUMINA?
                              Thanks a lot for your advice.
                              This is the first time I dealing with ILLUMINA data in fastq format. Thus still fresh on dealing with their data

                              Originally posted by simonandrews View Post
                              I think the focus on 'average quality' may be a bit misleading here, as it probably isn't the most useful measure to make. What you will often find is that for runs which have problems with poor quality it tends to be that the quality drops off as the run progresses. In general therefore you will see all reads showing poor quality later on in the run. Finding individual reads which are poor in an otherwise good run won't normally remove many sequences from your set.

                              What we've sometimes found to be beneficial is trimming the entire set of sequences to a length where the quality across the run was still OK. This will leave you with a shorter read length, but with sequence which is of much higher average quality. I'd not recommend doing this unless there is a really serious quality problem in your data, but we've seen quite a few instances of longer read lengths (75bp+) where the later stages of the read were just noise.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              33 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              72 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              81 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X