Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by edge View Post
    Hi simon,
    Based on your understanding and experience regarding fastq format read, what is the best way I should do when receiving the fastq read from ILLUMINA?
    I think the posts above have covered what I'd do, one way or another.

    I'd run the sequences through FastQC. This will generate a per-base quality plot and a per-sequence quality plot. The per base plot will tell you if there is a systematic loss of quality as your run progresses. If there is you might want to consider trimming a bit off the length of all of your sequences.

    You can also see if there is a sub-population of sequences which have universally poor quality within a generally good run. This is much more unusual in my experience, but if it is a problem in your data you could filter these out as well.

    I have to say that in most cases we don't filter our data unless the qualities are really poor. Many downstream analyses can take the assigned quality values into account so they don't need to have it all removed, but this may be different in your case depending on what you're intending doing with the data.

    Comment


    • #17
      Thanks a lot and again for your explanation in detail, simon
      I'm trying out my data regarding your suggested pipeline now.
      Will ask your advice or comment if I'm facing any error or misunderstanding when dealing for the data read later.
      Thanks again, simon ^^

      Comment


      • #18
        Include Std dev?

        Excellent post! Two questions:

        1) Is there a way to include standard dev of the quality scores along with the average using the one-liner?

        2) I agree with Simon in that the quality of a read degrades as the length of the read increases. Is there a way (using the same one-liner) one can limit to say 30 base pairs instead of the entire line of a seq read and get the ave and std dev of the quality scores?

        Thanks, and your help is immense!

        Originally posted by Brugger View Post
        Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.


        More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.


        My initial oneliner should have been:
        Code:
        cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
        split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'
        And to do the filtration how ever one solution would be:
        Code:
        cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'
        Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.

        Comment


        • #19
          Originally posted by Seq_in_Perl View Post
          Is there a way to include standard dev of the quality scores along with the average using the one-liner?
          One other thing to mention here - quality scores are nowhere near normally distributed. They are heavily skewed towards high values, with a fairly rapid descent to values close to zero (especially for Illumina data where they use a score of 2 as a sort of flag rather than a true probability measure). This means that using mean and stdev tend to give a fairly poor reflection of the range of values you have. We tend to prefer looking at percentiles through the distribution to get a clearer picture of what's going on (we often find that the 25th, 50th, 75th and 90th percentiles can all have the same value, but that the 10th can vary wildly).

          Comment


          • #20
            Originally posted by Seq_in_Perl View Post
            Excellent post! Two questions:

            1) Is there a way to include standard dev of the quality scores along with the average using the one-liner?

            2) I agree with Simon in that the quality of a read degrades as the length of the read increases. Is there a way (using the same one-liner) one can limit to say 30 base pairs instead of the entire line of a seq read and get the ave and std dev of the quality scores?

            Thanks, and your help is immense!
            Code:
            cat fastq_file | perl -ne 'chomp;$_ = substr($_,0,30); print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $_=ord($_)-64; $s+=$_;$s2=$_*$_;} split(""); $l=length($_); print "  avg score: " .($score/$l)." stddev".(sqrt(($l*$s2-$s*$s)/($l*($l-1)))) ."\n";$s=$s2=$l=0;'
            I believe (not really tested). But now things are getting silly. A proper script would increase readability greatly...


            And I agree with simon, this does not really make any sense. We normally run bwa with aln with a '-q 15' flag to remove low quality ends of if you do assemblies, velvet comes with a suite of scripts for clipping off the low quality ends.

            Comment


            • #21
              An IMPORTANT bug

              Thanks for the code snip -- very nice!

              I just found a bug in the code -- without chomp the newline character, the score will be wrong. Here is the new code:

              cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>; chomp; $score=0; map{ $score += ord($_)-33}
              split(""); print " avg score: " .($score/length($_))."\n";'


              I tested this by following example:
              @HWI-ST570:300EKUABXX:3:2208:7666:82627 2:N:0:
              AGGTGGGGGGGGGTGGGGGTGTGGGGTGGGGTGGGTGGGTGTTGGGGGGATGGGGGTGTGAGGTGGGGGGGGGGGGGGGGGGGGGGGTTATGGTGT
              +
              ################################################################################################



              I also change the above code to Phred+33, which works for Illumina 1.8+ and Sanger. For Illumina 1.5+, use ord($_)-64

              Comment


              • #22
                hi

                I tried your code for getting quality averages, howvere this is what it returned to me:

                @M01232:82:000000000-AHHB7:1:2119:22801:25267 1:N:0:33 avg score: 635857.304635762
                @M01232:82:000000000-AHHB7:1:2119:10835:25269 1:N:0:33 avg score: 635856.986754967
                @M01232:82:000000000-AHHB7:1:2119:17952:25279 1:N:0:33 avg score: 635852.953642384

                I think I have an idea of how they got to this score but what am I doing wrong or how can I fix this?
                I copied the code exactley and added my fastq file.

                Also once/if this can be fixed is it possible to run multiple separate fastq files at a time instead of having to do one at a time. All my files contains 2 mill reads per file, it takes incredibly long.

                Any help?

                Auberi

                Comment


                • #23
                  Din't see the post prior to mine and that seems to have fixed my bug as I worked with ILLUMINA 1.8+.
                  Sorry about that.

                  Thanks

                  Comment


                  • #24
                    Sorry. It is a little off topic but ...

                    I am trying to merge paired ends fastq files.

                    Currently I have only been able to perform this task with BBmerge, mergePairs.py and a custom software developed by myself.

                    I consistently get low percentages of merging: about 20% when limiting quality of overlap above 90%.

                    Is this normal? I assumed that merging would be much above 50%.

                    Thanks.

                    Comment


                    • #25
                      Originally posted by bi_maniac View Post
                      Sorry. It is a little off topic but ...

                      I am trying to merge paired ends fastq files.

                      Currently I have only been able to perform this task with BBmerge, mergePairs.py and a custom software developed by myself.

                      I consistently get low percentages of merging: about 20% when limiting quality of overlap above 90%.

                      Is this normal? I assumed that merging would be much above 50%.

                      Thanks.
                      It depends on the insert size you have in your library. If a lot of your inserts are long enough that the overlap between your reads will be short or non-existent then you won't be able to merge them. If the reads do overlap, but the quality is poor then you might have trouble uniquely identifying the correct place to join the reads.

                      Comment


                      • #26
                        Thanks. If I understand correctly, given a library and a particular gene to be studied, insert size should be known. OK?

                        Comment


                        • #27
                          Originally posted by bi_maniac View Post
                          Thanks. If I understand correctly, given a library and a particular gene to be studied, insert size should be known. OK?
                          Whoever made the library should have a rough idea of the size selection they used when creating the library from which you can work out the insert size range. It will be an approximate range though, not a fixed value.

                          Comment


                          • #28
                            Thanks a lot. You are being very helpful

                            Comment


                            • #29
                              @bi_maniac: While you wait to hear back from people who made the libraries, you can estimate the insert size based on data at hand. See Brian's suggestion here: http://seqanswers.com/forums/showpos...13&postcount=2

                              Comment


                              • #30
                                It would help if you could post the insert-size histogram from merging, by the way. If it is monotonically increasing and then suddenly drops to zero just before 2*(read length), then they mostly don't overlap. If there is a prominent bell-shaped peak well below 2*(read length), then the problem is likely quality. But at a 20% merge rate I assume they mostly don't overlap.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                9 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X