Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Originally posted by edge View Post
    Hi simon,
    Based on your understanding and experience regarding fastq format read, what is the best way I should do when receiving the fastq read from ILLUMINA?
    I think the posts above have covered what I'd do, one way or another.

    I'd run the sequences through FastQC. This will generate a per-base quality plot and a per-sequence quality plot. The per base plot will tell you if there is a systematic loss of quality as your run progresses. If there is you might want to consider trimming a bit off the length of all of your sequences.

    You can also see if there is a sub-population of sequences which have universally poor quality within a generally good run. This is much more unusual in my experience, but if it is a problem in your data you could filter these out as well.

    I have to say that in most cases we don't filter our data unless the qualities are really poor. Many downstream analyses can take the assigned quality values into account so they don't need to have it all removed, but this may be different in your case depending on what you're intending doing with the data.

    Comment


    • #17
      Thanks a lot and again for your explanation in detail, simon
      I'm trying out my data regarding your suggested pipeline now.
      Will ask your advice or comment if I'm facing any error or misunderstanding when dealing for the data read later.
      Thanks again, simon ^^

      Comment


      • #18
        Include Std dev?

        Excellent post! Two questions:

        1) Is there a way to include standard dev of the quality scores along with the average using the one-liner?

        2) I agree with Simon in that the quality of a read degrades as the length of the read increases. Is there a way (using the same one-liner) one can limit to say 30 base pairs instead of the entire line of a seq read and get the ave and std dev of the quality scores?

        Thanks, and your help is immense!

        Originally posted by Brugger View Post
        Drio: Your kung-fu is faulty as the score is not reset after each entry. Actually so is mine above, but I never tested it on more than one entry... I am sure that there is something to be said about doing oneliners later on Fridays.


        More important this makes the analysis above wrong. If you had looked at the avg scores in greater detail you would have caught it as the quality value should not exceed 40.


        My initial oneliner should have been:
        Code:
        cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $score += ord($_)-64} 
        split(""); print "  avg score: " .($score/length($_))."\n";$code=0;'
        And to do the filtration how ever one solution would be:
        Code:
        cat fastq_file | perl -ne '$t=$_ . <> . <>; $_ = <>; chomp; map{ $score += ord($_)-64} split(""); print "$t$_\n" if ($score/length($_)) > 36; $score= 0'
        Another bug is that the \n for the quality line would be included in the calculations and there by making them wrong. However, as the same logic is used for all entries this makes the avg. values comparable.

        Comment


        • #19
          Originally posted by Seq_in_Perl View Post
          Is there a way to include standard dev of the quality scores along with the average using the one-liner?
          One other thing to mention here - quality scores are nowhere near normally distributed. They are heavily skewed towards high values, with a fairly rapid descent to values close to zero (especially for Illumina data where they use a score of 2 as a sort of flag rather than a true probability measure). This means that using mean and stdev tend to give a fairly poor reflection of the range of values you have. We tend to prefer looking at percentiles through the distribution to get a clearer picture of what's going on (we often find that the 25th, 50th, 75th and 90th percentiles can all have the same value, but that the 10th can vary wildly).

          Comment


          • #20
            Originally posted by Seq_in_Perl View Post
            Excellent post! Two questions:

            1) Is there a way to include standard dev of the quality scores along with the average using the one-liner?

            2) I agree with Simon in that the quality of a read degrades as the length of the read increases. Is there a way (using the same one-liner) one can limit to say 30 base pairs instead of the entire line of a seq read and get the ave and std dev of the quality scores?

            Thanks, and your help is immense!
            Code:
            cat fastq_file | perl -ne 'chomp;$_ = substr($_,0,30); print;<STDIN>;<STDIN>;$_ = <STDIN>;map{ $_=ord($_)-64; $s+=$_;$s2=$_*$_;} split(""); $l=length($_); print "  avg score: " .($score/$l)." stddev".(sqrt(($l*$s2-$s*$s)/($l*($l-1)))) ."\n";$s=$s2=$l=0;'
            I believe (not really tested). But now things are getting silly. A proper script would increase readability greatly...


            And I agree with simon, this does not really make any sense. We normally run bwa with aln with a '-q 15' flag to remove low quality ends of if you do assemblies, velvet comes with a suite of scripts for clipping off the low quality ends.

            Comment


            • #21
              An IMPORTANT bug

              Thanks for the code snip -- very nice!

              I just found a bug in the code -- without chomp the newline character, the score will be wrong. Here is the new code:

              cat fastq_file | perl -ne 'chomp;print;<STDIN>;<STDIN>;$_ = <STDIN>; chomp; $score=0; map{ $score += ord($_)-33}
              split(""); print " avg score: " .($score/length($_))."\n";'


              I tested this by following example:
              @HWI-ST570:300EKUABXX:3:2208:7666:82627 2:N:0:
              AGGTGGGGGGGGGTGGGGGTGTGGGGTGGGGTGGGTGGGTGTTGGGGGGATGGGGGTGTGAGGTGGGGGGGGGGGGGGGGGGGGGGGTTATGGTGT
              +
              ################################################################################################



              I also change the above code to Phred+33, which works for Illumina 1.8+ and Sanger. For Illumina 1.5+, use ord($_)-64

              Comment


              • #22
                hi

                I tried your code for getting quality averages, howvere this is what it returned to me:

                @M01232:82:000000000-AHHB7:1:2119:22801:25267 1:N:0:33 avg score: 635857.304635762
                @M01232:82:000000000-AHHB7:1:2119:10835:25269 1:N:0:33 avg score: 635856.986754967
                @M01232:82:000000000-AHHB7:1:2119:17952:25279 1:N:0:33 avg score: 635852.953642384

                I think I have an idea of how they got to this score but what am I doing wrong or how can I fix this?
                I copied the code exactley and added my fastq file.

                Also once/if this can be fixed is it possible to run multiple separate fastq files at a time instead of having to do one at a time. All my files contains 2 mill reads per file, it takes incredibly long.

                Any help?

                Auberi

                Comment


                • #23
                  Din't see the post prior to mine and that seems to have fixed my bug as I worked with ILLUMINA 1.8+.
                  Sorry about that.

                  Thanks

                  Comment


                  • #24
                    Sorry. It is a little off topic but ...

                    I am trying to merge paired ends fastq files.

                    Currently I have only been able to perform this task with BBmerge, mergePairs.py and a custom software developed by myself.

                    I consistently get low percentages of merging: about 20% when limiting quality of overlap above 90%.

                    Is this normal? I assumed that merging would be much above 50%.

                    Thanks.

                    Comment


                    • #25
                      Originally posted by bi_maniac View Post
                      Sorry. It is a little off topic but ...

                      I am trying to merge paired ends fastq files.

                      Currently I have only been able to perform this task with BBmerge, mergePairs.py and a custom software developed by myself.

                      I consistently get low percentages of merging: about 20% when limiting quality of overlap above 90%.

                      Is this normal? I assumed that merging would be much above 50%.

                      Thanks.
                      It depends on the insert size you have in your library. If a lot of your inserts are long enough that the overlap between your reads will be short or non-existent then you won't be able to merge them. If the reads do overlap, but the quality is poor then you might have trouble uniquely identifying the correct place to join the reads.

                      Comment


                      • #26
                        Thanks. If I understand correctly, given a library and a particular gene to be studied, insert size should be known. OK?

                        Comment


                        • #27
                          Originally posted by bi_maniac View Post
                          Thanks. If I understand correctly, given a library and a particular gene to be studied, insert size should be known. OK?
                          Whoever made the library should have a rough idea of the size selection they used when creating the library from which you can work out the insert size range. It will be an approximate range though, not a fixed value.

                          Comment


                          • #28
                            Thanks a lot. You are being very helpful

                            Comment


                            • #29
                              @bi_maniac: While you wait to hear back from people who made the libraries, you can estimate the insert size based on data at hand. See Brian's suggestion here: http://seqanswers.com/forums/showpos...13&postcount=2

                              Comment


                              • #30
                                It would help if you could post the insert-size histogram from merging, by the way. If it is monotonically increasing and then suddenly drops to zero just before 2*(read length), then they mostly don't overlap. If there is a prominent bell-shaped peak well below 2*(read length), then the problem is likely quality. But at a 20% merge rate I assume they mostly don't overlap.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Genetic Variation in Immunogenetics and Antibody Diversity
                                  by seqadmin



                                  The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                  11-06-2024, 07:24 PM
                                • seqadmin
                                  Choosing Between NGS and qPCR
                                  by seqadmin



                                  Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                  10-18-2024, 07:11 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 11-08-2024, 11:09 AM
                                0 responses
                                35 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 11-08-2024, 06:13 AM
                                0 responses
                                28 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 11-01-2024, 06:09 AM
                                0 responses
                                32 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 10-30-2024, 05:31 AM
                                0 responses
                                23 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X