Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bio_informatics
    Senior Member
    • Nov 2013
    • 182

    Sequence Quality for large number of fastqs

    Hi Members,

    Organism: S. aureus
    I've 100 isolates' Illumina WGS. That is 2*100 fastq.gz files. Read length 100.
    Unfortunately, the run failed before completing last 6 cycles.
    Sequencing centre has de-multiplexed data, and we've it in our court now.

    My challenge here is to strike out isolates which have poor quality. I'm interested for R2 only.

    One approach I know is:
    Get fast-qc report for all the R2, and visually decide(pass,fail) respective data one by one.
    Then again that would be qualitative, and thus I do not want to go for this approach (for the time being).
    Also, poor is a subjective term.

    Let's say, I'd like to have isolates with an average of 25 coverage. Is there way to do so?

    I'm sure this situation is very common with other labs.

    Looking forward for guidance!
    Last edited by bio_informatics; 11-12-2015, 07:55 AM.
    Bioinformaticscally calm
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    I would quality-trim (discarding reads below a certain length after trimming, say, 70bp) or quality-filter. Then discard libraries that are below your coverage target after the operation. For example, using BBDuk on files named "read1.fq.gz" and "read2.fq.gz" and trimming to q15:

    bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70

    You'll get some result like this:

    Code:
    Input:                          200000 reads            60200000 bases.
    QTrimmed:                       199996 reads (100.00%)  19609694 bases (32.57%)
    Result:                         191928 reads (95.96%)   40590306 bases (67.43%)
    Then you can decide empirically based on the number of resulting bases whether the library is sufficient for your needs.

    Alternately, if the only problem is the last 6 bases of read 2, then just trim the last 6 bases of read 2 and keep all of the libraries.

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      Remember to trim R1/R2 together even if you inspect/suspect only R2. @Brian's solution will ensure that is done.

      Basically, what you have in hand is a 100 x 94 run.

      Comment

      • bio_informatics
        Senior Member
        • Nov 2013
        • 182

        #4
        Originally posted by Brian Bushnell View Post

        bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70


        Code:
        Input:                          200000 reads            60200000 bases.
        QTrimmed:                       199996 reads (100.00%)  19609694 bases (32.57%)
        Result:                         191928 reads (95.96%)   40590306 bases (67.43%)
        Then you can decide empirically based on the number of resulting bases whether the library is sufficient for your needs.
        Hi Brian,
        Thanks much for your reply, and suggesting bbduk. This is the exact thing I was looking for.
        I'll give this a try.
        Bioinformaticscally calm

        Comment

        • bio_informatics
          Senior Member
          • Nov 2013
          • 182

          #5
          Originally posted by GenoMax View Post
          Remember to trim R1/R2 together even if you inspect/suspect only R2. @Brian's solution will ensure that is done.

          Basically, what you have in hand is a 100 x 94 run.
          Hi GM,
          Oh, I see.
          I couldn't imagine to that extent. Shall include both reads.

          Thank you.
          Bioinformaticscally calm

          Comment

          • bio_informatics
            Senior Member
            • Nov 2013
            • 182

            #6
            Originally posted by Brian Bushnell View Post
            bbduk.sh in=read#.fq.gz out=trimmed#.fq.gz qtrim=r trimq=15 minlen=70
            Hi Brian,
            Thanks again for sharing tool, and command.
            I've following output:

            [in=R2_001.fastq.gz, out=trimmed.fq.gz, qtrim=r, trimq=25, minlen=70]

            Input: 2301451 reads 195623335 bases.
            QTrimmed: 370763 reads (16.11%) 22690551 bases (11.60%)
            Result: 2043361 reads (88.79%) 172932784 bases (88.40%)
            I've following doubt:



            1) If I'm understanding output properly, here, 88.79% of the input reads passed my filters (Length 70, and Quality 25).
            Kindly correct me if I'm wrong.
            Last edited by bio_informatics; 11-11-2015, 11:32 AM.
            Bioinformaticscally calm

            Comment

            • GenoMax
              Senior Member
              • Feb 2008
              • 7142

              #7
              Here is what @Brian had to say about this question

              Why Input =/= KTrimmed + Result at times?

              There are two reasons for that, and they involve the "minlen" parameter and read pairing. For trimming, generally, the result should equal the input unless some reads are trimmed below the minimum length and discarded; you can, for example, have 100% trimmed and a result of 100% as long as they are only trimmed a little bit.

              With pairs, it's even more confusing as sometimes you have a pair discarded because one read got trimmed too short, so you can end up with, theoretically, 50% of reads trimmed and a result of 0%.

              If you set "minlen=0", then nothing will get discarded on the basis of being too short and result will equal input.
              I have already put a feature request in to include a stat line about number of reads that get discarded. @Brian has promised to add that in a future BBMap release.

              Comment

              • bio_informatics
                Senior Member
                • Nov 2013
                • 182

                #8
                Originally posted by GenoMax View Post
                Here is what @Brian had to say about this question


                I have already put a feature request in to include a stat line about number of reads that get discarded. @Brian has promised to add that in a future BBMap release.
                Hi Genomax,
                Sorry, I don't understand what it mean.
                Let me take baby steps here. Input R2 - lagging strand.


                Input: 2301451 reads 195623335 bases.
                QTrimmed: 370763 reads (16.11%) 22690551 bases (11.60%)
                Result: 2043361 reads (88.79%) 172932784 bases (88.40%)
                Qtrimmed (reads) % + Result (reads) % >100

                However,
                Qtrimmed (bases) % + Result (bases) % =100

                I'm confused here.
                Were 16.11% of my reads discarded?
                If so, then my output should have 83-ish% of result. Why there's a conflict?

                References:


                Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc
                Bioinformaticscally calm

                Comment

                • GenoMax
                  Senior Member
                  • Feb 2008
                  • 7142

                  #9
                  Originally posted by bio_informatics View Post
                  Hi Genomax,
                  Qtrimmed (reads) % + Result (reads) % >100

                  However,
                  Qtrimmed (bases) % + Result (bases) % =100

                  I'm confused here.
                  Were 16.11% of my reads discarded?
                  If so, then my output should have 83-ish% of result. Why there's a conflict?
                  ~16% of the reads were Qtrimmed (but not all were discarded). Based on the parameter set for minlen (default = 10) some reads still remain in the result file after being trimmed. If you set minlen=0, no read will get discarded.

                  BBDuk currently does not provide a stat line which tells you exactly how many reads were discarded. I have asked Brian to add that as a feature request.

                  BTW: Why are you not trimming R1/R2 together? Are you not interested in R1 at all and will not use that read for analysis?

                  Comment

                  • bio_informatics
                    Senior Member
                    • Nov 2013
                    • 182

                    #10
                    Originally posted by GenoMax View Post
                    ~16% of the reads were Qtrimmed (but not all were discarded). Based on the parameter set for minlen (default = 10) some reads still remain in the result file after being trimmed. If you set minlen=0, no read will get discarded.

                    BBDuk currently does not provide a stat line which tells you exactly how many reads were discarded. I have asked Brian to add that as a feature request.

                    BTW: Why are you not trimming R1/R2 together? Are you not interested in R1 at all and will not use that read for analysis?
                    Hi GM,
                    Thanks for your reply.
                    That helps.

                    But, then why doesn't this apply to the bases? Base %-age seems to add up to 100.
                    Yes, I had posted output for Read 2 only.

                    I'm definitely going to use R1 for analysis.

                    If I include both my reads, with q25 and length 70, output looks like:

                    Processing time: 4.471 seconds.

                    Input: 2301450 reads 195623250 bases.
                    QTrimmed: 459677 reads (19.97%) 51656212 bases (26.41%)
                    Result: 1701048 reads (73.91%) 143967038 bases (73.59%)

                    Time: 4.911 seconds.
                    Reads Processed: 2301k 468.65k reads/sec
                    Bases Processed: 195m 39.84m bases/sec
                    The reason I was using R2 only is if Read 2's result (from bbduk) percentage is below 80% all those isolate will be discarded for the analysis.
                    Bioinformaticscally calm

                    Comment

                    • bio_informatics
                      Senior Member
                      • Nov 2013
                      • 182

                      #11
                      Since I'm interested to see how many reads/bases are good. Why would I trim them?
                      Can I use maq (minimum average quality) argument in bbduk? My goal is to get count for the reads which have quality say xyz.
                      Sorry, I'm making waters murky. I think it's important for me to understand what to use, when and why.
                      Bioinformaticscally calm

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        #12
                        This is a re-sequencing project correct? Minq=25 seems pretty strict unless you have a specific reason. Do you have situations where there are areas in the middle of the read that are bad quality (based on your FastQC scans)?

                        It is not logical to calculate "average" quality for a read. Is that what you are referring to?

                        Looks like most of your R2 is ok (though slightly short since the run stopped after 94 cycles?)

                        Comment

                        • bio_informatics
                          Senior Member
                          • Nov 2013
                          • 182

                          #13
                          Originally posted by GenoMax View Post
                          This is a re-sequencing project correct? Minq=25 seems pretty strict unless you have a specific reason. Do you have situations where there are areas in the middle of the read that are bad quality (based on your FastQC scans)?

                          It is not logical to calculate "average" quality for a read. Is that what you are referring to?

                          Looks like most of your R2 is ok (though slightly short since the run stopped after 94 cycles?)
                          Hi,

                          - Yes, average read quality I referred to.
                          - Yes, the run stopped after 96 cycles.
                          - I pulled QC for Read 2, it looks good than I had expected. Nothing below quality 36. There are no regions where quality slacks in middle.

                          Yes, I'm interested to find out reads which are ugly and then can be sent for re-sequencing.

                          There's no reason to select Q25.
                          Q1) Is quality 25 while working with "Read 2" alone is high? Or
                          while working with both "R1 and R2" is high?

                          That is the issue, should I pull the bar for result percent below, that is select reads which have passed Q25, L 70? I'm considering to take reads which give 80% in result. (this is while running Read 2 only).

                          Or

                          Pull down Quality at which I'm obtaining my result which in this case 25.

                          Thanks again for your support, and guiding through these discussions.
                          Last edited by bio_informatics; 11-12-2015, 08:43 AM.
                          Bioinformaticscally calm

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            Based on what you are describing this run seems fully acceptable. Just 6 (or 4 bp) shorter on R2 than you expected, which is no big deal (unless there are NNNNN's at the end of every read in R2, where the cycles failed).

                            I would say just scan/trim for presence of adapters (and if you must, trim <Q20 bases) and then on towards mapping.

                            Comment

                            • Brian Bushnell
                              Super Moderator
                              • Jan 2014
                              • 2709

                              #15
                              For trimming, Q25 is very high. The more bases you trim, the less accurate your mapping. Even low-quality bases (under Q10) can increase the specificity when mapping. I don't trim above ~Q15 except in very rare circumstances; Q12 is more typical.

                              To clarify a couple of things:

                              "maq", or minimum average quality, translates the quality scores to probabilities, averages them, then transforms back to a PHRED-scaled Q-score. So, maq=20 would remove all reads with an average error rate of over 1%, or an average of 1 error per 100bp read. This is applied after trimming, if trimming is performed. Ns are considered to have a 75% error rate, so it's useful to at least trim trailing Ns (trimq=1) if you use maq; otherwise, any read with 4 trailing Ns would be discarded. And its mate would also be discarded.

                              And - it's important to always trim (and otherwise process) R1 and R2 together, at the same time, in any process that might discard or reorder the reads. Otherwise you will end up with unmatched pairs that won't map correctly.

                              Lastly, note that you can trim the last 4 bases of read 2 only, and keep the result in the same order as before, like this:

                              bbduk.sh in=read2.fq out=trimmed2.fq ftr2=4 ordered

                              In this situation, it would be helpful if you could post a couple charts from one of the bad batches of raw data:

                              Either post the quality histogram according to Illumina:
                              bbduk.sh in=read1.fq in2=read2.fq qhist=qhist.txt bhist=bhist.txt

                              Or, even better, post the quality histograms as determined by mapping:
                              bbmap.sh in1=read1.fq in2=read2.fq ref=reference.fasta qhist=qhist.txt bhist=bhist.txt mhist=mhist.txt

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                                by SEQadmin2


                                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                                Here are nine questions we think about, in roughly the order they matter, before...
                                06-18-2026, 07:11 AM
                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 05:37 AM
                              0 responses
                              1 view
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-26-2026, 11:10 AM
                              0 responses
                              16 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-17-2026, 06:09 AM
                              0 responses
                              49 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-09-2026, 11:58 AM
                              0 responses
                              109 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...