Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How many poor quality bases define a bad read (using fastx or equivalent)?

    I want to know what people who have used fastx or similar have done to define a poor quality read. What do you call a bad base (I was thinking of calling one bad if it had a score of 20 or below from phred33) and how many bad bases do you allow per read? I wondered if >=1 bad Base per 10 would definite a bad read. So if I had a 100 bp read, 10 or more bases with a score or 20 or less would be rejected. Does this sound right? Too strict? To relaxed?

    Thanks for any advice

  • #2
    Where exactly are these bad bases located? If you are aligning to a reference you may be able to tolerate "bad" bases but for de novo assemblies you would want to be strict.

    Comment


    • #3
      Originally posted by GenoMax View Post
      Where exactly are these bad bases located? If you are aligning to a reference you may be able to tolerate "bad" bases but for de novo assemblies you would want to be strict.
      Before mapping they are located throughout the whole read, after mapping with Tophat you see a 3' bias for poor quality reads. Normally I would use mapping to filter out bad reads but the sheer number of them make me concerned that some low quality reads have been falsely mapped.

      Comment


      • #4
        Are the bad quality bases located at constant positions (cycle #) in reads? What fraction of reads are affected? This could be indicative of some sort of a problem with the run itself.

        Comment


        • #5
          Originally posted by GenoMax View Post
          Are the bad quality bases located at constant positions (cycle #) in reads? What fraction of reads are affected? This could be indicative of some sort of a problem with the run itself.
          I am not sure what you mean? No the poor bases are throughout not at constant location. (See output from FASTQC attached). It could have been a poor run as both of rep1 from this study look like this. I am re-analysing some data from another lab published in 2013 to get used to our computational pipeline and see if I can see similar results to them.
          Attached Files

          Comment


          • #6
            Are you sure this is sanger fastq quality encoding? That run looks pretty marginal with Q-scores all over the place for all cycles. You would definitely want to trim this data before alignment.

            Comment


            • #7
              Originally posted by GenoMax View Post
              Are you sure this is sanger fastq quality encoding? That run looks pretty marginal with Q-scores all over the place for all cycles. You would definitely want to trim this data before alignment.
              Tophat said it is phred33 and this is the output from FASTQC so I think it is probably correct. Given the low mapping % of about 50% I think they are low quality so it makes sense to me.

              Looking at this it suggests the bias is weak and trimming would only have marginal effect, am I wrong to think this? If so where would you trim to?

              I have attached the post-mapping quality. Here you see a 3' bias. I was thinking of trimming reads to 80 nt and then map and see what the mapping % is but also just removing some poor reads like described earlier.
              Attached Files

              Comment


              • #8
                It looks like a possibly bimodal distribution where some reads are basically good and others are basically bad. As a result, your data will have a quantitative bias, since quality is somewhat sequence-dependent on Illumina platforms. This bias will be present whether you quality-trim, quality-filter, or map raw reads, but will be minimized by using an aligner that is very robust to errors (e.g. BBMap rather than TopHat). Because the quality issue is present at all positions in entire reads, quality-insensitive trimming to 80bp may not help much.

                If you do quality-trimming on this data, I'd suggest a tool that implements an optimal trimming algorithm (BBDuk or seqtk) rather than something like fastx, which will give inferior results, and I'd recommend that you trim on both ends. So, for example:

                bbduk.sh in=reads.fq out=trimmed.fq qtrim=rl trimq=15 minlen=50

                That will ensure that the resulting trimmed read has an average quality of at least Q15 (and, in fact, is the longest possible substring with such that any terminal substrings have at least Q15), which is fine for accurate mapping with BBMap. Reads shorter than 50bp after trimming will be discarded.

                Alternatively, you can trim them on the fly while mapping with BBMap, which will restrict them from being trimmed shorter than a certain length, and just try mapping them anyway. That may minimize the bias incurred by trimming. I can try to give you an optimal command line for BBMap if you want to try it, and if you tell me what kingdom of organism you are mapping to or what its intron lengths are.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                57 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                56 views
                0 likes
                Last Post seqadmin  
                Working...
                X