Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with quality in fastq file

    Hi all,

    I have a little problem with bowtie2 aligner related to the quality of my reads in fastq file. I have some raw RNAseq data (Illumina, single end, 50pb), and when I try to align it against my reference sequence, it pops me after a while messages like "read HWI-ST766:125........ has more quality values than read characters" or ""read HWI-ST766:125........ has spaces in quality check".

    So, obviously, there are some reads in my file that are "corrupted" in some way and bowtie2 doesn't like that.

    I tried to delete these specific line with grep and sed functions, it worked well but it's too long and I can't do it every time I have this issue.

    So, I was wondering if I could somehow clean my data according to the quality or perhaps eliminate all reads which will make bowtie2 bug...

    Anyone has a clue how to do this? It's frustrating, cause I feel I'm not far away from getting my results, but there is always something else!

    Here is my command for alignment (if it can help):

    bowtie2 -q -a -p 6 -t -x IndexFile -U FastqFile -S SamFile

    Thank you in advance!

  • #2
    Personally I would first try to redownload the FASTQ file in case it was corrupted over the network, and if applicable repeat the decompression as well - again, just in case there is a bad sector on your drive or something. It might also be worth running a test on your RAM (e.g. memcheck) to make sure that is working fine - otherwise you can get problems from that too, e.g. bases flipping as in http://mira-assembler.sourceforge.ne...onus_part.html

    Comment


    • #3
      How did it get that way?

      Assuming the bowtie2 error messages speak the truth (which you can verify by examining the relevant fastq lines), I'd sure recommend tracing the problem back to its source, rather than trying to clean up the data after the fact.

      Are the bad reads interspersed with good ones, or do they fall at the end of the fastq file? In the latter case, you may have filled up your disk.

      What is the output of the instrument -- .bcl files? How do you turn that into fastq files?

      Do the reads which bowtie2 does NOT complain about look plausible? E.g., quality characters in the correct range?

      If all else fails, post a few (6) reads here, showing a bad read in context with other 'good' ones.

      --SP

      Comment


      • #4
        I too am running into this issue with a quite a few datasets, using bowtie2.0.0-beta6.

        Examples:

        @SRR387921.488948 0303_20110429_2_SL_AWG_TG_NA11829_4_2pA_01003434289_1_4_41_117/1
        T10331232322220002110220221110022020211222032021222
        +
        !%85117+****&7(&=,'%%).%'((4).)61)%.,(&''7='10%-&,)

        @SRR096575.4651 VAB_0513_20101119_1_SP_ANG_TG_NA11830_3_1sA_01003380693_2853_102_63/1
        T322003021302112201213211122322210023002300122221.1
        +
        !9,%.7%6-/9.)975+%),8+(<.(*19*%+&%%*2%<)'*5*&.)%(!&

        @SRR096590.1165 VAB_0510_20101117_2_SP_ANG_TG_NA11831_5_1sA_01003380706_11279_16_41/1
        T300321021030023001031320311113312212333223222232.3
        +
        !557*.7;6925=46+:>-9:-690>;%(3-2-&5)/&'5)%8%&)*(%!2


        Strangely enough, these are all the first read in their respective files, and all of them appear to be correct (i.e. same number of quality values as read chars.)

        Comment


        • #5
          Originally posted by kz26 View Post
          I too am running into this issue with a quite a few datasets, using bowtie2.0.0-beta6.

          ...

          Strangely enough, these are all the first read in their respective files, and all of them appear to be correct (i.e. same number of quality values as read chars.)
          Those are colour space FASTQ, and frustratingly there seem to be two schools of thought on how many quality scores are needed, specifically should there be a score for the adaptor base or not.

          Comment


          • #6
            maubp, what does that mean? I have the same problem as kz26. Help please!

            Comment


            • #7
              I mean some sources include a quality for the adaptor, e.g. here we have an adapter plus 50 colour space calls. Should there be 51 qualities or just 50?

              Code:
              @SRR387921.488948 0303_20110429_2_SL_AWG_TG_NA11829_4_2pA_01003434289_1_4_41_117/1
              T10331232322220002110220221110022020211222032021222
              +
              !%85117+****&7(&=,'%%).%'((4).)61)%.,(&''7='10%-&,)
              That file has 51 quality scores, including one for the adapter. Some tools do not expect a quality for the adapter. So if we remove the "!" for the adapter "T" in this case we'd get:

              Code:
              @SRR387921.488948 0303_20110429_2_SL_AWG_TG_NA11829_4_2pA_01003434289_1_4_41_117/1
              T10331232322220002110220221110022020211222032021222
              +
              %85117+****&7(&=,'%%).%'((4).)61)%.,(&''7='10%-&,)
              I don't do any work with colour space, so I've not researched this issue. But this is my observation and guess about the apparent problem.

              Comment


              • #8
                what i have is this

                @HWI-ST1146:66:C0YHCACXX:7:1101:2909:2074 1:N:0:ATCACG
                CCACTAGCTTTCCTGGCAC
                +
                JJEHIJIIJJJHEHFHFFF

                so the number of letters is the same for the read and the quality. I'm using Bowtie 0.12.7. and i've used it before 10's of times but with output from older machines. this new one is from HiSeq

                Comment


                • #9
                  Originally posted by afadda View Post
                  what i have is this

                  @HWI-ST1146:66:C0YHCACXX:7:1101:2909:2074 1:N:0:ATCACG
                  CCACTAGCTTTCCTGGCAC
                  +
                  JJEHIJIIJJJHEHFHFFF

                  so the number of letters is the same for the read and the quality. I'm using Bowtie 0.12.7. and i've used it before 10's of times but with output from older machines. this new one is from HiSeq
                  Is there an error message? The recent Illumina pipelines use the original Sanger FASTQ encoding for quality scores - perhaps you are using an option specific to the obsolete Illumina specific FASTQ encoding?

                  Comment


                  • #10
                    yes. message is:
                    Too few quality values for read: HWI-ST1146:66:C0YHCACXX:7:1101:8166:5424 1:N:0:ACTTGA
                    are you sure this is a FASTQ-int file?

                    my command line is:
                    bowtie -S -a --best --strata -v2 -m14 $reference $seqfile > $samfile --un $unalignfile

                    Comment


                    • #11
                      OK - so what does that read look like in the FASTQ input file? You showed a different read (which was only 19 bases long, and had as expected a matching 19 quality scores).

                      Comment


                      • #12
                        you're absolutely right. it's a programming mistake on my side when i was trimming the reads, so that the read in the error message had different length for quality.
                        thanks for trouble shooting!
                        (should never program when sleepy)

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        12 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        51 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        68 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X