Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • prep_reads error when running Tophat

    I am running tophat on a test reads and got the following error,

    Thu Apr 29 16:48:07 2010] Beginning TopHat run (v1.0.13)
    -----------------------------------------------
    [Thu Apr 29 16:48:07 2010] Preparing output location ./tophat_out/
    [Thu Apr 29 16:48:07 2010] Checking for Bowtie index files
    [Thu Apr 29 16:48:07 2010] Checking for reference FASTA file
    [Thu Apr 29 16:48:07 2010] Checking for Bowtie
    Bowtie version: 0.12.5.0
    [Thu Apr 29 16:48:07 2010] Checking reads
    seed length: 101bp
    format: fastq
    quality scale: phred33 (default)
    [FAILED]
    Error: could not execute prep_reads

    The prep_reads.log file has this information,

    rep_reads v1.0.13
    ---------------------------
    Saw ASCII character 10 but expected 33-based Phred qual.
    terminate called after throwing an instance of 'int'

    I looked through data and the only ASCII character 10s I could find are the newlines at the end of each line. The test data is attached. Can someone help?
    Attached Files

  • #2
    If this is Illumina data, were your reads processed with pipeline v1.3 or later? If so, you have to include the --solexa-quals option in your TopHat run.

    Comment


    • #3
      This is Illumina data. What I received was sequence.txt file and I have converted it into fastq (sanger) format. Do I still need to use --solexa-quals?

      Comment


      • #4
        Fastq files include quality scores, so the answer would be yes (once again, only if your reads were processed with pipeline v1.3 or later).

        Comment


        • #5
          I have already converted the Illumina quality score to Sanger standard quality score (shift each character by 31). Do I still need to use the option?

          Comment


          • #6
            I guess not. At this point my knowledge ends and I would go running to the nearest full-time bioinformatics geek. One last thing though: I do see an extra newline at the end of the sample you posted, so I would double check your input file once to make sure that you dont have any in there.

            Sorry and best of luck,

            Shurjo

            Comment


            • #7
              Shurjo, Thanks for the help. I have checked the file again to make sure there is no extra newline. These two reads were taken out from a large data file. The prep_reads apparently runs fine for the first 200,000 some reads and then choke on these two and I just could not see how they are different from other reads.

              Comment


              • #8
                Can you verify that the FASTQ file is correctly formatted? The fact that TopHat is choosing a seed length of 101bp tells me something's up with that file. The seed length ought to be 25 for 50bp reads or longer. TopHat's FASTQ parser occasionally screws up when FASTQ records are incorrectly formatted or when the read and/or quality sequences span more than one line in the file. We plan to replace the parser in an upcoming version to make it more robust to this kind of thing.

                Comment


                • #9
                  Cole, could you take a look at the fastq file I attached? The original fastq file was converted from the Illumina SCARF format and contains millions of reads. prep_reads gave the error after 10 minutes, and the two reads I attached seem to be responsible for the problem.

                  Comment


                  • #10
                    Originally posted by bzhang View Post
                    Saw ASCII character 10 but expected 33-based Phred qual.
                    terminate called after throwing an instance of 'int'

                    I looked through data and the only ASCII character 10s I could find are the newlines at the end of each line. The test data is attached. Can someone help?
                    Are you on Linux/Unix? It sounds like the file has DOS/Windows new lines (CR, LF - i.e. ASCII 10, 13) rather than Unix style (LF only). Try using dos2unix on it (or a similar tool).

                    Comment


                    • #11
                      I think I figured out the problem. The Illumina sequence file uses '.' for undetermined bases and prep_reads filters this out when reading the sequence. This creates a mismatch between the sequences and the quality scores. For the problematic reads I attached, the first sequence contains 11 '.'s, so prep_reads reads in 90 bases. There happens to be a '@' in the quality scores after 90 and prep_reads treats it as the start of a new record, and this messes up the next record and hence the error. I don't know if using '.' in the sequences is a new convention adopted by Illumina or not. I am surprised that I am the first one to encounter this problem. For now I guess I'll just convert all those '.'s into 'N's, but prep_reads can certainly be more robust.

                      I am sort of lucky in a sense that my data contains enough reads to see this problem. If I only have 200,000 reads, I may not see the problem and happily carry on the downstream analysis unaware of the mismatch between the sequences and the quality scores.

                      Comment


                      • #12
                        Thanks for the heads up. We'll add the bug to our tracker and address it in the next release. Others are likely to have this problem.

                        Comment


                        • #13
                          Originally posted by Cole Trapnell View Post
                          Can you verify that the FASTQ file is correctly formatted? The fact that TopHat is choosing a seed length of 101bp tells me something's up with that file. The seed length ought to be 25 for 50bp reads or longer.
                          I am also getting seed lengths = read_length (54, 76bp). Tophat runs fine till the end, but the accepted_hits.sam has zero spliced reads (for 76bp run). I run it in paired end mode, therefore assumed that something is wrong with my --mate-inner-dist / --mate-std-dev values (60, 20). Checked with the lab corrected these (20, 20), but still got no splices. Input FASTQ files were filtered using R ShortRead package. The same files seem to be doing OK with other mappers (SOAP, GEM).

                          Is there any way I can check that my FASTQ files are Tophat compatible?

                          Comment


                          • #14
                            From what I understand by reading the code, at least in the recent versions, the seed length is equal to the shortest read length. So if all the reads are of the same length, the seed length is set to the read length. I am not sure about the impact of setting seed length this way, guess I have to read more paper to understand this.

                            Comment


                            • #15
                              Originally posted by darked89 View Post
                              I am also getting seed lengths = read_length (54, 76bp). Tophat runs fine till the end, but the accepted_hits.sam has zero spliced reads (for 76bp run). I run it in paired end mode, therefore assumed that something is wrong with my --mate-inner-dist / --mate-std-dev values (60, 20). Checked with the lab corrected these (20, 20), but still got no splices. Input FASTQ files were filtered using R ShortRead package. The same files seem to be doing OK with other mappers (SOAP, GEM).

                              Is there any way I can check that my FASTQ files are Tophat compatible?
                              It seems tophat calls bowtie with option -v 2, which, according to the manual, means at most 2 mismatches allowed and the option -l (which specifies seed length) is ignored. I think your fastq files are fine as long as they don't contain non-alphabetical characters in the sequences.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM
                              • seqadmin
                                The Impact of AI in Genomic Medicine
                                by seqadmin



                                Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                                02-26-2024, 02:07 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 03-14-2024, 06:13 AM
                              0 responses
                              32 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-08-2024, 08:03 AM
                              0 responses
                              71 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-07-2024, 08:13 AM
                              0 responses
                              80 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-06-2024, 09:51 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X