Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat fails to catch error thrown by Bowtie, gives incomplete results

    Hi,

    I've recently discovered a strange behaviour in TopHat: it can sometimes give incomplete (or even incorrect) results, due to an error while running Bowtie on the junction sequence database.

    I'm using TopHat 1.0.13, with Bowtie 0.12.5 (or 0.12.3), on a Linux x86_64 computation cluster with a Lustre filesystem. The data I'm using are single-end, 76bp long reads. I'm running TopHat with the following parameters:

    -p 1 -a 8 -i 40 -m 1 -I 1000000 -F 0 --coverage-search --microexon-search

    For one of the runs where I get incomplete results, I had noticed this weird thing in the output:

    [Thu May 27 17:25:49 2010] Mapping reads against segment_juncs with Bowtie
    [Thu May 27 17:25:50 2010] Mapping reads against segment_juncs with Bowtie
    [Thu May 27 17:25:51 2010] Mapping reads against segment_juncs with Bowtie

    The weird thing is that mapping the reads against segment_juncs should take a lot more time, since I have about 20 million reads. So I thought that there might be an error in building the bowtie index for the splice junctions, but the bowtie_build.log shows no error. However, I find the following type of errors in some other log files from the run:

    ############################################

    filebd4xji.log

    Error reading ebwt array: returned 41750080, length was 168445184
    Your index files may be corrupt; please try re-building or re-downloading.
    A complete index consists of 6 files: XYZ.1.ebwt, XYZ.2.ebwt, XYZ.3.ebwt,
    XYZ.4.ebwt, XYZ.rev.1.ebwt, and XYZ.rev.2.ebwt. The XYZ.1.ebwt and
    XYZ.rev.1.ebwt files should have the same size, as should the XYZ.2.ebwt and
    XYZ.rev.2.ebwt files.

    ############################################

    So it seems that even though the Bowtie index for the junction sequences was built correctly, the alignment of reads on the junction index fails. I've run several series of tests, and I found that this Bowtie error does not occur all the times (it seems to be more or less random), but it does seem to be quite frequent for large datasets. It is not clear yet why this happens - it might be OS-specific or filesystem-specific - so I am currently testing several solutions to fix this problem (see also parallel thread "Bowtie can't read index files").

    However, the bigger issue here is that TopHat does not catch the error thrown by Bowtie, and finishes with apparent success, while giving only an incomplete set of exon-exon junctions. This is quite dangerous, since most users will not search for "Error" messages in the log files if TopHat has finished successfully. So I would advise TopHat users to check the log files for Bowtie errors before proceeding with their analyses.

    Any comments or suggestions on how to solve this problem would be much appreciated.

    Best wishes,

    Anamaria

  • #2
    Originally posted by anecsulea View Post
    Hi,

    I've recently discovered a strange behaviour in TopHat: it can sometimes give incomplete (or even incorrect) results, due to an error while running Bowtie on the junction sequence database.

    I'm using TopHat 1.0.13, with Bowtie 0.12.5 (or 0.12.3), on a Linux x86_64 computation cluster with a Lustre filesystem. The data I'm using are single-end, 76bp long reads. I'm running TopHat with the following parameters:

    -p 1 -a 8 -i 40 -m 1 -I 1000000 -F 0 --coverage-search --microexon-search

    For one of the runs where I get incomplete results, I had noticed this weird thing in the output:

    [Thu May 27 17:25:49 2010] Mapping reads against segment_juncs with Bowtie
    [Thu May 27 17:25:50 2010] Mapping reads against segment_juncs with Bowtie
    [Thu May 27 17:25:51 2010] Mapping reads against segment_juncs with Bowtie

    The weird thing is that mapping the reads against segment_juncs should take a lot more time, since I have about 20 million reads. So I thought that there might be an error in building the bowtie index for the splice junctions, but the bowtie_build.log shows no error. However, I find the following type of errors in some other log files from the run:

    ############################################

    filebd4xji.log

    Error reading ebwt array: returned 41750080, length was 168445184
    Your index files may be corrupt; please try re-building or re-downloading.
    A complete index consists of 6 files: XYZ.1.ebwt, XYZ.2.ebwt, XYZ.3.ebwt,
    XYZ.4.ebwt, XYZ.rev.1.ebwt, and XYZ.rev.2.ebwt. The XYZ.1.ebwt and
    XYZ.rev.1.ebwt files should have the same size, as should the XYZ.2.ebwt and
    XYZ.rev.2.ebwt files.

    ############################################

    So it seems that even though the Bowtie index for the junction sequences was built correctly, the alignment of reads on the junction index fails. I've run several series of tests, and I found that this Bowtie error does not occur all the times (it seems to be more or less random), but it does seem to be quite frequent for large datasets. It is not clear yet why this happens - it might be OS-specific or filesystem-specific - so I am currently testing several solutions to fix this problem (see also parallel thread "Bowtie can't read index files").

    However, the bigger issue here is that TopHat does not catch the error thrown by Bowtie, and finishes with apparent success, while giving only an incomplete set of exon-exon junctions. This is quite dangerous, since most users will not search for "Error" messages in the log files if TopHat has finished successfully. So I would advise TopHat users to check the log files for Bowtie errors before proceeding with their analyses.

    Any comments or suggestions on how to solve this problem would be much appreciated.

    Best wishes,

    Anamaria
    This is an interesting bug - thanks for reporting it. There is code to check that the call to bowtie-build succeeded and that the index is good (or at least passes bowtie-build's internal checks), but for some reason that code is not catching the exception. I'll look into it further.

    Can you re-run this with --keep-tmp enabled, and then try to run the bowtie-build step listed in run.log manually? If that step is failing (some or all of the time), you might want to check the size of the juncs_db.fa file that TopHat generates and feeds to bowtie-build. I'm curious as to how big it is and/or whether it's corrupt in some way.

    Comment


    • #3
      Originally posted by Cole Trapnell View Post
      This is an interesting bug - thanks for reporting it. There is code to check that the call to bowtie-build succeeded and that the index is good (or at least passes bowtie-build's internal checks), but for some reason that code is not catching the exception. I'll look into it further.

      Can you re-run this with --keep-tmp enabled, and then try to run the bowtie-build step listed in run.log manually? If that step is failing (some or all of the time), you might want to check the size of the juncs_db.fa file that TopHat generates and feeds to bowtie-build. I'm curious as to how big it is and/or whether it's corrupt in some way.
      As far as I can see, there is no reason why the code that checks that bowtie-build succeeded should catch this exception, since the error does not come from bowtie-build, but from the bowtie aligner. Indeed, as I explained above, the index is built correctly and is definitely not corrupt, yet bowtie fails to read it into memory when trying to align the reads. This issue is discussed into more detail in a parallel thread in this forum ("Bowtie fails to read index files"), and I have managed to find a solution that works on the computation cluster that I'm using. However, I still believe that the fact that TopHat does not catch this error is a serious problem, and needs to be corrected in future versions of the software.

      Best wishes,

      Anamaria

      Comment


      • #4
        Originally posted by anecsulea View Post
        As far as I can see, there is no reason why the code that checks that bowtie-build succeeded should catch this exception, since the error does not come from bowtie-build, but from the bowtie aligner. Indeed, as I explained above, the index is built correctly and is definitely not corrupt, yet bowtie fails to read it into memory when trying to align the reads. This issue is discussed into more detail in a parallel thread in this forum ("Bowtie fails to read index files"), and I have managed to find a solution that works on the computation cluster that I'm using. However, I still believe that the fact that TopHat does not catch this error is a serious problem, and needs to be corrected in future versions of the software.

        Best wishes,

        Anamaria
        OK - I see where things are going awry. It sounds like from the parallel thread that your filesystem/OS is interacting with Bowtie in a way that's producing the failure. A recent version of TopHat streamlined the way Bowtie is called, and it looks like I failed to put back some of the exception handling code. It's there now and will be present in the next release.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 08:47 AM
        0 responses
        12 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        59 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X