Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Slightly different TopHat output from # of threads

    Dear all,

    I sent this to the Tuxedo mailing list 3 days ago and so far there hasn't been any replies (though, yes, it's been the weekend). I'm repeating that message here in the hopes someone can offer some advice.

    I am encountering some strange output with TopHat -- hopefully someone here can explain to me what's going on or what I'm doing wrong.

    Basically, the results from TopHat version 2.0.13 seems to change drastically based on the number of threads used for the data sets I'm analyzing. Generally, the more threads I use, the less alignments there are.

    For example, when I use less threads, the output BAM file is clearly larger in size and in number of lines. Between (for example) 12 to 2 threads, I see a difference in size of 250 MB to 7 GB. At first, I thought there might be some redundancy in the file or something that doesn't affect the final results. But, I threw the BAM files into IGV and I do see a difference in terms of the number of reads that align. However, I expected there to be no difference if I change the number of threads.

    In the attached screen capture of IGV, the three tracks from top to bottom are from using the same input file with TopHat. It depicts a region of 1.5 genes in mouse; but I see something like this throughout the data. The only difference is that the number of threads increases from 2 to 8 to 12. Am I missing something obvious?

    Any help would be appreciated!

    Ray
    Attached Files

  • #2
    I don't think you are missing anything obvious... sounds like a major bug, in which only one thread is writing data. I suggest you roll back to an earlier version that you believe was working correctly.

    Comment


    • #3
      Hi Brian,

      Previously, I had expected a larger file if I used more threads because of some kind of duplication in results. But this was the opposite of what I had expected. What you said makes sense and it never occurred to me at all..

      I guess I didn't expect there to be such a bug and presumed its use of threads was working correctly.

      Actually, I found this out by accident and don't know how far back I have to go to get a working TopHat. But it sounds like if I use a single thread, it'll run longer, but the results might be more correct.

      Thanks a lot for the advice!

      Ray

      Comment


      • #4
        Hate to point out the obvious, but you should start by trying the latest version of TopHat, version 2.1.0.

        Another possibility is that you are running TopHat on a computing cluster, and are requesting less processors than number of threads. On my own computing cluster, I will get all sorts of strange results if I request only one processor from the scheduler, and then try and run 2 or more threads.

        Those are two potential solutions that I would investigate, upgrading to the latest version of TopHat, which is my standard advice for any bug with any program, and verifying the number of processors available for multi-threading.

        P.S. I run TopHat just about every day, and the only serious problem I have ever experienced was when I asked the scheduler for less processors than the number of threads given in argument to TopHat.
        Last edited by blancha; 11-01-2015, 05:51 PM.

        Comment


        • #5
          Another environment variable that you may want to investigate is OMP_NUM_THREADS.

          It really depends on what kind of system you are running TopHat, so it's hard to do more troubleshooting without any more information about the system.

          It could help to set OMP_NUM_THREADS to the number of threads TopHat is asked to run with, if these processors are available on the system.

          Comment


          • #6
            Hi,


            Originally posted by blancha View Post
            Hate to point out the obvious, but you should start by trying the latest version of TopHat, version 2.1.0.

            True, but 2.0.13 was the latest version for Ubuntu 15.04 so I'm sure I wasn't the only one still running it. I just moved to 15.10 and yes, it does run 2.1.0 and am looking into it.


            Originally posted by blancha View Post
            Another possibility is that you are running TopHat on a computing cluster, and are requesting less processors than number of threads. On my own computing cluster, I will get all sorts of strange results if I request only one processor from the scheduler, and then try and run 2 or more threads.

            Actually, I'm running on a single computer but it is running a scheduler. However, even in such a scenario, there shouldn't be any "strange results", should there? I mean, logically there shouldn't be. In your example, the scheduler or OS should put a stop to it but if it does not, then TopHat shouldn't give strange results, should it?

            Anyway, in my case, I'm passing the same value for -p to both TopHat and the scheduler. But that ended up being error prone so I made the scheduler processor number very large. I can't imagine *that* being a problem...if you suspect it could be, I can remove the scheduler from the equation and run via the command line. Hmmmmm, might be worth trying.

            Originally posted by blancha View Post
            Those are two potential solutions that I would investigate, upgrading to the latest version of TopHat, which is my standard advice for any bug with any program, and verifying the number of processors available for multi-threading.

            P.S. I run TopHat just about every day, and the only serious problem I have ever experienced was when I asked the scheduler for less processors than the number of threads given in argument to TopHat.

            So, may I ask if you've ran TopHat before on the same input but with various values of -p? I've tried several values from 2 to 16 (the limit of the computer I'm using) and have a gradual decrease in output size and reads mapped (as shown in IGV).

            I never did this before and just thought of doing it on a whim. So, I'm a bit surprised with the results.

            I'll give TopHat 2.1.0 a try and post what I find. But even if this is a problem with an older version, that's still a serious problem, isn't it? What I mean is, I have had projects using the older version of TopHat and never checked the effect of -p...

            Thanks a lot for your comments! It certainly gives me some things to try...

            Ray

            Comment


            • #7
              Originally posted by blancha View Post
              Another environment variable that you may want to investigate is OMP_NUM_THREADS.

              It really depends on what kind of system you are running TopHat, so it's hard to do more troubleshooting without any more information about the system.

              It could help to set OMP_NUM_THREADS to the number of threads TopHat is asked to run with, if these processors are available on the system.
              Thanks for this as well! I've ran OMP-based programs before but didn't know I had to set this environment variable. I thought the program could determine it by itself.

              I will give it a try -- thank you!

              Ray

              Comment


              • #8
                Originally posted by rwan View Post
                I'll give TopHat 2.1.0 a try and post what I find. But even if this is a problem with an older version, that's still a serious problem, isn't it? What I mean is, I have had projects using the older version of TopHat and never checked the effect of -p...
                Unfortunately, it sounds like you should go back and re-evaluate all the data you processed with that version of Tophat, to be safe.

                As for the number of processors and number of threads... the number of processors should be completely transparent, and a deterministic program should give identical results for a large number of threads whether there is 1 processor or many processors.

                And by the way, I'd like to toss in a recommendation that you try BBMap for RNA-seq, as long as you're (possibly) going back and reprocessing a lot of data.

                Comment


                • #9
                  So, may I ask if you've ran TopHat before on the same input but with various values of -p? I've tried several values from 2 to 16 (the limit of the computer I'm using) and have a gradual decrease in output size and reads mapped (as shown in IGV).
                  Yes, the results were absolutely identical.
                  Only the runtime was shorted, obviously.

                  Actually, I'm running on a single computer but it is running a scheduler. However, even in such a scenario, there shouldn't be any "strange results", should there? I mean, logically there shouldn't be. In your example, the scheduler or OS should put a stop to it but if it does not, then TopHat shouldn't give strange results, should it?
                  I seem to remember that TopHat would run to completion, and give bewildering results, without any error messages. After some unfortunate experiences, I was always very careful to request a number of processors equal or greater to the the number of threads on which TopHat run. It's a dangerous "bug", since there are no error messages in the log.

                  So, you might want to check that you are requesting from the scheduler a number of processors equal or greater to the number of threads on which TopHat will run.

                  You can also add the following command in your submission script to the scheduler, before running TopHat.

                  export OMP_NUM_THREADS=#threads_requested_for_TopHat

                  Either due to updates to the scheduler or to TopHat, I no longer need to export this variable in my job submission scripts.
                  Until two years ago, users on my computing cluster had to export this variable when submitting multi-threaded TopHat jobs, or they would get incorrect results.

                  I should mention too that TopHat is really just calling Bowtie1 or 2 to do the actual alignment, so you might want to verify that you also have the latest version of Bowtie1 or 2.
                  Last edited by blancha; 11-01-2015, 07:26 PM.

                  Comment


                  • #10
                    Originally posted by blancha View Post
                    Yes, the results were absolutely identical.
                    Only the runtime was shorted, obviously.

                    Ok! That was what I was expecting so I'll try to figure out what's going on.


                    Originally posted by blancha View Post
                    I seem to remember that TopHat would run to completion, and give bewildering results, without any error messages. After some unfortunate experiences, I was always very careful to request a number of processors equal or greater to the the number of threads on which TopHat run. It's a dangerous "bug", since there are no error messages in the log.

                    So, you might want to check that you are requesting from the scheduler a number of processors equal or greater to the number of threads on which TopHat will run.

                    You can also add the following command in your submission script to the scheduler, before running TopHat.

                    export OMP_NUM_THREADS=#threads_requested_for_TopHat

                    Either due to updates to the scheduler or to TopHat, I no longer need to export this variable in my job submission scripts.
                    Until two years ago, users on my computing cluster had to export this variable when submitting multi-threaded TopHat jobs, or they would get incorrect results.

                    I should mention too that TopHat is really just calling Bowtie1 or 2 to do the actual alignment, so you might want to verify that you also have the latest version of Bowtie1 or 2.

                    I'm actually using the packages included with Ubuntu. I know that means the software could be 6 months (or so) out of date compared to downloading the latest version, but it's easier to maintain since upgrading the OS also upgrades the software. (And I hope there are other users as lazy as me and will end up using the same set of program versions as me.)

                    I will give what you suggest a try. Fortunately, I'm on a single-user system (i.e., it's just me) so I can bypass the scheduler if there's a possibility that the scheduler is the cause of the problems.

                    Thank you!

                    Ray

                    Comment


                    • #11
                      Originally posted by Brian Bushnell View Post
                      Unfortunately, it sounds like you should go back and re-evaluate all the data you processed with that version of Tophat, to be safe.

                      As for the number of processors and number of threads... the number of processors should be completely transparent, and a deterministic program should give identical results for a large number of threads whether there is 1 processor or many processors.

                      And by the way, I'd like to toss in a recommendation that you try BBMap for RNA-seq, as long as you're (possibly) going back and reprocessing a lot of data.

                      Yes, that is what I was expecting though I wouldn't be surprised if a careless mistake on my part was the cause of what I'm seeing. The old data was passed on to someone else -- I'll have to let them know.

                      Thanks for the suggestion about BBMap! I wasn't aware of it.

                      I did run STAR and the output file size was more than TopHat with 2 threads -- I am currently running it with 1 thread and if the file size ends up being similar, then your original suspicion was correct. STAR, regardless of number of threads, seems to give similar outputs (though I only checked file sizes and not throw the BAM file into IGV yet).

                      Ray

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      30 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      32 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      28 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X