Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Tophat halts for no apparent reason

    We're trying to align a set of stranded, paired-end reads to the rat genome with Tophat. Tophat runs through the early stages just fine, but then halts during segment search -- no errors, no warnings, nothing. This is an example of our output:

    Code:
    [2014-05-07 13:22:44] Preparing reads
    	 left reads: min. length=101, max. length=101, 27329923 kept reads (10347 discarded)
    	right reads: min. length=101, max. length=101, 27195050 kept reads (145220 discarded)
    [2014-05-07 13:35:52] Mapping left_kept_reads to genome Rn5 with Bowtie2 
    [2014-05-07 14:34:36] Mapping left_kept_reads_seg1 to genome Rn5 with Bowtie2 (1/3)
    [2014-05-07 15:06:40] Mapping left_kept_reads_seg2 to genome Rn5 with Bowtie2 (2/3)
    [2014-05-07 15:43:09] Mapping left_kept_reads_seg3 to genome Rn5 with Bowtie2 (3/3)
    [2014-05-07 16:34:38] Mapping right_kept_reads to genome Rn5 with Bowtie2 
    [2014-05-07 17:36:40] Mapping right_kept_reads_seg1 to genome Rn5 with Bowtie2 (1/3)
    [2014-05-07 18:11:10] Mapping right_kept_reads_seg2 to genome Rn5 with Bowtie2 (2/3)
    We've got a big machine for this (512GB memory, ~7TB free disk space, 24 processors) so I cannot believe it's a system issue.

    I'm going to try several avenues: I'll run as root in case file permissions are an issue, I'll try running it on a tiny subset of the data to see if it's related to the reads. Otherwise I'm a bit stumped.

    I should mention we're using Python 2.7.3 and running under Bio-Linux (Ubuntu 12.04).

    Any ideas greatly appreciated!

  • #2
    I would give a look at that file - right_kept_reads_seg2
    Check if it was created at all.

    Comment


    • #3
      One should never run regular programs as root.

      Are you running bio-linux as a virtual machine or is that what is running natively on server hardware?

      Comment


      • #4
        Thanks for your fast responses. I agree that running as root isn't a great idea, but I'm grasping at straws a bit here. As it turns out, I haven't had to try running as root yet (see below). To answer Ohad's question, it did create a right_kept_reads_seg2 file.

        We started the same job twice using the command:

        Code:
        tophat -a 10 -g 1 --microexon-search -p 1 --segment-length=33 --segment-mismatches=2 --mate-inner-dist 31 --mate-std-dev 107 -o test_dir --solexa-quals --library-type=fr-firststrand Rn5 test_dir/sample1_R1_1.fastq test_dir/sample1_R1_2.fastq
        Both times it halted early at the same step, but with no indication of a problem.

        Next I simply restarted the alignments using the '--restart' option and this time they ran to completion?! This was a surprise.

        Unfortunately we've got about 30 more FASTQ pairs we want to run and I'm not much closer to an answer than I was before. Running tophat twice on each pair doesn't seem like a good solution. I'm concerned there is something systemic that is not set up properly and that we'll get poor alignment fidelity as a result.

        Comment


        • #5
          You need to take a look at your log file. I think run.log
          Also , what do you mean by "halts" ? are you sure it's not running ? does it crash at all ?
          You are using -p 1 , why ? you have many processors...

          Comment


          • #6
            @CompBio: Why are you using "--solexa-quals"? If your data is of recent vintage it is almost certainly in sanger fastq format.

            Has this install of TopHat been validated (used) before? If this is the first time you are using the install then do yourself a favor and run the test data that is available (http://tophat.cbcb.umd.edu/tutorial.shtml: see "test the install") with Tophat to make sure there is no problem with the install.

            Comment


            • #7
              Thanks again for the quick responses. We are currently testing our pipline, including timing. As we have 24 processors (and plenty of memory/disk space), we anticipate running 24 alignments concurrently. Hence I use '-p 1' to restrict it to a single processor.

              As for the quality scores, our options (tophat 2.0.9) are as follows:
              Code:
                  --solexa-quals
                  --solexa1.3-quals          (same as phred64-quals)
                  --phred64-quals            (same as solexa1.3-quals)
                  -Q/--quals
                  --integer-quals
              Our reads are Illumina 1.9 (Sanger, verified using FASTQC, as you rightly guessed). As this is a recent vintage I was under the impression that solexa1.3/phred64 was correct, but when I tried it, tophat gave me errors. We bypassed that particular error by switching to solexa-quals; unfortunately the Tophat documentation doesn't mention Sanger at all.

              Comment


              • #8
                Sanger quality is phred+33 (http://en.wikipedia.org/wiki/FASTQ_format#Encoding). No need to explicitly specify quality since your reads are already in sanger format (phred33 is TopHat default).

                You have not addressed a couple of previous questions:

                Have you tried the test data set to verify if the install of tophat is working as expected (or have you used it for other analysis before this)?

                Are you running biolinux in a virtual machines or as the main OS on this hardware?

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  Sanger quality is phred+33 (http://en.wikipedia.org/wiki/FASTQ_format#Encoding). No need to explicitly specify quality since your reads are already in sanger format (phred33 is TopHat default).

                  You have not addressed a couple of previous questions:

                  Have you tried the test data set to verify if the install of tophat is working as expected (or have you used it for other analysis before this)?

                  Are you running biolinux in a virtual machines or as the main OS on this hardware?
                  Thanks re: tophat defaults, didn't see that in the documentation.

                  To the other questions:

                  When I say the process 'halts' I mean we use 'ps' or 'top' and see no evidence of any tophat-related scripts running. CPU and disk usage are nominal. The last modification time for tophat-related files is not within the past 2-3 hours. And as I mentioned before, the log file shows no progress after "Mapping right_kept_reads_seg2..."

                  I have run tophat on a tiny subset of the data (100,000 records) and it worked just fine. I have also run it on another data set without problems. The key differences seem to be the data sets and the user/login name. Since 'tophat --restart' worked, I don't think the data sets are the issue.

                  Bio-Linux (Ubuntu 12.04) is indeed the main OS for this hardware.

                  Update: I did find a possible culprit: typically we run our wrapper script in background, redirect stderr and stdout to a file, and log off. Thus I tried using 'nohup' to bypass SIGHUPs and this time it ran to completion.

                  I'm hopeful the solution really is that simple, but I'm not convinced because we do not set 'huponexit' for any users, either in the system-wide bashrc file or in user-specific files. Thus as I understand it, SIGHUP signals should not be an issue. Perhaps an experienced Linux sysadmin can shed some light.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Current Approaches to Protein Sequencing
                    by seqadmin


                    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                    04-04-2024, 04:25 PM
                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 04-11-2024, 12:08 PM
                  0 responses
                  30 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 10:19 PM
                  0 responses
                  32 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-10-2024, 09:21 AM
                  0 responses
                  28 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 04-04-2024, 09:00 AM
                  0 responses
                  52 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X