SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
reason for low mapping rate?? miaom RNA Sequencing 3 05-10-2014 08:25 AM
Tophat fusion crashed for unknown reason ymc Bioinformatics 0 06-21-2013 04:18 AM
any reason to compile GATK from source? Kotoro Bioinformatics 2 04-22-2012 04:36 PM
Apparent duplication levels incongruence between bismark and fastqc with BS-Seq data gcarbajosa Bioinformatics 2 12-13-2011 09:43 AM
Getting started 454, academic center has no apparent core facility dkoelleseattle General 1 09-28-2009 07:20 PM

Reply
 
Thread Tools
Old 05-10-2014, 01:35 AM   #1
CompBio
Member
 
Location: Bristol, UK

Join Date: Aug 2009
Posts: 26
Default Tophat halts for no apparent reason

We're trying to align a set of stranded, paired-end reads to the rat genome with Tophat. Tophat runs through the early stages just fine, but then halts during segment search -- no errors, no warnings, nothing. This is an example of our output:

Code:
[2014-05-07 13:22:44] Preparing reads
	 left reads: min. length=101, max. length=101, 27329923 kept reads (10347 discarded)
	right reads: min. length=101, max. length=101, 27195050 kept reads (145220 discarded)
[2014-05-07 13:35:52] Mapping left_kept_reads to genome Rn5 with Bowtie2 
[2014-05-07 14:34:36] Mapping left_kept_reads_seg1 to genome Rn5 with Bowtie2 (1/3)
[2014-05-07 15:06:40] Mapping left_kept_reads_seg2 to genome Rn5 with Bowtie2 (2/3)
[2014-05-07 15:43:09] Mapping left_kept_reads_seg3 to genome Rn5 with Bowtie2 (3/3)
[2014-05-07 16:34:38] Mapping right_kept_reads to genome Rn5 with Bowtie2 
[2014-05-07 17:36:40] Mapping right_kept_reads_seg1 to genome Rn5 with Bowtie2 (1/3)
[2014-05-07 18:11:10] Mapping right_kept_reads_seg2 to genome Rn5 with Bowtie2 (2/3)
We've got a big machine for this (512GB memory, ~7TB free disk space, 24 processors) so I cannot believe it's a system issue.

I'm going to try several avenues: I'll run as root in case file permissions are an issue, I'll try running it on a tiny subset of the data to see if it's related to the reads. Otherwise I'm a bit stumped.

I should mention we're using Python 2.7.3 and running under Bio-Linux (Ubuntu 12.04).

Any ideas greatly appreciated!
CompBio is offline   Reply With Quote
Old 05-10-2014, 03:33 AM   #2
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

I would give a look at that file - right_kept_reads_seg2
Check if it was created at all.
Ohad is offline   Reply With Quote
Old 05-10-2014, 06:31 AM   #3
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

One should never run regular programs as root.

Are you running bio-linux as a virtual machine or is that what is running natively on server hardware?
GenoMax is offline   Reply With Quote
Old 05-10-2014, 11:58 PM   #4
CompBio
Member
 
Location: Bristol, UK

Join Date: Aug 2009
Posts: 26
Default

Thanks for your fast responses. I agree that running as root isn't a great idea, but I'm grasping at straws a bit here. As it turns out, I haven't had to try running as root yet (see below). To answer Ohad's question, it did create a right_kept_reads_seg2 file.

We started the same job twice using the command:

Code:
tophat -a 10 -g 1 --microexon-search -p 1 --segment-length=33 --segment-mismatches=2 --mate-inner-dist 31 --mate-std-dev 107 -o test_dir --solexa-quals --library-type=fr-firststrand Rn5 test_dir/sample1_R1_1.fastq test_dir/sample1_R1_2.fastq
Both times it halted early at the same step, but with no indication of a problem.

Next I simply restarted the alignments using the '--restart' option and this time they ran to completion?! This was a surprise.

Unfortunately we've got about 30 more FASTQ pairs we want to run and I'm not much closer to an answer than I was before. Running tophat twice on each pair doesn't seem like a good solution. I'm concerned there is something systemic that is not set up properly and that we'll get poor alignment fidelity as a result.
CompBio is offline   Reply With Quote
Old 05-11-2014, 03:03 AM   #5
Ohad
Member
 
Location: Israel TA

Join Date: Jul 2013
Posts: 28
Default

You need to take a look at your log file. I think run.log
Also , what do you mean by "halts" ? are you sure it's not running ? does it crash at all ?
You are using -p 1 , why ? you have many processors...
Ohad is offline   Reply With Quote
Old 05-11-2014, 11:07 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

@CompBio: Why are you using "--solexa-quals"? If your data is of recent vintage it is almost certainly in sanger fastq format.

Has this install of TopHat been validated (used) before? If this is the first time you are using the install then do yourself a favor and run the test data that is available (http://tophat.cbcb.umd.edu/tutorial.shtml: see "test the install") with Tophat to make sure there is no problem with the install.
GenoMax is offline   Reply With Quote
Old 05-11-2014, 09:58 PM   #7
CompBio
Member
 
Location: Bristol, UK

Join Date: Aug 2009
Posts: 26
Default

Thanks again for the quick responses. We are currently testing our pipline, including timing. As we have 24 processors (and plenty of memory/disk space), we anticipate running 24 alignments concurrently. Hence I use '-p 1' to restrict it to a single processor.

As for the quality scores, our options (tophat 2.0.9) are as follows:
Code:
    --solexa-quals
    --solexa1.3-quals          (same as phred64-quals)
    --phred64-quals            (same as solexa1.3-quals)
    -Q/--quals
    --integer-quals
Our reads are Illumina 1.9 (Sanger, verified using FASTQC, as you rightly guessed). As this is a recent vintage I was under the impression that solexa1.3/phred64 was correct, but when I tried it, tophat gave me errors. We bypassed that particular error by switching to solexa-quals; unfortunately the Tophat documentation doesn't mention Sanger at all.
CompBio is offline   Reply With Quote
Old 05-12-2014, 05:12 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Sanger quality is phred+33 (http://en.wikipedia.org/wiki/FASTQ_format#Encoding). No need to explicitly specify quality since your reads are already in sanger format (phred33 is TopHat default).

You have not addressed a couple of previous questions:

Have you tried the test data set to verify if the install of tophat is working as expected (or have you used it for other analysis before this)?

Are you running biolinux in a virtual machines or as the main OS on this hardware?
GenoMax is offline   Reply With Quote
Old 05-13-2014, 10:59 PM   #9
CompBio
Member
 
Location: Bristol, UK

Join Date: Aug 2009
Posts: 26
Default

Quote:
Originally Posted by GenoMax View Post
Sanger quality is phred+33 (http://en.wikipedia.org/wiki/FASTQ_format#Encoding). No need to explicitly specify quality since your reads are already in sanger format (phred33 is TopHat default).

You have not addressed a couple of previous questions:

Have you tried the test data set to verify if the install of tophat is working as expected (or have you used it for other analysis before this)?

Are you running biolinux in a virtual machines or as the main OS on this hardware?
Thanks re: tophat defaults, didn't see that in the documentation.

To the other questions:

When I say the process 'halts' I mean we use 'ps' or 'top' and see no evidence of any tophat-related scripts running. CPU and disk usage are nominal. The last modification time for tophat-related files is not within the past 2-3 hours. And as I mentioned before, the log file shows no progress after "Mapping right_kept_reads_seg2..."

I have run tophat on a tiny subset of the data (100,000 records) and it worked just fine. I have also run it on another data set without problems. The key differences seem to be the data sets and the user/login name. Since 'tophat --restart' worked, I don't think the data sets are the issue.

Bio-Linux (Ubuntu 12.04) is indeed the main OS for this hardware.

Update: I did find a possible culprit: typically we run our wrapper script in background, redirect stderr and stdout to a file, and log off. Thus I tried using 'nohup' to bypass SIGHUPs and this time it ran to completion.

I'm hopeful the solution really is that simple, but I'm not convinced because we do not set 'huponexit' for any users, either in the system-wide bashrc file or in user-specific files. Thus as I understand it, SIGHUP signals should not be an issue. Perhaps an experienced Linux sysadmin can shed some light.
CompBio is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:32 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO