Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Running tophat on a cluster

    Hello, all.

    I'm currently having issues while running tophat on a cluster with a pbs scheduler. From what I can tell, it's taking far too long to perform the analysis (it hasn't even completed yet - I've had to keep reallocating time on the cluster since it's taking longer than I had anticipated - currently I'm sitting at 100+ hours run-time). The data is paired-end GAIIx data - each lane comes in at about 9.7GB for the fastq file.

    I guess my question is, when running tophat, do I set the -p flag to the number of nodes I have allocated, or only the number of cores one each node? Or is it a combination of the two: nodes*cores?

    Sorry if this seems like a silly question, but google didn't return anything helpful from here (some information was close though), so I thought I'd give it a shot.

  • #2
    -p is the number of threads you want to run on each node. I think you just need to run n lanes on n nodes. However, if your node has a multicore processor, you can specify -p "# of cores" in each processor to make your test faster. Otherwise, using -p is not helpful.
    Last edited by Daehwan; 12-03-2010, 09:39 AM.

    Comment


    • #3
      That was what I figured, and is what I have started doing with other runs. It just still feels like it's taking to long to complete a run. But I guess this is just what happens when dealing with a massive amount of biological data.

      Comment


      • #4
        Originally posted by NM_010117 View Post
        Hello, all.

        I'm currently having issues while running tophat on a cluster with a pbs scheduler. From what I can tell, it's taking far too long to perform the analysis (it hasn't even completed yet - I've had to keep reallocating time on the cluster since it's taking longer than I had anticipated - currently I'm sitting at 100+ hours run-time). The data is paired-end GAIIx data - each lane comes in at about 9.7GB for the fastq file.

        I guess my question is, when running tophat, do I set the -p flag to the number of nodes I have allocated, or only the number of cores one each node? Or is it a combination of the two: nodes*cores?

        Sorry if this seems like a silly question, but google didn't return anything helpful from here (some information was close though), so I thought I'd give it a shot.
        I am facing a similar issue...were you able to resolve it?
        ~Thanks!

        Comment


        • #5
          Originally posted by rpauly View Post
          I am facing a similar issue...were you able to resolve it?
          ~Thanks!
          It is normal for a tophat run to take a few hours depending on the amount of your input data. You just need to be patient.

          If you are running on a cluster it is important to keep the threads for an individual job confined to one physical node (depending on the scheduler your cluster uses, you would need to provide the right options).

          Comment


          • #6
            Originally posted by GenoMax View Post
            It is normal for a tophat run to take a few hours depending on the amount of your input data. You just need to be patient.

            If you are running on a cluster it is important to keep the threads for an individual job confined to one physical node (depending on the scheduler your cluster uses, you would need to provide the right options).
            Thank you for the quick reply!
            I am using the PBS cluster and analyzing the 101bps paired end RNA-seq illumina data, with the option of -p 12 on tophat and with ncpus=24:mem=16gb on the cluster. Is there another way I could optimize the process? I gave it a walltime of 30 hours, which does not seem to be sufficent, so I am going to increase it to 72 hrs.

            Comment


            • #7
              What is the size of your input data and what genome are you aligning against?

              Did the job get killed after 30 h (which should be enough unless you have a billion read dataset, which you may want to split and start multiple tophat jobs, if that is the case)?

              Comment


              • #8
                Originally posted by GenoMax View Post
                What is the size of your input data and what genome are you aligning against?

                Did the job get killed after 30 h (which should be enough unless you have a billion read dataset, which you may want to split and start multiple tophat jobs, if that is the case).
                The fastq files are close to 10GB, I am aligning it to the older human reference hg19.
                Yes, the job got killed after 30hrs and it has close to 3 million reads. I also had previously dedicated the process to 1 node. I have 10 samples so spliting the file would be hard...maybe I should give STAR a shot.

                Comment


                • #9
                  10G is not that big. Something here does not sound right. Do you have an idea of how many reads did the job go through before it got killed? Did you get a partial accepted_hits.bam file?

                  If you want to post your PBS script we can take a look at how you submitted the job (remove any identifying information such as file paths/names).

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    10G is not that big. Something here does not sound right. Do you have an idea of how many reads did the job go through before it got killed? Did you get a partial accepted_hits.bam file?

                    If you want to post your PBS script we can take a look at how you submitted the job (remove any identifying information such as file paths/names).

                    No I did not get a partial accepted_hits.bam.file,but it did give me an error of exceeded walltime. Please see my PBS script below:
                    #!/bin/bash
                    #PBS -N tophat_cms23055_2624-40399001
                    #PBS -l walltime=30:00:00
                    #PBS -l select=1:ncpus=24:mem=16gb


                    #PBS -o /home/rpauly/2624-40399001/cms23055.log
                    #PBS -o /home/rpauly/2624-40399001/cms23055.err

                    module load samtools/0.1.19
                    module load bowtie/1.0.1

                    cd /home/rpauly


                    /home/rpauly/tophat-2.1.1.Linux_x86_64/tophat --bowtie1 --fusion-search --no-coverage-search -o /scratch1/rpauly/2624-40399001/cms23055_tophat_output -p 20 -G /home/rpauly/refFlat_Oct_2016.gtf /home/rpauly/tophat-2.1.1.Linux_x86_64/bowtie2-2.2.9/genomes/hg19 /home/rpauly/2624-40399001/cms23055_S28_L006_R1_001.fastq.gz /home/rpauly/2624-40399001/cms23055_S28_L006_R2_001.fastq.gz >/home/rpauly/2624-40399001/cms23055_error

                    ~Thanks!

                    Comment


                    • #11
                      Did you look at the log and err files to see if they had anything related?

                      Is "home/rpauly/tophat-2.1.1.Linux_x86_64/bowtie2-2.2.9/genomes/hg19" the the basename for your bowtie1 index files?

                      Comment


                      • #12
                        Originally posted by GenoMax View Post
                        Did you look at the log and err files to see if they had anything related?

                        Is "home/rpauly/tophat-2.1.1.Linux_x86_64/bowtie2-2.2.9/genomes/hg19" the the basename for your bowtie1 index files?
                        There was no error in the in the log or err file, it just simply stopped running.
                        I have attached a screenshot of the bowtie1 index files.

                        Thanks again!
                        Attached Files

                        Comment


                        • #13
                          Those appear to be bowtie2 genome index files (if I follow the directory names at the top of the page). They will not work with bowtie1 (those are different), which you are specifying in your tophat command. Is there a reason you are using --bowtie1?
                          Last edited by GenoMax; 10-31-2016, 07:59 AM.

                          Comment


                          • #14
                            Originally posted by GenoMax View Post
                            Those appear to be bowtie2 genome index files (if I follow the directory names at the top of the page). They will not work with bowtie1 (those are different), which you are specifying in your tophat command. Is there a reason you are using --bowtie1?
                            So that was the problem? But I did not get any error messages indicating this!
                            The only reason I was using bowtie1 was because I read it does better with fusion detection than bowtie2.

                            Also I just downloaded the hg19_c.ebwt.zip file (which I assume are bowtie1 index files?) and added it to the same folder. I will try rerunning the process and see if that helps.
                            ~Thanks !

                            Comment


                            • #15
                              Hopefully getting the bowtie1 indexes (not sure where you got them from but there should be multiple files) will do the trick (I would put them in a different directory and change the name in your TopHat command to avoid any further "issues").

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advancing Precision Medicine for Rare Diseases in Children
                                by seqadmin




                                Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                                12-16-2024, 07:57 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin



                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has seen remarkable advancements,...
                                12-02-2024, 01:49 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 12-17-2024, 10:28 AM
                              0 responses
                              26 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-13-2024, 08:24 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-12-2024, 07:41 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 12-11-2024, 07:45 AM
                              0 responses
                              42 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X