Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Optimal amount of RAM per processor in TopHat?

    Hi folks, just a quick observation and question.

    I recently ran TopHat for human derived mRNA 2x100bp paired end files with 12411993 reads for each pair. I used 5 processors and 20GB of RAM to process on our SGE cluster. When I got the report file back it was reported that the max memory used was just under 4.25GB (which surprised me a little).

    So, I want to limit the resources that I'm using on our cluster (obviously). Does anyone have a feel for how much memory I should request as a function of the number of processors that I use in TopHat?

    Thanks,
    Sam
    Last edited by sdarko; 05-19-2011, 05:39 AM.

  • #2
    Originally posted by sdarko View Post
    Hi folks, just a quick observation and question.

    I recently ran TopHat for human derived mRNA 2x100bp paired end files with 12411993 reads for each pair. I used 5 processors and 20GB of RAM to process on our SGE cluster. When I got the report file back it was reported that the max memory used was just under 4.25GB (which surprised me a little).

    So, I want to limit the resources that I'm using on our cluster (obviously). Does anyone have a feel for how much memory I should request as a function of the number of processors that I use in TopHat?

    Thanks,
    Sam
    I believe the amount of RAM depends mostly (if not only) on the size of the reference genome and ~4 GB sounds about right for the human one. So, yes, I think requesting 20 GB was unnecessary! Adding more RAM doesn't speed up the algorithm.

    Anyone please correct me if I got this wrong...

    Dario

    Comment


    • #3
      how do you run tophat on the SGE environment?

      here is the contents of a shell file I use but it constantly returns error report
      no such file or directory. !

      #!/bin/sh
      #$ -V
      #$ -N SEQ_practical
      #$ -S /bin/sh
      #$ -cwd
      #$ -m abes
      #$ -M [email protected]
      #$ -pe ompi 4-12
      #$ -q all.q

      mpirun -np $NSLOTS tophat -p 1 -i 30 -I 20000 --segment-length 16 --segment-mismatches 1 -G /opt/bowtie-0.12.7/indexes/refGene_mm8.gtf -o ~/set11/tophatOut11/11Tophat.out /opt/bowtie-0.12.7/indexes/mm8 ~/set11/Galaxy11_Ba_wt_starved_fastqsanger.txt

      sleep 60


      It doesnt matter if I use the full path or just the ~/..... I get the same error

      I f I run it as a serial job no error is returned, but that way I am not utilizing cluster power

      Comment


      • #4
        Can anyone corroborate Dario's understanding? Is there really no advantage / use for more than 4GB RAM for the human genome? I can see logical reasons for both yes and no answers.

        Cole Trapnell's 2009 paper mentions memory usage twice.

        (1) "This memory-efficient data structure allows Bowtie to scan reads against a mammalian genome using around 2 GB of memory (within what is commonly available on a standard desktop computer)."
        It's hard to tell whether that's a sales pitch to PC users or a nice way of telling me that I can't utilize my machine's full potential.

        (2) "The entire TopHat run took 21 h, 50 min on a 3.0 GHz Intel Xeon 5160 processor, using <4 GB of RAM, a throughput of nearly 2.2 million reads per CPU hour."
        This suggests to me that 4GB really is a cap. Why would he choose low RAM usage over speed if >4GB was an option? That seems silly.

        Though it could be a red herring, here's someone claiming >20GB of RAM usage :
        Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


        In reading through the bowtie-build documentation and building hg19, there are mentions of 64-bit options (-bmax in particular) to utilize >4GB RAM. At first, I interpreted the options as a method to specifically build ebwt indexes which allow >4GB memory usage for bowtie/tophat, thus speeding up alignment. Upon second glance, it seems like the >4GB RAM option only applies to the indexing/building process. Ostensibly, this would speed ebwt index production, but result in identical ebwt indexes to those produced with default options. Thus, no effect on bowtie/tophat memory usage.

        Is there any clearly-written documentation on this? Every link I can Google is purple. Thanks.

        --------------------------------------------
        I'm in the middle of the first run on a new 16GB machine and I haven't seen more than 3.6GB used yet (see below for progress). At any given moment, 7/8 cores are maxed and 1/8 cores is around 50% usage, while a similar run on bowtie proper has all cores maxed constantly. I'm not sure if that's an indication that memory is a bottleneck for tophat or not. It's hard to believe that the processor is the only bottleneck and all cores aren't constantly at full tilt.

        [Wed Jan 4 20:54:56 2012] Beginning TopHat run (v1.3.3)
        -----------------------------------------------
        [Wed Jan 4 20:54:56 2012] Preparing output location /media/data/grimmer/tophat-out//
        [Wed Jan 4 20:54:56 2012] Checking for Bowtie index files
        [Wed Jan 4 20:54:56 2012] Checking for reference FASTA file
        [Wed Jan 4 20:54:56 2012] Checking for Bowtie
        Bowtie version: 0.12.7.0
        [Wed Jan 4 20:54:56 2012] Checking for Samtools
        Samtools Version: 0.1.18
        [Wed Jan 4 20:54:56 2012] Generating SAM header for /hg19
        [Wed Jan 4 20:54:58 2012] Preparing reads
        format: fastq
        quality scale: solexa33 (reads generated with GA pipeline version < 1.3)
        Left reads: min. length=101, count=84425814
        Right reads: min. length=101, count=84440371
        [Wed Jan 4 21:59:02 2012] Mapping left_kept_reads against hg19 with Bowtie
        [Wed Jan 4 23:02:20 2012] Processing bowtie hits
        [Wed Jan 4 23:48:57 2012] Mapping left_kept_reads_seg1 against hg19 with Bowtie (1/4)
        [Thu Jan 5 00:41:23 2012] Mapping left_kept_reads_seg2 against hg19 with Bowtie (2/4)
        Last edited by grimmer; 01-05-2012, 01:25 AM.

        Comment


        • #5
          Originally posted by SEQond View Post
          how do you run tophat on the SGE environment?

          here is the contents of a shell file I use but it constantly returns error report
          no such file or directory. !

          Code:
          #!/bin/sh
          #$ -V
          #$ -N SEQ_practical
          #$ -S /bin/sh
          #$ -cwd
          #$ -m abes
          #$ -M [email][email protected][/email]
          #$ -pe ompi 4-12
          #$ -q all.q
          
          mpirun -np $NSLOTS tophat -p 1 -i 30 -I 20000 --segment-length 16 --segment-mismatches 1 -G /opt/bowtie-0.12.7/indexes/refGene_mm8.gtf -o ~/set11/tophatOut11/11Tophat.out /opt/bowtie-0.12.7/indexes/mm8 ~/set11/Galaxy11_Ba_wt_starved_fastqsanger.txt
          
          sleep 60

          It doesnt matter if I use the full path or just the ~/..... I get the same error

          I f I run it as a serial job no error is returned, but that way I am not utilizing cluster power
          Here is my script to run tophat and then kick off cufflinks in SGE.

          Code:
          #!/bin/sh
          #$ -N tophat_20111220_ucsc
          #$ -S /bin/bash
          #$ -q long.q 
          #$ -M [email protected]
          #$ -m be
          #$ -l h_vmem=4G
          #$ -t 1-8
          
          flowcell=$1
          index=$2
          lane=$SGE_TASK_ID
          
          echo $flowcell
          echo $lane
          echo $index
          
          export PATH=$PATH:/usr/local/bio_apps/fastx_toolkit/bin:/usr/local/bio_apps/tophat-1.3.3/bin:/usr/local/bio_apps/bowtie:/usr/local/bio_apps/samtools:/usr/local/bio_apps/cufflinks-1.2.1
          export PATH=$PATH:/usr/sge/bin/lx24-amd64
          
          output="ucsc_index_"$index"_20111221"
          mate_inner_dist=-35
          std_dev=45
          reference=/data/lab/reference_sets/Homo_sapiens/UCSC/genomic_alignment/bowtie/genome
          gtf=/data/lab/reference_sets/Homo_sapiens/UCSC/annotation/genes/genes.gtf
          mask=/data/lab/reference_sets/Homo_sapiens/UCSC/annotation/genes/repeat_masker_UCSC.gtf
          
          left="s_"$lane"_1_"$index"_sequence"
          right="s_"$lane"_2_"$index"_sequence"
          dir="/data/lab/data/mRNA_Seq_20110426/fc"$flowcell"/"$lane"/"
          cd $dir
          tophat -o $output -r $mate_inner_dist --solexa1.3-quals --max-multihits 10 -G $gtf --library-type fr-unstranded $reference $left $right
          cd $dir$output
          samtools index accepted_hits.bam
          samtools idxstats accepted_hits.bam
          
          reference=/data/lab/reference_sets/Homo_sapiens/UCSC/genomic_alignment/bowtie/genome.fa
          
          dir="/data/lab/data/mRNA_Seq_20110426/fc"$flowcell"/"$lane"/"$output
          cd $dir
          output="ucsc_cufflinks_index_"$index
          
          cufflinks -M $mask -u -q --no-update-check -o $output -g $gtf -b $reference --library-type fr-unstranded -L $flowcell"_"$lane"_"$index accepted_hits.bam

          Comment


          • #6
            Originally posted by dariober View Post
            I believe the amount of RAM depends mostly (if not only) on the size of the reference genome and ~4 GB sounds about right for the human one. So, yes, I think requesting 20 GB was unnecessary! Adding more RAM doesn't speed up the algorithm.

            Anyone please correct me if I got this wrong...

            Dario
            I think you're right.

            Comment


            • #7
              If bowtie would use more RAM, then it would probably load the human genome reference multiple times =)

              I think one of the main goals for these algorithms, is speed, and to minimize the amount of RAM needed, so people don't need a special computer to analyze their data.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-27-2024, 06:37 PM
              0 responses
              13 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-27-2024, 06:07 PM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              69 views
              0 likes
              Last Post seqadmin  
              Working...
              X