Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sdarko
    Member
    • Apr 2009
    • 52

    Optimal amount of RAM per processor in TopHat?

    Hi folks, just a quick observation and question.

    I recently ran TopHat for human derived mRNA 2x100bp paired end files with 12411993 reads for each pair. I used 5 processors and 20GB of RAM to process on our SGE cluster. When I got the report file back it was reported that the max memory used was just under 4.25GB (which surprised me a little).

    So, I want to limit the resources that I'm using on our cluster (obviously). Does anyone have a feel for how much memory I should request as a function of the number of processors that I use in TopHat?

    Thanks,
    Sam
    Last edited by sdarko; 05-19-2011, 05:39 AM.
  • dariober
    Senior Member
    • May 2010
    • 311

    #2
    Originally posted by sdarko View Post
    Hi folks, just a quick observation and question.

    I recently ran TopHat for human derived mRNA 2x100bp paired end files with 12411993 reads for each pair. I used 5 processors and 20GB of RAM to process on our SGE cluster. When I got the report file back it was reported that the max memory used was just under 4.25GB (which surprised me a little).

    So, I want to limit the resources that I'm using on our cluster (obviously). Does anyone have a feel for how much memory I should request as a function of the number of processors that I use in TopHat?

    Thanks,
    Sam
    I believe the amount of RAM depends mostly (if not only) on the size of the reference genome and ~4 GB sounds about right for the human one. So, yes, I think requesting 20 GB was unnecessary! Adding more RAM doesn't speed up the algorithm.

    Anyone please correct me if I got this wrong...

    Dario

    Comment

    • SEQond
      Member
      • Jul 2010
      • 27

      #3
      how do you run tophat on the SGE environment?

      here is the contents of a shell file I use but it constantly returns error report
      no such file or directory. !

      #!/bin/sh
      #$ -V
      #$ -N SEQ_practical
      #$ -S /bin/sh
      #$ -cwd
      #$ -m abes
      #$ -M [email protected]
      #$ -pe ompi 4-12
      #$ -q all.q

      mpirun -np $NSLOTS tophat -p 1 -i 30 -I 20000 --segment-length 16 --segment-mismatches 1 -G /opt/bowtie-0.12.7/indexes/refGene_mm8.gtf -o ~/set11/tophatOut11/11Tophat.out /opt/bowtie-0.12.7/indexes/mm8 ~/set11/Galaxy11_Ba_wt_starved_fastqsanger.txt

      sleep 60


      It doesnt matter if I use the full path or just the ~/..... I get the same error

      I f I run it as a serial job no error is returned, but that way I am not utilizing cluster power

      Comment

      • grimmer
        Junior Member
        • Jun 2011
        • 5

        #4
        Can anyone corroborate Dario's understanding? Is there really no advantage / use for more than 4GB RAM for the human genome? I can see logical reasons for both yes and no answers.

        Cole Trapnell's 2009 paper mentions memory usage twice.

        (1) "This memory-efficient data structure allows Bowtie to scan reads against a mammalian genome using around 2 GB of memory (within what is commonly available on a standard desktop computer)."
        It's hard to tell whether that's a sales pitch to PC users or a nice way of telling me that I can't utilize my machine's full potential.

        (2) "The entire TopHat run took 21 h, 50 min on a 3.0 GHz Intel Xeon 5160 processor, using <4 GB of RAM, a throughput of nearly 2.2 million reads per CPU hour."
        This suggests to me that 4GB really is a cap. Why would he choose low RAM usage over speed if >4GB was an option? That seems silly.

        Though it could be a red herring, here's someone claiming >20GB of RAM usage :
        Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)


        In reading through the bowtie-build documentation and building hg19, there are mentions of 64-bit options (-bmax in particular) to utilize >4GB RAM. At first, I interpreted the options as a method to specifically build ebwt indexes which allow >4GB memory usage for bowtie/tophat, thus speeding up alignment. Upon second glance, it seems like the >4GB RAM option only applies to the indexing/building process. Ostensibly, this would speed ebwt index production, but result in identical ebwt indexes to those produced with default options. Thus, no effect on bowtie/tophat memory usage.

        Is there any clearly-written documentation on this? Every link I can Google is purple. Thanks.

        --------------------------------------------
        I'm in the middle of the first run on a new 16GB machine and I haven't seen more than 3.6GB used yet (see below for progress). At any given moment, 7/8 cores are maxed and 1/8 cores is around 50% usage, while a similar run on bowtie proper has all cores maxed constantly. I'm not sure if that's an indication that memory is a bottleneck for tophat or not. It's hard to believe that the processor is the only bottleneck and all cores aren't constantly at full tilt.

        [Wed Jan 4 20:54:56 2012] Beginning TopHat run (v1.3.3)
        -----------------------------------------------
        [Wed Jan 4 20:54:56 2012] Preparing output location /media/data/grimmer/tophat-out//
        [Wed Jan 4 20:54:56 2012] Checking for Bowtie index files
        [Wed Jan 4 20:54:56 2012] Checking for reference FASTA file
        [Wed Jan 4 20:54:56 2012] Checking for Bowtie
        Bowtie version: 0.12.7.0
        [Wed Jan 4 20:54:56 2012] Checking for Samtools
        Samtools Version: 0.1.18
        [Wed Jan 4 20:54:56 2012] Generating SAM header for /hg19
        [Wed Jan 4 20:54:58 2012] Preparing reads
        format: fastq
        quality scale: solexa33 (reads generated with GA pipeline version < 1.3)
        Left reads: min. length=101, count=84425814
        Right reads: min. length=101, count=84440371
        [Wed Jan 4 21:59:02 2012] Mapping left_kept_reads against hg19 with Bowtie
        [Wed Jan 4 23:02:20 2012] Processing bowtie hits
        [Wed Jan 4 23:48:57 2012] Mapping left_kept_reads_seg1 against hg19 with Bowtie (1/4)
        [Thu Jan 5 00:41:23 2012] Mapping left_kept_reads_seg2 against hg19 with Bowtie (2/4)
        Last edited by grimmer; 01-05-2012, 01:25 AM.

        Comment

        • sdarko
          Member
          • Apr 2009
          • 52

          #5
          Originally posted by SEQond View Post
          how do you run tophat on the SGE environment?

          here is the contents of a shell file I use but it constantly returns error report
          no such file or directory. !

          Code:
          #!/bin/sh
          #$ -V
          #$ -N SEQ_practical
          #$ -S /bin/sh
          #$ -cwd
          #$ -m abes
          #$ -M [email][email protected][/email]
          #$ -pe ompi 4-12
          #$ -q all.q
          
          mpirun -np $NSLOTS tophat -p 1 -i 30 -I 20000 --segment-length 16 --segment-mismatches 1 -G /opt/bowtie-0.12.7/indexes/refGene_mm8.gtf -o ~/set11/tophatOut11/11Tophat.out /opt/bowtie-0.12.7/indexes/mm8 ~/set11/Galaxy11_Ba_wt_starved_fastqsanger.txt
          
          sleep 60

          It doesnt matter if I use the full path or just the ~/..... I get the same error

          I f I run it as a serial job no error is returned, but that way I am not utilizing cluster power
          Here is my script to run tophat and then kick off cufflinks in SGE.

          Code:
          #!/bin/sh
          #$ -N tophat_20111220_ucsc
          #$ -S /bin/bash
          #$ -q long.q 
          #$ -M [email protected]
          #$ -m be
          #$ -l h_vmem=4G
          #$ -t 1-8
          
          flowcell=$1
          index=$2
          lane=$SGE_TASK_ID
          
          echo $flowcell
          echo $lane
          echo $index
          
          export PATH=$PATH:/usr/local/bio_apps/fastx_toolkit/bin:/usr/local/bio_apps/tophat-1.3.3/bin:/usr/local/bio_apps/bowtie:/usr/local/bio_apps/samtools:/usr/local/bio_apps/cufflinks-1.2.1
          export PATH=$PATH:/usr/sge/bin/lx24-amd64
          
          output="ucsc_index_"$index"_20111221"
          mate_inner_dist=-35
          std_dev=45
          reference=/data/lab/reference_sets/Homo_sapiens/UCSC/genomic_alignment/bowtie/genome
          gtf=/data/lab/reference_sets/Homo_sapiens/UCSC/annotation/genes/genes.gtf
          mask=/data/lab/reference_sets/Homo_sapiens/UCSC/annotation/genes/repeat_masker_UCSC.gtf
          
          left="s_"$lane"_1_"$index"_sequence"
          right="s_"$lane"_2_"$index"_sequence"
          dir="/data/lab/data/mRNA_Seq_20110426/fc"$flowcell"/"$lane"/"
          cd $dir
          tophat -o $output -r $mate_inner_dist --solexa1.3-quals --max-multihits 10 -G $gtf --library-type fr-unstranded $reference $left $right
          cd $dir$output
          samtools index accepted_hits.bam
          samtools idxstats accepted_hits.bam
          
          reference=/data/lab/reference_sets/Homo_sapiens/UCSC/genomic_alignment/bowtie/genome.fa
          
          dir="/data/lab/data/mRNA_Seq_20110426/fc"$flowcell"/"$lane"/"$output
          cd $dir
          output="ucsc_cufflinks_index_"$index
          
          cufflinks -M $mask -u -q --no-update-check -o $output -g $gtf -b $reference --library-type fr-unstranded -L $flowcell"_"$lane"_"$index accepted_hits.bam

          Comment

          • sdarko
            Member
            • Apr 2009
            • 52

            #6
            Originally posted by dariober View Post
            I believe the amount of RAM depends mostly (if not only) on the size of the reference genome and ~4 GB sounds about right for the human one. So, yes, I think requesting 20 GB was unnecessary! Adding more RAM doesn't speed up the algorithm.

            Anyone please correct me if I got this wrong...

            Dario
            I think you're right.

            Comment

            • pongorlorinc
              Junior Member
              • Aug 2010
              • 1

              #7
              If bowtie would use more RAM, then it would probably load the human genome reference multiple times =)

              I think one of the main goals for these algorithms, is speed, and to minimize the amount of RAM needed, so people don't need a special computer to analyze their data.

              Comment

              Latest Articles

              Collapse

              • GATTACAT
                Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by GATTACAT
                Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                07-01-2026, 11:43 AM
              • SEQadmin2
                Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                by SEQadmin2


                I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                Here are nine questions we think about, in roughly the order they matter, before...
                06-18-2026, 07:11 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, Yesterday, 11:08 AM
              0 responses
              7 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-30-2026, 05:37 AM
              0 responses
              11 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-26-2026, 11:10 AM
              0 responses
              19 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-17-2026, 06:09 AM
              0 responses
              53 views
              0 reactions
              Last Post SEQadmin2  
              Working...