Unconfigured Ad

**dariober** · 05-19-2011, 06:10 AM

Originally posted by sdarko View Post

Hi folks, just a quick observation and question.

I recently ran TopHat for human derived mRNA 2x100bp paired end files with 12411993 reads for each pair. I used 5 processors and 20GB of RAM to process on our SGE cluster. When I got the report file back it was reported that the max memory used was just under 4.25GB (which surprised me a little).

So, I want to limit the resources that I'm using on our cluster (obviously). Does anyone have a feel for how much memory I should request as a function of the number of processors that I use in TopHat?

Thanks,
Sam

I believe the amount of RAM depends mostly (if not only) on the size of the reference genome and ~4 GB sounds about right for the human one. So, yes, I think requesting 20 GB was unnecessary! Adding more RAM doesn't speed up the algorithm.

Anyone please correct me if I got this wrong...

Dario

**SEQond** · 12-12-2011, 06:10 AM

how do you run tophat on the SGE environment?

here is the contents of a shell file I use but it constantly returns error report
no such file or directory. !

#!/bin/sh
#$ -V
#$ -N SEQ_practical
#$ -S /bin/sh
#$ -cwd
#$ -m abes
#$ -M [email protected]
#$ -pe ompi 4-12
#$ -q all.q

mpirun -np $NSLOTS tophat -p 1 -i 30 -I 20000 --segment-length 16 --segment-mismatches 1 -G /opt/bowtie-0.12.7/indexes/refGene_mm8.gtf -o ~/set11/tophatOut11/11Tophat.out /opt/bowtie-0.12.7/indexes/mm8 ~/set11/Galaxy11_Ba_wt_starved_fastqsanger.txt

sleep 60

It doesnt matter if I use the full path or just the ~/..... I get the same error

I f I run it as a serial job no error is returned, but that way I am not utilizing cluster power

**grimmer** · 01-05-2012, 12:56 AM

Can anyone corroborate Dario's understanding? Is there really no advantage / use for more than 4GB RAM for the human genome? I can see logical reasons for both yes and no answers.

Cole Trapnell's 2009 paper mentions memory usage twice.

(1) "This memory-efficient data structure allows Bowtie to scan reads against a mammalian genome using around 2 GB of memory (within what is commonly available on a standard desktop computer)."
It's hard to tell whether that's a sales pitch to PC users or a nice way of telling me that I can't utilize my machine's full potential.

(2) "The entire TopHat run took 21 h, 50 min on a 3.0 GHz Intel Xeon 5160 processor, using <4 GB of RAM, a throughput of nearly 2.2 million reads per CPU hour."
This suggests to me that 4GB really is a cap. Why would he choose low RAM usage over speed if >4GB was an option? That seems silly.

Though it could be a red herring, here's someone claiming >20GB of RAM usage :

Tophat memory usage during "Searching for junctions via segment mapping" - SEQanswers

http://seqanswers.com/forums/showthread.php?t=15708&highlight=memory+tophat

Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)

In reading through the bowtie-build documentation and building hg19, there are mentions of 64-bit options (-bmax in particular) to utilize >4GB RAM. At first, I interpreted the options as a method to specifically build ebwt indexes which allow >4GB memory usage for bowtie/tophat, thus speeding up alignment. Upon second glance, it seems like the >4GB RAM option only applies to the indexing/building process. Ostensibly, this would speed ebwt index production, but result in identical ebwt indexes to those produced with default options. Thus, no effect on bowtie/tophat memory usage.

Is there any clearly-written documentation on this? Every link I can Google is purple. Thanks.

--------------------------------------------
I'm in the middle of the first run on a new 16GB machine and I haven't seen more than 3.6GB used yet (see below for progress). At any given moment, 7/8 cores are maxed and 1/8 cores is around 50% usage, while a similar run on bowtie proper has all cores maxed constantly. I'm not sure if that's an indication that memory is a bottleneck for tophat or not. It's hard to believe that the processor is the only bottleneck and all cores aren't constantly at full tilt.

[Wed Jan 4 20:54:56 2012] Beginning TopHat run (v1.3.3)
-----------------------------------------------
[Wed Jan 4 20:54:56 2012] Preparing output location /media/data/grimmer/tophat-out//
[Wed Jan 4 20:54:56 2012] Checking for Bowtie index files
[Wed Jan 4 20:54:56 2012] Checking for reference FASTA file
[Wed Jan 4 20:54:56 2012] Checking for Bowtie
Bowtie version: 0.12.7.0
[Wed Jan 4 20:54:56 2012] Checking for Samtools
Samtools Version: 0.1.18
[Wed Jan 4 20:54:56 2012] Generating SAM header for /hg19
[Wed Jan 4 20:54:58 2012] Preparing reads
format: fastq
quality scale: solexa33 (reads generated with GA pipeline version < 1.3)
Left reads: min. length=101, count=84425814
Right reads: min. length=101, count=84440371
[Wed Jan 4 21:59:02 2012] Mapping left_kept_reads against hg19 with Bowtie
[Wed Jan 4 23:02:20 2012] Processing bowtie hits
[Wed Jan 4 23:48:57 2012] Mapping left_kept_reads_seg1 against hg19 with Bowtie (1/4)
[Thu Jan 5 00:41:23 2012] Mapping left_kept_reads_seg2 against hg19 with Bowtie (2/4)

**sdarko** · 01-05-2012, 05:09 AM

Originally posted by SEQond View Post

how do you run tophat on the SGE environment?

here is the contents of a shell file I use but it constantly returns error report
no such file or directory. !

Code:

#!/bin/sh
#$ -V
#$ -N SEQ_practical
#$ -S /bin/sh
#$ -cwd
#$ -m abes
#$ -M [email][email protected][/email]
#$ -pe ompi 4-12
#$ -q all.q

mpirun -np $NSLOTS tophat -p 1 -i 30 -I 20000 --segment-length 16 --segment-mismatches 1 -G /opt/bowtie-0.12.7/indexes/refGene_mm8.gtf -o ~/set11/tophatOut11/11Tophat.out /opt/bowtie-0.12.7/indexes/mm8 ~/set11/Galaxy11_Ba_wt_starved_fastqsanger.txt

sleep 60

It doesnt matter if I use the full path or just the ~/..... I get the same error

I f I run it as a serial job no error is returned, but that way I am not utilizing cluster power

Here is my script to run tophat and then kick off cufflinks in SGE.

Code:

#!/bin/sh
#$ -N tophat_20111220_ucsc
#$ -S /bin/bash
#$ -q long.q 
#$ -M [email protected]
#$ -m be
#$ -l h_vmem=4G
#$ -t 1-8

flowcell=$1
index=$2
lane=$SGE_TASK_ID

echo $flowcell
echo $lane
echo $index

export PATH=$PATH:/usr/local/bio_apps/fastx_toolkit/bin:/usr/local/bio_apps/tophat-1.3.3/bin:/usr/local/bio_apps/bowtie:/usr/local/bio_apps/samtools:/usr/local/bio_apps/cufflinks-1.2.1
export PATH=$PATH:/usr/sge/bin/lx24-amd64

output="ucsc_index_"$index"_20111221"
mate_inner_dist=-35
std_dev=45
reference=/data/lab/reference_sets/Homo_sapiens/UCSC/genomic_alignment/bowtie/genome
gtf=/data/lab/reference_sets/Homo_sapiens/UCSC/annotation/genes/genes.gtf
mask=/data/lab/reference_sets/Homo_sapiens/UCSC/annotation/genes/repeat_masker_UCSC.gtf

left="s_"$lane"_1_"$index"_sequence"
right="s_"$lane"_2_"$index"_sequence"
dir="/data/lab/data/mRNA_Seq_20110426/fc"$flowcell"/"$lane"/"
cd $dir
tophat -o $output -r $mate_inner_dist --solexa1.3-quals --max-multihits 10 -G $gtf --library-type fr-unstranded $reference $left $right
cd $dir$output
samtools index accepted_hits.bam
samtools idxstats accepted_hits.bam

reference=/data/lab/reference_sets/Homo_sapiens/UCSC/genomic_alignment/bowtie/genome.fa

dir="/data/lab/data/mRNA_Seq_20110426/fc"$flowcell"/"$lane"/"$output
cd $dir
output="ucsc_cufflinks_index_"$index

cufflinks -M $mask -u -q --no-update-check -o $output -g $gtf -b $reference --library-type fr-unstranded -L $flowcell"_"$lane"_"$index accepted_hits.bam

**sdarko** · 01-05-2012, 05:11 AM

Originally posted by dariober View Post

I believe the amount of RAM depends mostly (if not only) on the size of the reference genome and ~4 GB sounds about right for the human one. So, yes, I think requesting 20 GB was unnecessary! Adding more RAM doesn't speed up the algorithm.

Anyone please correct me if I got this wrong...

Dario

I think you're right.

**pongorlorinc** · 01-25-2012, 11:11 PM

If bowtie would use more RAM, then it would probably load the human genome reference multiple times =)

I think one of the main goals for these algorithms, is speed, and to minimize the amount of RAM needed, so people don't need a special computer to analyze their data.

Topics	Statistics	Last Post
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, Yesterday, 11:08 AM	0 responses 7 views 0 reactions	Last Post by SEQadmin2 Yesterday, 11:08 AM
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, 06-30-2026, 05:37 AM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 06-30-2026, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 19 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 53 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM

Unconfigured Ad

Optimal amount of RAM per processor in TopHat?

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News