Dear all,
I am a fan of Tophat, have been using it forever. Now, I met a problem that Tophat is too slow for a large dataset.
I have 3 lanes HiSeq data for each sample, about 670 million 100bp-PE-reads per sample. So I want to process 3 lanes' reads together by Tophat.
Mapping to genome by bowtie was fast, but when Tophat reached the step of "Searching for junctions via segment mapping", it has been running for almost 2 weeks. And the log "segment_juncs.log" shows that only chromosome 4-9 have been processed, which means only 1/5 of whole genome is done by last 2 weeks. 68G memory is claimed on this step, but only one thread.
following are options that I used. I know option "--coverage-search --microexon-search" will slow it down, but I am not sure how much:
tophat -o tophat_${d}_PE -F 0.05 -i 50 -p 32 --library-type fr-unstranded --mate-std-dev 110 -g 30 --coverage-search --microexon-search --initial-read-mismatches 3
Any suggestion to speed up tophat, especially for this "Searching for junctions via segment mapping" step. It looks to me, each chromosome is analyzed individually on this step. Any method to make it into Multi-process mode?
I have more than 20 samples in hands now, which seems like mission impossible.
Thanks,
Mark
p.s. log of tophat so far:
[Sat Feb 18 23:36:38 2012] Beginning TopHat run (v1.3.2)
-----------------------------------------------
[Sat Feb 18 23:36:38 2012] Preparing output location tophat_702LP_PE/
[Sat Feb 18 23:36:38 2012] Checking for Bowtie index files
[Sat Feb 18 23:36:38 2012] Checking for reference FASTA file
[Sat Feb 18 23:36:38 2012] Checking for Bowtie
Bowtie version: 0.12.7.0
[Sat Feb 18 23:36:38 2012] Checking for Samtools
Samtools Version: 0.1.18
[Sat Feb 18 23:36:38 2012] Generating SAM header for /mnt/enclosure/mofan/database/HG19/Homo_sapiens.GRCh37.62.dna.chromosome
[Sat Feb 18 23:37:01 2012] Preparing reads
format: fastq
quality scale: phred33 (default)
[Sat Feb 18 23:37:01 2012] Reading known junctions from GTF file
Left reads: min. length=100, count=667886430
Right reads: min. length=100, count=667737832
[Sun Feb 19 07:37:30 2012] Mapping left_kept_reads against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie
[Sun Feb 19 14:11:45 2012] Processing bowtie hits
[Mon Feb 20 01:51:44 2012] Mapping left_kept_reads_seg1 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (1/4)
[Mon Feb 20 04:47:11 2012] Mapping left_kept_reads_seg2 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (2/4)
[Mon Feb 20 07:50:27 2012] Mapping left_kept_reads_seg3 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (3/4)
[Mon Feb 20 11:21:12 2012] Mapping left_kept_reads_seg4 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (4/4)
[Mon Feb 20 14:42:48 2012] Mapping right_kept_reads against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie
[Mon Feb 20 20:52:41 2012] Processing bowtie hits
[Tue Feb 21 09:09:31 2012] Mapping right_kept_reads_seg1 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (1/4)
[Tue Feb 21 12:36:23 2012] Mapping right_kept_reads_seg2 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (2/4)
[Tue Feb 21 15:54:08 2012] Mapping right_kept_reads_seg3 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (3/4)
[Tue Feb 21 19:50:05 2012] Mapping right_kept_reads_seg4 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (4/4)
[Tue Feb 21 23:22:35 2012] Searching for junctions via segment mapping
I am a fan of Tophat, have been using it forever. Now, I met a problem that Tophat is too slow for a large dataset.
I have 3 lanes HiSeq data for each sample, about 670 million 100bp-PE-reads per sample. So I want to process 3 lanes' reads together by Tophat.
Mapping to genome by bowtie was fast, but when Tophat reached the step of "Searching for junctions via segment mapping", it has been running for almost 2 weeks. And the log "segment_juncs.log" shows that only chromosome 4-9 have been processed, which means only 1/5 of whole genome is done by last 2 weeks. 68G memory is claimed on this step, but only one thread.
following are options that I used. I know option "--coverage-search --microexon-search" will slow it down, but I am not sure how much:
tophat -o tophat_${d}_PE -F 0.05 -i 50 -p 32 --library-type fr-unstranded --mate-std-dev 110 -g 30 --coverage-search --microexon-search --initial-read-mismatches 3
Any suggestion to speed up tophat, especially for this "Searching for junctions via segment mapping" step. It looks to me, each chromosome is analyzed individually on this step. Any method to make it into Multi-process mode?
I have more than 20 samples in hands now, which seems like mission impossible.
Thanks,
Mark
p.s. log of tophat so far:
[Sat Feb 18 23:36:38 2012] Beginning TopHat run (v1.3.2)
-----------------------------------------------
[Sat Feb 18 23:36:38 2012] Preparing output location tophat_702LP_PE/
[Sat Feb 18 23:36:38 2012] Checking for Bowtie index files
[Sat Feb 18 23:36:38 2012] Checking for reference FASTA file
[Sat Feb 18 23:36:38 2012] Checking for Bowtie
Bowtie version: 0.12.7.0
[Sat Feb 18 23:36:38 2012] Checking for Samtools
Samtools Version: 0.1.18
[Sat Feb 18 23:36:38 2012] Generating SAM header for /mnt/enclosure/mofan/database/HG19/Homo_sapiens.GRCh37.62.dna.chromosome
[Sat Feb 18 23:37:01 2012] Preparing reads
format: fastq
quality scale: phred33 (default)
[Sat Feb 18 23:37:01 2012] Reading known junctions from GTF file
Left reads: min. length=100, count=667886430
Right reads: min. length=100, count=667737832
[Sun Feb 19 07:37:30 2012] Mapping left_kept_reads against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie
[Sun Feb 19 14:11:45 2012] Processing bowtie hits
[Mon Feb 20 01:51:44 2012] Mapping left_kept_reads_seg1 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (1/4)
[Mon Feb 20 04:47:11 2012] Mapping left_kept_reads_seg2 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (2/4)
[Mon Feb 20 07:50:27 2012] Mapping left_kept_reads_seg3 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (3/4)
[Mon Feb 20 11:21:12 2012] Mapping left_kept_reads_seg4 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (4/4)
[Mon Feb 20 14:42:48 2012] Mapping right_kept_reads against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie
[Mon Feb 20 20:52:41 2012] Processing bowtie hits
[Tue Feb 21 09:09:31 2012] Mapping right_kept_reads_seg1 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (1/4)
[Tue Feb 21 12:36:23 2012] Mapping right_kept_reads_seg2 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (2/4)
[Tue Feb 21 15:54:08 2012] Mapping right_kept_reads_seg3 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (3/4)
[Tue Feb 21 19:50:05 2012] Mapping right_kept_reads_seg4 against Homo_sapiens.GRCh37.62.dna.chromosome with Bowtie (4/4)
[Tue Feb 21 23:22:35 2012] Searching for junctions via segment mapping
Comment