SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: PathSeq: software to identify or discover microbes by deep sequencing of huma Newsbot! Literature Watch 0 09-09-2011 03:00 AM
Pathseq not in the cloud lorddoskias Bioinformatics 0 07-28-2011 03:25 AM
Seeking statistics on genomic data Fixee Bioinformatics 7 07-03-2011 11:44 PM
Seeking advice on setting up a breakpoint detection Baseless Bioinformatics 0 03-08-2010 12:54 PM
Seeking bioinformatics job nextGenJob Academic/Non-Profit Jobs 0 01-19-2010 11:44 AM

Reply
 
Thread Tools
Old 06-29-2011, 04:17 PM   #1
dbrazel
Junior Member
 
Location: Bay Area, CA

Join Date: Jun 2011
Posts: 1
Question Seeking advice on PathSeq

Hi all,

I'm interested in using the PathSeq software and I was wondering if anyone had some advice on what sort of Amazon EC2 instances result in reasonable run times for full Illumina GAIIx or HiSeq data sets.

Thanks in advance!
dbrazel is offline   Reply With Quote
Old 07-02-2011, 02:44 PM   #2
pcs_murali
Member
 
Location: Boston

Join Date: May 2010
Posts: 26
Default

Hi,

The Amazon EC2 instances we used is Large instances. In our hands, for total RNA sequencing data (30 to 50 million reads) from GAIIx it takes about 1 to 1.2 days (on 20 nodes parallely) to finish the runs.

We are working towards reducing these runs.

Please let me know if you need more information on Pathseq.

Thanks
Chandra
pcs_murali is offline   Reply With Quote
Old 07-07-2011, 01:52 PM   #3
yiweiny
Junior Member
 
Location: usa

Join Date: May 2009
Posts: 6
Default PathSeq is very slow on Ec2

I also set up PathSeq on Ec2. I was able to run it but it was very slow. I tested a data set with 100,000 reads and it took 10 hours running on 10 instances. I would appreciate any advice you can give me.
yiweiny is offline   Reply With Quote
Old 07-08-2011, 09:16 AM   #4
pcs_murali
Member
 
Location: Boston

Join Date: May 2010
Posts: 26
Default

Hi,

Could you send me the version of Pathseq you are running?

Also, is your dataset from RNA based or DNA based?


Thanks
Chandra
pcs_murali is offline   Reply With Quote
Old 07-08-2011, 12:23 PM   #5
yiweiny
Junior Member
 
Location: usa

Join Date: May 2009
Posts: 6
Default slow PathSeq

Hi, Chandra,
The version of PathSeq is 5.1. The data set is a sampling of 100,000 reads from the sample input files provided by the PathSeq web site. My problem is that these does not seem to be a difference whether I run it on 10 nodes or on 20 nodes. Both took a long time to run. I am concerned the Hadoop cluster is not set up correctly.
Thansk for the prompt reply and I am looking forward to hearing from you.

Yi
yiweiny is offline   Reply With Quote
Old 07-09-2011, 08:03 AM   #6
pcs_murali
Member
 
Location: Boston

Join Date: May 2010
Posts: 26
Default

Hi Yi,

I will re-run it on the cloud and see how much time it will take.

In our hands we run other samples with 40million reads in 1 to 1.2 days.

Meanwhile, please download the latest version from our website.

I will get back to you as soon as possible.

I greatly appreciate your comments.

Thanks
Chandra
pcs_murali is offline   Reply With Quote
Old 07-11-2011, 08:05 AM   #7
yiweiny
Junior Member
 
Location: usa

Join Date: May 2009
Posts: 6
Default pathseq logs

Hi, Chandra,
Thanks for the advice. I don't know whether it is helpful to you or not, but here is part of the Hadoop log I captured from the master node in Ec2:


rmr: cannot remove config: No such file or directory.
rmr: cannot remove s3config: No such file or directory.
rmr: cannot remove load: No such file or directory.
Master data_loader
11/07/08 20:27:30 WARN streaming.StreamJob: -jobconf option is deprecated, plea
se use -D instead.
packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar324271362
3624356081/] [] /tmp/streamjob4418070527408526594.jar tmpDir=null
11/07/08 20:27:30 INFO mapred.FileInputFormat: Total input paths to process : 3
11/07/08 20:27:31 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
/local]
11/07/08 20:27:31 INFO streaming.StreamJob: Running job: job_201107082024_0001
11/07/08 20:27:31 INFO streaming.StreamJob: To kill this job, run:
11/07/08 20:27:31 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
/hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
ill job_201107082024_0001
11/07/08 20:27:31 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
42.ec2.internal:50030/jobdetails.jsp?jobid=job_201107082024_0001
11/07/08 20:27:32 INFO streaming.StreamJob: map 0% reduce 0%
11/07/08 20:27:44 INFO streaming.StreamJob: map 33% reduce 0%
11/07/08 20:27:47 INFO streaming.StreamJob: map 67% reduce 0%
11/07/08 20:27:48 INFO streaming.StreamJob: map 100% reduce 0%
11/07/08 20:47:03 INFO streaming.StreamJob: Job complete: job_201107082024_0001
11/07/08 20:47:03 INFO streaming.StreamJob: Output: load

real 19m33.828s
user 0m2.365s
sys 0m0.665s
Master loader completed
ERROR: Bucket 'ami-yiweijob6-stat' does not exist
Bucket 's3://ami-yiweijob6-stat/' removed
Bucket 's3://ami-yiweijob6-stat/' created
ERROR: Bucket 'ami-yiweijob6-output' does not exist
Bucket 's3://ami-yiweijob6-output/' removed
Bucket 's3://ami-yiweijob6-output/' created
File s3://reads-yiwei-regeneron/input1.local saved as '/usr/local/hadoop-0.19.0
/input1.local' (75 bytes in 0.0 seconds, 4.17 kB/s)
File s3://reads-yiwei-regeneron/input10.local saved as '/usr/local/hadoop-0.19.
0/input10.local' (76 bytes in 0.0 seconds, 3.08 kB/s)
File s3://reads-yiwei-regeneron/input11.local saved as '/usr/local/hadoop-0.19.
0/input11.local' (76 bytes in 0.0 seconds, 2.79 kB/s)
File s3://reads-yiwei-regeneron/input12.local saved as '/usr/local/hadoop-0.19.
0/input12.local' (76 bytes in 0.0 seconds, 3.16 kB/s)
File s3://reads-yiwei-regeneron/input2.local saved as '/usr/local/hadoop-0.19.0
/input2.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
File s3://reads-yiwei-regeneron/input3.local saved as '/usr/local/hadoop-0.19.0
/input3.local' (75 bytes in 0.0 seconds, 2.59 kB/s)
File s3://reads-yiwei-regeneron/input4.local saved as '/usr/local/hadoop-0.19.0
/input4.local' (75 bytes in 0.0 seconds, 3.34 kB/s)
File s3://reads-yiwei-regeneron/input5.local saved as '/usr/local/hadoop-0.19.0
/input5.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
File s3://reads-yiwei-regeneron/input6.local saved as '/usr/local/hadoop-0.19.0
/input6.local' (75 bytes in 0.0 seconds, 3.43 kB/s)
File s3://reads-yiwei-regeneron/input7.local saved as '/usr/local/hadoop-0.19.0
/input7.local' (75 bytes in 0.0 seconds, 3.18 kB/s)
File s3://reads-yiwei-regeneron/input8.local saved as '/usr/local/hadoop-0.19.0
/input8.local' (75 bytes in 0.0 seconds, 3.25 kB/s)
File s3://reads-yiwei-regeneron/input9.local saved as '/usr/local/hadoop-0.19.0
/input9.local' (75 bytes in 0.0 seconds, 3.46 kB/s)
rmr: cannot remove test: No such file or directory.
rmr: cannot remove maq: No such file or directory.
Maq alignments + Duplicate remover
11/07/08 20:47:10 WARN streaming.StreamJob: -jobconf option is deprecated, plea
se use -D instead.
packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone
2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MAQ
unmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar7827
192869392733442/] [] /tmp/streamjob8065656650743401458.jar tmpDir=null
11/07/08 20:47:10 INFO mapred.FileInputFormat: Total input paths to process : 1
2
11/07/08 20:47:11 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
/local]
11/07/08 20:47:11 INFO streaming.StreamJob: Running job: job_201107082024_0002
11/07/08 20:47:11 INFO streaming.StreamJob: To kill this job, run:
11/07/08 20:47:11 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
/hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
ill job_201107082024_0002
11/07/08 20:47:11 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
42.ec2.internal:50030/jobdetails.jsp?jobid=job_201107082024_0002
11/07/08 20:47:12 INFO streaming.StreamJob: map 0% reduce 0%
11/07/08 20:47:24 INFO streaming.StreamJob: map 8% reduce 0%
11/07/08 20:47:25 INFO streaming.StreamJob: map 17% reduce 0%
11/07/08 20:47:29 INFO streaming.StreamJob: map 33% reduce 0%
11/07/08 20:47:30 INFO streaming.StreamJob: map 42% reduce 0%
11/07/08 20:47:34 INFO streaming.StreamJob: map 58% reduce 0%
11/07/08 20:47:35 INFO streaming.StreamJob: map 67% reduce 0%
11/07/08 20:47:39 INFO streaming.StreamJob: map 75% reduce 0%

This is for running 100,000 reads in 3 instances in Ec2. I have to shut it down after 2 hours as the processing does not seem to be able to be finished in reasonable amount of time. I hope this log is useful for your trouble shooting. And thanks again for your help!

Yi Wei
yiweiny is offline   Reply With Quote
Old 07-11-2011, 08:24 AM   #8
pcs_murali
Member
 
Location: Boston

Join Date: May 2010
Posts: 26
Default

Hi Yi Wei,

Thanks for your log file.

I am re-running Pathseq with the sample file provided with the package. This sample file contains 6 million unique reads. I will share my results with you, once it is done.

I am looking at the log file which you posted. There are no errors produced. It seems the Pathseq is running fine. As you know we are running 4 maq alignments and 2 megablast alignments and 2 blastn alignments. This in turn takes time to finish them, which is independent of number of reads they go into up to a certain extent. What is mean is as follows:

If you have 100,000 reads ---- running may take about 5 hours to finish
If you have 1 million reads -----running may take about more or less the same time as that of 100,000 hours to finish
If you have 40 million reads -----running may take about 16-18 hours to finish

I will post you with my latest results i will get from 6 million reads.

Meanwhile, Please let me know what is your requirements.

1. How many reads you have in your real sequencing file?
2. Is reads from Illumina?
3. Are you using total RNAseq or WGS?

Thanks
Chandra
pcs_murali is offline   Reply With Quote
Old 07-11-2011, 08:50 AM   #9
yiweiny
Junior Member
 
Location: usa

Join Date: May 2009
Posts: 6
Default pathseq qustions

Hi, Chandra,
Thanks for the advice. It is very helpful. What we are trying to do is look for potential pathogen sequences from Illumina RNA-Seq data. We are probably going to get 40-80 million reads from each sample. Can you send me a copy of Hadoop log from your run of 6 million reads in the sample data file provided by the PathSeq package? I would like to run the same 6 million reads and compare the logs.
Best Regards,

Yi Wei

P.S.
1. Do you have plans to modify PathSeq so that it can be run on internal computer clusters instead of Amazon Ec2?
2. Are you considering using Bowtie or Bwa for initial filtering step, as they are much faster than Maq?
yiweiny is offline   Reply With Quote
Old 07-11-2011, 09:02 AM   #10
pcs_murali
Member
 
Location: Boston

Join Date: May 2010
Posts: 26
Default

Hi Yi Wei,

Yes, we are working towards getting BWA implemented into the Pathseq. You are correct BWA is much faster then MAQ.

Also, working for hadoop based internal computing cluster.

What kind of internal computer cluster you have? Is it LSF?

Thanks
Chandra
pcs_murali is offline   Reply With Quote
Old 07-20-2011, 08:00 AM   #11
pcs_murali
Member
 
Location: Boston

Join Date: May 2010
Posts: 26
Default

Hi Yi Wei and Pathseq users,

Here is the log file from the Pathseq runs. I just removed some lines for clarity.

The log file created is from Pathseq runs on 6 million unique reads (Sample file with Pathseq package).

In summary:
Total time in hours (for all 20 nodes) 387.7
Wall to Wall time is ~19 hours

Most important thing to highlight here is:
6 million reads took 19hours, doesn't mean that 60 million takes 10 times more. In our hands, 40 millions sequencing run take about the same time of 19hours.

Currently, we are working towards faster Pathseq. From the preliminary runs, newer Pathseq takes half the time that of the current version. Once we are done with validation, we will go for public release.

Please let me know if you have more questions / help with Pathseq installation.

Thanks
Chandra


Log file:
******
Master data_loader
**********************************
11/07/19 14:33:53 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar3823448146028608527/] [] /tmp/streamjob9062470838490462284.jar tmpDir=null
11/07/19 14:33:54 INFO mapred.FileInputFormat: Total input paths to process : 20
11/07/19 14:33:55 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/19 14:33:55 INFO streaming.StreamJob: Running job: job_201107191423_0001
11/07/19 14:33:55 INFO streaming.StreamJob: To kill this job, run:
11/07/19 14:33:55 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0001
11/07/19 14:33:55 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0001
11/07/19 14:33:56 INFO streaming.StreamJob: map 0% reduce 0%
11/07/19 14:34:09 INFO streaming.StreamJob: map 20% reduce 0%
11/07/19 14:34:10 INFO streaming.StreamJob: map 40% reduce 0%
11/07/19 14:34:11 INFO streaming.StreamJob: map 60% reduce 0%
11/07/19 14:34:12 INFO streaming.StreamJob: map 80% reduce 0%
11/07/19 14:34:13 INFO streaming.StreamJob: map 95% reduce 0%
11/07/19 14:34:14 INFO streaming.StreamJob: map 100% reduce 0%
11/07/19 15:32:58 INFO streaming.StreamJob: Job complete: job_201107191423_0001
11/07/19 15:32:58 INFO streaming.StreamJob: Output: load

real 59m5.290s
user 0m3.278s
sys 0m1.108s
Master loader completed



Maq alignments + Duplicate remover
**********************************
11/07/19 15:33:07 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MAQunmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar2138415996895576783/] [] /tmp/streamjob4610713994932979234.jar tmpDir=null
11/07/19 15:33:08 INFO mapred.FileInputFormat: Total input paths to process : 21
11/07/19 15:33:08 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/19 15:33:08 INFO streaming.StreamJob: Running job: job_201107191423_0002
11/07/19 15:33:08 INFO streaming.StreamJob: To kill this job, run:
11/07/19 15:33:08 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0002
11/07/19 15:33:08 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0002
11/07/19 15:33:09 INFO streaming.StreamJob: map 0% reduce 0%
11/07/19 15:33:21 INFO streaming.StreamJob: map 24% reduce 0%
11/07/19 15:33:22 INFO streaming.StreamJob: map 52% reduce 0%
11/07/19 15:33:23 INFO streaming.StreamJob: map 76% reduce 0%
11/07/19 15:33:24 INFO streaming.StreamJob: map 86% reduce 0%
11/07/19 15:33:25 INFO streaming.StreamJob: map 95% reduce 0%
11/07/19 15:33:26 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 01:14:01 INFO streaming.StreamJob: Job complete: job_201107191423_0002
11/07/20 01:14:02 INFO streaming.StreamJob: Output: maq

real 580m56.490s
user 0m6.135s
sys 0m12.924s
Maq alignments + Duplicate remover completed

Repeat masker loader
********************

real 2m15.171s
user 1m11.617s
sys 0m12.480s
Repeat masker loader completed

Run repeat masker
********************
11/07/20 01:16:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_repeatmasker.py, /mnt/hadoop/hadoop-unjar6903467901556213816/] [] /tmp/streamjob3668474814406944845.jar tmpDir=null
11/07/20 01:16:24 INFO mapred.FileInputFormat: Total input paths to process : 60
11/07/20 01:16:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/20 01:16:24 INFO streaming.StreamJob: Running job: job_201107191423_0003
11/07/20 01:16:24 INFO streaming.StreamJob: To kill this job, run:
11/07/20 01:16:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0003
11/07/20 01:16:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0003
11/07/20 01:16:25 INFO streaming.StreamJob: map 0% reduce 0%
11/07/20 01:16:36 INFO streaming.StreamJob: map 5% reduce 0%
11/07/20 01:16:37 INFO streaming.StreamJob: map 10% reduce 0%
11/07/20 01:16:38 INFO streaming.StreamJob: map 20% reduce 0%
11/07/20 01:16:40 INFO streaming.StreamJob: map 32% reduce 0%
11/07/20 01:16:41 INFO streaming.StreamJob: map 37% reduce 0%
11/07/20 01:16:42 INFO streaming.StreamJob: map 43% reduce 0%
11/07/20 01:16:43 INFO streaming.StreamJob: map 53% reduce 0%
11/07/20 01:16:45 INFO streaming.StreamJob: map 65% reduce 0%
11/07/20 01:16:46 INFO streaming.StreamJob: map 72% reduce 0%
11/07/20 01:16:47 INFO streaming.StreamJob: map 77% reduce 0%
11/07/20 01:16:48 INFO streaming.StreamJob: map 85% reduce 0%
11/07/20 01:16:49 INFO streaming.StreamJob: map 88% reduce 0%
11/07/20 01:16:51 INFO streaming.StreamJob: map 97% reduce 0%
11/07/20 01:16:52 INFO streaming.StreamJob: map 98% reduce 0%
11/07/20 01:16:54 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 03:59:21 INFO streaming.StreamJob: Job complete: job_201107191423_0003
11/07/20 03:59:21 INFO streaming.StreamJob: Output: repeat

real 162m59.218s
user 0m4.786s
sys 0m1.192s
Repeat masker runs completed

Deleted hdfs://ip-10-118-59-251.ec2.internal:50001/user/root/load
Master data_loader for Post
********************
11/07/20 03:59:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_data_postsub.py, /mnt/hadoop/hadoop-unjar4739713752539699730/] [] /tmp/streamjob4058317523841970356.jar tmpDir=null
11/07/20 03:59:24 INFO mapred.FileInputFormat: Total input paths to process : 20
11/07/20 03:59:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/20 03:59:24 INFO streaming.StreamJob: Running job: job_201107191423_0004
11/07/20 03:59:24 INFO streaming.StreamJob: To kill this job, run:
11/07/20 03:59:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0004
11/07/20 03:59:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0004
11/07/20 03:59:25 INFO streaming.StreamJob: map 0% reduce 0%
11/07/20 03:59:36 INFO streaming.StreamJob: map 20% reduce 0%
11/07/20 03:59:37 INFO streaming.StreamJob: map 45% reduce 0%
11/07/20 03:59:38 INFO streaming.StreamJob: map 60% reduce 0%
11/07/20 03:59:39 INFO streaming.StreamJob: map 80% reduce 0%
11/07/20 03:59:40 INFO streaming.StreamJob: map 95% reduce 0%
11/07/20 03:59:41 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 04:18:27 INFO streaming.StreamJob: Job complete: job_201107191423_0004
11/07/20 04:18:28 INFO streaming.StreamJob: Output: load

real 19m5.252s
user 0m2.360s
sys 0m1.129s
Master loader completed

Postsubtraction loader
********************
real 0m27.082s
user 0m10.882s
sys 0m1.435s

Postsubstraction on the Unmapped reads
********************
11/07/20 04:19:02 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/Fas2FQ1.java, /root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postunmapped.py, /mnt/hadoop/hadoop-unjar3151623736871864286/] [] /tmp/streamjob6660721182687954669.jar tmpDir=null
11/07/20 04:19:03 INFO mapred.FileInputFormat: Total input paths to process : 40
11/07/20 04:19:03 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/20 04:19:03 INFO streaming.StreamJob: Running job: job_201107191423_0005
11/07/20 04:19:03 INFO streaming.StreamJob: To kill this job, run:
11/07/20 04:19:03 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0005
11/07/20 04:19:03 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0005
11/07/20 04:19:04 INFO streaming.StreamJob: map 0% reduce 0%
11/07/20 04:19:15 INFO streaming.StreamJob: map 5% reduce 0%
11/07/20 04:19:16 INFO streaming.StreamJob: map 15% reduce 0%
11/07/20 04:19:17 INFO streaming.StreamJob: map 18% reduce 0%
11/07/20 04:19:18 INFO streaming.StreamJob: map 28% reduce 0%
11/07/20 04:19:19 INFO streaming.StreamJob: map 48% reduce 0%
11/07/20 04:19:20 INFO streaming.StreamJob: map 55% reduce 0%
11/07/20 04:19:21 INFO streaming.StreamJob: map 65% reduce 0%
11/07/20 04:19:23 INFO streaming.StreamJob: map 72% reduce 0%
11/07/20 04:19:24 INFO streaming.StreamJob: map 92% reduce 0%
11/07/20 04:19:25 INFO streaming.StreamJob: map 95% reduce 0%
11/07/20 04:19:26 INFO streaming.StreamJob: map 97% reduce 0%
11/07/20 04:19:30 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 10:28:32 INFO streaming.StreamJob: Job complete: job_201107191423_0005
11/07/20 10:28:32 INFO streaming.StreamJob: Output: postsub

real 369m31.146s
user 0m5.041s
sys 0m1.540s

Postsubstraction on the contigs
********************
11/07/20 10:28:33 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postvelvet.py, /mnt/hadoop/hadoop-unjar1426923625300485254/] [] /tmp/streamjob2059410816131312926.jar tmpDir=null
11/07/20 10:28:34 INFO mapred.FileInputFormat: Total input paths to process : 18
11/07/20 10:28:34 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/20 10:28:34 INFO streaming.StreamJob: Running job: job_201107191423_0006
11/07/20 10:28:34 INFO streaming.StreamJob: To kill this job, run:
11/07/20 10:28:34 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0006
11/07/20 10:28:34 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0006
11/07/20 10:28:35 INFO streaming.StreamJob: map 0% reduce 0%
11/07/20 10:28:47 INFO streaming.StreamJob: map 28% reduce 0%
11/07/20 10:28:48 INFO streaming.StreamJob: map 56% reduce 0%
11/07/20 10:28:49 INFO streaming.StreamJob: map 89% reduce 0%
11/07/20 10:28:50 INFO streaming.StreamJob: map 94% reduce 0%
11/07/20 10:28:51 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 10:31:56 INFO streaming.StreamJob: Job complete: job_201107191423_0006
11/07/20 10:31:56 INFO streaming.StreamJob: Output: postsubvel

real 3m24.070s
user 0m1.577s
sys 0m0.238s
Postsubtraction completed

File '/usr/local/hadoop-0.19.0/output/Output.tar' stored as 's3://ami-ami-QFnew-foutput/Output.tar' (106291200 bytes in 16.4 seconds, 6.19 MB/s) [1 of 1]

Results Summary:
*************
Results summary:

Substraction Pathseq_Cloud
Total number of reads 6369435
Total number of reads after duplicate remover 6369435
Total number of unmapped reads after Maq 1 alignment (Database: MAQ1) 1829265
Total number of unmapped reads after Maq 2 alignment (Database: MAQ2) 504427
Total number of unmapped reads after Maq 3 alignment (Database: MAQ3) 488954
Total number of unmapped reads after Maq 4 alignment (Database: MAQ4) 485479
Total number of unmapped reads after repeat masker 365393
Total number of unmapped reads after Megablast (Database: BLAST1) 70343
Total number of unmapped reads after Megablast (Database: BLAST2) 33808
Total number of unmapped reads after BlastN1 (Database: BLAST1) 33768
Total number of unmapped reads after BlastN2 (Database: BLAST2) 33746
Total number of unmapped reads 33746
Reads after computational subtraction (Unmapped reads) unmappedreads.fq1
Contigs from unmapped reads contigs.fq1
pcs_murali is offline   Reply With Quote
Old 07-20-2011, 04:44 PM   #12
yiweiny
Junior Member
 
Location: usa

Join Date: May 2009
Posts: 6
Default

Hi, Chandra,
Thanks so much for the log! It is very helpful. I am looking forward to the new version of Pathseq. At the mean time I will start running PathSeq with our data and keep you updated on our progress.

Yi Wei
yiweiny is offline   Reply With Quote
Old 07-22-2011, 11:04 AM   #13
Tomi
Member
 
Location: Cambridge

Join Date: Jul 2011
Posts: 12
Unhappy Pathseq AMI problem

I have problems to Build my own AMI how its explained in the last step
of
the PathSeq installation.

If I execute ./create-Ami.com I receive an error that the ami is not
available.
I assumpt that I just need this command to create an Instance with
Pathseq
installed on it?

I am working on developing a GUI for the use of PathSeq, therefore it
would be nice if you could give me a documentation (if you have one)
from
the tool?!

It would be great if you could help me!
Tomi is offline   Reply With Quote
Old 07-26-2011, 06:56 AM   #14
yiweiny
Junior Member
 
Location: usa

Join Date: May 2009
Posts: 6
Thumbs down

Hi, Chandra,
The following is my experience running PathSeq with my own data:
I started with ~70 million human RNA-Seq 100 bp Illumina reads. I prefiltered these reads by running Bowtie against human 37.1 reference genome in my own desktop and ended up with ~11 million reads. After running Preprocessed_Reads.com, I got ~1.6 million reads. These reads were then uploaded onto S3 and PathSeq was launched on 20 nodes. PathSeq ran for more than 60 hour without finishing and I had to terminate the whole job. Here is the log I got from the master node.

Master data_loader
11/07/23 18:29:41 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar9078098862757602177/] [] /tmp/streamjob1077975618755703344.jar tmpDir=null
11/07/23 18:29:42 INFO mapred.FileInputFormat: Total input paths to process : 20
11/07/23 18:29:42 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/23 18:29:42 INFO streaming.StreamJob: Running job: job_201107231823_0001
11/07/23 18:29:42 INFO streaming.StreamJob: To kill this job, run:
11/07/23 18:29:42 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
-kill job_201107231823_0001
11/07/23 18:29:42 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0001
11/07/23 18:29:43 INFO streaming.StreamJob: map 0% reduce 0%
11/07/23 18:29:55 INFO streaming.StreamJob: map 10% reduce 0%
11/07/23 18:29:56 INFO streaming.StreamJob: map 30% reduce 0%
11/07/23 18:29:57 INFO streaming.StreamJob: map 45% reduce 0%
11/07/23 18:29:58 INFO streaming.StreamJob: map 60% reduce 0%
11/07/23 18:29:59 INFO streaming.StreamJob: map 80% reduce 0%
11/07/23 18:30:00 INFO streaming.StreamJob: map 100% reduce 0%
11/07/23 18:54:42 INFO streaming.StreamJob: Job complete: job_201107231823_0001
11/07/23 18:54:42 INFO streaming.StreamJob: Output: load

real 25m1.703s
user 0m2.231s
sys 0m0.320s
Master loader completed

Maq alignments + Duplicate remover
11/07/23 18:54:51 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MA
Qunmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar6628816665337722828/] [] /tmp/streamjob6711648814308398287.jar tmpDir=null
11/07/23 18:54:52 INFO mapred.FileInputFormat: Total input paths to process : 21
11/07/23 18:54:52 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/23 18:54:52 INFO streaming.StreamJob: Running job: job_201107231823_0002
11/07/23 18:54:52 INFO streaming.StreamJob: To kill this job, run:
11/07/23 18:54:52 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
-kill job_201107231823_0002
11/07/23 18:54:52 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0002
11/07/23 18:54:53 INFO streaming.StreamJob: map 0% reduce 0%
11/07/23 18:55:04 INFO streaming.StreamJob: map 29% reduce 0%
11/07/23 18:55:05 INFO streaming.StreamJob: map 43% reduce 0%
11/07/23 18:55:06 INFO streaming.StreamJob: map 52% reduce 0%
11/07/23 18:55:07 INFO streaming.StreamJob: map 67% reduce 0%
11/07/23 18:55:08 INFO streaming.StreamJob: map 90% reduce 0%
11/07/23 18:55:09 INFO streaming.StreamJob: map 100% reduce 0%
11/07/24 03:34:56 INFO streaming.StreamJob: Job complete: job_201107231823_0002
11/07/24 03:34:56 INFO streaming.StreamJob: Output: maq

real 520m5.778s
user 0m7.229s
sys 0m2.200s
Maq alignments + Duplicate remover completed

Run repeat masker
11/07/24 03:38:54 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root
/mapper_repeatmasker.py, /mnt/hadoop/hadoop-unjar682608492557274042/] [] /tmp/streamjob8244164963266699673.jar tmpDir=null
11/07/24 03:38:55 INFO mapred.FileInputFormat: Total input paths to process : 108
11/07/24 03:38:56 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/24 03:38:56 INFO streaming.StreamJob: Running job: job_201107231823_0003
11/07/24 03:38:56 INFO streaming.StreamJob: To kill this job, run:
11/07/24 03:38:56 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
-kill job_201107231823_0003
11/07/24 03:38:56 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0003
11/07/24 03:38:57 INFO streaming.StreamJob: map 0% reduce 0%
11/07/24 03:39:09 INFO streaming.StreamJob: map 3% reduce 0%
11/07/24 03:39:10 INFO streaming.StreamJob: map 7% reduce 0%
11/07/24 03:39:11 INFO streaming.StreamJob: map 11% reduce 0%
11/07/24 03:39:12 INFO streaming.StreamJob: map 17% reduce 0%
11/07/24 03:39:14 INFO streaming.StreamJob: map 21% reduce 0%
11/07/24 03:39:15 INFO streaming.StreamJob: map 26% reduce 0%
11/07/24 03:39:16 INFO streaming.StreamJob: map 29% reduce 0%
11/07/24 03:39:17 INFO streaming.StreamJob: map 35% reduce 0%
11/07/24 03:39:19 INFO streaming.StreamJob: map 40% reduce 0%
11/07/24 03:39:20 INFO streaming.StreamJob: map 44% reduce 0%
11/07/24 03:39:21 INFO streaming.StreamJob: map 47% reduce 0%
11/07/24 03:39:22 INFO streaming.StreamJob: map 53% reduce 0%
11/07/24 03:39:24 INFO streaming.StreamJob: map 55% reduce 0%
11/07/24 03:39:25 INFO streaming.StreamJob: map 56% reduce 0%
11/07/24 10:06:27 INFO streaming.StreamJob: map 57% reduce 0%
11/07/24 10:22:28 INFO streaming.StreamJob: map 58% reduce 0%
11/07/24 10:27:19 INFO streaming.StreamJob: map 59% reduce 0%
11/07/24 10:35:28 INFO streaming.StreamJob: map 60% reduce 0%
11/07/24 10:42:34 INFO streaming.StreamJob: map 61% reduce 0%
11/07/24 10:44:05 INFO streaming.StreamJob: map 62% reduce 0%
11/07/24 10:55:36 INFO streaming.StreamJob: map 63% reduce 0%
11/07/24 11:16:47 INFO streaming.StreamJob: map 64% reduce 0%
11/07/24 11:19:35 INFO streaming.StreamJob: map 65% reduce 0%
11/07/24 11:24:42 INFO streaming.StreamJob: map 66% reduce 0%
11/07/24 11:41:27 INFO streaming.StreamJob: map 67% reduce 0%
11/07/24 11:42:06 INFO streaming.StreamJob: map 68% reduce 0%
11/07/24 11:44:15 INFO streaming.StreamJob: map 69% reduce 0%
11/07/24 11:46:07 INFO streaming.StreamJob: map 70% reduce 0%
11/07/24 11:47:14 INFO streaming.StreamJob: map 71% reduce 0%
11/07/24 11:51:46 INFO streaming.StreamJob: map 72% reduce 0%
11/07/24 11:59:00 INFO streaming.StreamJob: map 73% reduce 0%
11/07/24 12:01:08 INFO streaming.StreamJob: map 74% reduce 0%
11/07/24 12:01:29 INFO streaming.StreamJob: map 75% reduce 0%
11/07/24 12:03:11 INFO streaming.StreamJob: map 76% reduce 0%
11/07/24 12:04:47 INFO streaming.StreamJob: map 77% reduce 0%
11/07/24 12:14:29 INFO streaming.StreamJob: map 78% reduce 0%
11/07/24 12:14:52 INFO streaming.StreamJob: map 79% reduce 0%
11/07/24 12:17:10 INFO streaming.StreamJob: map 80% reduce 0%
11/07/24 12:19:49 INFO streaming.StreamJob: map 81% reduce 0%
11/07/24 12:26:02 INFO streaming.StreamJob: map 82% reduce 0%
11/07/24 12:27:51 INFO streaming.StreamJob: map 83% reduce 0%
11/07/24 12:30:37 INFO streaming.StreamJob: map 84% reduce 0%
11/07/24 12:33:11 INFO streaming.StreamJob: map 85% reduce 0%
11/07/24 12:34:36 INFO streaming.StreamJob: map 86% reduce 0%
11/07/24 12:40:57 INFO streaming.StreamJob: map 87% reduce 0%
11/07/24 12:41:13 INFO streaming.StreamJob: map 88% reduce 0%
11/07/24 12:42:51 INFO streaming.StreamJob: map 89% reduce 0%
11/07/24 12:51:58 INFO streaming.StreamJob: map 90% reduce 0%
11/07/24 12:56:46 INFO streaming.StreamJob: map 91% reduce 0%
11/07/24 13:01:17 INFO streaming.StreamJob: map 92% reduce 0%
11/07/24 13:06:20 INFO streaming.StreamJob: map 93% reduce 0%
11/07/24 13:13:11 INFO streaming.StreamJob: map 94% reduce 0%
11/07/24 13:18:50 INFO streaming.StreamJob: map 95% reduce 0%
11/07/24 13:19:26 INFO streaming.StreamJob: map 96% reduce 0%
11/07/24 13:23:19 INFO streaming.StreamJob: map 97% reduce 0%
11/07/24 13:24:00 INFO streaming.StreamJob: map 98% reduce 0%
11/07/24 13:28:37 INFO streaming.StreamJob: map 99% reduce 0%
11/07/24 13:36:03 INFO streaming.StreamJob: map 100% reduce 0%
11/07/24 22:08:03 INFO streaming.StreamJob: Job complete: job_201107231823_0003
11/07/24 22:08:03 INFO streaming.StreamJob: Output: repeat

real 1109m9.111s
user 0m10.320s
sys 0m1.582s
Repeat masker runs completed


Postsubstraction on the Unmapped reads
11/07/24 22:28:29 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/Fas2FQ1.java, /root/FQone2Fastq.java,
/root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postunmapped
.py, /mnt/hadoop/hadoop-unjar1994272368229376705/] [] /tmp/streamjob104650512695835986.jar tmpDir=null
11/07/24 22:28:29 INFO mapred.FileInputFormat: Total input paths to process : 40
11/07/24 22:28:30 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/24 22:28:30 INFO streaming.StreamJob: Running job: job_201107231823_0005
11/07/24 22:28:30 INFO streaming.StreamJob: To kill this job, run:
11/07/24 22:28:30 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
-kill job_201107231823_0005
11/07/24 22:28:30 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0005
11/07/24 22:28:31 INFO streaming.StreamJob: map 0% reduce 0%
11/07/24 22:28:43 INFO streaming.StreamJob: map 5% reduce 0%
11/07/24 22:28:44 INFO streaming.StreamJob: map 8% reduce 0%
11/07/24 22:28:45 INFO streaming.StreamJob: map 15% reduce 0%
11/07/24 22:28:46 INFO streaming.StreamJob: map 20% reduce 0%
11/07/24 22:28:48 INFO streaming.StreamJob: map 38% reduce 0%
11/07/24 22:28:49 INFO streaming.StreamJob: map 55% reduce 0%
11/07/24 22:28:50 INFO streaming.StreamJob: map 58% reduce 0%
11/07/24 22:28:51 INFO streaming.StreamJob: map 65% reduce 0%
11/07/24 22:28:52 INFO streaming.StreamJob: map 72% reduce 0%
11/07/24 22:28:53 INFO streaming.StreamJob: map 90% reduce 0%
11/07/24 22:28:54 INFO streaming.StreamJob: map 100% reduce 0%

Job 5 ran for more than 34 hours before I terminated it.

From the output in S3 buckets I estimate that there were ~1 million reads after Maq subtraction and ~ 200,000 reads after repeat masking and Blast. PathSeq ran much slower than I expected and I donít know what I did wrong. Can you take a look at the logs and let me know what you think?
Thanks so much for your help!

Yi Wei
yiweiny is offline   Reply With Quote
Old 07-26-2011, 12:47 PM   #15
Tomi
Member
 
Location: Cambridge

Join Date: Jul 2011
Posts: 12
Default Same Problem

Hi,

I just wanted to tell you that I have the same Problem.
Well i watched through the log files and I saw that their is an execption and I guess that this is the reason because there are not really good results.

Well actually i tested it with some sequenced data from illumina, but unfortunetly I receive no reliable results. The output file is always showing that no reads were identified as human or well known pathogens, but that's not possible.
and it also tooks very long,altough I had used a small amount of data.

Well here is the exception which i found:

///////////////////////////////////////////////////////////////
Exception in thread "Timer thread for monitoring dfs" java.lang.NullPointerException
at org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(GangliaContext.java:195)
at org.apache.hadoop.metrics.ganglia.GangliaContext.emitMetric(GangliaContext.java:138)
at org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaContext.java:123)
at org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(AbstractMetricsContext.java:304)
at org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(AbstractMetricsContext.java:290)
at org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(AbstractMetricsContext.java:50)
at org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetricsContext.java:249)
at java.util.TimerThread.mainLoop(Unknown Source)
at java.util.TimerThread.run(Unknown Source)

/////////////////////////////////////////////////////////////////////////

I guess there is a problem with the hadoop cluster, I am trying now to use the newer version of hadoop, maybe this will change something.

But I am quite sure, that this is not a config problem.

I will tell you if i found a solution!


with best regards,
Tomi

Quote:
Originally Posted by yiweiny View Post
Hi, Chandra,
The following is my experience running PathSeq with my own data:
I started with ~70 million human RNA-Seq 100 bp Illumina reads. I prefiltered these reads by running Bowtie against human 37.1 reference genome in my own desktop and ended up with ~11 million reads. After running Preprocessed_Reads.com, I got ~1.6 million reads. These reads were then uploaded onto S3 and PathSeq was launched on 20 nodes. PathSeq ran for more than 60 hour without finishing and I had to terminate the whole job. Here is the log I got from the master node.

Master data_loader
11/07/23 18:29:41 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar9078098862757602177/] [] /tmp/streamjob1077975618755703344.jar tmpDir=null
11/07/23 18:29:42 INFO mapred.FileInputFormat: Total input paths to process : 20
11/07/23 18:29:42 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/23 18:29:42 INFO streaming.StreamJob: Running job: job_201107231823_0001
11/07/23 18:29:42 INFO streaming.StreamJob: To kill this job, run:
11/07/23 18:29:42 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
-kill job_201107231823_0001
11/07/23 18:29:42 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0001
11/07/23 18:29:43 INFO streaming.StreamJob: map 0% reduce 0%
11/07/23 18:29:55 INFO streaming.StreamJob: map 10% reduce 0%
11/07/23 18:29:56 INFO streaming.StreamJob: map 30% reduce 0%
11/07/23 18:29:57 INFO streaming.StreamJob: map 45% reduce 0%
11/07/23 18:29:58 INFO streaming.StreamJob: map 60% reduce 0%
11/07/23 18:29:59 INFO streaming.StreamJob: map 80% reduce 0%
11/07/23 18:30:00 INFO streaming.StreamJob: map 100% reduce 0%
11/07/23 18:54:42 INFO streaming.StreamJob: Job complete: job_201107231823_0001
11/07/23 18:54:42 INFO streaming.StreamJob: Output: load

real 25m1.703s
user 0m2.231s
sys 0m0.320s
Master loader completed

Maq alignments + Duplicate remover
11/07/23 18:54:51 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MA
Qunmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar6628816665337722828/] [] /tmp/streamjob6711648814308398287.jar tmpDir=null
11/07/23 18:54:52 INFO mapred.FileInputFormat: Total input paths to process : 21
11/07/23 18:54:52 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/23 18:54:52 INFO streaming.StreamJob: Running job: job_201107231823_0002
11/07/23 18:54:52 INFO streaming.StreamJob: To kill this job, run:
11/07/23 18:54:52 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
-kill job_201107231823_0002
11/07/23 18:54:52 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0002
11/07/23 18:54:53 INFO streaming.StreamJob: map 0% reduce 0%
11/07/23 18:55:04 INFO streaming.StreamJob: map 29% reduce 0%
11/07/23 18:55:05 INFO streaming.StreamJob: map 43% reduce 0%
11/07/23 18:55:06 INFO streaming.StreamJob: map 52% reduce 0%
11/07/23 18:55:07 INFO streaming.StreamJob: map 67% reduce 0%
11/07/23 18:55:08 INFO streaming.StreamJob: map 90% reduce 0%
11/07/23 18:55:09 INFO streaming.StreamJob: map 100% reduce 0%
11/07/24 03:34:56 INFO streaming.StreamJob: Job complete: job_201107231823_0002
11/07/24 03:34:56 INFO streaming.StreamJob: Output: maq

real 520m5.778s
user 0m7.229s
sys 0m2.200s
Maq alignments + Duplicate remover completed

Run repeat masker
11/07/24 03:38:54 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root
/mapper_repeatmasker.py, /mnt/hadoop/hadoop-unjar682608492557274042/] [] /tmp/streamjob8244164963266699673.jar tmpDir=null
11/07/24 03:38:55 INFO mapred.FileInputFormat: Total input paths to process : 108
11/07/24 03:38:56 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/24 03:38:56 INFO streaming.StreamJob: Running job: job_201107231823_0003
11/07/24 03:38:56 INFO streaming.StreamJob: To kill this job, run:
11/07/24 03:38:56 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
-kill job_201107231823_0003
11/07/24 03:38:56 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0003
11/07/24 03:38:57 INFO streaming.StreamJob: map 0% reduce 0%
11/07/24 03:39:09 INFO streaming.StreamJob: map 3% reduce 0%
11/07/24 03:39:10 INFO streaming.StreamJob: map 7% reduce 0%
11/07/24 03:39:11 INFO streaming.StreamJob: map 11% reduce 0%
11/07/24 03:39:12 INFO streaming.StreamJob: map 17% reduce 0%
11/07/24 03:39:14 INFO streaming.StreamJob: map 21% reduce 0%
11/07/24 03:39:15 INFO streaming.StreamJob: map 26% reduce 0%
11/07/24 03:39:16 INFO streaming.StreamJob: map 29% reduce 0%
11/07/24 03:39:17 INFO streaming.StreamJob: map 35% reduce 0%
11/07/24 03:39:19 INFO streaming.StreamJob: map 40% reduce 0%
11/07/24 03:39:20 INFO streaming.StreamJob: map 44% reduce 0%
11/07/24 03:39:21 INFO streaming.StreamJob: map 47% reduce 0%
11/07/24 03:39:22 INFO streaming.StreamJob: map 53% reduce 0%
11/07/24 03:39:24 INFO streaming.StreamJob: map 55% reduce 0%
11/07/24 03:39:25 INFO streaming.StreamJob: map 56% reduce 0%
11/07/24 10:06:27 INFO streaming.StreamJob: map 57% reduce 0%
11/07/24 10:22:28 INFO streaming.StreamJob: map 58% reduce 0%
11/07/24 10:27:19 INFO streaming.StreamJob: map 59% reduce 0%
11/07/24 10:35:28 INFO streaming.StreamJob: map 60% reduce 0%
11/07/24 10:42:34 INFO streaming.StreamJob: map 61% reduce 0%
11/07/24 10:44:05 INFO streaming.StreamJob: map 62% reduce 0%
11/07/24 10:55:36 INFO streaming.StreamJob: map 63% reduce 0%
11/07/24 11:16:47 INFO streaming.StreamJob: map 64% reduce 0%
11/07/24 11:19:35 INFO streaming.StreamJob: map 65% reduce 0%
11/07/24 11:24:42 INFO streaming.StreamJob: map 66% reduce 0%
11/07/24 11:41:27 INFO streaming.StreamJob: map 67% reduce 0%
11/07/24 11:42:06 INFO streaming.StreamJob: map 68% reduce 0%
11/07/24 11:44:15 INFO streaming.StreamJob: map 69% reduce 0%
11/07/24 11:46:07 INFO streaming.StreamJob: map 70% reduce 0%
11/07/24 11:47:14 INFO streaming.StreamJob: map 71% reduce 0%
11/07/24 11:51:46 INFO streaming.StreamJob: map 72% reduce 0%
11/07/24 11:59:00 INFO streaming.StreamJob: map 73% reduce 0%
11/07/24 12:01:08 INFO streaming.StreamJob: map 74% reduce 0%
11/07/24 12:01:29 INFO streaming.StreamJob: map 75% reduce 0%
11/07/24 12:03:11 INFO streaming.StreamJob: map 76% reduce 0%
11/07/24 12:04:47 INFO streaming.StreamJob: map 77% reduce 0%
11/07/24 12:14:29 INFO streaming.StreamJob: map 78% reduce 0%
11/07/24 12:14:52 INFO streaming.StreamJob: map 79% reduce 0%
11/07/24 12:17:10 INFO streaming.StreamJob: map 80% reduce 0%
11/07/24 12:19:49 INFO streaming.StreamJob: map 81% reduce 0%
11/07/24 12:26:02 INFO streaming.StreamJob: map 82% reduce 0%
11/07/24 12:27:51 INFO streaming.StreamJob: map 83% reduce 0%
11/07/24 12:30:37 INFO streaming.StreamJob: map 84% reduce 0%
11/07/24 12:33:11 INFO streaming.StreamJob: map 85% reduce 0%
11/07/24 12:34:36 INFO streaming.StreamJob: map 86% reduce 0%
11/07/24 12:40:57 INFO streaming.StreamJob: map 87% reduce 0%
11/07/24 12:41:13 INFO streaming.StreamJob: map 88% reduce 0%
11/07/24 12:42:51 INFO streaming.StreamJob: map 89% reduce 0%
11/07/24 12:51:58 INFO streaming.StreamJob: map 90% reduce 0%
11/07/24 12:56:46 INFO streaming.StreamJob: map 91% reduce 0%
11/07/24 13:01:17 INFO streaming.StreamJob: map 92% reduce 0%
11/07/24 13:06:20 INFO streaming.StreamJob: map 93% reduce 0%
11/07/24 13:13:11 INFO streaming.StreamJob: map 94% reduce 0%
11/07/24 13:18:50 INFO streaming.StreamJob: map 95% reduce 0%
11/07/24 13:19:26 INFO streaming.StreamJob: map 96% reduce 0%
11/07/24 13:23:19 INFO streaming.StreamJob: map 97% reduce 0%
11/07/24 13:24:00 INFO streaming.StreamJob: map 98% reduce 0%
11/07/24 13:28:37 INFO streaming.StreamJob: map 99% reduce 0%
11/07/24 13:36:03 INFO streaming.StreamJob: map 100% reduce 0%
11/07/24 22:08:03 INFO streaming.StreamJob: Job complete: job_201107231823_0003
11/07/24 22:08:03 INFO streaming.StreamJob: Output: repeat

real 1109m9.111s
user 0m10.320s
sys 0m1.582s
Repeat masker runs completed


Postsubstraction on the Unmapped reads
11/07/24 22:28:29 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/Fas2FQ1.java, /root/FQone2Fastq.java,
/root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postunmapped
.py, /mnt/hadoop/hadoop-unjar1994272368229376705/] [] /tmp/streamjob104650512695835986.jar tmpDir=null
11/07/24 22:28:29 INFO mapred.FileInputFormat: Total input paths to process : 40
11/07/24 22:28:30 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/24 22:28:30 INFO streaming.StreamJob: Running job: job_201107231823_0005
11/07/24 22:28:30 INFO streaming.StreamJob: To kill this job, run:
11/07/24 22:28:30 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
-kill job_201107231823_0005
11/07/24 22:28:30 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0005
11/07/24 22:28:31 INFO streaming.StreamJob: map 0% reduce 0%
11/07/24 22:28:43 INFO streaming.StreamJob: map 5% reduce 0%
11/07/24 22:28:44 INFO streaming.StreamJob: map 8% reduce 0%
11/07/24 22:28:45 INFO streaming.StreamJob: map 15% reduce 0%
11/07/24 22:28:46 INFO streaming.StreamJob: map 20% reduce 0%
11/07/24 22:28:48 INFO streaming.StreamJob: map 38% reduce 0%
11/07/24 22:28:49 INFO streaming.StreamJob: map 55% reduce 0%
11/07/24 22:28:50 INFO streaming.StreamJob: map 58% reduce 0%
11/07/24 22:28:51 INFO streaming.StreamJob: map 65% reduce 0%
11/07/24 22:28:52 INFO streaming.StreamJob: map 72% reduce 0%
11/07/24 22:28:53 INFO streaming.StreamJob: map 90% reduce 0%
11/07/24 22:28:54 INFO streaming.StreamJob: map 100% reduce 0%

Job 5 ran for more than 34 hours before I terminated it.

From the output in S3 buckets I estimate that there were ~1 million reads after Maq subtraction and ~ 200,000 reads after repeat masking and Blast. PathSeq ran much slower than I expected and I donít know what I did wrong. Can you take a look at the logs and let me know what you think?
Thanks so much for your help!

Yi Wei
Tomi is offline   Reply With Quote
Old 08-01-2011, 04:17 AM   #16
Tomi
Member
 
Location: Cambridge

Join Date: Jul 2011
Posts: 12
Default

Hello,

I have a problem with the new update that I have no acess to the AMI, and the old one is also not acessible any more!

I would need help

Greetings,
Tomi
Tomi is offline   Reply With Quote
Old 09-21-2011, 07:59 AM   #17
pcs_murali
Member
 
Location: Boston

Join Date: May 2010
Posts: 26
Default

Hi Tomi,

I am sorry for the late response. Please check now. It has been fixed last month.

Please let me know if you have any problem

Thanks
Chandra
pcs_murali is offline   Reply With Quote
Old 11-29-2011, 10:13 AM   #18
YW0712
Junior Member
 
Location: MD

Join Date: Nov 2011
Posts: 7
Default

Hello,

Thanks for sharing all your helpful comments here. I'm new to PathSeq and am trying to finish the installation. Unfortunately, I'm getting this error message:

Created hadoop-0.19.0-x86
ec2-bundle-vol interrupted.
_64.part.81
Created hadoop-0.19.0-x86_64.part.82
Created hadoop-0.19.0-x86_64.part.83
Created hadoop-0.19.0-x86_64.part.84
Created hadoop-0.19.0-x86_64.part.85
Created hadoop-0.19.0-x86_64.part.86
Generating digests for each part...
Digests generated.
--manifest has invalid value '/mnt/hadoop-0.19.0-x86_64.manifest.xml': File does not exist or is not a file.
Try 'ec2-upload-bundle --help'
Done
Client.InvalidManifest: HTTP 403 (Forbidden) response for URL http://s3.amazonaws.com:80/ami-tmp/h....manifest.xml: check your S3 ACLs are correct.
Terminate with: ec2-terminate-instances i-0b0c2868
INSTANCE i-0b0c2868 running shutting-down
Creation completed

If you'd please share any insight, I'd really appreciate it! I'd be happy to provide any more info.

Thanks in advance!
YW0712 is offline   Reply With Quote
Old 11-29-2011, 01:29 PM   #19
pcs_murali
Member
 
Location: Boston

Join Date: May 2010
Posts: 26
Default

Hi,

Thanks for your interest.

did you download the ec2-api-tools?

Also, please recheck all the steps in installation manual.

Then, Please re-run the script to create the own AMI.

Thanks
Chandra



Quote:
Originally Posted by YW0712 View Post
Hello,

Thanks for sharing all your helpful comments here. I'm new to PathSeq and am trying to finish the installation. Unfortunately, I'm getting this error message:

Created hadoop-0.19.0-x86
ec2-bundle-vol interrupted.
_64.part.81
Created hadoop-0.19.0-x86_64.part.82
Created hadoop-0.19.0-x86_64.part.83
Created hadoop-0.19.0-x86_64.part.84
Created hadoop-0.19.0-x86_64.part.85
Created hadoop-0.19.0-x86_64.part.86
Generating digests for each part...
Digests generated.
--manifest has invalid value '/mnt/hadoop-0.19.0-x86_64.manifest.xml': File does not exist or is not a file.
Try 'ec2-upload-bundle --help'
Done
Client.InvalidManifest: HTTP 403 (Forbidden) response for URL http://s3.amazonaws.com:80/ami-tmp/h....manifest.xml: check your S3 ACLs are correct.
Terminate with: ec2-terminate-instances i-0b0c2868
INSTANCE i-0b0c2868 running shutting-down
Creation completed

If you'd please share any insight, I'd really appreciate it! I'd be happy to provide any more info.

Thanks in advance!
pcs_murali is offline   Reply With Quote
Old 11-30-2011, 09:31 AM   #20
YW0712
Junior Member
 
Location: MD

Join Date: Nov 2011
Posts: 7
Default

Hi Chandra,

Thanks for the prompt response. I did download .ec2-api-tools and have double-checked the steps and can't seem find what I'm missing. The first error messages I'm still getting are:

REP AVAILABLE
/xchip/pasteur/chandra/Amazon/hadoop-0.20.2/src/contrib/ec2/bin/test: No such file or directory
distrib.tar.Z: No such file or directory
install_rlib 100% 103 0.1KB/s 00:00
/bin/tar: /usr/local/RepeatMasker/repeatmaskerlibraries-20090604.tar.gz: Cannot open: No such file or directory
/bin/tar: Error is not recoverable: exiting now
/bin/tar: Child returned status 2
/bin/tar: Error exit delayed from previous errors
install_cross 100% 208 0.2KB/s 00:00
/bin/tar: /usr/local/RepeatMasker/crossmatch/distrib.tar.Z: Cannot open: No such file or directory
/bin/tar: Error is not recoverable: exiting now

I've double-checked the directories for RepeatMasker and Crossmatch specified in my cluster.config file (which is not under /usr/local/). Is there another place I should've changed the directories? Any help would be greatly appreciated!
YW0712 is offline   Reply With Quote
Reply

Tags
cloud computing

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:00 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO