View Single Post
Old 08-07-2012, 03:24 PM   #7
LSC
Member
 
Location: stanford

Join Date: Jul 2012
Posts: 24
Default

Quote:
Originally Posted by ZFHans View Post
Hi LSC,

I'm trying to improve a 1.6 GB genome with Pacbio data. Celera read correction is slow and so I welcome your effort. I am trying to run LSC 0.2 but encounter problems. First, in some of the scripts that make up LSC the first line is #!/home/stow/swtree/bin/python2.6 Changing this to #!/usr/bin/python helped to get rid of some error messages.
Secondly, I installed novoalign v2.08 as suggested to do the alignments. In the runLSC.py script the aligner is called with no option for the output format. So novoalign produces their native format. In the next script however, the expected format is, I assume, the SAM format. So I added -o SAM to the option list in line 207 of runLSC.py (also had to add the path to novoalign because it would not run), and this got me to the next problem in convertNAV.py. This script looks at the first character of the line in the nav file at line 78 and line 127 of this script. In my version of the nav file the file header character is @ instead of # so I changed this. Now the desired .map file is produced but with only one column of numbers. I know I have short reads aligned so I think I should have more columns. Could you please comment on this? I paste below an example of my SAM output which is different from the example in your script


Code:
@HD	VN:1.0	SO:unsorted
@PG	ID:novoalign	PN:novoalign	VN:V2.08.02	CL:novoalign -r All -F FA -o SAM -d /mnt/scrap_disk/temp2/pseudochr_LR.fa.cps.nix -f /mnt/scrap_disk/temp2/SR.fa.ai.cps
@SQ	SN:Pac1	AS:pseudochr_LR.fa.cps.nix	LN:50000716
@SQ	SN:Pac2	AS:pseudochr_LR.fa.cps.nix	LN:50000772
@SQ	SN:Pac3	AS:pseudochr_LR.fa.cps.nix	LN:50002188
@SQ	SN:Pac4	AS:pseudochr_LR.fa.cps.nix	LN:50000094
@SQ	SN:Pac5	AS:pseudochr_LR.fa.cps.nix	LN:50001433
@SQ	SN:Pac6	AS:pseudochr_LR.fa.cps.nix	LN:50001526
@SQ	SN:Pac7	AS:pseudochr_LR.fa.cps.nix	LN:50001210
@SQ	SN:Pac8	AS:pseudochr_LR.fa.cps.nix	LN:50000056
@SQ	SN:Pac9	AS:pseudochr_LR.fa.cps.nix	LN:50000143
@SQ	SN:Pac10	AS:pseudochr_LR.fa.cps.nix	LN:50002588
@SQ	SN:Pac11	AS:pseudochr_LR.fa.cps.nix	LN:50001867
@SQ	SN:Pac12	AS:pseudochr_LR.fa.cps.nix	LN:50000245
@SQ	SN:Pac13	AS:pseudochr_LR.fa.cps.nix	LN:28473695
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	16	Pac10	21159148	3	8S67M19S	*	0	0	GAGTATACTCTCATCACATCAGTCAGAGCTGAGAGCTCTGATGAGAGTGACGTCTCAGACAGAGTCAGTGCTCTGATAGCTGACAGTGAGATAG	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:0	MD:Z:67	CC:Z:Pac2	CP:i:6239884	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6307:2493	256	Pac2	6239884	3	14S72M8S	*	0	0	CTATCTCACTGTCAGCTATCAGAGCACTGACTCTGTCTGAGACGTCACTCTCATCAGAGCTCTCAGCTCTGACTGATGTGATGAGAGTATACTC	*	PG:Z:novoalign	AS:i:242	UQ:i:242	NM:i:1	MD:Z:32G39	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6754:2491	4	*	0	0	*	*	0	0	CTCTATATCATGACGAGCATGTACTATACATAGCTGTGCAGCATCTAGAGTGTATCAGAGCACACAC	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6775:2489	4	*	0	0	*	*	0	0	AGTATATCTAGCATAGCTAGCACTCACTGTCATCTGTCATACATACTATATATATGTATATAGCTCTCTGAGCTAGACTGAGACTCTGATCAGACATCATGTATGAGATGTG	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:6822:2491	4	*	0	0	*	*	0	0	TGATACTATAGTGAGAGATACTACATGATATCACTGCTCTCTG	*	PG:Z:novoalign	ZS:Z:NM
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	0	Pac10	4706429	2	28M1I16M1I29M1S	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:2	MD:Z:73	CC:Z:Pac7	CP:i:31267643	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:1	IH:i:2
ILLUMINA-52179E:60:FC70G0LAAXX:6:77:7018:2483	256	Pac7	31267643	2	6M1I21M1I47M	*	0	0	ATAGTATCACTGCATACTATCATCTCAGCTGCTCTGCACTGCTGACTGTACTCGCTGCAGTATATCTATGATGTAT	*	PG:Z:novoalign	AS:i:122	UQ:i:122	NM:i:3	MD:Z:43G30	ZS:Z:R	ZN:i:2	NH:i:2	HI:i:2	IH:i:2
Could we combine this thread with the same one in the bioinformatics section?

Many thanks, Hans Jansen
Hi Hans Jansen,
Your feedback is really helpful. although LSC works well in my computer cluster now, I know there may be something wrong when it is applied in some other systems. Your test is a great example for me to find the bug.
1) your change of the python path is correct. I will fix it in the coming version.
2) LSC uses the native output format instead of SAM format in novoalign. Please don't change it. Please try my setting of the original native format again. In addition, if BWA or bowtie2 could output ALL possible mappable hits, LSC would save over 50% of running time (novoalign is somewhat slow) by using them. Do you know any possible way to let BWA and bowtie2 to output all hits (including detailed indel information)?
LSC is offline   Reply With Quote