SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
PubMed: ShoRAH: estimating the genetic diversity of a mixed sample from next-generati Newsbot! Literature Watch 0 09-20-2011 03:00 AM

Reply
 
Thread Tools
Old 03-20-2010, 03:16 PM   #1
pedrolance
Junior Member
 
Location: Berkeley CA

Join Date: May 2008
Posts: 2
Lightbulb ShoRAH

I'm looking for people that already used or is trying to use ShoRAH with very short reads.
pedrolance is offline   Reply With Quote
Old 10-12-2012, 12:47 PM   #2
naragam
Member
 
Location: Durham, NC

Join Date: Apr 2012
Posts: 21
Default

Yes I have used ShoRAH.... but only with 454 reads. Currently I am trying (for the first time) working with some illumina reads, but I am not sure what all it assumes for illumina files.

If there is anybody out there with experience in using the sam/bam tools, could you please reply so I can ask you some questions? Thanks much in advance,

Nash
naragam is offline   Reply With Quote
Old 02-06-2013, 07:26 AM   #3
ckseq
Junior Member
 
Location: South Africa

Join Date: Feb 2013
Posts: 4
Default

Hi Naragram and Pedrolance

I just found this post after submitting a thread about ShoRAH. I'm currently running ShoRAH for haplotype reconstruction from Illumina short read data. The analysis has been going for two days now - no results as of yet. If I manage to get results I will update on my progress.

Naragram, since you have used ShoRAH for 454 reads, can you please help me to understand what the paramers mean?

I am running the analysis with input data of 90 000 short reads mapped to a 11kb reference genome on -j = 100, -t = 1000 and -K = 10. I left out the rest.

What is the significance/function of -a (alpha), -K (start value for number of clusters), -k (number of reads per start cluster), -t (history time) and -R (randomseed)? How do they influence my analysis?
'
I am a complete novice, it has taken me 4 months to get the program to work without any prior knowledge of linux or command line, and that's with receiving lots of help from knowlegeable bioinformatics people. They have never used ShoRAH and don't know what the parameters mean in context with the analysis I'm doing. I'm trying to read and read but the literature has limited scope for someone in the business of only biology.

ck
ckseq is offline   Reply With Quote
Old 02-06-2013, 10:26 AM   #4
naragam
Member
 
Location: Durham, NC

Join Date: Apr 2012
Posts: 21
Default

ckseq,

Okay, one of the first things I'd like to ask you is the version of ShoRAH you are trying to run....if it's 0.6 (the new one), a lot has CHANGED! You might want to go through the minimal documentation available on their website. If you have an older version (perhaps more stable) 0.5, then I can help you a bit...

For v0.5, my run-shorah command is as follows:

shorah.py -f $1 -r $2 -j 1000 -s 1 -w $3 -a 0.1 -k -t $4

where, $1 is the fasta formatted input file of 454 Reads (Illumina Reads are completely different!), $2 is the reference (you need to get the shortest reference that covers your reads, else, you won't get any decent haps!), $3 is a tricky window parameter that you get by initially using a very large number (say, 10000) and then look at the dec.log output to find out what window size is suggested by ShoRAH that covers your amplicon. Finally, $4 is the threshold parameter which is set to a fairly high value of 0.7 by default, and based on your read quality and reference, you may have to play with it (I mean, reduce it to 0.5 or so) to get the haps you may be looking for.

Phew...yes, it's a bit of a wild goose chase, but, you can work with 454 seqs pretty reliably using ShoRAH 0.5.

For Illumina reads, I used ShoRAH 0.6 which ONLY work with *.sorted.bam seq files (NO, you can't use fasta files here) there are a lot of changes and I am still trying to get some sense out of my outputs so far...

As far as your specific parameter question is concerned, I haven't used the alpha (a) parameter, the -K parameter, or the -R parameter... I used the -k and -t parameter with default values, and I think -k is used just to keep the intermediate files.

Hope this helps...if you still have any more specific questions, please let me know and I shall try to help. Also, Dr. Zagordi who is the principal developer of the ShoRAH s/w is very helpful and always replies to any of your emails directly. Good luck!

Cheers, Nash
naragam is offline   Reply With Quote
Old 02-06-2013, 02:08 PM   #5
ckseq
Junior Member
 
Location: South Africa

Join Date: Feb 2013
Posts: 4
Default

Hi Nash

Thanks for you helpful reply! I'm not familiar with the global analysis or 454 data parameters. I sure will refer back to your post whenever I need to wortk with 454 data! Thanks!

I'm not sure about the version of ShoRAH, since I didnt install it on the server. I can with certainty say though that we had our fair share of sorted BAM file problems. We don't have a licence for novolign to start off with. If I understood the procedure correctly we used bowtie in the end. It won't work with paired-end reads though so we used both the ''forward'' and ''reverse'' datasets but ignored the paired end data option when we mapped. Still mapped to 90 000 reads and for a 11kb genome, it goes to show that paired end reads would have been overkill in any case!

Do you perhaps know the default values for the -k and -t parameters you used? I'm not certain whether the values we set in my analysis are the default and that's why I'm asking. Writing up a Masters thesis so theres and inherent need to be anal about the values and parameters.

Two and a half days and counting. I hope it's not due to the 100 iterations, or -t=1000 or -k=10. Then on the other side of the fence I realize that it's necessary to have enough iterations and hope that 100 might be sufficient.

Oh the joys of not being equipped with compuational biology skills. Then again, one has to start somewhere!

Thanks again for the help
ckseq is offline   Reply With Quote
Old 02-07-2013, 06:30 AM   #6
naragam
Member
 
Location: Durham, NC

Join Date: Apr 2012
Posts: 21
Default

ck,

"-k" parameter is just a binary flag (TRUE or FALSE; default is FALSE) that saves the intermediate files for you to look at. "-t" is the threshold I talked about and I have used values as low as 0.5 at times for some of the sequences that would not generate any haps for a high default value of 0.7. However, as I found later on, the real sensitive parameter is your reference sequence.... the shorter and better overlapping with your amplicon region the ref is, the better are your haps!

Good luck again...

Nash
naragam is offline   Reply With Quote
Old 02-11-2013, 08:09 AM   #7
ckseq
Junior Member
 
Location: South Africa

Join Date: Feb 2013
Posts: 4
Default

Naragram,

I am indeed using version 0.6. Thanks for your input. I also contacted Dr Zagordi and he informed me that -t should always be signifficantly less than -j (which was not the case in my first attempt). I restarted the analysis and it kept running and running forever, not writing anything to the .smp or .dbg file.

We decided to kill it again and try to do a pilot analysis with very little data, 8k reads mapped to 11kb genome.

I hope this one works so that we can maybe increase the amount of data. At least something is writing to the .smp and .dbg file now!
ckseq is offline   Reply With Quote
Reply

Tags
illumina, shorah, virus

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:11 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO