![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
inchworm and paired ends | ians | Bioinformatics | 4 | 10-05-2011 07:55 AM |
bfast bgeneratereads for paired ends | sdvie | Bioinformatics | 2 | 03-23-2011 11:04 AM |
paired ends with cuffdiff | Greg | Bioinformatics | 1 | 07-05-2010 11:12 AM |
MAQ paired ends | prm36 | Bioinformatics | 0 | 04-15-2010 08:29 AM |
consed and illumina paired ends | ggloor | Bioinformatics | 1 | 06-22-2009 09:46 PM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Ottawa Join Date: Apr 2011
Posts: 130
|
![]()
Greetings to you all,
First of all thank you for creating this forum, it seems like a great way to share knowledge. I am a undergraduate student starting on a job at my university involving assembly of sequenced data. I am an expirienced linux user so I will try to use linux programs. Currently I have familiarized myself with velvet a bit and I wish to try to assemble some data that has been previously well assembled and I wanted to test velvets capabilities on it as well. After making sure I know how to use velvet I will also try Ray and SOAP. So I have 2 files: 100611_s_4_1_seq_GDR-7.txt (1.6 GB) 100611_s_4_2_seq_GDR-7.txt (1.6 GB) I have used this as a refrence for my work: http://wiki.bioinformatics.ucdavis.e...y_using_velvet From what I understand I need to merge the two files into 1 file with shuffleSequences_fastq.pl. Is this correct? Code:
shuffleSequences_fastq.pl 100611_s_4_1_seq_GDR-7.txt100611_s_4_2_seq_GDR-7.txt 100611_s_4_both_seq_GDR-7.txt Code:
17. Do the subsetting. Soon we will compare the single ended assembly to the paired-end assembly. In order for the comparison to be fair, we must use the same total number of reads. Therefore each paired end file will contain 1/4 of the reads: Code:
Final graph has 302039 nodes and n50 of 175, max 1779, total 5984104, using 13024530/17609332 reads I am using the -shortPaired option for velveth. My last question is, has anyone here used consed? I just wanted to ask since I have some problems setting up that program. Thank you. Last edited by AdrianP; 04-23-2011 at 11:42 AM. |
![]() |
![]() |
![]() |
#2 |
Member
Location: Heidelberg Join Date: Feb 2011
Posts: 69
|
![]()
if you don't want to compare paired end to single end you don't need to do "subsetting".
the last line just tells you your N50, how many reads were used (you can also request velvet to output a UnusedReads.fa file) etc, i am not sure if the number of nodes is the number of your contigs. You can check that by grep ">" -c contigs.fa. |
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Ottawa Join Date: Apr 2011
Posts: 130
|
![]()
Thank you for your previous reply.
Is there any way to see which contig is largest? Somehow sort contigs by their size? |
![]() |
![]() |
![]() |
#4 |
Member
Location: Heidelberg Join Date: Feb 2011
Posts: 69
|
![]()
sure there are a lot of ways. ;-) Use a script, you even have the (length + kmer -1) in the id of the contig so it is really easy.
here are some perl scripts that might help: http://wiki.bioinformatics.ucdavis.e.../Data_Analysis |
![]() |
![]() |
![]() |
#5 |
Senior Member
Location: Ottawa Join Date: Apr 2011
Posts: 130
|
![]()
If Velvet assembles poor (small) contigs when other programs with same settings (coverage and insertsize) do much much better, what can be my conclusions?
By the way most of those scripts are for fasta, and i got fastq, is there a script that converts? Adrian |
![]() |
![]() |
![]() |
#6 |
Member
Location: Heidelberg Join Date: Feb 2011
Posts: 69
|
![]()
http://brianknaus.com/software/srtoolbox/fastq2fasta.pl
first hit in google. ;-) Also normally trimming "programs" takes fastq as input and output a fasta. velvet needs a good coverage to do well because it's de brujin based. Since I don't know on what data you run velvet I and what you expect there is no help. Try smaller kmers, try different parameters, try multiple kmer.... |
![]() |
![]() |
![]() |
#7 |
Senior Member
Location: Ottawa Join Date: Apr 2011
Posts: 130
|
![]()
One thing that is puzzeling me is that Genegenious takes a few days to assemble the data that velvet assembles in 10-15 mins. Is it normal that velvet runs so fast?
|
![]() |
![]() |
![]() |
#8 |
Member
Location: Heidelberg Join Date: Feb 2011
Posts: 69
|
![]()
i have no clue what genegenious is and on what algorithm it is based. So if it is a ovelap-based method, yes it is possible and depends on kmer, amount of reads you have, expected coverage, read length .....
|
![]() |
![]() |
![]() |
#9 |
Senior Member
Location: Ottawa Join Date: Apr 2011
Posts: 130
|
![]()
I was wondering a bit more about Velvet's last line output.
What is n50? (I understand that it is a measurement of quality, the higher the better???) What is max? What is total? What are nodes? (this isn't as important since i believe it is related to the graph that velvet builds, and I do not use the graph, just the final contigs.fa file. Should I use the graph?) Also, what is the diffrence between shortpaired and shortpaired2 ? Something to do with inert libraries... Thanks a lot! |
![]() |
![]() |
![]() |
#10 |
Member
Location: Heidelberg Join Date: Feb 2011
Posts: 69
|
![]()
come on, do a bit more research on your own. :P
read the velvet paper: http://genome.cshlp.org/content/18/5/821.short You don't want to use the graph and if you know what a graph is you should also know what nodes are. Nodes are the vertices of a graph. shortpaired2 ist the same as shortpaired but for a separate insert size library (also stated in the manual). |
![]() |
![]() |
![]() |
#11 |
Senior Member
Location: Ottawa Join Date: Apr 2011
Posts: 130
|
![]()
The Manual I have read it a few times but I guess what I was asking is what does that mean "separate insert size library" ? Separate from what?
As for the research paper, i had a look at it before, I can't find a defenition for n50, it jumps straightly to using that term, eve in the abstract. I would not have asked if I did not do the research myself. I found the answer here after googling n50 http://seqanswers.com/forums/showthread.php?t=2332 |
![]() |
![]() |
![]() |
#12 |
Member
Location: Heidelberg Join Date: Feb 2011
Posts: 69
|
![]()
yes, it is the first hit when you google "definition: N50"
separate to "shortpaired" I assume. So you can use 2 different PE libraries with different insert sizes, but maybe I am wrong. Total is their calculated base pairs total, but keep in mind that the bp length of eacht transcript is the "real" bp length minus kmer plus 1, as mentioned somewhere.^^ max might be the longest contig, but I don't remember it exactly, since you need to do more statistics anyway. ;-) |
![]() |
![]() |
![]() |
#13 | |
Senior Member
Location: berlin Join Date: Feb 2010
Posts: 156
|
![]() Quote:
You also could have one tool making a decent size but completely incorrect assembly, with another making a cautious but correct assembly. N50/size isn't everything. Can you validate your results somehow, e.g. against another closely-related known genome? Either way, getting the best out of a dataset may require months of trial and error, tweaking etc. Even getting a particular tool do run properly and produce decent output can take weeks and can make a massive difference - the DBG assemblers all seem to have glass jaws. Pre-filtering the data seems to make a massive difference to most though. Incidentally, I don't have a lot of experience with velvet in particular, since it's simply too heavy for my project (>1GBase genome) |
|
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: Ottawa Join Date: Apr 2011
Posts: 130
|
![]()
Yeah actually my next genome to work with is a mitGenome and is about 70k, pretty cool.
I will start working with consed, not an easy program to work with but as I understand incredibly useful. |
![]() |
![]() |
![]() |
#15 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]() Quote:
Hi ! I am the author of Ray so if you have any question, ask away. Basically, with Ray, you will need to convert your two files to fasta or fastq format. There is a script in maq for that. http://maq.sourceforge.net/ maq-0.7.1/scripts/fq_all2std.pl export2std 100611_s_4_1_seq_GDR-7.txt > 100611_s_4_1_seq_GDR-7.txt.fastq maq-0.7.1/scripts/fq_all2std.pl export2std 100611_s_4_2_seq_GDR-7.txt > 100611_s_4_2_seq_GDR-7.txt.fastq Ray is available at http://denovoassembler.sf.net Then, using Ray, you assemble these reads: mpirun -np 8 Ray -k 31 -p 100611_s_4_1_seq_GDR-7.txt.fastq 100611_s_4_2_seq_GR-7.txt.fastq -o Ray-test-1.4.0 I encourage you to explore the files written by Ray: ls Ray-test-1.4.0.* see http://denovoassembler.sourceforge.n...00000000000000 |
|
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: Ottawa Join Date: Apr 2011
Posts: 130
|
![]()
Quick Question, does Ray support hybrid assembly? 454 and illumina? Can ray do a reference assembly? Is there GUI?
I almost got to using RAY but had huge problems with openmpi. Adrian |
![]() |
![]() |
![]() |
#17 | |
Senior Member
Location: Québec, Canada Join Date: Jul 2008
Posts: 260
|
![]()
Yes.
Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. Sébastien Boisvert, François Laviolette, and Jacques Corbeil. Journal of Computational Biology (Mary Ann Liebert, Inc. publishers). November 2010, 17(11): 1519-1533. doi:10.1089/cmb.2009.0238 See others here too: http://denovoassembler.sourceforge.n...lications.html Yes. See paper above. No. Ray only does de novo stuff. Not in the vanilla, but I know that Applied Maths Inc. is working on one. See http://www.applied-maths.com/opensource/opensource.htm Quote:
Otherwise: wget http://www.open-mpi.org/software/omp...-1.4.3.tar.bz2 tar xjf openmpi-1.4.3.tar.bz2 cd openmpi-1.4.3 ./configure --prefix=$(pwd)/build make make install You now have Open-MPI in $(pwd)/build In particular, $(pwd)/build/bin/mpic++ and $(pwd)/build/bin/mpirun. Now add $(pwd)/build/bin to your path. For example in your .bashrc: export PATH=/home/boiseb01/software/openmpi-1.4.1/build/bin:$PATH Anyway, you should use a decent GNU/Linux distribution -- it makes things easier. |
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|