SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
parallel de novo assembler tmy1018 Bioinformatics 3 10-22-2012 08:31 AM
PubMed: A Comparison of Parallel Pyrosequencing and Sanger Clone-Based Sequencing and Newsbot! Literature Watch 0 11-01-2011 02:00 AM
Contrail - a hadoop-based de novo sequence assembler samanta General 0 09-08-2011 11:16 AM
looking for reference genome based assembler for short-reads zchou Bioinformatics 3 12-16-2009 08:13 PM
PubMed: ABySS: A parallel assembler for short read sequence data. Newsbot! Literature Watch 0 03-03-2009 05:00 AM

Reply
 
Thread Tools
Old 11-21-2013, 06:34 AM   #261
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by Brian E View Post
So, to add to an already massive thread, it's looking like Ray is going to be the assembler we're going to be using for our Metagenome work,

There is a thread for Ray Meta (Ray for Metagenomics):

http://seqanswers.com/forums/showthread.php?t=26278

Paper: http://genomebiology.com/2012/13/12/R122

Quote:
Originally Posted by Brian E View Post

and I was wondering if it is possible to obtain the location that each read ends up in the assembly.
There is the -amos option, but on the mailing list, I see more people relying on post-assembly read mapping onto the contigs.

Quote:
Originally Posted by Brian E View Post

We're using multiple biological replicates, sequenced separately (multiplexed) and we know that there is some variation in the abundance of the organisms that are present in the community. What I would like is the contribution of each individual sample to the local coverage of the whole assembly. I think this could help with pulling out individual genomes.
If you are sequencing barcoded samples, then demultiplexing should be done before feeding the data to Ray.

Quote:
Originally Posted by Brian E View Post

I know could get at this by mapping with e.g. Bowtie, but if it is possible to keep track of where these reads are ending up, it would save a significant amount of time on our computer cluster allocation. Is this possible?
As stated above, Ray can generate an AMOS file, which contains read usage.

Séb
seb567 is offline   Reply With Quote
Old 01-11-2014, 09:49 AM   #262
bossanova352
Junior Member
 
Location: United States

Join Date: Mar 2013
Posts: 9
Default

I'm trying to use Ray on some environmental metagenomes, but every time I set up a run, the output contig and scaffold files are empty. I haven't seen any indication during the assembly run, and all sequences are recognized by Ray. One possibility that comes to mind is from some errors that came up while compiling. Specifically, when trying to use the make install command, an error pops up saying there is no README file in ray-build/RayPlatform. If I place the README file in RayPlatform and try again, I get an error saying there is already a file named RayPlatform.

Edit: After reinstalling Ray with the developer package, there are no installation errors. As of now, I'm stumped why there are no assembled sequences in either of the output Contigs.fasta or Scaffolds.fasta files.

Last edited by bossanova352; 01-11-2014 at 11:25 PM.
bossanova352 is offline   Reply With Quote
Old 01-13-2014, 05:14 AM   #263
VidJa
Junior Member
 
Location: The Netherlands

Join Date: Apr 2010
Posts: 7
Default

I'm trying to use Ray with a mix of Illumina SE and 454 data using the -amos
mpiexec -n 10 Ray -s illumina.fasta -s 454.fasta -k 31 -amos -o mix
454 reads have an avg length of 396bp and the illumina reads are 60bp

The resulting amos file AMOS.afg can be browsed using Tablet, but I noticed that the original read names are converted to just a number and thus making it impossible to easily trace which reads end up in a particular contig. Is it possible to save the original readnames in the AMOS output?
VidJa is offline   Reply With Quote
Old 01-13-2014, 05:52 AM   #264
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by bossanova352 View Post
I'm trying to use Ray on some environmental metagenomes, but every time I set up a run, the output contig and scaffold files are empty. I haven't seen any indication during the assembly run, and all sequences are recognized by Ray. One possibility that comes to mind is from some errors that came up while compiling. Specifically, when trying to use the make install command, an error pops up saying there is no README file in ray-build/RayPlatform. If I place the README file in RayPlatform and try again, I get an error saying there is already a file named RayPlatform.

Edit: After reinstalling Ray with the developer package, there are no installation errors. As of now, I'm stumped why there are no assembled sequences in either of the output Contigs.fasta or Scaffolds.fasta files.
What is the content of RayOutput/NumberOfSequences.txt ?

Do you have any error in your standard output file ? ("grep Error log.stdout")
seb567 is offline   Reply With Quote
Old 01-13-2014, 05:52 AM   #265
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by VidJa View Post
I'm trying to use Ray with a mix of Illumina SE and 454 data using the -amos
mpiexec -n 10 Ray -s illumina.fasta -s 454.fasta -k 31 -amos -o mix
454 reads have an avg length of 396bp and the illumina reads are 60bp

The resulting amos file AMOS.afg can be browsed using Tablet, but I noticed that the original read names are converted to just a number and thus making it impossible to easily trace which reads end up in a particular contig. Is it possible to save the original readnames in the AMOS output?
This is not currently possible because Ray does not read or store read names at all.
seb567 is offline   Reply With Quote
Old 01-13-2014, 10:58 AM   #266
bossanova352
Junior Member
 
Location: United States

Join Date: Mar 2013
Posts: 9
Default

Quote:
Originally Posted by seb567 View Post
What is the content of RayOutput/NumberOfSequences.txt ?

Do you have any error in your standard output file ? ("grep Error log.stdout")
The output of NumberOfSequences.txt is:

Quote:
Files: 1

FileNumber: 0
FilePath: ../SFBloom_paired_trimmed_1.fa
NumberOfSequences: 7917760
FirstSequence: 0
LastSequence: 7917759

Summary
NumberOfSequences: 7917760
FirstSequence: 0
LastSequence: 7917759
This is what I find in the stdoutput file a considerable way through. I think the issue lies in what I've quoted below, as there are seeds before this and everything goes to 0 afterwards. It looks like it's skipping all paths because of short length:

Quote:
Rank 8 has 0 seeds
Rank 8 is creating seeds [147526/147526] (completed)
Rank 8: peak number of workers: 1887, maximum: 32768
Rank 8 : VirtualCommunicator (service provided by VirtualCommunicator): 494068 virtual messages generated 4040 real messages (0.817701%)
Rank 8 runtime statistics for seeding algorithm:
Rank 8 Skipped paths because of dead end for head: 0
Rank 8 Skipped paths because of dead end for tail: 0
Rank 8 Skipped paths because of two dead ends: 0
Rank 8 Skipped paths because of bubble weak component: 0
Rank 8 Skipped paths because of short length: 147526
Rank 8 Skipped paths because of bad ownership: 0
Rank 8 Skipped paths because of low coverage: 0
Rank 8 Eligible paths: 0
Rank 8: assembler memory usage: 139120 KiB
I'm working with Illumina Hiseq paired end 100 bp reads which have been trimmed based on quality scores and length, so there should be nothing shorter than 50 bp. Oh, and my command is this:

mpiexec -n 10 Ray -k 57 -i ../SFBloom_paired_trimmed_1.fa -o RayOutputTest

Last edited by bossanova352; 01-13-2014 at 11:03 AM.
bossanova352 is offline   Reply With Quote
Old 01-13-2014, 01:23 PM   #267
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by bossanova352 View Post
The output of NumberOfSequences.txt is:



This is what I find in the stdoutput file a considerable way through. I think the issue lies in what I've quoted below, as there are seeds before this and everything goes to 0 afterwards. It looks like it's skipping all paths because of short length:



I'm working with Illumina Hiseq paired end 100 bp reads which have been trimmed based on quality scores and length, so there should be nothing shorter than 50 bp. Oh, and my command is this:

mpiexec -n 10 Ray -k 57 -i ../SFBloom_paired_trimmed_1.fa -o RayOutputTest
Can you paste the 10 first lines of your file named SFBloom_paired_trimmed_1.fa ?
seb567 is offline   Reply With Quote
Old 01-13-2014, 01:49 PM   #268
bossanova352
Junior Member
 
Location: United States

Join Date: Mar 2013
Posts: 9
Default

Quote:
Originally Posted by seb567 View Post
Can you paste the 10 first lines of your file named SFBloom_paired_trimmed_1.fa ?
Yeah, here it is:

Quote:
>DB775P1:2451TDYACXX:2:1101:1582:1958_1:N:0:CGTACTAG
ATAATCGTTTGCTCGGCTATTTGAGTTGCAGATATTAATTGTTTACGACGGGATGTATCA
AGTTCTGCAAAGACATCATCCAAAATTAGAATGGGTTC
>DB775P1:2451TDYACXX:2:1101:1582:1958_2:N:0:CGTACTAG
ATGACCTACATCTACAAATCGGAGATTTTCCGGCTAAAGGTTATGCCAGCCACGGTGAAT
CCTGGTCAATGGCGATTTCACTACGCATTGGATCTTTTAAT
>DB775P1:2451TDYACXX:2:1101:1853:1966_1:N:0:CGTACTAG
TTCACCTAGAGAATGACCGGCAACAAAGTGGGGCGTAGGAAGTCGATCCTTGAAAGAGGT
GAGGACCATCCAGGAGTGCATTAAAATAGCCGGCTGAG
>DB775P1:2451TDYACXX:2:1101:1853:1966_2:N:0:CGTACTAG
bossanova352 is offline   Reply With Quote
Old 01-13-2014, 01:53 PM   #269
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by bossanova352 View Post
Yeah, here it is:
Each sequence needs to be on one line (this is a current limitation of fasta support in Ray).

That's presumably the issue.
seb567 is offline   Reply With Quote
Old 01-13-2014, 07:13 PM   #270
bossanova352
Junior Member
 
Location: United States

Join Date: Mar 2013
Posts: 9
Default

Quote:
Originally Posted by seb567 View Post
Each sequence needs to be on one line (this is a current limitation of fasta support in Ray).

That's presumably the issue.
How silly of me! Well I changed the formatting, but unfortunately I'm still not getting any output from Ray. This is what the file looks like now (all sequences on one line):

Quote:
>DB775P1:2451TDYACXX:2:1101:1582:1958_1:N:0:CGTACTAG
AGTTCTGCAAAGACATCATCCAAAATTAGAATGGGTTCTTGTTTACGACGGGATGTATCA
>DB775P1:2451TDYACXX:2:1101:1582:1958_2:N:0:CGTACTAG
CCTGGTCAATGGCGATTTCACTACGCATTGGATCTTTTAATTATGCCAGCCACGGTGAAT
>DB775P1:2451TDYACXX:2:1101:1853:1966_1:N:0:CGTACTAG
GAGGACCATCCAGGAGTGCATTAAAATAGCCGGCTGAGGAAGTCGATCCTTGAAAGAGGT
>DB775P1:2451TDYACXX:2:1101:1853:1966_2:N:0:CGTACTAG
GTCAAGAATGCCATCCGAGCTGCGATGACCAATATCGAGCAAAGTAGCGATGCCCGCGCT
>DB775P1:2451TDYACXX:2:1101:2768:1957_1:N:0:CGTACTAG
GTTGGAGCGCTTGGTATCCTGCGCTCCAATATTCATCACAGTGGGAATGACGCCCCCTAC
Again, it looks like this step has some clues as to what is going on:

Quote:
Rank 2 has 12675 seeds
Rank 2 is creating seeds [2985784/2985784] (completed)
Rank 2: peak number of workers: 2002, maximum: 32768
Rank 2 : VirtualCommunicator (service provided by VirtualCommunicator): 19245610
Rank 2 runtime statistics for seeding algorithm:
Rank 2 Skipped paths because of dead end for head: 0
Rank 2 Skipped paths because of dead end for tail: 0
Rank 2 Skipped paths because of two dead ends: 0
Rank 2 Skipped paths because of bubble weak component: 0
Rank 2 Skipped paths because of short length: 2960369
Rank 2 Skipped paths because of bad ownership: 12740
Rank 2 Skipped paths because of low coverage: 0
Rank 2 Eligible paths: 12675
Rank 2: assembler memory usage: 263224 KiB
Fixed! It was another formatting issue, (^M characters were showing up after the one-line formatting). Thanks, Seb! I appreciate the help.

Last edited by bossanova352; 01-14-2014 at 10:34 AM.
bossanova352 is offline   Reply With Quote
Old 01-15-2014, 07:43 AM   #271
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default

Quote:
Originally Posted by bossanova352 View Post
How silly of me! Well I changed the formatting, but unfortunately I'm still not getting any output from Ray. This is what the file looks like now (all sequences on one line):



Again, it looks like this step has some clues as to what is going on:



Fixed! It was another formatting issue, (^M characters were showing up after the one-line formatting). Thanks, Seb! I appreciate the help.
Short seeds mean that they are not connecting with one another.

Can you provide a couple of lines from CoverageDistribution.txt (head) ?
seb567 is offline   Reply With Quote
Old 01-15-2014, 08:20 AM   #272
bossanova352
Junior Member
 
Location: United States

Join Date: Mar 2013
Posts: 9
Default

Quote:
Originally Posted by seb567 View Post
Short seeds mean that they are not connecting with one another.

Can you provide a couple of lines from CoverageDistribution.txt (head) ?
Sure! It does seem to be working now, I'm getting contigs and scaffolds in my output files.

Quote:
# KmerCoverage Frequency
# Any frequency is a even number because of odd k-mer length
2 158870850
3 43942818
4 18999600
5 10198722
6 6257290
7 4165874
8 2937460
9 2155282
bossanova352 is offline   Reply With Quote
Old 02-12-2014, 10:56 AM   #273
seb567
Senior Member
 
Location: Québec, Canada

Join Date: Jul 2008
Posts: 260
Default Ray 2.3.1

Hi,

Ray 2.3.1 is now available on http://denovoassembler.sourceforge.net/download.html.

Significant changes:

* This version includes "Surveyor" to compute similarity (or distance) matrices
for hundreds or possibli thousands of samples.
* fix compilation error on Apple OS X Mavericks
* fix infinite loop when running on 2 CPU cores
* fix a bug when the number of ranks is a prime number



All changes in Ray:

Rob Egan (1):
fix compilation on NERSC's edison machine using PrgEnv-intel

Sébastien Boisvert (30):
SequencesLoader: fix bad automatic pairing of sequence files
SequencesLoader: fix compilation warnings
Surveyor: verify buffer size before getting producer
Surveyor: add a variable to store the period
Surveyor: run in actor-model-only mode
spawn actors with spawn instead of spawnActor
Documentation: add some documentation for Surveyor
SeedExtender: add some assertions
Searcher: disable verbose outputs
Surveyor: skip invalid files
coloring: added comments for coloring subsystem
update release procedure
next release will be 2.3.1
fix infinite loop when running on 2 CPU cores
fix a bug when the number of ranks is a prime number
print number of payloads
add some code to test directed surveys with Surveyor
fix reproducibility issue for similarity and distance matrices
Surveyor: support nucleotides in lower case
report invalid edges as warnings instead of errors
documentation: add license in README
Surveyor: report 0 hits when necessary
SeedingData: provide prototypes for friend functions
Surveyor: fix compilation issue without debug code
seeds: add a parameter -minimum-seed-length (default 100)
add option -graph-only to stop after graph building
fix compilation error on Apple OS X Mavericks
use CONFIG_ASSERT instead of ASSERT for optional code
version 2.3.1
update releases


Changes in RayPlatform:

Rob Egan (1):
fix compilation on NERSC's edison machine using PrgEnv-intel

Sébastien Boisvert (15):
communication: relay buffer bytes instead of buffer 64-bit integers
core: add a actor-model-only mode
actors: add playground status with -debug
core: add buffer statistics with -debug
actor model: change the method name from spawnActor to spawn
fix the code for testing message integrity
fix a regression introduced in a01f97eae41bcd759bfc521d84053552cf38d521
files: add method to check if a file is valid
add mini-rank information in the message metadata
fix mini-rank runtime engine
print registered message tags in debug mode
documentation: add LGPLv3 info in README
communication: some routes don't require routing
use CONFIG_ASSERT instead of ASSERT for optional code
fix compilation warning
seb567 is offline   Reply With Quote
Old 03-05-2014, 06:19 AM   #274
canderson30
Junior Member
 
Location: Nebraska

Join Date: Nov 2012
Posts: 2
Default

Hello,

When I look through some outputs generated from the amos file following assembly, many of the contigs were assigned 0 reads (used default bank2contig after seeing many contigs were not showing up in the generated sam file). Obviously, this does not make much sense, but I was wondering if anyone else has came across this? I was trying to avoid mapping by using the amos file and now I just want to confirm that the contigs I am getting are 'real' I suppose.

I thought this may be due to read recycling at first, but reads show up under multiple contigs still. Anyone have other ideas what is causing this issue or how to correct it during assembly?


Chris
canderson30 is offline   Reply With Quote
Old 05-08-2015, 03:55 AM   #275
Zapages
Member
 
Location: NJ

Join Date: Oct 2012
Posts: 94
Default

I am trying to assemble 275 paired end Illumina reads that I have interleaved together. Previously I was successfuly ran the interleaved files at the Kmer value 137. I compiled latest Ray version at Max Kmer size of 600 (technically 599).

That code was:


Code:
mpiexec -n 30 Ray -k 137 -i interleaved.fastq -o Ray_K137

Now if I try a smaller Kmer value, I am running into a weird error Chunk Size error.

I have tried:

Code:
mpiexec -n 10 Ray -k 51 -i interleaved.fastq -o Ray_K51_try3
Code:
mpiexec -n 30 Ray -k 51 -i interleaved.fastq -o Ray_K51_try3
All these have caused the same Chunk Size error. I even tried it without mpiexec enabled. I still was retruned with the error below.

Code:
Rank 0 : VirtualCommunicator (service provided by VirtualCommunicator): 2957916 virtual messages generated 115295 real messages (3.89785%)
Rank 0 freed 549453824 bytes from the path memory pool (chunks: 131)
Rank 0: gossiping generated 0 messages (gossips: 0 ---> 0)
Critical exception: The length of the requested memory exceeds the CHUNK_SIZE: 36423920 > 33554432
Ray: RayPlatform/memory/MyAllocator.cpp:97: void* MyAllocator::allocate(int): Assertion `false' failed.
[BioLinux301:05209] *** Process received signal ***
[BioLinux301:05209] Signal: Aborted (6)
[BioLinux301:05209] Signal code:  (-6)
[BioLinux301:05209] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7f4a34d27340]
[BioLinux301:05209] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f4a34987bb9]
[BioLinux301:05209] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f4a3498afc8]
[BioLinux301:05209] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76) [0x7f4a34980a76]
[BioLinux301:05209] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22) [0x7f4a34980b22]
[BioLinux301:05209] [ 5] Ray() [0x533b50]
[BioLinux301:05209] [ 6] Ray() [0x4f7552]
[BioLinux301:05209] [ 7] Ray() [0x551768]
[BioLinux301:05209] [ 8] Ray() [0x5550ab]
[BioLinux301:05209] [ 9] Ray() [0x5562ea]
[BioLinux301:05209] [10] Ray() [0x413379]
[BioLinux301:05209] [11] Ray() [0x40c5bf]
[BioLinux301:05209] [12] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f4a34972ec5]
[BioLinux301:05209] [13] Ray() [0x40e0cf]
[BioLinux301:05209] *** End of error message ***
zsh: abort      Ray -k 51 -i interleaved.fastq -o Ray_K51_try3
I am running this on BioLinux 8 Workstation that 32 threads and specs are: Intel Xeon E5-2640v2 2 Ghz with 128 GB of RAM.

Really appreciate on how to proceed.

Last edited by Zapages; 05-08-2015 at 04:22 AM.
Zapages is offline   Reply With Quote
Old 06-02-2015, 10:42 PM   #276
hbn
Member
 
Location: Netherlands

Join Date: Apr 2011
Posts: 16
Default

Maybe this question was already asked somewhere, but I can not find it:

Is there a way to set the maximum insert size for paired end assembly with Ray? If not, what is the maximum insert size considered?

I have an assembly which uses both normal insert size Illumina reads ( ~ 250 bp) and some longer insert sizes ( ~ 500 bp). When adding this last library, the results do not improve, which I think is suspicious.. Any ideas?
hbn is offline   Reply With Quote
Old 06-04-2015, 06:41 AM   #277
bastianwur
Member
 
Location: Germany/Netherlands

Join Date: Feb 2014
Posts: 98
Default

Contigs, or scaffolds?
Have you tried giving the a possible distance for the reads to the assembler?
bastianwur is offline   Reply With Quote
Old 08-31-2015, 09:01 PM   #278
shayan shams
Junior Member
 
Location: United states

Join Date: Aug 2015
Posts: 1
Default

Hi Folks,
I have serious problem with Ray and open mpi
I am using a cluster with 4 nodes each has 8 cores and surprisingly when I run ray on single node with mpirun -np 8 it takes shorter time than I use two nodes and so on for example for one node it takes 5mins and for two nodes mpirun -np16 it taks 8min and for 3 nodes mpirun -24 it takes 12 mins and so on can any body please help me to find out the problem
shayan shams is offline   Reply With Quote
Old 12-10-2015, 05:25 AM   #279
jazz710
Member
 
Location: Iowa

Join Date: Oct 2012
Posts: 41
Default

Can you explain Ray Surveyor in a bit more detail? I'm having a hard time understanding the documentation but I think this could be of use to me.
jazz710 is offline   Reply With Quote
Reply

Tags
assembler, genome, illumina, mix

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:16 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO