Seqanswers Leaderboard Ad

**gringer** · 02-04-2013, 08:28 PM

Originally posted by kmkocot View Post

The library was made with a Nextera kit and sequenced using the new 2 X 250 reagent kits. The average size distribution of my library was around 500 bp but some smaller fragments were present. For those fragments, the read pairs will at least partially overalp. Does Ray have a problem when the two members of a pair of reads overlap? Should I treat the data as non paired end?

Originally posted by seb567 View Post

Ray will be fine with those.

... as long as your computer cluster is up to the challenge (which depends more on the target genome size than the number of input reads). While Ray is quite memory efficient, you may have a bit of difficulty assembling a human genome using Ray on a small cluster or desktop.

**seb567** · 02-04-2013, 08:33 PM

Originally posted by gringer View Post

... as long as your computer cluster is up to the challenge. While Ray is quite memory efficient, you may have a bit of difficulty assembling a human genome using Ray on a small cluster or desktop.

Sure.

And it depends what is implied by "assembling a human genome".

Assemblathon 2 results indicate that Ray is really good with gene content, but its scaffolder is way too conservative.

Our group is mostly into bacterial genomes and human microbiomes.

See our recent paper: http://genomebiology.com/2012/13/12/R122/abstract

Thanks for the feedback !

-Sébastien

**gringer** · 02-04-2013, 08:41 PM

Originally posted by seb567 View Post

Assemblathon 2 results indicate that Ray is really good with gene content, but its scaffolder is way too conservative.

FWIW, I've been able to improve on Ray assemblies a little by running the scaffolds through AMOS' minimus2 (in the default all-vs-all mode). That was able to pick up a few more SNPs and merge contigs that were almost identical.

**kmkocot** · 02-25-2013, 02:54 PM

Thanks guys! Sorry for the great delay in my reply. I have been at sea.

We are working with invertebrate genomes of unknown size but we're after the mitochondrial genomes for this project and they've been shaking out OK on our 80 CPU cluster.

Best,
Kevin

**kmkocot** · 02-25-2013, 02:55 PM

Me again. seb567, the link to the visualization tool you posted (http://genome.ulaval.ca/corbeillab/Ray-Cloud-Browser) is broken.

**seb567** · 02-26-2013, 06:05 PM

Originally posted by kmkocot View Post

Thanks guys! Sorry for the great delay in my reply. I have been at sea.

We are working with invertebrate genomes of unknown size but we're after the mitochondrial genomes for this project and they've been shaking out OK on our 80 CPU cluster.

Best,
Kevin

Cool !

Originally posted by kmkocot View Post

Me again. seb567, the link to the visualization tool you posted (http://genome.ulaval.ca/corbeillab/Ray-Cloud-Browser) is broken.

It's the IT at my institution that failed I guess. Anyway, I have set up DNS canonical names (CNAME), which are more robust.

All my Ray Cloud Browser deployments are in the cloud.

4 demos (these are canonical names to cloud instances):

E. coli on a t1.micro spot instance in Amazon EC2

Some microbiomes of a colleague on 1 t1.micro spot instance in Amazon EC2

E. coli on a small Linux Virtual Machine in Windows Azure

A vertebrate genome (American eel) on a Silver instance in IBM SmartCloud

In all these links, raytrek.com can be replaced by boisvert.info (example: browser.cloud.raytrek.com and browser.cloud.boisvert.info are the same instance).

**yaximik** · 03-09-2013, 01:49 PM

A bit confused about parameters - help...

Hi,
What is the meaning of averageOuterDistance and standardDeviation for paired end files? Is it just average read length in the dataset?
If so, then why it is not required for single read file?
If not, is it an average fragment length in the library? Such as surmised from BioAnalyzer trace, for example?
If so, then default autocalc may give very wrong estimate, could it? For example, one of my paired read runs was done with a library of 600 bp +/- 15%, but during assembly autocalc estimate was something 150 bp - how this can be so much off?

**yaximik** · 03-09-2013, 07:10 PM

More help needed....

Hi,

I tried to run Ray (maxkmer 32) on 2 x quad core RHEl58 with hyper-threading enabled:

mpiexec -n 16 Ray <Ray.conf> and got the error:

Code:

........
Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...)
[Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...)
[Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...)
[Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...)
[Loader::load] File: /home/yaximik/AssRefMap/SC/minia/SCMiSeqAllFGMGPGIGclean_k27.contigs.fasta (please wait...)
[G5NNJN1:07040] *** Process received signal ***
[G5NNJN1:07040] Signal: Segmentation fault (11)
[G5NNJN1:07040] Signal code:  (128)
[G5NNJN1:07040] Failing at address: (nil)
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 7040 on node G5NNJN1 exited on signal 11 (Segmentation fault).

The last file loaded was a file with fasta contigs from another assembler (minia). Does this mean contigs from other assemblers cannot be used in Ray?

**yaximik** · 03-09-2013, 07:12 PM

Oops The machine has 96 GB memory

**KirillK** · 03-09-2013, 10:01 PM

Hi guys!

Is there a way to provide a reference genome for Ray?

cheers,
KK

**seb567** · 03-11-2013, 05:30 AM

Originally posted by yaximik View Post

Hi,

I tried to run Ray (maxkmer 32) on 2 x quad core RHEl58 with hyper-threading enabled:

mpiexec -n 16 Ray <Ray.conf> and got the error:

Code:

........
Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...)
[Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...)
[Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...)
[Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...)
[Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...)
[Loader::load] File: /home/yaximik/AssRefMap/SC/minia/SCMiSeqAllFGMGPGIGclean_k27.contigs.fasta (please wait...)
[G5NNJN1:07040] *** Process received signal ***
[G5NNJN1:07040] Signal: Segmentation fault (11)
[G5NNJN1:07040] Signal code:  (128)
[G5NNJN1:07040] Failing at address: (nil)
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 7040 on node G5NNJN1 exited on signal 11 (Segmentation fault).

The last file loaded was a file with fasta contigs from another assembler (minia). Does this mean contigs from other assemblers cannot be used in Ray?

The maximum read length is 65536 nucleotides.

**seb567** · 03-11-2013, 05:32 AM

Originally posted by KirillK View Post

Hi guys!

Is there a way to provide a reference genome for Ray?

cheers,
KK

You can provide reference genomes using the -search option.

Code:

       -search searchDirectory
              Provides a directory containing fasta files to be searched in the de Bruijn graph.
              Biological abundances will be written to RayOutput/BiologicalAbundances
              See Documentation/BiologicalAbundances.txt

However, this will not be used to aid in the assembly. This option is useful to report biological abundances.

See this paper for more information.

**seb567** · 03-11-2013, 05:37 AM

Originally posted by yaximik View Post

Hi,
What is the meaning of averageOuterDistance and standardDeviation for paired end files?

The outer distance is the sum of the gap size, the length of the left read and the length of the right read.

This is computed for paired reads and mate pairs.

Is it just average read length in the dataset?

No.

If so, then why it is not required for single read file?

It only applies for pairs.

If not, is it an average fragment length in the library?

Yes.

Such as surmised from BioAnalyzer trace, for example?

Yes, but the BioAnalyzer will also include sequencing adapters in the evaluation whereas these are not included in sequencing reads usually.

If so, then default autocalc may give very wrong estimate, could it? For example, one of my paired read runs was done with a library of 600 bp +/- 15%, but during assembly autocalc estimate was something 150 bp - how this can be so much off?

The 600 bp +/- 15% presumably includes adapters that are not in sequencing reads.

You can run another application on your data (like ABySS) and you'll see that Ray's right.

**yaximik** · 03-11-2013, 06:11 AM

The maximum read length is 65536 nucleotides.

Got to be another reason. The assembly file by minia includes max contig of 16091 nt. Without this dataset, Ray produced assembly with max contig/scaffold of 46428 nt.

The 600 bp +/- 15% presumably includes adapters that are not in sequencing reads.

That is puzzling. The combined adaptor length (both sides) is standard at 120 bp, so autocalc is then a way off (600-120=480, but estimated is ~150). Obviously much smaller library size should affect scaffolding. Would that be better to provide real numbers? Also, i guess the narrower distribution should be better, correct? This can be done by refractionation of the library and collecting narrow distribution, say +/-5%.

**seb567** · 03-11-2013, 06:17 AM

Originally posted by yaximik View Post

Got to be another reason. The assembly file by minia includes max contig of 16091 nt. Without this dataset, Ray produced assembly with max contig/scaffold of 46428 nt.

Then the problem is presumably caused by the lack of support for multiline fasta files for reads in Ray.

Please do submit a ticket if you feel this should be fixed.

That is puzzling. The combined adaptor length (both sides) is standard at 120 bp, so autocalc is then a way off (600-120=480, but estimated is ~150). Obviously much smaller library size should affect scaffolding. Would that be better to provide real numbers? Also, i guess the narrower distribution should be better, correct? This can be done by refractionation of the library and collecting narrow distribution, say +/-5%.

You can plot your distributions.

LibraryStatistics.txt contains averages, but you have all the signal in Library0.txt, Library1.txt. If you are using the git version of Ray, this information is now in LibraryData.xml

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News