Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by kmkocot View Post
    The library was made with a Nextera kit and sequenced using the new 2 X 250 reagent kits. The average size distribution of my library was around 500 bp but some smaller fragments were present. For those fragments, the read pairs will at least partially overalp. Does Ray have a problem when the two members of a pair of reads overlap? Should I treat the data as non paired end?
    Originally posted by seb567 View Post
    Ray will be fine with those.
    ... as long as your computer cluster is up to the challenge (which depends more on the target genome size than the number of input reads). While Ray is quite memory efficient, you may have a bit of difficulty assembling a human genome using Ray on a small cluster or desktop.
    Last edited by gringer; 02-04-2013, 08:31 PM.

    Comment


    • Originally posted by gringer View Post
      ... as long as your computer cluster is up to the challenge. While Ray is quite memory efficient, you may have a bit of difficulty assembling a human genome using Ray on a small cluster or desktop.
      Sure.

      And it depends what is implied by "assembling a human genome".

      Assemblathon 2 results indicate that Ray is really good with gene content, but its scaffolder is way too conservative.

      Our group is mostly into bacterial genomes and human microbiomes.

      See our recent paper: http://genomebiology.com/2012/13/12/R122/abstract


      Thanks for the feedback !

      -Sébastien

      Comment


      • Originally posted by seb567 View Post
        Assemblathon 2 results indicate that Ray is really good with gene content, but its scaffolder is way too conservative.
        FWIW, I've been able to improve on Ray assemblies a little by running the scaffolds through AMOS' minimus2 (in the default all-vs-all mode). That was able to pick up a few more SNPs and merge contigs that were almost identical.

        Comment


        • Thanks guys! Sorry for the great delay in my reply. I have been at sea.

          We are working with invertebrate genomes of unknown size but we're after the mitochondrial genomes for this project and they've been shaking out OK on our 80 CPU cluster.

          Best,
          Kevin

          Comment


          • Me again. seb567, the link to the visualization tool you posted (http://genome.ulaval.ca/corbeillab/Ray-Cloud-Browser) is broken.

            Comment


            • Originally posted by kmkocot View Post
              Thanks guys! Sorry for the great delay in my reply. I have been at sea.

              We are working with invertebrate genomes of unknown size but we're after the mitochondrial genomes for this project and they've been shaking out OK on our 80 CPU cluster.

              Best,
              Kevin
              Cool !


              Originally posted by kmkocot View Post
              Me again. seb567, the link to the visualization tool you posted (http://genome.ulaval.ca/corbeillab/Ray-Cloud-Browser) is broken.
              It's the IT at my institution that failed I guess. Anyway, I have set up DNS canonical names (CNAME), which are more robust.

              All my Ray Cloud Browser deployments are in the cloud.

              4 demos (these are canonical names to cloud instances):

              E. coli on a t1.micro spot instance in Amazon EC2

              Some microbiomes of a colleague on 1 t1.micro spot instance in Amazon EC2

              E. coli on a small Linux Virtual Machine in Windows Azure

              A vertebrate genome (American eel) on a Silver instance in IBM SmartCloud


              In all these links, raytrek.com can be replaced by boisvert.info (example: browser.cloud.raytrek.com and browser.cloud.boisvert.info are the same instance).

              Comment


              • A bit confused about parameters - help...

                Hi,
                What is the meaning of averageOuterDistance and standardDeviation for paired end files? Is it just average read length in the dataset?
                If so, then why it is not required for single read file?
                If not, is it an average fragment length in the library? Such as surmised from BioAnalyzer trace, for example?
                If so, then default autocalc may give very wrong estimate, could it? For example, one of my paired read runs was done with a library of 600 bp +/- 15%, but during assembly autocalc estimate was something 150 bp - how this can be so much off?

                Comment


                • More help needed....

                  Hi,

                  I tried to run Ray (maxkmer 32) on 2 x quad core RHEl58 with hyper-threading enabled:

                  mpiexec -n 16 Ray <Ray.conf> and got the error:
                  Code:
                  ........
                  Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...)
                  [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...)
                  [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...)
                  [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...)
                  [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...)
                  [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...)
                  [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...)
                  [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...)
                  [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...)
                  [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...)
                  [Loader::load] File: /home/yaximik/AssRefMap/SC/minia/SCMiSeqAllFGMGPGIGclean_k27.contigs.fasta (please wait...)
                  [G5NNJN1:07040] *** Process received signal ***
                  [G5NNJN1:07040] Signal: Segmentation fault (11)
                  [G5NNJN1:07040] Signal code:  (128)
                  [G5NNJN1:07040] Failing at address: (nil)
                  --------------------------------------------------------------------------
                  mpiexec noticed that process rank 0 with PID 7040 on node G5NNJN1 exited on signal 11 (Segmentation fault).
                  The last file loaded was a file with fasta contigs from another assembler (minia). Does this mean contigs from other assemblers cannot be used in Ray?

                  Comment


                  • Oops The machine has 96 GB memory

                    Comment


                    • Hi guys!

                      Is there a way to provide a reference genome for Ray?

                      cheers,
                      KK

                      Comment


                      • Originally posted by yaximik View Post
                        Hi,

                        I tried to run Ray (maxkmer 32) on 2 x quad core RHEl58 with hyper-threading enabled:


                        mpiexec -n 16 Ray <Ray.conf> and got the error:
                        Code:
                        ........
                        Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...)
                        [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SC-MILLib1-Herc2s10cFr1Fr2run2R1AdQ30.fastq (please wait...)
                        [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...)
                        [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run1R1AdQ30.fastq (please wait...)
                        [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...)
                        [Loader::load] File: /media/FantomHD/Data/MiSeq/SC/AdQ30/SCPfx3s25cFr3-150-200run2R1AdQ30.fastq (please wait...)
                        [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...)
                        [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SColdAll.fasta (please wait...)
                        [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...)
                        [Loader::load] File: /media/FantomHD/AssRefMap/SC/SCold/SCallSanger.fasta (please wait...)
                        [Loader::load] File: /home/yaximik/AssRefMap/SC/minia/SCMiSeqAllFGMGPGIGclean_k27.contigs.fasta (please wait...)
                        [G5NNJN1:07040] *** Process received signal ***
                        [G5NNJN1:07040] Signal: Segmentation fault (11)
                        [G5NNJN1:07040] Signal code:  (128)
                        [G5NNJN1:07040] Failing at address: (nil)
                        --------------------------------------------------------------------------
                        mpiexec noticed that process rank 0 with PID 7040 on node G5NNJN1 exited on signal 11 (Segmentation fault).
                        The last file loaded was a file with fasta contigs from another assembler (minia). Does this mean contigs from other assemblers cannot be used in Ray?
                        The maximum read length is 65536 nucleotides.

                        Comment


                        • Originally posted by KirillK View Post
                          Hi guys!

                          Is there a way to provide a reference genome for Ray?

                          cheers,
                          KK
                          You can provide reference genomes using the -search option.

                          Code:
                                 -search searchDirectory
                                        Provides a directory containing fasta files to be searched in the de Bruijn graph.
                                        Biological abundances will be written to RayOutput/BiologicalAbundances
                                        See Documentation/BiologicalAbundances.txt
                          However, this will not be used to aid in the assembly. This option is useful to report biological abundances.

                          See this paper for more information.
                          Last edited by seb567; 03-11-2013, 05:39 AM. Reason: added Genome Biology reference

                          Comment


                          • Originally posted by yaximik View Post
                            Hi,
                            What is the meaning of averageOuterDistance and standardDeviation for paired end files?
                            The outer distance is the sum of the gap size, the length of the left read and the length of the right read.

                            This is computed for paired reads and mate pairs.


                            Is it just average read length in the dataset?
                            No.

                            If so, then why it is not required for single read file?
                            It only applies for pairs.


                            If not, is it an average fragment length in the library?
                            Yes.

                            Such as surmised from BioAnalyzer trace, for example?
                            Yes, but the BioAnalyzer will also include sequencing adapters in the evaluation whereas these are not included in sequencing reads usually.


                            If so, then default autocalc may give very wrong estimate, could it? For example, one of my paired read runs was done with a library of 600 bp +/- 15%, but during assembly autocalc estimate was something 150 bp - how this can be so much off?
                            The 600 bp +/- 15% presumably includes adapters that are not in sequencing reads.

                            You can run another application on your data (like ABySS) and you'll see that Ray's right.

                            Comment


                            • The maximum read length is 65536 nucleotides.
                              Got to be another reason. The assembly file by minia includes max contig of 16091 nt. Without this dataset, Ray produced assembly with max contig/scaffold of 46428 nt.

                              The 600 bp +/- 15% presumably includes adapters that are not in sequencing reads.
                              That is puzzling. The combined adaptor length (both sides) is standard at 120 bp, so autocalc is then a way off (600-120=480, but estimated is ~150). Obviously much smaller library size should affect scaffolding. Would that be better to provide real numbers? Also, i guess the narrower distribution should be better, correct? This can be done by refractionation of the library and collecting narrow distribution, say +/-5%.

                              Comment


                              • Originally posted by yaximik View Post
                                Got to be another reason. The assembly file by minia includes max contig of 16091 nt. Without this dataset, Ray produced assembly with max contig/scaffold of 46428 nt.
                                Then the problem is presumably caused by the lack of support for multiline fasta files for reads in Ray.

                                Please do submit a ticket if you feel this should be fixed.


                                That is puzzling. The combined adaptor length (both sides) is standard at 120 bp, so autocalc is then a way off (600-120=480, but estimated is ~150). Obviously much smaller library size should affect scaffolding. Would that be better to provide real numbers? Also, i guess the narrower distribution should be better, correct? This can be done by refractionation of the library and collecting narrow distribution, say +/-5%.
                                You can plot your distributions.

                                LibraryStatistics.txt contains averages, but you have all the signal in Library0.txt, Library1.txt. If you are using the git version of Ray, this information is now in LibraryData.xml

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 08:47 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                54 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X