Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by Wallysb01 View Post
    Hi Sébastien,

    I have a question, Ray appears to be automatically setting the minimum kmer coverage to 1 minus the peak coverage, is that true, and what's the reason for this? I apparently have absurd peak coverage of 626, even with k=61. Could this be why my assembly was poor the first time with k=31 above?

    I'm thinking I'm going to cancel my job and drop that min coverage.
    Can you post somewhere the content of CoverageDistribution.txt and send an email
    with that information to denovoassembler-users AT lists.sourceforge.net.

    Comment


    • Originally posted by habm View Post
      Thanks for the helpful responses, Sébastien, and for addressing this problem in the next version.
      Meanwhile, are the files PREFIX.LibraryX.txt made automatically, please? I cannot see them, whether I set a mean and sd for insert size in the command line, or not.

      These are automatically generated.

      Can you send an email to denovoassembler-users AT lists.sourceforge.net with:

      You Ray command.

      Thank you.

      Comment


      • Dear all,

        I have two libraries of genomic DNA sequenced with Illumina. I want to perform a de novo assembly using Ray (v 1.6.0).

        Library 1 (is paired-ends, 500 ± 50)
        95.607.712 reads (47.803.856 pairs)
        573.667 contigs were built by Ray
        It took 1 day, 24 minutes and 16 seconds to perform the assembly

        Library 2 (is mate-pair, 2200 ± 200)
        85.198.684 reads (42.599.342 pairs)
        28.224 contigs were built by Ray
        It took 3 hours, 7 minutes and 44 seconds to perform the assembly

        My problem is that my cluster is not big enough to run Ray using the two libraries simultaneously. So, what I did was run Ray in two steps, first, using only the Library 1 and then Library 2. Now, the idea is to run again Ray but using as input the generated contigs in the two previous steps (as single-end reads). Therefore my question is, can I do this?, will work? (All under the assumption that this way I will use less memory)

        My second question is why it took only 3 hours the assembly of the Library 2? There is something wrong with the assembly? Should I be concerned?

        My last question is related with the previous one. In order to double-check the assembly of library 2, I run it again (I used the same parameters as before, defaults ones). The result in time was the same but the number of contigs varied slightly between runs (I have not checked the contigs sequences). So, Ray is deterministic software? Or every time it is used, will generate a different output…

        Thanks in advance.

        Comment


        • Originally posted by kail View Post

          Dear all,

          I have two libraries of genomic DNA sequenced with Illumina. I want to perform a de novo assembly using Ray (v 1.6.0).

          Library 1 (is paired-ends, 500 ± 50)
          95.607.712 reads (47.803.856 pairs)
          573.667 contigs were built by Ray
          It took 1 day, 24 minutes and 16 seconds to perform the assembly

          Library 2 (is mate-pair, 2200 ± 200)
          85.198.684 reads (42.599.342 pairs)
          28.224 contigs were built by Ray
          It took 3 hours, 7 minutes and 44 seconds to perform the assembly

          My problem is that my cluster is not big enough to run Ray using the two libraries simultaneously.
          How big is your compute cluster ?

          Originally posted by kail View Post

          So, what I did was run Ray in two steps, first, using only the Library 1 and then Library 2. Now, the idea is to run again Ray but using as input the generated contigs in the two previous steps (as single-end reads). Therefore my question is, can I do this?,
          That won't work well with Ray. Furthermore, using both libraries simultaneously give Ray much more information that helps assemble the genome.

          Originally posted by kail View Post


          will work?
          I don't think it will work well.

          Originally posted by kail View Post

          (All under the assumption that this way I will use less memory)

          My second question is why it took only 3 hours the assembly of the Library 2?



          How many nucleotides were outputted by Ray ?

          Can you provide the content of CoverageDistributionAnalysis.txt files for both of them ?


          Originally posted by kail View Post


          There is something wrong with the assembly?
          I don't know, you did not provide much information describing your assemblies aside from the number of contigs.

          Originally posted by kail View Post


          Should I be concerned?
          Originally posted by kail View Post
          You should definitely run Ray on all data at once.

          You have < 100 M reads.

          Originally posted by kail View Post


          My last question is related with the previous one. In order to double-check the assembly of library 2, I run it again (I used the same parameters as before, defaults ones). The result in time was the same but the number of contigs varied slightly between runs (I have not checked the contigs sequences). So, Ray is deterministic software? Or every time it is used, will generate a different output…
          Ray will generate different assemblies with the same input. This is caused by the randomness of the order in which messages are sent during the computation.

          But the assemblies are mostly equivalent.

          Originally posted by kail View Post



          Thanks in advance.
          ***
          Sébastien Boisvert
          Ray -- Parallel genome assemblies for parallel DNA sequencing - GitHub - sebhtml/ray: Ray -- Parallel genome assemblies for parallel DNA sequencing

          Comment


          • Originally posted by seb567 View Post
            How big is your compute cluster ?
            Actually is a single machine, a Mac Pro 2 x 2.4 GHz Quad-Core Intel Xeon, with 64GB of RAM and 2TB of disk.


            Originally posted by seb567 View Post
            That won't work well with Ray. Furthermore, using both libraries simultaneously give Ray much more information that helps assemble the genome.
            I tried to do that, but my "cluster" ran out of memory.

            Originally posted by seb567 View Post
            How many nucleotides were outputted by Ray ?
            I really don’t know how many nucleotides were outputted by Ray for the assembly of Library 2… I guest 26311531.

            Originally posted by seb567 View Post
            Can you provide the content of CoverageDistributionAnalysis.txt files for both of them ?
            Here are some outputs of Ray for both libraries:

            Library 1

            ----------LibraryStatistics.txt ----------
            File: Paired-Ends.fastq
            NumberOfSequences: 95607712

            Total: 95607712

            NumberOfPairedLibraries: 1

            LibraryNumber: 0
            InputFormat: Interleaved,Paired
            DetectionType: Manual
            File: Paired-Ends.fastq
            NumberOfSequences: 95607712
            AverageOuterDistance: 500
            StandardDeviation: 50

            ----------OutputNumbers.txt----------

            Number of contigs: 573667
            Total length of contigs: 88436065
            Number of contigs >= 500 nt: 1550
            Total length of contigs >= 500 nt: 936979
            Number of scaffolds: 573666
            Total length of scaffolds: 88436427
            Number of scaffolds >= 500 nt: 1549
            Total length of scaffolds >= 500: 937341

            ----------CoverageDistributionAnalysis.txt----------
            k-mer length: 21
            Lowest coverage observed: 1
            MinimumCoverage: 131
            PeakCoverage: 131
            RepeatCoverage: 139
            Number of k-mers with at least MinimumCoverage: 2305986 k-mers
            Estimated genome length: 1152993 nucleotides
            Percentage of vertices with coverage 1: 24.3365 %
            DistributionFile: RayOutput.CoverageDistribution.txt

            Library 2

            ----------LibraryStatistics.txt ----------
            File: Mate-Pair.fastq
            NumberOfSequences: 85198684

            Total: 85198684

            NumberOfPairedLibraries: 1

            LibraryNumber: 0
            InputFormat: Interleaved,Paired
            DetectionType: Manual
            File: Mate-Pair.fastq
            NumberOfSequences: 85198684
            AverageOuterDistance: 2200
            StandardDeviation: 200

            ----------OutputNumbers.txt----------
            Number of contigs: 28224
            Total length of contigs: 26311531
            Number of contigs >= 500 nt: 5609
            Total length of contigs >= 500 nt: 23362817
            Number of scaffolds: 27056
            Total length of scaffolds: 27898655
            Number of scaffolds >= 500 nt: 4441
            Total length of scaffolds >= 500: 24949941

            ----------CoverageDistributionAnalysis.txt----------
            k-mer length: 21
            Lowest coverage observed: 1
            MinimumCoverage: 31
            PeakCoverage: 172
            RepeatCoverage: 178
            Number of k-mers with at least MinimumCoverage: 55438516 k-mers
            Estimated genome length: 27719258 nucleotides
            Percentage of vertices with coverage 1: 22.5815 %
            DistributionFile: RayOutput.CoverageDistribution.txt


            Originally posted by seb567 View Post
            Ray will generate different assemblies with the same input. This is caused by the randomness of the order in which messages are sent during the computation.

            But the assemblies are mostly equivalent.
            This means that if I had used a single processor (-np 1, mpi), would I have got the same output?.
            In case I run Ray twice (using only one processor), changing the order of the reads for the second run (in the input file), does the output of Ray should be different between the two runs, as well?

            Comment


            • Originally posted by kail;47393


              [B
              ----------CoverageDistributionAnalysis.txt----------[/B]
              k-mer length: 21
              Lowest coverage observed: 1
              MinimumCoverage: 131
              PeakCoverage: 131
              RepeatCoverage: 139
              Number of k-mers with at least MinimumCoverage: 2305986 k-mers
              Estimated genome length: 1152993 nucleotides
              Percentage of vertices with coverage 1: 24.3365 %
              DistributionFile: RayOutput.CoverageDistribution.txt


              ----------CoverageDistributionAnalysis.txt----------
              k-mer length: 21
              Lowest coverage observed: 1
              MinimumCoverage: 31
              PeakCoverage: 172
              RepeatCoverage: 178
              Number of k-mers with at least MinimumCoverage: 55438516 k-mers
              Estimated genome length: 27719258 nucleotides
              Percentage of vertices with coverage 1: 22.5815 %
              DistributionFile: RayOutput.CoverageDistribution.txt

              This means that if I had used a single processor (-np 1, mpi), would I have got the same output?.
              In case I run Ray twice (using only one processor), changing the order of the reads for the second run (in the input file), does the output of Ray should be different between the two runs, as well?

              Something is wrong with your coverage distributions.

              Can you post PREFIX.CoverageDistribution.txt to http://pastebin.com ?


              If you run Ray with -np 1 (1 compute core) on the same data twice, you will obtain the same
              result.

              When using more than 1 compute core, assemblies can change because of the order of the messages.

              Comment


              • Ray 1.7

                Dear assemblers,


                Ray v1.7 is now available (and the Assemblathon 2 is over).


                Summary of what changed:

                * MANUAL_PAGE.txt replaces the PDF manual.
                * Output files are written to the directory specified by -o (previously it was a file prefix)
                * Round-robin reception of messages
                * Bloom filter
                * Illumina mate-pairs support
                * Job checkpointing
                * New scaffolding algorithm
                * New assembly engine for the extension of seeds with mate-pairs (NovaEngine)
                * Parallel file partitionning
                * Network latency testing
                * Compiles cleanly on 32-bit systems

                All the changes:



                Ray -- Parallel genome assemblies for parallel DNA sequencing - GitHub - sebhtml/ray: Ray -- Parallel genome assemblies for parallel DNA sequencing

                Comment


                • Great work Seb!

                  Did you run 1.7 on the parrot data (Assemblathon 2) to compare with Ray 1.6.1?

                  Comment


                  • Originally posted by lletourn View Post
                    Great work Seb!

                    Did you run 1.7 on the parrot data (Assemblathon 2) to compare with Ray 1.6.1?
                    Of course !

                    (I am very thankful to the Assemblathon 2 organizers and data suppliers.)


                    cat k31-Ray-Bird-20110921-1/OutputNumbers.txt

                    Contigs >= 100 nt
                    Number: 88826
                    Total length: 1169161521
                    Average: 13162
                    N50: 41098
                    Median: 3368
                    Largest: 465622

                    Contigs >= 500 nt
                    Number: 68550
                    Total length: 1164709611
                    Average: 16990
                    N50: 41306
                    Median: 6862
                    Largest: 465622

                    Scaffolds >= 100 nt
                    Number: 47279
                    Total length: 1270995781
                    Average: 26882
                    N50: 567125
                    Median: 725
                    Largest: 3236250

                    Scaffolds >= 500 nt
                    Number: 27408
                    Total length: 1266700501
                    Average: 46216
                    N50: 571612
                    Median: 2137
                    Largest: 3236250


                    A lot of contigs end simply because there is no coverage.


                    I will post something on my blog soon describing the requirements and the outcomes.

                    See https://github.com/sebhtml/assemblathon-2-ray for the full command line required.

                    Comment


                    • Wow impressive numbers.

                      I think I'll try 1.7 on some of my "old" assemblies to see the difference.

                      Comment


                      • Also:

                        the automatic detection of "outer distances" sometimes fails for very large distances (like mate-pairs with 20000 +/- something as the outer distance).

                        So in these cases you need to provide the information manually.

                        Examples:

                        Automatic detection:

                        -p \
                        BGI_illumina_data/PARprgDAADTAAPE/110514_I263_FC81P81ABXX_L5_PARprgDAADTAAPE_1.fq.fastq \
                        BGI_illumina_data/PARprgDAADTAAPE/110514_I263_FC81P81ABXX_L5_PARprgDAADTAAPE_2.fq.fastq \

                        Manual detection:

                        -p \
                        BGI_illumina_data/PARprgDAPDUAAPEI-12/110531_I232_FCB05V6ABXX_L8_PARprgDAPDUAAPEI-12_1.fq.fastq \
                        BGI_illumina_data/PARprgDAPDUAAPEI-12/110531_I232_FCB05V6ABXX_L8_PARprgDAPDUAAPEI-12_2.fq.fastq \
                        20000 2000 \


                        See this example https://github.com/sebhtml/assemblat...ird-Testbed.sh if you are not sure what I mean.

                        Happy assembly !

                        Comment


                        • I have a VERY large dataset of RNA-Seq, is Ray dealing well with transcriptimic data and alternative splicing. Or alternatively can the Ray outputs be injected in Velvet-Oases, trans-Abyss or even TopHat?

                          Comment


                          • about mates, do we need to revcomp the illumina reads to keep them as innies, or Ray doesn't care if they're outties or innies?

                            Comment


                            • Originally posted by Ceratites View Post
                              I have a VERY large dataset of RNA-Seq, is Ray dealing well with transcriptimic data and alternative splicing. Or alternatively can the Ray outputs be injected in Velvet-Oases, trans-Abyss or even TopHat?
                              Ray was not designed initially to do transcriptomes or metagenomes. We are presently working on modifications to handle metagenomes.

                              I don't know about the compatibility between Ray and Velvet-Oases, trans-ABySS or TopHat but I would tend to say no.

                              Comment


                              • Originally posted by lletourn View Post
                                about mates, do we need to revcomp the illumina reads to keep them as innies, or Ray doesn't care if they're outties or innies?
                                no

                                both are fine

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                11 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                10 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                51 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                67 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X