Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • It might be possible to shoehorn Ray into doing something like the 'Inchworm' part of Trinity:



    I've had a bit of a hiatus from work on Ray due to additional projects, but I'm interested in seeing if this will work because the current transcriptome assembly programs have really high memory requirements. The memory requirements are odd because the transcript graphs should be simpler (fewer repeats because you're making things like proteins, so branches should be mostly due to different isoforms), and the transcriptome size is smaller than the genome size.

    My guess is trying something like disabling the genome coverage graph functions -- with RNASeq the mean coverage is per-transcript, but there can be within-transcript bias -- and writing out sequences that have some minimum coverage level based on the average coverage for each disconnected graph.

    Comment


    • Ray error message: Fatal error

      Hi Sébastien,

      I've been using Ray to assemble a 30-50 Mb fungal genome from 454 and PE Illumina reads. When I was testing the software with raw reads I had no trouble en the assembly carried on correctly. The problem arose when I quality filtered all the reads and created a new fasta and fastq files. I´m pasting the error message here:

      What could the problem be?

      Cheers,
      Santiago

      Rank 5: gathering scaffold links [1/3559] [1/28971]
      Rank 2: gathering scaffold links [1/3854] [1/30494]
      Rank 4: gathering scaffold links [1/3726] [1/56682]
      Fatal Error: ReadIndex: 18854336 but Reads: 18635750
      Ray: code/communication/MessageProcessor.cpp:127: void MessageProcessor::call_RAY_MPI_TAG_GET_READ_MARKERS(Message*): Assertion `readId<(int)m_myReads->size()' failed.
      [ipara:21878] *** Process received signal ***
      [ipara:21878] Signal: Aborted (6)
      [ipara:21878] Signal code: (-6)
      [ipara:21878] [ 0] /lib/libpthread.so.0 [0x7ff0190d3a80]
      [ipara:21878] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7ff018da3ed5]
      [ipara:21878] [ 2] /lib/libc.so.6(abort+0x183) [0x7ff018da53f3]
      [ipara:21878] [ 3] /lib/libc.so.6(__assert_fail+0xe9) [0x7ff018d9cdc9]
      [ipara:21878] [ 4] Ray(_ZN16MessageProcessor33call_RAY_MPI_TAG_GET_READ_MARKERSEP7Message+0x454) [0x43fa74]
      [ipara:21878] [ 5] Ray(_ZN7Machine10runVanillaEv+0x99) [0x454f19]
      [ipara:21878] [ 6] Ray(_ZN7Machine5startEv+0x1031) [0x456c51]
      [ipara:21878] [ 7] Ray(main+0x3c) [0x4c0abc]
      [ipara:21878] [ 8] /lib/libc.so.6(__libc_start_main+0xe6) [0x7ff018d901a6]
      [ipara:21878] [ 9] Ray(__gxx_personality_v0+0x201) [0x42cd09]
      [ipara:21878] *** End of error message ***
      mpiexec noticed that job rank 0 with PID 21872 on node ipara exited on signal 15 (Terminated).
      6 additional processes aborted (not shown)

      Comment


      • Originally posted by gringer View Post
        It might be possible to shoehorn Ray into doing something like the 'Inchworm' part of Trinity:



        I've had a bit of a hiatus from work on Ray due to additional projects, but I'm interested in seeing if this will work because the current transcriptome assembly programs have really high memory requirements. The memory requirements are odd because the transcript graphs should be simpler (fewer repeats because you're making things like proteins, so branches should be mostly due to different isoforms), and the transcriptome size is smaller than the genome size.

        My guess is trying something like disabling the genome coverage graph functions -- with RNASeq the mean coverage is per-transcript, but there can be within-transcript bias -- and writing out sequences that have some minimum coverage level based on the average coverage for each disconnected graph.
        Hello,

        I don't think we can assume that each transcript will be a disconnected-from-the-rest component in the graph.

        Also, I think you should work with the mode k-mer coverage, not the mean k-mer coverage because the mean will be artificially increased by repeats.

        We tested Ray on the Schizosaccharomyces pombe dataset from the Trinity paper.

        Ray is quite good but presently we are focusing on assembly of metagenomes and biological abundances using virtual colors.


        Sébastien

        Comment


        • Originally posted by santiagosnchez View Post
          Hi Sébastien,

          I've been using Ray to assemble a 30-50 Mb fungal genome from 454 and PE Illumina reads. When I was testing the software with raw reads I had no trouble en the assembly carried on correctly. The problem arose when I quality filtered all the reads and created a new fasta and fastq files. I´m pasting the error message here:

          What could the problem be?

          Cheers,
          Santiago

          Rank 5: gathering scaffold links [1/3559] [1/28971]
          Rank 2: gathering scaffold links [1/3854] [1/30494]
          Rank 4: gathering scaffold links [1/3726] [1/56682]
          Fatal Error: ReadIndex: 18854336 but Reads: 18635750
          Ray: code/communication/MessageProcessor.cpp:127: void MessageProcessor::call_RAY_MPI_TAG_GET_READ_MARKERS(Message*): Assertion `readId<(int)m_myReads->size()' failed.
          [ipara:21878] *** Process received signal ***
          [ipara:21878] Signal: Aborted (6)
          [ipara:21878] Signal c areode: (-6)
          [ipara:21878] [ 0] /lib/libpthread.so.0 [0x7ff0190d3a80]
          [ipara:21878] [ 1] /lib/libc.so.6(gsignal+0x35) [0x7ff018da3ed5]
          [ipara:21878] [ 2] /lib/libc.so.6(abort+0x183) [0x7ff018da53f3]
          [ipara:21878] [ 3] /lib/libc.so.6(__assert_fail+0xe9) [0x7ff018d9cdc9]
          [ipara:21878] [ 4] Ray(_ZN16MessageProcessor33call_RAY_MPI_TAG_GET_READ_MARKERSEP7Message+0x454) [0x43fa74]
          [ipara:21878] [ 5] Ray(_ZN7Machine10runVanillaEv+0x99) [0x454f19]
          [ipara:21878] [ 6] Ray(_ZN7Machine5startEv+0x1031) [0x456c51]
          [ipara:21878] [ 7] Ray(main+0x3c) [0x4c0abc]
          [ipara:21878] [ 8] /lib/libc.so.6(__libc_start_main+0xe6) [0x7ff018d901a6]
          [ipara:21878] [ 9] Ray(__gxx_personality_v0+0x201) [0x42cd09]
          [ipara:21878] *** End of error message ***
          mpiexec noticed that job rank 0 with PID 21872 on node ipara exited on signal 15 (Terminated).
          6 additional processes aborted (not shown)
          Paired reads are stored in two files usually. For any pair of files, each file of the pair must have the same sequence count.

          I suspect that the resulting fastq files you generated (after filtering) don't have a coherent number of sequences.

          This is due to the fact that for any pair of sequences, 0, 1 or 2 sequences can be filtered out. In the 0 and 2 cases, there is no problem because it is a 'remove all' or a 'keep all' scenario.

          But when only 1 sequence is filtered out, its twin should also be filtered out or perhaps put aside in a file containing 'alone' sequences.

          The problem arises because Ray utilises Unique Sequencer Identifier, which are computed from the initial partition (fastq identifiers are not utilised at all).

          The problem will go away should you provide Ray with a coherent sequence count for each file.


          Sébastien

          Comment


          • Thanks for replying Sebastien,

            I figured out the problem right after my post. Do you recommend a way to exclude / delete unpaired filtered reads from each file? I've been trying to find some scripts, but no luck.

            By the way, excellent program(!), by far the best assembler I've used.

            Cheers,
            Santiago

            Comment


            • Ray - Coverage too high

              Hello, I have been using Ray for the de novo synthesis of several bacterial genomes. Overall it seems to be a really good program that has been giving me longer contigs that SOAPdenovo.

              However, recently I ran into an error that seems to be due to genome coverage that is too high:

              Rank 0: the minimum coverage is 2
              Rank 0: the peak coverage is 2
              Rank 0: Assembler panic: no peak observed in the k-mer coverage distribution.
              Rank 0: to deal with the sequencing error rate, try to lower the k-mer length (-k)

              At first I thought that I had the opposite problem, not enough coverage. I tried to lower the k as suggested, but I kept getting the same error. The only way I have been able to get Ray to run on this dataset is too either decrease the number of sequences that I am inputting into the program (in which case I get very good contigs) or increasing the k-mer to very high numbers (e.g., 63).

              If possible, could you explain why high coverage would result in this type of error?

              And can you provide guidelines for the optimal genome coverage for Ray?


              Thank you.

              Jason

              Comment


              • Sébastien,

                Is there a way to reuse some of Ray's output files in order to avoid some of the initial computations on the same data?

                Cheers,
                Santiago

                Comment


                • Originally posted by santiagosnchez View Post
                  Thanks for replying Sebastien,

                  I figured out the problem right after my post. Do you recommend a way to exclude / delete unpaired filtered reads from each file? I've been trying to find some scripts, but no luck.

                  By the way, excellent program(!), by far the best assembler I've used.

                  Cheers,
                  Santiago
                  I don't know any particularly good program for this precise task.

                  Comment


                  • Originally posted by jtladner View Post
                    Hello, I have been using Ray for the de novo synthesis of several bacterial genomes. Overall it seems to be a really good program that has been giving me longer contigs that SOAPdenovo.


                    However, recently I ran into an error that seems to be due to genome coverage that is too high:

                    Rank 0: the minimum coverage is 2
                    Rank 0: the peak coverage is 2
                    Rank 0: Assembler panic: no peak observed in the k-mer coverage distribution.
                    Rank 0: to deal with the sequencing error rate, try to lower the k-mer length (-k)
                    This limitation was removed in the Release of Ray 2.0-Release Candidate 5.

                    You can try Ray 2.0-rc5.

                    We modified this to enable metagenome assemblies.


                    Originally posted by jtladner View Post


                    At first I thought that I had the opposite problem, not enough coverage. I tried to lower the k as suggested, but I kept getting the same error. The only way I have been able to get Ray to run on this dataset is too either decrease the number of sequences that I am inputting into the program (in which case I get very good contigs) or increasing the k-mer to very high numbers (e.g., 63).

                    If you plot the coverage distribution, I am sure you will see something thatg is not smooth, yet I am sure you will see a sizable peak.

                    To plot your data (enter these commands in your terminal)


                    Code:
                    cd Place-Where-My-Assembly-Is-Located
                    ls CoverateDistribution.txt # make sure you are at the good place
                    R --vanilla
                    
                    # the next commands will be given to R
                    data=read.table('CoverageDistribution.txt',header=TRUE)
                    pdf('MyCoverageFrequencies.pdf')
                    plot(data[,1],data[,2],xlab='k-mer coverage depth',ylab='Frequency',log='xy',type='l')
                    dev.off()
                    There is also a fancy script that ships with Ray that does that automatically.

                    Code:
                    ~/git-clones/ray/scripts/plot-coverage-distribution.R CoverageDistribution.txt

                    Originally posted by jtladner View Post


                    If possible, could you explain why high coverage would result in this type of error?
                    We bought an Illumina HiSeq 1000 at our institution.

                    One of the acceptation tests was to do a whole lane of PhiX, a virus whose genome has just 5386 nucleotides.


                    The coverage distribution was ridiculous:




                    If we zoom in, we can see that the peak is not smooth.





                    This *may* be caused be cluster complexity on the flow cell.

                    *Maybe* your data look like this also, maybe not.


                    Originally posted by jtladner View Post

                    And can you provide guidelines for the optimal genome coverage for Ray?
                    As the saying goes, "the more, the better."

                    You should plot your distributions to assess the quality of your data.


                    Originally posted by jtladner View Post

                    Thank you.

                    Jason

                    Comment


                    • Greetings !

                      Originally posted by santiagosnchez View Post
                      Sébastien,

                      Is there a way to reuse some of Ray's output files in order to avoid some of the initial computations on the same data?

                      Cheers,
                      Santiago

                      Yes, they are called checkpoints.

                      You just have to add -read-write-checkpoints

                      However, note that checkpointing files (they are binary and have the .ray extension) are only valid with the same command using the same data with the same number of MPI rank.

                      This mechanism is a checkpointing facility.



                      HTML Code:
                      mpiexec -n 1 Ray -help | less
                      
                        Checkpointing
                      
                             -write-checkpoints
                                    Write checkpoint files
                      
                             -read-checkpoints
                                    Read checkpoint files
                      
                             -read-write-checkpoints
                                    Read and write checkpoint files
                      
                      
                      
                      
                      

                      Comment


                      • So this could be achieved by typing something like:

                        mpiexec -n <#> Ray -o <$$$$> -read-checkpoints
                        (after you did a run with -write-checkpoints)

                        Is it possible to change the k-mer size for instance?

                        Thanks,

                        Santiago

                        Comment


                        • RAY on colourspace

                          Hi there,

                          Do you have any news on the colourspace issue? We ran RAY today for the first time and was very impressed, except that we mostly deal with SOLiD data and would need the contigs in base space eventually :-)

                          Thanks!

                          Anelda

                          Comment


                          • Problem at compilation with latest GCC version

                            Hi everyone,

                            I encountered a problem when trying to build the latest stable version of Ray (1.7) with the latest version of GCC (v4.7.0).

                            The problem occured at the make step.

                            With GCC v4.7.0, I got the following errors:

                            Code:
                            code/communication/MessageProcessor.cpp: In member function 'void MessageProcessor::call_RAY_MPI_TAG_ASK_VERTEX_PATH(Message*)':
                            code/communication/MessageProcessor.cpp:1685:7: error: redeclaration of 'int i'
                            code/communication/MessageProcessor.cpp:1675:10: error: 'int i' previously declared here
                            make: *** [code/communication/MessageProcessor.o] Error
                            However, when I used GCC v4.1.2 (which was also installed on this machine) instead, the installation finished correctly.

                            Comment


                            • That's because the more recent versions of GCC do more code checking. Redeclaring variables introduces some scoping issues, and usually means that the coder hasn't realised there's an ambiguity. Luckily, these redeclaration errors are usually easily fixed, for example by changing the name of the inner loop variable to j instead of i.

                              Comment


                              • Originally posted by santiagosnchez View Post
                                So this could be achieved by typing something like:

                                mpiexec -n <#> Ray -o <$$$$> -read-checkpoints
                                (after you did a run with -write-checkpoints)

                                Is it possible to change the k-mer size for instance?

                                Thanks,

                                Santiago
                                No, you can not change the k-mer size if you use the same checkpointing files.

                                There is the option -read-write-checkpoints that read and write these checkpoints too.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM
                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                24 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                25 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 09:21 AM
                                0 responses
                                21 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-04-2024, 09:00 AM
                                0 responses
                                52 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X