Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #91
    Dear boetsie,

    On another note, for my -x 0 output, I am noticing that an 'n' is being reported instead of lower case nt's when there is a clear overlap between adjacent contigs in the scaffold output.

    For example, the first two contigs of a scaffold are:
    >contig29
    ....TTCTTTTTCTTCCCATCTTCAGCCTTCTTAGCTTCGGCTTCCTCCCTCTCTTTCAACACAACAAGGGCAT
    >contig15
    TTCTTTTTCTTCCCATCTTCAGCCTTCTTAGCTTCGGCTTCCTCCCTCTCTTTCAACACAACAAGGGCAT....

    There is a clear 70nt overlap, however the final.scaffold output is putting an 'n' between these two contigs. This the details of the scaffold:
    >scaffold1.1|size6918|tigs5
    f_tig29|size803|links11|gaps-58
    f_tig15|size4884|links6|gaps-99
    f_tig13|size173|links6|gaps-24|merged38
    f_tig24|size387|links45|gaps-52|merged45
    f_tig8|size752

    However it is reporting lowercase nts between contig 13/24 and 24/8, so I'm wondering if there is some kind of threshold '-n' which determines what is reported.

    My invocation was:
    perl /usr/local/bin/SSPACE-1.1_linux-x86_64/SSPACE_v1-1.pl -l library.txt -s contigsassem71.fa -x 0 -m 50 -o 20 -n 15 -p 1 -v 1 -b pass11_sspace

    Sorry for the spate of questions.
    Thanks!
    Kennels

    Comment


    • #92
      Hi Kennels,

      Try to run SSPACE with -x 1 and -v 0, -v is the verbose option but this will give you lots of lines of intermediate steps, including the reads used for extension. I don't think you want to this. The error in the main:umper i'm familiar with, it is fixed in the premium version, but i'll fix it in the new basic version too.

      Just run with -v 0 and everything should work fine.

      Regards,
      Boetsie

      Originally posted by Kennels View Post
      Hi boetsie

      I'm eager to use your program after having to 'manually' extend contigs using combinations of patman/bowtie/velvet processes.

      I initially had a bowtie-build error which was resolved by giving chmod a+x to all the files in the SSPACE subdirectories. I am using the latest version, v1.1.

      Unfortunately I'm getting another error when i use the -x 1 option.

      ######################################
      Finished Collecting Overlapping Reads - BUILDING CONSENSUS...
      Undefined subroutine &main:umper called at /usr/local/bin/SSPACE-1.1_linux-x86_64/bin/ExtendOrFormatContigs.pl line 212, <IN> line 8.

      LIBRARY pass7
      ------------------------------------------------------------

      =>Mon Aug 29 13:44:56 2011: Building Bowtie index for contigs (tmp.pass7_sspace/subset_contigs.fasta)
      Warning: Empty input file
      Reference file does not seem to be a FASTA file
      Command: /usr/local/bin/SSPACE-1.1_linux-x86_64/bowtie/bowtie-build --quiet --noref tmp.pass7_sspace/subset_contigs.fasta bowtieoutput/pass7_sspace.pass7.bowtieIndex
      #######################################

      I can't find the 'tmp.pass7_sspace/subset_contigs.fasta' file anywhere, but perhaps this has something to do with the undefined subroutine &main:umper? Also, I do have many unmapped reads, so I'm thinking it should be able to extend?

      When I use the -x 0 option however, I am able to finish with no problems. I don't think I have any problems with my inputs.

      My invocation was:
      perl /usr/local/bin/SSPACE-1.1_linux-x86_64/SSPACE_v1-1.pl -l library.txt -s contigs.fa -x 1 -m 50 -o 20 -p 1 -b pass7_sspace -v 1

      Could you comment?
      Thank you,
      kennels

      Comment


      • #93
        Hmmm, i've searched into the code and i see that it only goes till 50bp overlap max. I was not aware of this limitation, i'll fix this in the new release.

        If you would like them to be fixed immediately, send me a personal message, so I can send you the code by e-mail.

        Sorry for these small bugs, and thanks for mentioning them!

        Boetsie

        Originally posted by Kennels View Post
        Dear boetsie,

        On another note, for my -x 0 output, I am noticing that an 'n' is being reported instead of lower case nt's when there is a clear overlap between adjacent contigs in the scaffold output.

        For example, the first two contigs of a scaffold are:
        >contig29
        ....TTCTTTTTCTTCCCATCTTCAGCCTTCTTAGCTTCGGCTTCCTCCCTCTCTTTCAACACAACAAGGGCAT
        >contig15
        TTCTTTTTCTTCCCATCTTCAGCCTTCTTAGCTTCGGCTTCCTCCCTCTCTTTCAACACAACAAGGGCAT....

        There is a clear 70nt overlap, however the final.scaffold output is putting an 'n' between these two contigs. This the details of the scaffold:
        >scaffold1.1|size6918|tigs5
        f_tig29|size803|links11|gaps-58
        f_tig15|size4884|links6|gaps-99
        f_tig13|size173|links6|gaps-24|merged38
        f_tig24|size387|links45|gaps-52|merged45
        f_tig8|size752

        However it is reporting lowercase nts between contig 13/24 and 24/8, so I'm wondering if there is some kind of threshold '-n' which determines what is reported.

        My invocation was:
        perl /usr/local/bin/SSPACE-1.1_linux-x86_64/SSPACE_v1-1.pl -l library.txt -s contigsassem71.fa -x 0 -m 50 -o 20 -n 15 -p 1 -v 1 -b pass11_sspace

        Sorry for the spate of questions.
        Thanks!
        Kennels

        Comment


        • #94
          Dear Boetsi

          I tried running SSPACE for the same data for different 'n' parameter values . However there is no difference in the result that I get for n=3 or 5 or 15 (default). In all cases the N50 of the scaffold generated comes to around 1995 and all other characteristics as well such as the median or sum or maximum length. This was done on the contigs generated by ABySS .

          To compare it with scaffolder in SOAPdenovo I ran the contigs generation on SOAPdenovo and did the scaffolding with the SOAPdenovo scaff tool as well as with SSPACE. The N50 and other evaluation criteria are much better for SOAPdenovo! The N50 using SOAPdenovo scaff came to about 21,653 and that with SSPACE only about 2,677 ! I am using the default value of k=5 and a=0.7 with n=15 . I have tried changing n as I stated in my first paragraph and there is no advantage of doing it. Do you recommend me any different values for k and a ?

          Aby

          Comment


          • #95
            Changes to the -n parameter will not influence the N50 much. The -n parameter is only used for merging two contigs next to each other (thus removing gaps). If two contigs are merged it will decrease the N50 instead of increasing.

            As stated in my previous post, it is important that sufficient paired-reads map to the contigs. If there are not much paired-reads that map, you should lower the -k value to for example 3 (or even 2). Especially, as you stated before, you have low coverage. Other option may be to trim your reads to remove erronoeus nucleotides.

            Originally posted by narain View Post
            Dear Boetsi

            I tried running SSPACE for the same data for different 'n' parameter values . However there is no difference in the result that I get for n=3 or 5 or 15 (default). In all cases the N50 of the scaffold generated comes to around 1995 and all other characteristics as well such as the median or sum or maximum length. This was done on the contigs generated by ABySS .

            To compare it with scaffolder in SOAPdenovo I ran the contigs generation on SOAPdenovo and did the scaffolding with the SOAPdenovo scaff tool as well as with SSPACE. The N50 and other evaluation criteria are much better for SOAPdenovo! The N50 using SOAPdenovo scaff came to about 21,653 and that with SSPACE only about 2,677 ! I am using the default value of k=5 and a=0.7 with n=15 . I have tried changing n as I stated in my first paragraph and there is no advantage of doing it. Do you recommend me any different values for k and a ?

            Aby

            Comment


            • #96
              Dear Boetsie

              Thank you for your suggestion. I will try with reduced 'k' parameters to 2 and 3. Do you recommend any changes to 'a' parameter value ?

              What exactly is the 'n' parameter useful for ?

              Aby

              Comment


              • #97
                Originally posted by narain View Post
                Dear Boetsie

                Thank you for your suggestion. I will try with reduced 'k' parameters to 2 and 3. Do you recommend any changes to 'a' parameter value ?

                What exactly is the 'n' parameter useful for ?

                Aby
                You could decrease the -a value to 0.5 (meaning that there should at least be 2 times more links) if multiple links are found.

                The -n parameter is useful for merging two contigs. Say you have contigA and contigB, they are scaffolded with a gap of -20bp. Then SSPACE will search for an overlap of -n or more nucleotides:

                contigA
                AGATGATATAAAAGTATAGATTA
                contigB
                ATAAAAGTATAGATTAGGGGTTATGATA

                overlap:
                AGATGATATAAAAGTATAGATTA
                -------ATAAAAGTATAGATTAGGGGTTATGATA


                So if the size of the overlap is above the defined -n parameter, they are merged together;
                AGATGATATAAAAGTATAGATTAGGGGTTATGATA

                regards,
                Boetsie

                Comment


                • #98
                  Dear Boetsie

                  As per your suggestion I ran SSPACE with lowering k parameter value from what it was 5 earlier to 2. The N50 value reduced further from what it was 1995 to 1424 ! The value of a was 0.7 and n was 10 as before. Did you rather mean to increase value of k ?

                  Aby

                  Comment


                  • #99
                    Originally posted by narain View Post
                    Dear Boetsie

                    As per your suggestion I ran SSPACE with lowering k parameter value from what it was 5 earlier to 2. The N50 value reduced further from what it was 1995 to 1424 ! The value of a was 0.7 and n was 10 as before. Did you rather mean to increase value of k ?

                    Aby
                    Hi Aby,

                    well this i should have expected, since you said that your coverage was very low. Say you first had a nice scaffold with five links between two contigs, but now another contig can also be linked with four links, the ratio will be 4/5 = 0.8 (thus above your -a 0.7). This way, less scaffolds are formed. You could increase the -a option, but then you should wonder how reliable are your scaffolds!

                    Could you maybe send me your summaryfile and library file (personal message or to my private e-mail [email protected]), so i can hopefully try to help you further out?

                    Regards,
                    Boetsie

                    Comment


                    • Dear Boetsie

                      I have 90 bp length paired end reads of about 110 GB in total for human genome. This is approximately about 15x coverage, which is slightly less than what most assemblers look for ( 20x or more). With the decrease in k parameter value, there is a decrease in N50 , which is not a good sign. Indeed reliability of the scaffold generated is of utmost importance. I am sending you the logfile and the summary file generated as email attachment. Do you still suggest me to go for a lower value of a ? I think if I need bigger scaffold , I should go for a bigger value of a say 0.9 or higher . Should I do that in combination with higher value of k ? How much higher can I keep k and what is your suggestion ?

                      Aby

                      Comment


                      • I've added a new Basic version at http://www.baseclear.com/landingpages/sspacev12/

                        Main improvements are;
                        - searches now for overlaps larger than 50bp as suggested by Kennels
                        - merge-information is now correct in the evidence file. If multiple libraries were used, the merge information of previous libraries was not included in the final evidence file.
                        - Solved the error of 'main: Dumper not found', if -x 1 and -v 1 are set.
                        - now able to allow gaps for mapping the reads against the contigs with the -g option. -g 1 allows three gaps, max is 3 gaps.

                        Boetsie

                        Comment


                        • No pairs found

                          Hello Boetsie,

                          I'm now playing with SSPACE and I'm getting some strange output. I have two files with contigs -- say, contigs1.fasta and contigs2.fasta. They were output by the same assembler on the same data set (E.coli reads). The second file have some contigs from the first file glued together. For some reasons, SSPACE successfully scaffolds contigs2, but fails to find a single read pair on contigs1. Could you please help with this? I'm attaching the two summary files.
                          Attached Files

                          Comment


                          • Hi Boetsie,

                            Thanks for the update, I've started using it in spades for a number of datasets and its great.

                            In one project, I have a total of around 770 million 100nt-long PE reads across 7 lanes. Unfortunately I am quite limited in computing capacity (only 16Gb RAM - yes I have been critiqued before in other posts to get better specs, but we all have our circumstances ) till we get access to a better one, so as expected an analyses pretty much stopped at the stage of reading the unmapped reads into memory. I currently just want to extend a small number of separate contigs as much as possible, and it would be great to consider all reads at once.

                            1. I'm just wondering if it is possible to overcome the memory limitation - is the Premium version using a different way to store/access the data?
                            2. Should I split my inputs into smaller libraries - does sspace free up memory after reading a library before going on to the next (but i think not)?
                            3. Or should I carry out separate analyses of sspace - but i'm afraid of losing possibilities to extend contigs by not considering all data at once.

                            Sorry if the questions are naive.

                            Cheers,
                            kennels

                            Comment


                            • Originally posted by kulikov View Post
                              Hello Boetsie,

                              I'm now playing with SSPACE and I'm getting some strange output. I have two files with contigs -- say, contigs1.fasta and contigs2.fasta. They were output by the same assembler on the same data set (E.coli reads). The second file have some contigs from the first file glued together. For some reasons, SSPACE successfully scaffolds contigs2, but fails to find a single read pair on contigs1. Could you please help with this? I'm attaching the two summary files.
                              Hi Kulikov,

                              Are the contigs of summary1.txt a mix of the two assemblies? In other words; are parts of the contigs present in other contigs? Because what i think has happened, is that reads could map to multiple contigs. SSPACE does not allow reads to map to multiple contigs.

                              Boetsie

                              Comment


                              • Originally posted by Kennels View Post
                                Hi Boetsie,

                                Thanks for the update, I've started using it in spades for a number of datasets and its great.

                                In one project, I have a total of around 770 million 100nt-long PE reads across 7 lanes. Unfortunately I am quite limited in computing capacity (only 16Gb RAM - yes I have been critiqued before in other posts to get better specs, but we all have our circumstances ) till we get access to a better one, so as expected an analyses pretty much stopped at the stage of reading the unmapped reads into memory. I currently just want to extend a small number of separate contigs as much as possible, and it would be great to consider all reads at once.

                                1. I'm just wondering if it is possible to overcome the memory limitation - is the Premium version using a different way to store/access the data?
                                2. Should I split my inputs into smaller libraries - does sspace free up memory after reading a library before going on to the next (but i think not)?
                                3. Or should I carry out separate analyses of sspace - but i'm afraid of losing possibilities to extend contigs by not considering all data at once.

                                Sorry if the questions are naive.

                                Cheers,
                                kennels
                                Hi Kennels,

                                good that it works great!

                                To start; it is important to filter your reads on quality, especially with such large read length. For extension the whole read is mapped to the contigs, if not mapped it will be used for contig extension. If the quality of the reads (or part of the reads) are bad, the read will not map and it will be used for contig extension.

                                1.
                                In SSPACE premium;
                                - you can run bowtie with gaps, allowing up to 3 gaps, thereby reducing the number of unmapped reads and thus the number of reads stored in memory.
                                - A different method of storing the unmapped reads is used compared with Basic version, saving 25% of memory.
                                - extension is faster.

                                2.
                                For contig extension all libraries are used at once, so all reads are used. Memory is thus not freed after each library.

                                3.
                                You will loose coverage if you split the libraries. If you have sufficient coverage for one library, I should give it a go.

                                Regards,
                                Boetsie

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 11:49 AM
                                0 responses
                                15 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                16 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                61 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X