Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Contigs from gsAssembler / newbler

    Hi all,

    I have 454 data from a metagenomic project. I tried to assemble the reads with gsAssembler / Newbler with default parameters.

    I got very strange things and I ask for help to understand what could be the reasons for this:

    1- two singletons very similar are not assembled in a contig, I checked the corresponding reads and I do not understand why newbler left them as singletons...

    2-Some contigs have only one read!!! That is a complete mystery for me! Why don't they belong to singletons?

    3-Newbler cuts 5' or 3' ends of some contigs and when I check the reads, they align almost perfectly on the 5' and 3'. Why newbler cut them out?

    And finally a global question:
    I deal with plant viruses with plant nucleic acids in the pool, what would be the most suitable assembling software to recover my viruses sequences (no splicing, genome size of 12kb max)?

    Thanks in advance.

  • #2
    Originally posted by atalon1 View Post
    Hi all,

    I have 454 data from a metagenomic project. I tried to assemble the reads with gsAssembler / Newbler with default parameters.

    I got very strange things and I ask for help to understand what could be the reasons for this:

    1- two singletons very similar are not assembled in a contig, I checked the corresponding reads and I do not understand why newbler left them as singletons...
    How long are the reads? By default, there should be at least 40 bases overlap between reads with at least 90% similarity.

    Originally posted by atalon1 View Post
    2-Some contigs have only one read!!! That is a complete mystery for me! Why don't they belong to singletons?
    Your reads are from a metagenomic sample, so I guess overall coverage (read depth) for the contigs is low. In that case, there can be multiple nodes (i.e, contigs) in the contig graph that have only one part of a read. Do you find the same read pop up in other contigs? Perhaps have a look at my blog post on how newbler works.

    Originally posted by atalon1 View Post
    3-Newbler cuts 5' or 3' ends of some contigs and when I check the reads, they align almost perfectly on the 5' and 3'. Why newbler cut them out?
    I don't completely understand the question, but perhaps you refer to reads starting in one, and continuing in another contig? Again, when you understand the concept of the contig graph, and how collapsed repeat break up reads, it might make sense.

    Originally posted by atalon1 View Post
    And finally a global question:
    I deal with plant viruses with plant nucleic acids in the pool, what would be the most suitable assembling software to recover my viruses sequences (no splicing, genome size of 12kb max)?
    Try increasing the alignment stringency (-mi and -ml parameters) to pull apart virus and plant reads, check contigs for depth and GC%, blasting of contigs is also an obvious thing to do (perhaps in combination with MEGAN which allows for selecting subsets of reads/contigs based on taxonomic position of the blast hits).

    Comment


    • #3
      1-I kept the 40b and 90% default parameter settings. For example, I have 2 reads. One of 179 bases and the second one 164 bases. They are 100% identical, the first one is just a bit longer and newbler put both of them as singleton. Why they do not fall into on contig?

      2-Indeed the overall coverage is low. I took one of those single read contigs. I blasted it against the other contigs and I did not find anything. On the contrary, if I blast it against the singletons, I find 2 reads longer than the sequence with 99% identity over the 119 bases of the single read contig. I guess it is due to the way newbler cuts the contig path...

      3-Let me be a bit more understandable. Newbler is making a contig of 2 reads that overlap. If I check the start of the contig, it is downstream of the start of the overlap of the 2 reads.
      examble:

      Code:
      read1 : aatcgtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
      read2 :     gtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
      contg :                gaatcgtcgaatcgtcgaatcgtcg
      why newbler does not start the contig at the start of the overlap between the 2 reads? Note that I used fasta input, without quality file, so it could not have cut it because of poor quality of the base determination

      Comment


      • #4
        Originally posted by atalon1 View Post
        1-I kept the 40b and 90% default parameter settings. For example, I have 2 reads. One of 179 bases and the second one 164 bases. They are 100% identical, the first one is just a bit longer and newbler put both of them as singleton. Why they do not fall into on contig?
        Hmmm. Now I am lost as well...

        Originally posted by atalon1 View Post
        2-Indeed the overall coverage is low. I took one of those single read contigs. I blasted it against the other contigs and I did not find anything. On the contrary, if I blast it against the singletons, I find 2 reads longer than the sequence with 99% identity over the 119 bases of the single read contig. I guess it is due to the way newbler cuts the contig path...
        Again, lost...

        Originally posted by atalon1 View Post
        3-Let me be a bit more understandable. Newbler is making a contig of 2 reads that overlap. If I check the start of the contig, it is downstream of the start of the overlap of the 2 reads.
        examble:

        Code:
        read1 : aatcgtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
        read2 :     gtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
        contg :                gaatcgtcgaatcgtcgaatcgtcg
        why newbler does not start the contig at the start of the overlap between the 2 reads? Note that I used fasta input, without quality file, so it could not have cut it because of poor quality of the base determination
        Might be a newbler thing. technically, I agree one needs overlap to be able to call something a contig, newbler might be programmed to be more lenient.

        Finally, If you can get your hands on the sff files of the 454 run, I think you will see a better assembly!

        Comment


        • #5
          Ssr

          Originally Posted by atalon1>
          3-Let me be a bit more understandable. Newbler is making a contig of 2 reads that overlap. If I check the start of the contig, it is downstream of the start of the overlap of the 2 reads.
          examble:

          Code:
          read1 : aatcgtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
          read2 : gtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
          contg : gaatcgtcgaatcgtcgaatcgtcg




          If you look in these 2 reads, you can recognize that they are repetitive seq. and they can shift to right or left. you can see that.


          read1 : aatcgtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
          read2 : gtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
          contg : gaatcgtcgaatcgtcgaatcgtcg


          or


          read1 : aatcgtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
          read2 : gtcgaatcgtcgaatcgtcgaatcgtcgaatcgtcg
          contg : gaatcgtcgaatcgtcgaatcgtcg

          etc........
          and also consider that for aligning or blasting these kind seq by softwares will be masked (usually as default).
          Last edited by VA-NGS; 07-21-2010, 03:56 PM.

          Comment


          • #6
            Hi all- I think I'm having similar problems with alignments using the cDNA setting for de novo assembly.

            (1) when I put the contigs resulting from Newbler into Geneious, I find several hundred of them align with another contig with 100% similarity (as illustrated by atalon, many of them are completely overlapping, often by more than 100bp).

            (2) at the same time, some of my contigs (viewed in the .ace files) have less than 0.5% similarity. Basically, they should not be aligned with each other.

            (3) I also get a bunch of contigs that are 1-3bp long! There can be contig containing 100 copies of just an A! I guess this is Newbler trying to tease apart splice variants. Does anyone know what setting turns up the minimum length of a contig?

            I am currently trying to mess around with every setting that I can within Newbler, but these three problems persist. Does anyone have any suggestions?

            Thanks,
            Alice

            Comment


            • #7
              Originally posted by aliceb View Post
              Hi all- I think I'm having similar problems with alignments using the cDNA setting for de novo assembly.

              (1) when I put the contigs resulting from Newbler into Geneious, I find several hundred of them align with another contig with 100% similarity (as illustrated by atalon, many of them are completely overlapping, often by more than 100bp).
              From this I am wondering if you know the difference between contigs and isotigs (and isogroups, for that matter). See http://seqanswers.com/forums/showthread.php?t=5928.

              You should use the isotigs in Geneious, and expect isotigs from the same isogroup to be very similar.

              Originally posted by aliceb View Post
              (2) at the same time, some of my contigs (viewed in the .ace files) have less than 0.5% similarity. Basically, they should not be aligned with each other.
              This I don't understand. There are lots of contigs in the ace file and many of them will be less than 0.5% similar..

              Originally posted by aliceb View Post
              (3) I also get a bunch of contigs that are 1-3bp long! There can be contig containing 100 copies of just an A! I guess this is Newbler trying to tease apart splice variants. Does anyone know what setting turns up the minimum length of a contig?
              In the ace file, all contigs from whatever length are included and that can not be changed. In the 454*.fna files, there is a length limit of default 100 ('All') or 500 ('Large'). These can be adjusted.

              Originally posted by aliceb View Post
              I am currently trying to mess around with every setting that I can within Newbler, but these three problems persist. Does anyone have any suggestions?

              Thanks,
              Alice

              Comment


              • #8
                Thanks for the reply.. I thought I understood this contig/isotig/isogroup business, but maybe not...

                My problem is that there are contigs that are exactly identical- many base pairs overlap and match completely. These should be joined together. I am using my assembly to analyze expression, so if genes are incorrectly split (rather than true splice variants), it will lead to incorrect counts (because there is more than one contig for my expression data to match to..).

                On the second part of my question, I did mistype. When I look at at the 454isotigs.ace file, some of the individual isotigs are less than 0.5% similar. Looking at a few random sequences, there is no way they should have ever been aligned. If I were to just look at the 454Isotigs.fna file, I would be looking at the consensus sequence from these, right? And I would be assuming that these are good alignments, but they aren't. So, I'm trying to figure out what settings will increase the stringency of building these.

                The bottom line of all of this is, when I look at the contigs, they are often riddled with errors in their alignments, while at other times identical sequences have not been matched up. As I understand it, these contigs are being built into isotigs (is this right?), which appear to be ever more poorly aligned.

                Am I completely on the wrong track here? And if not, how do I improve each of these steps?

                Thanks!
                -Alice

                Comment


                • #9
                  Thanks for the reply.. I thought I understood this contig/isotig/isogroup business, but maybe not...
                  Maybe my just published post will help?

                  My problem is that there are contigs that are exactly identical- many base pairs overlap and match completely. These should be joined together.
                  As I tried to explain in the post, small sequence variants between parental chromosomes, or between individuals if several of them were used to generate the RNA sample, these will result in several, almost identical contigs. You could try using CD-HIT or another clustering program on your isotigs to correct for this.

                  When I look at at the 454isotigs.ace file, some of the individual isotigs are less than 0.5% similar. Looking at a few random sequences, there is no way they should have ever been aligned.
                  I hope you mean isotigs from the same isogroup, otherwise I still don't understand?

                  Looking at a few random sequences, there is no way they should have ever been aligned. If I were to just look at the 454Isotigs.fna file, I would be looking at the consensus sequence from these, right? And I would be assuming that these are good alignments, but they aren't.
                  Here, an example (screenshot or alignment) would help me understand the question better...

                  So, I'm trying to figure out what settings will increase the stringency of building these.
                  You could try changing the -mi and -ml settings (also explained on my blog).

                  Comment


                  • #10
                    Hi again- thanks for the help. You're post is really useful, particularly the graphic to illustrate the contig/isotig/isogroup business. I'm sending a couple of screenshots from geneious and tablet. I hope this helps clarify my questions.

                    First, the identical isotigs. I understand that sequences interpreted as splice variants and can be very similar. But I am finding isotig sequences that are ENTIRELY identical. 100%, no differences, overlapping like this for over 1000bp (see attached Screenshot-5 copy.jpg). I thought maybe the number of contigs allowed in an isotig might have maxed out, but I set this at 5000 and got the same results. Certainly this isn't right.

                    Second, I used Tablet to view the alignment of my raw 454 reads as they are aligned into contigs and isotigs (454Isotigs.ace). I have a lot of things that are just called contigs, which means that they belong to an isogroup with only one sequence in it, right (i.e. Newbler didn't detect any evidence of splice variants in this 'gene')? The worry-some thing here is that some of these individual alignments are really bad (see attached Screenshot copy.jpg). I haven't given Newbler anything shorter than 100bp, and yet it has assembled many short sequences together. Doesn't this seem strange?

                    I should add that we've been as thorough as possible cleaning the data, so I don't think that there is some adapter/primer sequence floating around...

                    Thanks!
                    Attached Files

                    Comment


                    • #11
                      Originally posted by aliceb View Post
                      First, the identical isotigs. I understand that sequences interpreted as splice variants and can be very similar. But I am finding isotig sequences that are ENTIRELY identical. 100%, no differences, overlapping like this for over 1000bp (see attached Screenshot-5 copy.jpg).
                      Well, I do see in the screenshot the one of the isotigs being two bases longer at the beginning, right? For newbler, that could indicate a graph with two variant contigs leading to two isotigs.

                      Second, I used Tablet to view the alignment of my raw 454 reads as they are aligned into contigs and isotigs (454Isotigs.ace). I have a lot of things that are just called contigs, which means that they belong to an isogroup with only one sequence in it, right (i.e. Newbler didn't detect any evidence of splice variants in this 'gene')?
                      No. Had there been only one contig in the isotig, it would have been called an isotig.
                      Contigs are added to the ace file when they are not used for isotigs, for various reasons explained in my post. I would check the fasta header for that contig in the 454AllContigs.fna file, it will report the 'status', and when that is not 'isotig', it indicates why the contig did not make it into an isotig. Also, check how many contigs belong to this isogroup...

                      The worry-some thing here is that some of these individual alignments are really bad (see attached Screenshot copy.jpg). I haven't given Newbler anything shorter than 100bp, and yet it has assembled many short sequences together. Doesn't this seem strange?
                      Again, you need to think in terms of the contig graph. These 'short sequences' are probably parts of longer reads that come from different contigs, enter contig05526 because they overlap there, and exit towards different contigs again. I do think the alignment at the end of the contig (suddenly much deeper depth) is somewhat strange...

                      Hope this helps...

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM
                      • seqadmin
                        The Impact of AI in Genomic Medicine
                        by seqadmin



                        Artificial intelligence (AI) has evolved from a futuristic vision to a mainstream technology, highlighted by the introduction of tools like OpenAI's ChatGPT and Google's Gemini. In recent years, AI has become increasingly integrated into the field of genomics. This integration has enabled new scientific discoveries while simultaneously raising important ethical questions1. Interviews with two researchers at the center of this intersection provide insightful perspectives into...
                        02-26-2024, 02:07 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 03-14-2024, 06:13 AM
                      0 responses
                      33 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-08-2024, 08:03 AM
                      0 responses
                      72 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-07-2024, 08:13 AM
                      0 responses
                      81 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-06-2024, 09:51 AM
                      0 responses
                      68 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X