Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • newbler assembly ... padding and SNPs

    I'm looking over a newbler assembly of older 454 reads (~200bp) and it seems to me that base disagreements are padded and offset, instead of aligned on top of each other (see below); this seems like it will prevent automated SNP-finding, e.g. with the Marth Lab's polyBayesShort or (the updated version) GigaBayes (see the Marth Lab page at Boston College) ...

    Here's an example of what I mean in the consed view of the newbler-produced ace file:

    consensus:
    ...TTAT*cAGTGT...
    reads:
    ...TTATg*AGTGT...
    ...TTATg*AGTGT...
    ...TTAT*cAGTGT...
    ...TTAT*cAGTGT...
    ...TTAT*cAGTGT...

    ... and here's what I expected:
    consensus:
    ...TTATcAGTGT...
    reads:
    ...TTATGAGTGT...
    ...TTATGAGTGT...
    ...TTATCAGTGT...
    ...TTATCAGTGT...
    ...TTATCAGTGT...

    This is obviously a SNP candidate, but the former representation (padded and offset) is going to be harder to find with someone else's tool or my own scripts. I'm not seeing *any* of the latter case with this assembly ... but I've definitely seen the latter case with 454 reads assembled with PCAP. Does anyone recognize this ... am I missing some newbler default behavior?

  • #2
    Hi jnfass,

    Yes, that is the default behavior of newbler/gsAssembler/gsMapper. 454 data set is fundamentally different from traditional data so that other tools might need to accomodate the difference here.

    The alignment under newbler is aligned under flow space, not on bases.
    Therefore, in you case above, 2 g's are aligned with 3 gaps, 3 C's are ligned with 2 gaps. C and G will never align together in one colume in newbler 454 data.

    Basically, 454 data set has no replacement. It only has In Del, insertion of a base, or deletion of a base. This is due to the nature of flow based technology.

    A replace will be counted as 2 action, for example, C -> G change will simply be a deletion of C, immediately followed by insertion of G, 2 InDels in 454 is same as 1 replacement on one base position.

    Comment


    • #3
      Thanks hlu ...

      But I don't see how this is different than, say, solexa reads.

      I was thinking that this "pad and offset" strategy might be the default because of the poly-base repeat problem (mistakenly calling 4 A's instead of 5 because strings of one base are read as a fold increase in signal intensity) ... in which case a base disagreement may be the result of an extra (misread) base inserted. But I guess I was hoping that newbler (gsAssembler) / gsMapper could distinguish between this case and a substitution not near any repeats.

      ~Joe

      Comment


      • #4
        Originally posted by jnfass View Post
        Thanks hlu ...

        But I don't see how this is different than, say, solexa reads.

        I was thinking that this "pad and offset" strategy might be the default because of the poly-base repeat problem (mistakenly calling 4 A's instead of 5 because strings of one base are read as a fold increase in signal intensity) ... in which case a base disagreement may be the result of an extra (misread) base inserted. But I guess I was hoping that newbler (gsAssembler) / gsMapper could distinguish between this case and a substitution not near any repeats.

        ~Joe

        Sequencing errors are more than poly-base issues. There are PCR erros or other erros in data set.

        This is actually a data presentation issue. Not SNP issue. Ace file under gsMapping/gsAssembler/newbler merely faithfully presents the flow space alignment into base without changing flow anchoring coordinates.

        For SNPs, gsMapper provides the proper SNPs information in another file.

        Under gsMapper files, there is one file called "454HCDiffs.txt", meaning 454 high confidence difference files for all the SNPs and mutations. In this file, the flow splace is converted into base space, and C -> G mutation would be there in one row. This is obviously post-ace processing result.

        You might want to look deeper into this file for mutations.

        BTW, assembly results are not well prepared for calling mutations. If you want mutation, you want to run gsMapper using a reference, or using assembled-contig as reference. That way, you can get "454HCDiffs.txt" file, which has all the relavent SNPs and mutations.

        There is another file, called "454AllDiffs.txt", which even includes all the low confident SNPs/mutations from gsMapper result.

        Comment


        • #5
          I think I get what you're saying:
          So, where there's a base mismatch between reads in an alignment, the ace file representation retains the original flow space order ... i.e. in my original example, fluorophore-tagged G's were flowed over the slide before tagged C's? So theoretically I should be able to see the order of the cycle of bases by looking over my alignments ... as in (for example) I might always see "a*" on top of "*g", and "g*" on top of "*c", and "c*" on top of "*t" ... which would mean that the sequencer flows in a's, g's, c's, t's, and repeats ... (if I follow my own logic correctly)?

          I'll look into gsMapper - I wasn't aware that it would do SNP calls. I don't have a reference, but as you say I could map to the assembled contigs.

          Thanks for your comments!

          Comment


          • #6
            Originally posted by jnfass View Post
            I think I get what you're saying:
            So, where there's a base mismatch between reads in an alignment, the ace file representation retains the original flow space order ... i.e. in my original example, fluorophore-tagged G's were flowed over the slide before tagged C's? So theoretically I should be able to see the order of the cycle of bases by looking over my alignments ... as in (for example) I might always see "a*" on top of "*g", and "g*" on top of "*c", and "c*" on top of "*t" ... which would mean that the sequencer flows in a's, g's, c's, t's, and repeats ... (if I follow my own logic correctly)?
            Thanks for your comments!

            I think you got it.

            I don't have record on hand on the flow order. But each colume of newbler ace file alignment is unique fluorophore-tagged base. 454 has 4 bases in 1 cycle for the flow order.

            Newbler does not want to change that flow order information because the basic information is flow based. The base level display is only for human eye, and I believe the software everything is based on raw signal in flow for all the calculation.

            Comment


            • #7
              wag the dog

              Hlu -

              Thanks for your explanations of the padding in the ace file. Emailing from Branford and with deep understanding of the newbler system, your posts were enlightening.

              As enlightening and as rational as they may be, as a biologist I don't care about the flow information. Particularly, if I'm going from 0 SNPs to tens of thousands of SNPs in a single run of an untracted genome. The SNPs need to be verified anyway, why not produce output that is compatible with the rest of the world's developed software? It somewhat ruffled my feathers to have flowgrams dictating to biologists what they want to get out of the sequence data.

              Why not have it both ways? Display it in Newbler as flowgrams etc. but have an export option (with a hefty disclaimer if it makes you feel better) that condenses the pads to your best guess.

              If I understand what you are saying, I do an assembly export and the contigs. The fasta file doesn't have flowgram information. Then I use the assembled contigs as a reference to map all the original reads. No pads and the were SNPs called. Why not just have gsAssembler do this mapping process as a macro with the click of the 'export traditional ace button'?

              Comment


              • #8
                jaudall, I was interested in what your experience was, since this old post..
                --
                bioinfosm

                Comment


                • #9
                  We've developed some workarounds with other software and our own code. No word from 454 about any changes. Perhaps, they think they are right and it is the rest of the world that needs to change.

                  Comment


                  • #10
                    FYI I've noticed that newbler does this even in hybrid sanger/454 assemblies. It's a real problem for our existing SNP processing tools, and the solution of assembling THEN mapping again suggested above seems overly roundabout, not to mention that you couldn't be sure the same reads assembled initially will in fact map to the same place. I'm trying out mira and hopefully it will not do the same!

                    Comment


                    • #11
                      Originally posted by jaudall View Post
                      We've developed some workarounds with other software and our own code. No word from 454 about any changes. Perhaps, they think they are right and it is the rest of the world that needs to change.
                      Any suggestions where we might publish our workaround? I would like to get it out there and get some kudos for it, but it isn't a very 'high-brow' bioinformatics application that would go in any journal.

                      Comment


                      • #12
                        Originally posted by hlu View Post
                        Sequencing errors are more than poly-base issues. There are PCR erros or other erros in data set.

                        This is actually a data presentation issue. Not SNP issue. Ace file under gsMapping/gsAssembler/newbler merely faithfully presents the flow space alignment into base without changing flow anchoring coordinates.

                        For SNPs, gsMapper provides the proper SNPs information in another file.

                        Under gsMapper files, there is one file called "454HCDiffs.txt", meaning 454 high confidence difference files for all the SNPs and mutations. In this file, the flow splace is converted into base space, and C -> G mutation would be there in one row. This is obviously post-ace processing result.

                        You might want to look deeper into this file for mutations.

                        BTW, assembly results are not well prepared for calling mutations. If you want mutation, you want to run gsMapper using a reference, or using assembled-contig as reference. That way, you can get "454HCDiffs.txt" file, which has all the relavent SNPs and mutations.

                        There is another file, called "454AllDiffs.txt", which even includes all the low confident SNPs/mutations from gsMapper result.
                        Hi hlu,
                        Thank you very much for providing me with an explanation on why the SNP positions in the 454AlignmentInfo.tsv appear as they do when compared with 454HCDiffs.txt.

                        Would you know of an easy way to convert the flow space representation of 454AlignmentInfo.tsv into base space?

                        Cheers.

                        Comment


                        • #13
                          Originally posted by greigite View Post
                          FYI I've noticed that newbler does this even in hybrid sanger/454 assemblies. It's a real problem for our existing SNP processing tools, and the solution of assembling THEN mapping again suggested above seems overly roundabout, not to mention that you couldn't be sure the same reads assembled initially will in fact map to the same place. I'm trying out mira and hopefully it will not do the same!
                          Greigite,

                          How did you get on with MIRA? Is handling SNPs easier with that? I've never used MIRA and am doing a lot of SNP identification so I will be very interested to hear your views.

                          Cheers.

                          Comment

                          Latest Articles

                          Collapse

                          • seqadmin
                            Strategies for Sequencing Challenging Samples
                            by seqadmin


                            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                            03-22-2024, 06:39 AM
                          • seqadmin
                            Techniques and Challenges in Conservation Genomics
                            by seqadmin



                            The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                            Avian Conservation
                            Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                            03-08-2024, 10:41 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by seqadmin, Yesterday, 06:37 PM
                          0 responses
                          10 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, Yesterday, 06:07 PM
                          0 responses
                          9 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-22-2024, 10:03 AM
                          0 responses
                          49 views
                          0 likes
                          Last Post seqadmin  
                          Started by seqadmin, 03-21-2024, 07:32 AM
                          0 responses
                          67 views
                          0 likes
                          Last Post seqadmin  
                          Working...
                          X