SEQanswers

Go Back   SEQanswers > Applications Forums > De novo discovery



Similar Threads
Thread Thread Starter Forum Replies Last Post
How to Improve Newbler Assembly shuang Bioinformatics 2 09-13-2011 10:45 PM
de novo 454 assembly w/ newbler ... how long? jnfass De novo discovery 7 06-21-2011 01:13 AM
Newbler de novo assembly moinul De novo discovery 3 05-27-2011 06:13 PM
Newbler de novo assembly and repeats wiart De novo discovery 2 08-19-2009 01:28 PM
Newbler Assembly on Chloroplast Genome RajAgainstTheMachine Bioinformatics 6 07-14-2009 11:41 AM

Reply
 
Thread Tools
Old 01-06-2009, 12:26 PM   #1
jnfass
Member
 
Location: Davis, CA

Join Date: Aug 2008
Posts: 88
Default newbler assembly ... padding and SNPs

I'm looking over a newbler assembly of older 454 reads (~200bp) and it seems to me that base disagreements are padded and offset, instead of aligned on top of each other (see below); this seems like it will prevent automated SNP-finding, e.g. with the Marth Lab's polyBayesShort or (the updated version) GigaBayes (see the Marth Lab page at Boston College) ...

Here's an example of what I mean in the consed view of the newbler-produced ace file:

consensus:
...TTAT*cAGTGT...
reads:
...TTATg*AGTGT...
...TTATg*AGTGT...
...TTAT*cAGTGT...
...TTAT*cAGTGT...
...TTAT*cAGTGT...

... and here's what I expected:
consensus:
...TTATcAGTGT...
reads:
...TTATGAGTGT...
...TTATGAGTGT...
...TTATCAGTGT...
...TTATCAGTGT...
...TTATCAGTGT...

This is obviously a SNP candidate, but the former representation (padded and offset) is going to be harder to find with someone else's tool or my own scripts. I'm not seeing *any* of the latter case with this assembly ... but I've definitely seen the latter case with 454 reads assembled with PCAP. Does anyone recognize this ... am I missing some newbler default behavior?
jnfass is offline   Reply With Quote
Old 01-06-2009, 01:22 PM   #2
hlu
Member
 
Location: Branford, Connecticut

Join Date: Jan 2009
Posts: 32
Default

Hi jnfass,

Yes, that is the default behavior of newbler/gsAssembler/gsMapper. 454 data set is fundamentally different from traditional data so that other tools might need to accomodate the difference here.

The alignment under newbler is aligned under flow space, not on bases.
Therefore, in you case above, 2 g's are aligned with 3 gaps, 3 C's are ligned with 2 gaps. C and G will never align together in one colume in newbler 454 data.

Basically, 454 data set has no replacement. It only has In Del, insertion of a base, or deletion of a base. This is due to the nature of flow based technology.

A replace will be counted as 2 action, for example, C -> G change will simply be a deletion of C, immediately followed by insertion of G, 2 InDels in 454 is same as 1 replacement on one base position.
hlu is offline   Reply With Quote
Old 01-06-2009, 01:41 PM   #3
jnfass
Member
 
Location: Davis, CA

Join Date: Aug 2008
Posts: 88
Default

Thanks hlu ...

But I don't see how this is different than, say, solexa reads.

I was thinking that this "pad and offset" strategy might be the default because of the poly-base repeat problem (mistakenly calling 4 A's instead of 5 because strings of one base are read as a fold increase in signal intensity) ... in which case a base disagreement may be the result of an extra (misread) base inserted. But I guess I was hoping that newbler (gsAssembler) / gsMapper could distinguish between this case and a substitution not near any repeats.

~Joe
jnfass is offline   Reply With Quote
Old 01-06-2009, 02:27 PM   #4
hlu
Member
 
Location: Branford, Connecticut

Join Date: Jan 2009
Posts: 32
Default

Quote:
Originally Posted by jnfass View Post
Thanks hlu ...

But I don't see how this is different than, say, solexa reads.

I was thinking that this "pad and offset" strategy might be the default because of the poly-base repeat problem (mistakenly calling 4 A's instead of 5 because strings of one base are read as a fold increase in signal intensity) ... in which case a base disagreement may be the result of an extra (misread) base inserted. But I guess I was hoping that newbler (gsAssembler) / gsMapper could distinguish between this case and a substitution not near any repeats.

~Joe

Sequencing errors are more than poly-base issues. There are PCR erros or other erros in data set.

This is actually a data presentation issue. Not SNP issue. Ace file under gsMapping/gsAssembler/newbler merely faithfully presents the flow space alignment into base without changing flow anchoring coordinates.

For SNPs, gsMapper provides the proper SNPs information in another file.

Under gsMapper files, there is one file called "454HCDiffs.txt", meaning 454 high confidence difference files for all the SNPs and mutations. In this file, the flow splace is converted into base space, and C -> G mutation would be there in one row. This is obviously post-ace processing result.

You might want to look deeper into this file for mutations.

BTW, assembly results are not well prepared for calling mutations. If you want mutation, you want to run gsMapper using a reference, or using assembled-contig as reference. That way, you can get "454HCDiffs.txt" file, which has all the relavent SNPs and mutations.

There is another file, called "454AllDiffs.txt", which even includes all the low confident SNPs/mutations from gsMapper result.
hlu is offline   Reply With Quote
Old 01-06-2009, 02:48 PM   #5
jnfass
Member
 
Location: Davis, CA

Join Date: Aug 2008
Posts: 88
Default

I think I get what you're saying:
So, where there's a base mismatch between reads in an alignment, the ace file representation retains the original flow space order ... i.e. in my original example, fluorophore-tagged G's were flowed over the slide before tagged C's? So theoretically I should be able to see the order of the cycle of bases by looking over my alignments ... as in (for example) I might always see "a*" on top of "*g", and "g*" on top of "*c", and "c*" on top of "*t" ... which would mean that the sequencer flows in a's, g's, c's, t's, and repeats ... (if I follow my own logic correctly)?

I'll look into gsMapper - I wasn't aware that it would do SNP calls. I don't have a reference, but as you say I could map to the assembled contigs.

Thanks for your comments!
jnfass is offline   Reply With Quote
Old 01-06-2009, 03:26 PM   #6
hlu
Member
 
Location: Branford, Connecticut

Join Date: Jan 2009
Posts: 32
Default

Quote:
Originally Posted by jnfass View Post
I think I get what you're saying:
So, where there's a base mismatch between reads in an alignment, the ace file representation retains the original flow space order ... i.e. in my original example, fluorophore-tagged G's were flowed over the slide before tagged C's? So theoretically I should be able to see the order of the cycle of bases by looking over my alignments ... as in (for example) I might always see "a*" on top of "*g", and "g*" on top of "*c", and "c*" on top of "*t" ... which would mean that the sequencer flows in a's, g's, c's, t's, and repeats ... (if I follow my own logic correctly)?
Thanks for your comments!

I think you got it.

I don't have record on hand on the flow order. But each colume of newbler ace file alignment is unique fluorophore-tagged base. 454 has 4 bases in 1 cycle for the flow order.

Newbler does not want to change that flow order information because the basic information is flow based. The base level display is only for human eye, and I believe the software everything is based on raw signal in flow for all the calculation.
hlu is offline   Reply With Quote
Old 02-12-2009, 09:09 PM   #7
jaudall
Junior Member
 
Location: provo

Join Date: Oct 2008
Posts: 3
Default wag the dog

Hlu -

Thanks for your explanations of the padding in the ace file. Emailing from Branford and with deep understanding of the newbler system, your posts were enlightening.

As enlightening and as rational as they may be, as a biologist I don't care about the flow information. Particularly, if I'm going from 0 SNPs to tens of thousands of SNPs in a single run of an untracted genome. The SNPs need to be verified anyway, why not produce output that is compatible with the rest of the world's developed software? It somewhat ruffled my feathers to have flowgrams dictating to biologists what they want to get out of the sequence data.

Why not have it both ways? Display it in Newbler as flowgrams etc. but have an export option (with a hefty disclaimer if it makes you feel better) that condenses the pads to your best guess.

If I understand what you are saying, I do an assembly export and the contigs. The fasta file doesn't have flowgram information. Then I use the assembled contigs as a reference to map all the original reads. No pads and the were SNPs called. Why not just have gsAssembler do this mapping process as a macro with the click of the 'export traditional ace button'?
jaudall is offline   Reply With Quote
Old 07-24-2009, 09:29 AM   #8
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

jaudall, I was interested in what your experience was, since this old post..
bioinfosm is offline   Reply With Quote
Old 07-24-2009, 10:56 AM   #9
jaudall
Junior Member
 
Location: provo

Join Date: Oct 2008
Posts: 3
Default

We've developed some workarounds with other software and our own code. No word from 454 about any changes. Perhaps, they think they are right and it is the rest of the world that needs to change.
jaudall is offline   Reply With Quote
Old 09-04-2009, 01:11 PM   #10
greigite
Senior Member
 
Location: Cambridge, MA

Join Date: Mar 2009
Posts: 141
Default

FYI I've noticed that newbler does this even in hybrid sanger/454 assemblies. It's a real problem for our existing SNP processing tools, and the solution of assembling THEN mapping again suggested above seems overly roundabout, not to mention that you couldn't be sure the same reads assembled initially will in fact map to the same place. I'm trying out mira and hopefully it will not do the same!
greigite is offline   Reply With Quote
Old 09-04-2009, 08:29 PM   #11
jaudall
Junior Member
 
Location: provo

Join Date: Oct 2008
Posts: 3
Default

Quote:
Originally Posted by jaudall View Post
We've developed some workarounds with other software and our own code. No word from 454 about any changes. Perhaps, they think they are right and it is the rest of the world that needs to change.
Any suggestions where we might publish our workaround? I would like to get it out there and get some kudos for it, but it isn't a very 'high-brow' bioinformatics application that would go in any journal.
jaudall is offline   Reply With Quote
Old 06-09-2011, 04:55 AM   #12
Soni
Member
 
Location: Nairobi

Join Date: Oct 2009
Posts: 11
Default

Quote:
Originally Posted by hlu View Post
Sequencing errors are more than poly-base issues. There are PCR erros or other erros in data set.

This is actually a data presentation issue. Not SNP issue. Ace file under gsMapping/gsAssembler/newbler merely faithfully presents the flow space alignment into base without changing flow anchoring coordinates.

For SNPs, gsMapper provides the proper SNPs information in another file.

Under gsMapper files, there is one file called "454HCDiffs.txt", meaning 454 high confidence difference files for all the SNPs and mutations. In this file, the flow splace is converted into base space, and C -> G mutation would be there in one row. This is obviously post-ace processing result.

You might want to look deeper into this file for mutations.

BTW, assembly results are not well prepared for calling mutations. If you want mutation, you want to run gsMapper using a reference, or using assembled-contig as reference. That way, you can get "454HCDiffs.txt" file, which has all the relavent SNPs and mutations.

There is another file, called "454AllDiffs.txt", which even includes all the low confident SNPs/mutations from gsMapper result.
Hi hlu,
Thank you very much for providing me with an explanation on why the SNP positions in the 454AlignmentInfo.tsv appear as they do when compared with 454HCDiffs.txt.

Would you know of an easy way to convert the flow space representation of 454AlignmentInfo.tsv into base space?

Cheers.
Soni is offline   Reply With Quote
Old 06-09-2011, 04:58 AM   #13
Soni
Member
 
Location: Nairobi

Join Date: Oct 2009
Posts: 11
Default

Quote:
Originally Posted by greigite View Post
FYI I've noticed that newbler does this even in hybrid sanger/454 assemblies. It's a real problem for our existing SNP processing tools, and the solution of assembling THEN mapping again suggested above seems overly roundabout, not to mention that you couldn't be sure the same reads assembled initially will in fact map to the same place. I'm trying out mira and hopefully it will not do the same!
Greigite,

How did you get on with MIRA? Is handling SNPs easier with that? I've never used MIRA and am doing a lot of SNP identification so I will be very interested to hear your views.

Cheers.
Soni is offline   Reply With Quote
Reply

Tags
454, de novo, newbler, snp

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:53 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO