Seqanswers Leaderboard Ad

**hlu** · 01-06-2009, 01:22 PM

Hi jnfass,

Yes, that is the default behavior of newbler/gsAssembler/gsMapper. 454 data set is fundamentally different from traditional data so that other tools might need to accomodate the difference here.

The alignment under newbler is aligned under flow space, not on bases.
Therefore, in you case above, 2 g's are aligned with 3 gaps, 3 C's are ligned with 2 gaps. C and G will never align together in one colume in newbler 454 data.

Basically, 454 data set has no replacement. It only has In Del, insertion of a base, or deletion of a base. This is due to the nature of flow based technology.

A replace will be counted as 2 action, for example, C -> G change will simply be a deletion of C, immediately followed by insertion of G, 2 InDels in 454 is same as 1 replacement on one base position.

**jnfass** · 01-06-2009, 01:41 PM

Thanks hlu ...

But I don't see how this is different than, say, solexa reads.

I was thinking that this "pad and offset" strategy might be the default because of the poly-base repeat problem (mistakenly calling 4 A's instead of 5 because strings of one base are read as a fold increase in signal intensity) ... in which case a base disagreement may be the result of an extra (misread) base inserted. But I guess I was hoping that newbler (gsAssembler) / gsMapper could distinguish between this case and a substitution not near any repeats.

~Joe

**hlu** · 01-06-2009, 02:27 PM

Originally posted by jnfass View Post

Thanks hlu ...

But I don't see how this is different than, say, solexa reads.

I was thinking that this "pad and offset" strategy might be the default because of the poly-base repeat problem (mistakenly calling 4 A's instead of 5 because strings of one base are read as a fold increase in signal intensity) ... in which case a base disagreement may be the result of an extra (misread) base inserted. But I guess I was hoping that newbler (gsAssembler) / gsMapper could distinguish between this case and a substitution not near any repeats.

~Joe

Sequencing errors are more than poly-base issues. There are PCR erros or other erros in data set.

This is actually a data presentation issue. Not SNP issue. Ace file under gsMapping/gsAssembler/newbler merely faithfully presents the flow space alignment into base without changing flow anchoring coordinates.

For SNPs, gsMapper provides the proper SNPs information in another file.

Under gsMapper files, there is one file called "454HCDiffs.txt", meaning 454 high confidence difference files for all the SNPs and mutations. In this file, the flow splace is converted into base space, and C -> G mutation would be there in one row. This is obviously post-ace processing result.

You might want to look deeper into this file for mutations.

BTW, assembly results are not well prepared for calling mutations. If you want mutation, you want to run gsMapper using a reference, or using assembled-contig as reference. That way, you can get "454HCDiffs.txt" file, which has all the relavent SNPs and mutations.

There is another file, called "454AllDiffs.txt", which even includes all the low confident SNPs/mutations from gsMapper result.

**jnfass** · 01-06-2009, 02:48 PM

I think I get what you're saying:
So, where there's a base mismatch between reads in an alignment, the ace file representation retains the original flow space order ... i.e. in my original example, fluorophore-tagged G's were flowed over the slide before tagged C's? So theoretically I should be able to see the order of the cycle of bases by looking over my alignments ... as in (for example) I might always see "a*" on top of "*g", and "g*" on top of "*c", and "c*" on top of "*t" ... which would mean that the sequencer flows in a's, g's, c's, t's, and repeats ... (if I follow my own logic correctly)?

I'll look into gsMapper - I wasn't aware that it would do SNP calls. I don't have a reference, but as you say I could map to the assembled contigs.

Thanks for your comments!

**hlu** · 01-06-2009, 03:26 PM

Originally posted by jnfass View Post

I think I get what you're saying:
So, where there's a base mismatch between reads in an alignment, the ace file representation retains the original flow space order ... i.e. in my original example, fluorophore-tagged G's were flowed over the slide before tagged C's? So theoretically I should be able to see the order of the cycle of bases by looking over my alignments ... as in (for example) I might always see "a*" on top of "*g", and "g*" on top of "*c", and "c*" on top of "*t" ... which would mean that the sequencer flows in a's, g's, c's, t's, and repeats ... (if I follow my own logic correctly)?
Thanks for your comments!

I think you got it.

I don't have record on hand on the flow order. But each colume of newbler ace file alignment is unique fluorophore-tagged base. 454 has 4 bases in 1 cycle for the flow order.

Newbler does not want to change that flow order information because the basic information is flow based. The base level display is only for human eye, and I believe the software everything is based on raw signal in flow for all the calculation.

**jaudall** · 02-12-2009, 09:09 PM

wag the dog

Hlu -

Thanks for your explanations of the padding in the ace file. Emailing from Branford and with deep understanding of the newbler system, your posts were enlightening.

As enlightening and as rational as they may be, as a biologist I don't care about the flow information. Particularly, if I'm going from 0 SNPs to tens of thousands of SNPs in a single run of an untracted genome. The SNPs need to be verified anyway, why not produce output that is compatible with the rest of the world's developed software? It somewhat ruffled my feathers to have flowgrams dictating to biologists what they want to get out of the sequence data.

Why not have it both ways? Display it in Newbler as flowgrams etc. but have an export option (with a hefty disclaimer if it makes you feel better) that condenses the pads to your best guess.

If I understand what you are saying, I do an assembly export and the contigs. The fasta file doesn't have flowgram information. Then I use the assembled contigs as a reference to map all the original reads. No pads and the were SNPs called. Why not just have gsAssembler do this mapping process as a macro with the click of the 'export traditional ace button'?

**bioinfosm** · 07-24-2009, 08:29 AM

jaudall, I was interested in what your experience was, since this old post..

**jaudall** · 07-24-2009, 09:56 AM

We've developed some workarounds with other software and our own code. No word from 454 about any changes. Perhaps, they think they are right and it is the rest of the world that needs to change.

**greigite** · 09-04-2009, 12:11 PM

FYI I've noticed that newbler does this even in hybrid sanger/454 assemblies. It's a real problem for our existing SNP processing tools, and the solution of assembling THEN mapping again suggested above seems overly roundabout, not to mention that you couldn't be sure the same reads assembled initially will in fact map to the same place. I'm trying out mira and hopefully it will not do the same!

**jaudall** · 09-04-2009, 07:29 PM

Originally posted by jaudall View Post

We've developed some workarounds with other software and our own code. No word from 454 about any changes. Perhaps, they think they are right and it is the rest of the world that needs to change.

Any suggestions where we might publish our workaround? I would like to get it out there and get some kudos for it, but it isn't a very 'high-brow' bioinformatics application that would go in any journal.

**Soni** · 06-09-2011, 03:55 AM

Originally posted by hlu View Post

Sequencing errors are more than poly-base issues. There are PCR erros or other erros in data set.

This is actually a data presentation issue. Not SNP issue. Ace file under gsMapping/gsAssembler/newbler merely faithfully presents the flow space alignment into base without changing flow anchoring coordinates.

For SNPs, gsMapper provides the proper SNPs information in another file.

Under gsMapper files, there is one file called "454HCDiffs.txt", meaning 454 high confidence difference files for all the SNPs and mutations. In this file, the flow splace is converted into base space, and C -> G mutation would be there in one row. This is obviously post-ace processing result.

You might want to look deeper into this file for mutations.

BTW, assembly results are not well prepared for calling mutations. If you want mutation, you want to run gsMapper using a reference, or using assembled-contig as reference. That way, you can get "454HCDiffs.txt" file, which has all the relavent SNPs and mutations.

There is another file, called "454AllDiffs.txt", which even includes all the low confident SNPs/mutations from gsMapper result.

Hi hlu,
Thank you very much for providing me with an explanation on why the SNP positions in the 454AlignmentInfo.tsv appear as they do when compared with 454HCDiffs.txt.

Would you know of an easy way to convert the flow space representation of 454AlignmentInfo.tsv into base space?

Cheers.

**Soni** · 06-09-2011, 03:58 AM

Originally posted by greigite View Post

FYI I've noticed that newbler does this even in hybrid sanger/454 assemblies. It's a real problem for our existing SNP processing tools, and the solution of assembling THEN mapping again suggested above seems overly roundabout, not to mention that you couldn't be sure the same reads assembled initially will in fact map to the same place. I'm trying out mira and hopefully it will not do the same!

Greigite,

How did you get on with MIRA? Is handling SNPs easier with that? I've never used MIRA and am doing a lot of SNP identification so I will be very interested to hear your views.

Cheers.

Topics	Statistics	Last Post
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 18 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 21 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 19 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM
Enhanced Neoantigen Detection: Introducing NeoHunter by seqadmin Started by seqadmin, 05-06-2024, 07:17 AM	0 responses 21 views 0 likes	Last Post by seqadmin 05-06-2024, 07:17 AM

Seqanswers Leaderboard Ad

Announcement

newbler assembly ... padding and SNPs

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News