SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > 454 Pyrosequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Lower case characters in FASTa reference sequence foxyg Bioinformatics 5 09-08-2010 02:08 PM
upper quartile normalization milesgr Bioinformatics 3 06-17-2010 09:38 AM
Lower percentage of properly paired sequence sunnyvu Bioinformatics 3 05-12-2010 09:58 AM
capital and lower case bases Layla Bioinformatics 0 06-08-2009 06:12 AM
maqview upper and lower case AnamikaDarwin Bioinformatics 3 03-21-2009 12:45 AM

Reply
 
Thread Tools
Old 06-16-2009, 03:14 AM   #1
Layla
Member
 
Location: London

Join Date: Sep 2008
Posts: 58
Default Titanium upper and lower case bases

Seeing a read like this from 454 Titanium shotgun experiment using DNA from a capture array.

tcagCTCGAGATTCTGGATCCTCACGTAATTCATCCTACATTACCTAGTAATTggtgaccatctgcattagctaattagcttatagaagaagacaacttctcatggtttatgacagaatata
gtctgcaacttggagcaaggcacacaggggattaggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

first tcag are the key sequence. Why is the rest of the sequence in upper and lower case? Thought the upper case meant good quality bases, but looking at the .fna files this does not seem to be the case.

Any help to understand this is appreciated

Cheers
L
Layla is offline   Reply With Quote
Old 06-16-2009, 05:12 AM   #2
kmcarr
Senior Member
 
Location: USA, Midwest

Join Date: May 2008
Posts: 1,173
Default

Layla,

The fact that you are seeing the key tag (tcag) in your sequence indicates that you have the untrimmed sequence. SFF files store the complete flowgram, sequence and quality scores for a well. They also contains trimming information for each read, the 5' and 3' positions of high quality sequence. The trim points also account for the key tag (and multiplex barcode if used) at the 5' end and the library adapter at the 3' end if the insert was short.

When the FASTA and QUAL files are output from an SFF file using the sffinfo program they normally just contain the trimmed sequence. It is also possible to output the entire untrimmed sequence by using the -n option when you run sffinfo. In this case the portions of the read which are beyond the trim points are also output but in lower case. That is what you are seeing, the lower case bases are those which the 454 software marked to be trimmed.
kmcarr is offline   Reply With Quote
Old 06-16-2009, 05:50 AM   #3
Layla
Member
 
Location: London

Join Date: Sep 2008
Posts: 58
Default 50% lower case bases

Thank you for the information kmcarr.

I carried out a simple sffinfo -s file1.sff > file1.fna command without the -n option to get to this file. The fact that 454 has marked for these bases to be trimmed, should I also be eliminating them before I map them to the human genome? My concern is that 50% of my bases from 500MB are in lower case and in removing such bases, each read will only be on average 50 bases instead of the 500 bases that Titanium should be giving.

Any suggestions on what one should do? I guess still holding onto those reads should not be an option?

L
Layla is offline   Reply With Quote
Old 06-19-2009, 01:57 PM   #4
hlu
Member
 
Location: Branford, Connecticut

Join Date: Jan 2009
Posts: 32
Default

Quote:
Originally Posted by Layla View Post
Thank you for the information kmcarr.

I carried out a simple sffinfo -s file1.sff > file1.fna command without the -n option to get to this file. The fact that 454 has marked for these bases to be trimmed, should I also be eliminating them before I map them to the human genome? My concern is that 50% of my bases from 500MB are in lower case and in removing such bases, each read will only be on average 50 bases instead of the 500 bases that Titanium should be giving.

Any suggestions on what one should do? I guess still holding onto those reads should not be an option?

L

Might want to contact software support on this issue? This sounds like a mis-behavior for sffinfo software.
hlu is offline   Reply With Quote
Old 07-14-2009, 04:08 AM   #5
dan
wiki wiki
 
Location: Cambridge, England

Join Date: Jul 2008
Posts: 266
Default

Looking at the 454TrimStatus.txt file (produced by assembly or mapping of an SFF), I get the following values:

Mean Raw Length = 534
Mean Orig Trimmed Length = 380


About trimming before mapping... you should certainly trim the key tag and any adapter sequence from your reads before mapping (there is no way this could or should map onto your genome except by chance, i.e. in error).

Using the 454 software, I was told that there is no special consideration taken for low quality mismatches. i.e. gsMapper does not use quality information when mapping. For this reason, you should trim low quality bases before mapping. However, I'd be interested to know of any mapper that can take quality information into account, i.e. by not penalising a low quality mismatch or by mapping high quality bases and using low quality bases when generating the consensus...

It seems that the error model for 454 could be captured by a HMM. You could then map using all the available information for a read (excluding key tag and any adapter sequence) and then somehow perform a multiple HMM to HMM alignment to generate the consensus... Any maths geniuses around?

Cheers,
__________________
Homepage: Dan Bolser
MetaBase the database of biological databases.

Last edited by dan; 07-14-2009 at 04:10 AM. Reason: fixed a typo
dan is offline   Reply With Quote
Old 07-14-2009, 07:58 AM   #6
bioinfosm
Senior Member
 
Location: USA

Join Date: Jan 2008
Posts: 482
Default

Perhaps MOSAIK from Marth lab works with quality values of 454 data..
bioinfosm is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:06 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO