Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Layla
    Member
    • Sep 2008
    • 58

    Titanium upper and lower case bases

    Seeing a read like this from 454 Titanium shotgun experiment using DNA from a capture array.

    tcagCTCGAGATTCTGGATCCTCACGTAATTCATCCTACATTACCTAGTAATTggtgaccatctgcattagctaattagcttatagaagaagacaacttctcatggtttatgacagaatata
    gtctgcaacttggagcaaggcacacaggggattaggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

    first tcag are the key sequence. Why is the rest of the sequence in upper and lower case? Thought the upper case meant good quality bases, but looking at the .fna files this does not seem to be the case.

    Any help to understand this is appreciated

    Cheers
    L
  • kmcarr
    Senior Member
    • May 2008
    • 1181

    #2
    Layla,

    The fact that you are seeing the key tag (tcag) in your sequence indicates that you have the untrimmed sequence. SFF files store the complete flowgram, sequence and quality scores for a well. They also contains trimming information for each read, the 5' and 3' positions of high quality sequence. The trim points also account for the key tag (and multiplex barcode if used) at the 5' end and the library adapter at the 3' end if the insert was short.

    When the FASTA and QUAL files are output from an SFF file using the sffinfo program they normally just contain the trimmed sequence. It is also possible to output the entire untrimmed sequence by using the -n option when you run sffinfo. In this case the portions of the read which are beyond the trim points are also output but in lower case. That is what you are seeing, the lower case bases are those which the 454 software marked to be trimmed.

    Comment

    • Layla
      Member
      • Sep 2008
      • 58

      #3
      50% lower case bases

      Thank you for the information kmcarr.

      I carried out a simple sffinfo -s file1.sff > file1.fna command without the -n option to get to this file. The fact that 454 has marked for these bases to be trimmed, should I also be eliminating them before I map them to the human genome? My concern is that 50% of my bases from 500MB are in lower case and in removing such bases, each read will only be on average 50 bases instead of the 500 bases that Titanium should be giving.

      Any suggestions on what one should do? I guess still holding onto those reads should not be an option?

      L

      Comment

      • hlu
        Member
        • Jan 2009
        • 32

        #4
        Originally posted by Layla View Post
        Thank you for the information kmcarr.

        I carried out a simple sffinfo -s file1.sff > file1.fna command without the -n option to get to this file. The fact that 454 has marked for these bases to be trimmed, should I also be eliminating them before I map them to the human genome? My concern is that 50% of my bases from 500MB are in lower case and in removing such bases, each read will only be on average 50 bases instead of the 500 bases that Titanium should be giving.

        Any suggestions on what one should do? I guess still holding onto those reads should not be an option?

        L

        Might want to contact software support on this issue? This sounds like a mis-behavior for sffinfo software.

        Comment

        • dan
          wiki wiki
          • Jul 2008
          • 194

          #5
          Looking at the 454TrimStatus.txt file (produced by assembly or mapping of an SFF), I get the following values:

          Mean Raw Length = 534
          Mean Orig Trimmed Length = 380


          About trimming before mapping... you should certainly trim the key tag and any adapter sequence from your reads before mapping (there is no way this could or should map onto your genome except by chance, i.e. in error).

          Using the 454 software, I was told that there is no special consideration taken for low quality mismatches. i.e. gsMapper does not use quality information when mapping. For this reason, you should trim low quality bases before mapping. However, I'd be interested to know of any mapper that can take quality information into account, i.e. by not penalising a low quality mismatch or by mapping high quality bases and using low quality bases when generating the consensus...

          It seems that the error model for 454 could be captured by a HMM. You could then map using all the available information for a read (excluding key tag and any adapter sequence) and then somehow perform a multiple HMM to HMM alignment to generate the consensus... Any maths geniuses around?

          Cheers,
          Last edited by dan; 07-14-2009, 03:10 AM. Reason: fixed a typo
          Homepage: Dan Bolser
          MetaBase the database of biological databases.

          Comment

          • bioinfosm
            Senior Member
            • Jan 2008
            • 483

            #6
            Perhaps MOSAIK from Marth lab works with quality values of 454 data..
            --
            bioinfosm

            Comment

            Latest Articles

            Collapse

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by SEQadmin2, 06-05-2026, 10:09 AM
            0 responses
            13 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-04-2026, 08:59 AM
            0 responses
            24 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 12:03 PM
            0 responses
            28 views
            0 reactions
            Last Post SEQadmin2  
            Started by SEQadmin2, 06-02-2026, 11:40 AM
            0 responses
            22 views
            0 reactions
            Last Post SEQadmin2  
            Working...