Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Titanium upper and lower case bases

    Seeing a read like this from 454 Titanium shotgun experiment using DNA from a capture array.

    tcagCTCGAGATTCTGGATCCTCACGTAATTCATCCTACATTACCTAGTAATTggtgaccatctgcattagctaattagcttatagaagaagacaacttctcatggtttatgacagaatata
    gtctgcaacttggagcaaggcacacaggggattaggnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn

    first tcag are the key sequence. Why is the rest of the sequence in upper and lower case? Thought the upper case meant good quality bases, but looking at the .fna files this does not seem to be the case.

    Any help to understand this is appreciated

    Cheers
    L

  • #2
    Layla,

    The fact that you are seeing the key tag (tcag) in your sequence indicates that you have the untrimmed sequence. SFF files store the complete flowgram, sequence and quality scores for a well. They also contains trimming information for each read, the 5' and 3' positions of high quality sequence. The trim points also account for the key tag (and multiplex barcode if used) at the 5' end and the library adapter at the 3' end if the insert was short.

    When the FASTA and QUAL files are output from an SFF file using the sffinfo program they normally just contain the trimmed sequence. It is also possible to output the entire untrimmed sequence by using the -n option when you run sffinfo. In this case the portions of the read which are beyond the trim points are also output but in lower case. That is what you are seeing, the lower case bases are those which the 454 software marked to be trimmed.

    Comment


    • #3
      50% lower case bases

      Thank you for the information kmcarr.

      I carried out a simple sffinfo -s file1.sff > file1.fna command without the -n option to get to this file. The fact that 454 has marked for these bases to be trimmed, should I also be eliminating them before I map them to the human genome? My concern is that 50% of my bases from 500MB are in lower case and in removing such bases, each read will only be on average 50 bases instead of the 500 bases that Titanium should be giving.

      Any suggestions on what one should do? I guess still holding onto those reads should not be an option?

      L

      Comment


      • #4
        Originally posted by Layla View Post
        Thank you for the information kmcarr.

        I carried out a simple sffinfo -s file1.sff > file1.fna command without the -n option to get to this file. The fact that 454 has marked for these bases to be trimmed, should I also be eliminating them before I map them to the human genome? My concern is that 50% of my bases from 500MB are in lower case and in removing such bases, each read will only be on average 50 bases instead of the 500 bases that Titanium should be giving.

        Any suggestions on what one should do? I guess still holding onto those reads should not be an option?

        L

        Might want to contact software support on this issue? This sounds like a mis-behavior for sffinfo software.

        Comment


        • #5
          Looking at the 454TrimStatus.txt file (produced by assembly or mapping of an SFF), I get the following values:

          Mean Raw Length = 534
          Mean Orig Trimmed Length = 380


          About trimming before mapping... you should certainly trim the key tag and any adapter sequence from your reads before mapping (there is no way this could or should map onto your genome except by chance, i.e. in error).

          Using the 454 software, I was told that there is no special consideration taken for low quality mismatches. i.e. gsMapper does not use quality information when mapping. For this reason, you should trim low quality bases before mapping. However, I'd be interested to know of any mapper that can take quality information into account, i.e. by not penalising a low quality mismatch or by mapping high quality bases and using low quality bases when generating the consensus...

          It seems that the error model for 454 could be captured by a HMM. You could then map using all the available information for a read (excluding key tag and any adapter sequence) and then somehow perform a multiple HMM to HMM alignment to generate the consensus... Any maths geniuses around?

          Cheers,
          Last edited by dan; 07-14-2009, 03:10 AM. Reason: fixed a typo
          Homepage: Dan Bolser
          MetaBase the database of biological databases.

          Comment


          • #6
            Perhaps MOSAIK from Marth lab works with quality values of 454 data..
            --
            bioinfosm

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            18 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            22 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            16 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            47 views
            0 likes
            Last Post seqadmin  
            Working...
            X