Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How will SAM evolve to handle multiple tags per fragment?

    A bit of an open question looking for debate.

    SAM format has a bunch of support for mate pair / paired end sequences in terms of flags and specific fields.

    A number of sequencing approaches can generate multiple tags per fragment. For example, Polonator reads are really mate quads. Complete Genomics takes this even farther. Conceptually, SOLiD and perhaps Illumina could generate 4 tags from "jumping" libraries. Helicos and PacBio (and presumably the VisGen/Life technology) can use pulses of unlabeled nucleotides to potentially generate a very large number of linked reads.

    Any thoughts to how to accommodate these? Or will a new format be required?

  • #2
    Originally posted by krobison View Post
    A bit of an open question looking for debate.

    SAM format has a bunch of support for mate pair / paired end sequences in terms of flags and specific fields.

    A number of sequencing approaches can generate multiple tags per fragment. For example, Polonator reads are really mate quads. Complete Genomics takes this even farther. Conceptually, SOLiD and perhaps Illumina could generate 4 tags from "jumping" libraries. Helicos and PacBio (and presumably the VisGen/Life technology) can use pulses of unlabeled nucleotides to potentially generate a very large number of linked reads.

    Any thoughts to how to accommodate these? Or will a new format be required?
    I mentioned this a while ago since we have had triple-end data for quite some time now (we don't generate such data anymore since it is not cost-effective). Anyhow, in my opinion there will have to be a non-backwards compatible SAM format, since the FLAG field among other fields are not compatible given n-linked reads. The fields that hold information about the pair/mate will have to become dynamic given the same structure. Maybe we could have a field that says how many tags are linked and then the rest of the line is easily reconstructed.

    A very real problem is that rapidly iterating across the links will become difficult when the SAM file is sorted by coordinate. This can be currently seen when trying to iterate across multiple alignments (some on different chromosomes).

    It would be helpful if the sequencing companies could help in this discussion by giving us insight into what they intend to release so that we can remain proactive in developing good open formats.

    My 2.097 cents CAD.

    Comment


    • #3
      How did you generate "triple-end" data? Do you have access to the magic, unreleased SOLiD reagents for reading "backwards" ?

      Comment


      • #4
        We used to discuss this issue. A potential solution is to keep all ends in a loop. e.g. in the mate position field end1 points end2, end2 to end3, and end3 back to end1. We can also keep the positions of all ends in a tag for each alignment. However, the real question is how we are going to use multi-end information. That is why it is not decided yet. But I do not see this as a major problem.

        Comment


        • #5
          Originally posted by krobison View Post
          How did you generate "triple-end" data? Do you have access to the magic, unreleased SOLiD reagents for reading "backwards" ?
          Illumina

          Anyhow, not that exciting just saying we will have to solve this problem.

          Comment


          • #6
            Heng:

            This sort of data may be particularly valuable on the PacBio instrument, depending on how long a fragment you can do it on.

            For example, imagine a series of ~50 nt reads separated by ~500nt gaps across a 20Kb fragment (I've totally pulled those numbers out of thin air, but I think this is roughly plausible from what I've heard). That would clearly be useful for de novo assembly, but would also help (I think) resolve complex repeat structures in resequencing applications.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Essential Discoveries and Tools in Epitranscriptomics
              by seqadmin




              The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
              Yesterday, 07:01 AM
            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            58 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            45 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            55 views
            0 likes
            Last Post seqadmin  
            Working...
            X