Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • latest SAM format: did the meanings of read and fragment get swapped?

    In the latest SAM format specification, the meanings of "read" and "fragment" seem to be reversed from what I am accustomed to.

    My understanding is that a fragment is the result of breaking a piece of DNA or RNA (e.g., a chromosome, cDNA, or mRNA) into smaller pieces (e.g., by shearing or nebulization). It is a subsequence of the original DNA or RNA. And reads are subsequences of a fragment (e.g., for paired end reads, from sequencing both ends of the fragment).

    However, in the latest SAM Format Specification (v 1.4-r962), April 17, 2011, if I am understanding the specs correctly, the meanings of "fragment" and "read" have been swapped. (The specs can be downloaded from http://samtools.sourceforge.net/ , under the "General Information" heading, under "SAM Spec v1.4".)

    In that document, the definitions are (with my emphases):
    • Template: A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences.
    • Fragment: A contiguous (sub)sequence on a template which is sequenced or assembled. For sequencing data, fragments are indexed by the order in which they are sequenced. For fragments of an assembled sequence, they are indexed by the order of the leftmost coordinate on the assembled sequence.
    • Read: A raw sequence that comes off a sequencing machine. A read may consist of multiple fragments.


    And the bitwise FLAGs are:
    • 0x1 template having multiple fragments in sequencing
    • 0x2 each fragment properly aligned according to the aligner
    • 0x4 fragment unmapped
    • 0x8 next fragment in the template unmapped
    • 0x10 SEQ being reverse complemented
    • 0x20 SEQ of the next fragment in the template being reversed
    • 0x40 the first fragment in the template
    • 0x80 the last fragment in the template
    • 0x100 secondary alignment
    • 0x200 not passing quality controls
    • 0x400 PCR or optical duplicate


    Whereas from the samtools man page (http://samtools.sourceforge.net/samtools.shtml), the Sam Format bitwise flags are
    • 0x0001 the read is paired in sequencing
    • 0x0002 the read is mapped in a proper pair
    • 0x0004 the query sequence itself is unmapped
    • 0x0008 the mate is unmapped
    • 0x0010 strand of the query (1 for reverse)
    • 0x0020 strand of the mate
    • 0x0040 the read is the first read in a pair
    • 0x0080 the read is the second read in a pair
    • 0x0100 the alignment is not primary
    • 0x0200 the read fails platform/vendor quality checks
    • 0x0400 the read is either a PCR or an optical duplicate


    I am confused. As sequencing moves toward more than just two reads (paired ends) per piece of DNA/RNA, are the meanings of the terms "fragment" and "read" changing?
    Last edited by d f; 08-03-2011, 01:51 PM. Reason: Adding URL

  • #2
    Looking back, I don't think they have changed their definition of a read and fragment, although I admit, it is somewhat natural to use fragments to call the broken DNA. We can "read" that subsequence with many fragments (pairs, strobes, etc.). The spec allows for multi-fragment reads using the flag field (first/last fragment) etc.

    Comment


    • #3
      I had the same questions as df when looking at the SAMv1.4 specs. It seems that the bioinformatics community is using the word "fragment" to mean two different things.

      For example, in the Cufflinks FAQ, there is a discussion about FPKM versus RPKM. They write "Paired-end RNA-Seq experiments produce two reads per fragment". In the SAM specs, it seems that a paired-end read is counted as one read with two fragments.

      Another part that confused me was that a fragment could be a subsequence but a read is a "raw sequence" (and not "raw (sub)sequence") - shouldn't a paired-end read be considered a sub-sequence of the template?

      I think it would be great if we could standardize the terminology, at the very least to cut down on the questions about "RPKM versus FPKM".

      thanks!
      Justin

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      59 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      57 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      48 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Working...
      X