Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • latest SAM format: did the meanings of read and fragment get swapped?

    In the latest SAM format specification, the meanings of "read" and "fragment" seem to be reversed from what I am accustomed to.

    My understanding is that a fragment is the result of breaking a piece of DNA or RNA (e.g., a chromosome, cDNA, or mRNA) into smaller pieces (e.g., by shearing or nebulization). It is a subsequence of the original DNA or RNA. And reads are subsequences of a fragment (e.g., for paired end reads, from sequencing both ends of the fragment).

    However, in the latest SAM Format Specification (v 1.4-r962), April 17, 2011, if I am understanding the specs correctly, the meanings of "fragment" and "read" have been swapped. (The specs can be downloaded from http://samtools.sourceforge.net/ , under the "General Information" heading, under "SAM Spec v1.4".)

    In that document, the definitions are (with my emphases):
    • Template: A DNA/RNA sequence part of which is sequenced on a sequencing machine or assembled from raw sequences.
    • Fragment: A contiguous (sub)sequence on a template which is sequenced or assembled. For sequencing data, fragments are indexed by the order in which they are sequenced. For fragments of an assembled sequence, they are indexed by the order of the leftmost coordinate on the assembled sequence.
    • Read: A raw sequence that comes off a sequencing machine. A read may consist of multiple fragments.


    And the bitwise FLAGs are:
    • 0x1 template having multiple fragments in sequencing
    • 0x2 each fragment properly aligned according to the aligner
    • 0x4 fragment unmapped
    • 0x8 next fragment in the template unmapped
    • 0x10 SEQ being reverse complemented
    • 0x20 SEQ of the next fragment in the template being reversed
    • 0x40 the first fragment in the template
    • 0x80 the last fragment in the template
    • 0x100 secondary alignment
    • 0x200 not passing quality controls
    • 0x400 PCR or optical duplicate


    Whereas from the samtools man page (http://samtools.sourceforge.net/samtools.shtml), the Sam Format bitwise flags are
    • 0x0001 the read is paired in sequencing
    • 0x0002 the read is mapped in a proper pair
    • 0x0004 the query sequence itself is unmapped
    • 0x0008 the mate is unmapped
    • 0x0010 strand of the query (1 for reverse)
    • 0x0020 strand of the mate
    • 0x0040 the read is the first read in a pair
    • 0x0080 the read is the second read in a pair
    • 0x0100 the alignment is not primary
    • 0x0200 the read fails platform/vendor quality checks
    • 0x0400 the read is either a PCR or an optical duplicate


    I am confused. As sequencing moves toward more than just two reads (paired ends) per piece of DNA/RNA, are the meanings of the terms "fragment" and "read" changing?
    Last edited by d f; 08-03-2011, 01:51 PM. Reason: Adding URL

  • #2
    Looking back, I don't think they have changed their definition of a read and fragment, although I admit, it is somewhat natural to use fragments to call the broken DNA. We can "read" that subsequence with many fragments (pairs, strobes, etc.). The spec allows for multi-fragment reads using the flag field (first/last fragment) etc.

    Comment


    • #3
      I had the same questions as df when looking at the SAMv1.4 specs. It seems that the bioinformatics community is using the word "fragment" to mean two different things.

      For example, in the Cufflinks FAQ, there is a discussion about FPKM versus RPKM. They write "Paired-end RNA-Seq experiments produce two reads per fragment". In the SAM specs, it seems that a paired-end read is counted as one read with two fragments.

      Another part that confused me was that a fragment could be a subsequence but a read is a "raw sequence" (and not "raw (sub)sequence") - shouldn't a paired-end read be considered a sub-sequence of the template?

      I think it would be great if we could standardize the terminology, at the very least to cut down on the questions about "RPKM versus FPKM".

      thanks!
      Justin

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Advancing Precision Medicine for Rare Diseases in Children
        by seqadmin




        Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
        12-16-2024, 07:57 AM
      • seqadmin
        Recent Advances in Sequencing Technologies
        by seqadmin



        Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

        Long-Read Sequencing
        Long-read sequencing has seen remarkable advancements,...
        12-02-2024, 01:49 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 12-17-2024, 10:28 AM
      0 responses
      35 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-13-2024, 08:24 AM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-12-2024, 07:41 AM
      0 responses
      36 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 12-11-2024, 07:45 AM
      0 responses
      46 views
      0 likes
      Last Post seqadmin  
      Working...
      X