How to infer Illumina paired-end strand specificity from SAM output?

David Harmin

Junior Member

Join Date: Feb 2011

Posts: 1
- Share
- Tweet
#1

How to infer Illumina paired-end strand specificity from SAM output?

02-16-2011, 09:34 AM

Hello, everyone, I'm new here.

I usually analyze unpaired-read data from SOLiD sequencing of RNA-Seq libraries. The output from mapping these reads is simple and unambiguously specifies whether each read was mapped to a chromosome's POS or NEG strand. Now for the first time I am looking at paired-end reads from Solexa sequencing of an RNA-Seq sample. The library was prepared via a dUTP second-strand method. These reads were mapped (to the mouse genome) by someone else using the program BWA. I've been handed a .bam file, and have since uncompressed it to SAM and familiarized myself with SAMtools and the SAM format.

At the moment I'm only concerned with characterizing individual reads, not with their pairing per se. My question concerns how to infer the strand to which each read maps, specifically whether a read is sense or antisense to a gene the read overlaps. I know that the 0x10 bit of SAM's FLAG field indicates reverse complementation (rc), which I take to mean that the read mapped to the NEG strand of the reference sequence. The RNA-Seq sample in question should almost entirely map to genes in the sense direction. I find that almost all -- but not quite all -- "left"-end reads of a pair do map to the POS reference strand ... but that almost all "right"-end reads of a pair map to the NEG strand.

Is it correct to assume that the right-most member of a pair, when rc'ed and thereby mapped to the NEG strand, actually originated on the POS strand, i.e., the same strand as the other member of its pair? In other words, if the left pair member is not rc'ed and the right pair member is rc'ed, do they both "really" map to the POS strand? And if the left member is rc'ed and the right member is not rc'ed, do they both map to the NEG strand? I can't find any reference that specifies "paired reads from the left and right [or 5' and 3'] ends of a fragment map by convention to the same and opposite strand, respectively, as the originating RNA fragment" or some such thing.

Looking at two highly expressed genes, one annotated on its chromosome's POS strand and the other on a NEG strand, I find that for both of them almost all the left-end reads are not rc'ed and almost all the right-end reads are rc'ed. If this holds for all genes one would conclude, depending on the convention used, that either every read maps to a POS strand, every read maps to a NEG strand, or close to half of reads are found on each strand irrespective of gene orientation. This just doesn't make sense to me, since reads stemming from transcription of these and other genes should align to the genes in the sense orientation -- i.e., they should map to the genes' strands in the reference genome.

I need to sort this out before deciding whether each read derives from an RNA fragment that was transcribed sense or antisense to a gene.

Thanks!

Addendum: I don't quite see the conistency between the sign of SAM's TLEN field and bit 0x40 bit of the FLAG field, where the flag indicates "first fragment in the template". ... Before I used TLEN >= 0 to identify "leftmost" (i.e., first) fragments. However, if I use instead use a set x40 bit to indicate first fragment, I find the following. For the gene whose reads should align to the POS strand, most reads in the SAM output are either ( 1st fragment and reverse-complemented ) or ( not 1st fragment and not rc'ed ). For the NEG-strand gene, I find the opposite trend, either ( 1st fragment and not rc'ed ) or ( not 1st fragment and rc'ed ). Good that these are now different. However, they each seem to suggest the wrong strand if one goes by whether the first fragment is reverse-complemented.

But the "wrong" strand would be reported if the whole sample actually consisted of reverse-complemented copies of RNA transcripts. This does seem to be the case in the dUTP second-strand sample prep: uridine digestion leaves the strand complementary to the initial transcript. Does this make sense?

If that's correct, then for this library prep a read should be assigned the NEG strand if and only if

-- it is a "first" fragment (FLAG x40 bit = 1) and its SEQ is not reverse-complemented (FLAG x10 bit = 0), or

-- it isn't a first fragment (x40 bit = 0) and its SEQ is reverse-complemented (x10 bit = 1).

Otherwise it's on the POS strand.

Last edited by David Harmin; 02-16-2011, 08:26 PM. Reason: Update of test cases
Tags: illumina rna-seq, paired end reads, sam, strand-specific

Previous template Next

Advancing Precision Medicine for Rare Diseases in Children

by seqadmin

Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
- Channel: Articles
12-16-2024, 07:57 AM
Recent Advances in Sequencing Technologies

by seqadmin

Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

Long-Read Sequencing
Long-read sequencing has seen remarkable advancements,...
- Channel: Articles
12-02-2024, 01:49 PM

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 22 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 28 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

How to infer Illumina paired-end strand specificity from SAM output?

Latest Articles

ad_right_rmr

News