Seqanswers Leaderboard Ad

**flobpf** · 03-17-2011, 12:19 PM

Strandness in GTF/GFF file: Strand of the gene, the coding strand of the RNA, the non-template strand

Strandness in column 2 of SAM file: The strand the read being mapped onto the reference

Strandness of Tophat XS:A tag: Strand of the RNA that produce the read.

What I don't understand is that "Strand of the RNA" and "Strand of the gene in GTF". I think they are the same thing?

If I am correct, then for a gene A sit on the +ve strand. Then I should only say the read comes from this gene A if Tophat tag this pair with XS:A:+ve.

Marcowanger, I think you are right. Please see this image attached for an illustrated explanation (and let me know if I'm wrong)

So the read obtained from directional sequencing corresponds to the template strand of the gene (not coding strand). Having said that, I infer the following. However, if we use XS tag in TopHat, we'd get the sense/antisense wrong (thanks Marcowanger for pointing that this was not an HTSeq issue):

Is my understanding correct? Or am I being mistaken?

Flobpf

UPDATE June 10, 2011: I contacted Illumina and they were confused too. However, finally they (and people at TopHat and our sequencing center) resolved the issue. The reads that come out of the machine have the same sequence as the CODING strand of the DNA and not the template strand.

**marcowanger** · 03-17-2011, 08:01 PM

To strengthen our understanding on this issue, perhaps we should take a look at the paper on strand specific sequencing.

I take this for an example. http://nar.oxfordjournals.org/conten...e123/suppl/DC1

Please see below attachment:
nar-00743-met-g-2009-File006.marco.pdf

I have add my annotation to the above supplementary file (highlighting for better illustration). Basically if you read to page 4, we know the read's sequence is identical to the 1st strand cDNA (i.e. complementary to the DNA coding strand, a.k.a mRNA sequence with T-U substitution). Then flobpf you are right. The read should corresponds to DNA's template strand, which is also complementary to the strandness in GTF/GFF file (which indicates the gene coding strand)

XS:A:+/-:

This is the tophat's tag not HTSeq tag.

If the interpretation of XS:A is right

XS: "+" for GT-AG and "-" for CT-AC

(http://seqanswers.com/forums/showthr...=tophat+strand)

As HT-Seq count only the read map to the Gene coding strand, the read from a gene-coding-strand being +ve will be mapped to -ve DNA strand (A gene with template strand -ve, coding strand+ve produces a mRNA that will be mapped to template strand -ve), then being discarded by HT-seq.

By the way, anyone know how HT-seq deals with paired-ended strand specific sequencing? Does it only look at Forward pair to infer strandness?

Originally posted by flobpf View Post

Marcowanger, I think you are right. Please see this image attached for an illustrated explanation (and let me know if I'm wrong)

So the read obtained from directional sequencing corresponds to the template strand of the gene (not coding strand). Having said that, I infer the following. However, HTSeq-counts seems to guess the sense/antisense wrong:

Is my understanding correct? Or am I being mistaken?

Flobpf

**marcowanger** · 03-17-2011, 08:07 PM

As Simon assumes

I assume strand-specific RNA-Seq sequences the coding strand

.

Then it probably explains why I got extremely low count for nearly all of the genes.

**Simon Anders** · 03-22-2011, 02:08 AM

Just to quickly clarify:

htseq-count does not look at the 'XS' optional field, only a the strand information according to the FLAG field. The strand information in the FLAG field (bit 0x10) specifies whether the sequence as read by the sequencer and as found in the FASTQ file has the same orientation as the sequence in the reference FASTA file ("+" strand, bit cleared), or whether the sequences in FASTQ and FASTA file are reverse-complements of each other ("-" strand, bit set).

With "--stranded=no", htseq-count ignored the strand bit. With "--stranded=yes" and single-end reads, it will count a read only if the alignment strand information according to the FLAG field is the same as the strand information for the gene/exon in the GFF file. For paired-end data, the strands have to be the same for the mate from the first sequencing pass and opposite for the mate from the second pass.

I guess I should add a new option "--stranded=reverse" that reverses this, i.e., counts reads only if, for single end, the strand informations in FLAG field and GFF file are opposite, and, for paired-end, if the first pass mate has opposite and the second pass mate same strand as the gene.

I hope this makes sense.

Simon

**marcowanger** · 03-22-2011, 02:18 AM

Thanks Simon.

Originally posted by Simon Anders View Post

Just to quickly clarify:

htseq-count does not look at the 'XS' optional field, only a the strand information according to the FLAG field. The strand information in the FLAG field (bit 0x10) specifies whether the sequence as read by the sequencer and as found in the FASTQ file has the same orientation as the sequence in the reference FASTA file ("+" strand, bit cleared), or whether the sequences in FASTQ and FASTA file are reverse-complements of each other ("-" strand, bit set).

With "--stranded=no", htseq-count ignored the strand bit. With "--stranded=yes" and single-end reads, it will count a read only if the alignment strand information according to the FLAG field is the same as the strand information for the gene/exon in the GFF file. For paired-end data, the strands have to be the same for the mate from the first sequencing pass and opposite for the mate from the second pass.

I guess I should add a new option "--stranded=reverse" that reverses this, i.e., counts reads only if, for single end, the strand informations in FLAG field and GFF file are opposite, and, for paired-end, if the first pass mate has opposite and the second pass mate same strand as the gene.

I hope this makes sense.

Simon

**flobpf** · 06-10-2011, 08:16 AM

Resolved

UPDATE June 10, 2011: I contacted Illumina and they were confused too. However, finally they (and people at TopHat and our sequencing center) resolved the issue. The reads that come out of the machine have the same sequence as the CODING strand of the DNA and not the template strand.

The correct option to use for Cufflinks is fr-secondstrand
Flobpf

**marcowanger** · 06-10-2011, 09:25 PM

Flobpf, nice to hear from you again.

May you kindly do some illustration? I got the protocol from our sequencing center and it seems the other way round. Maybe we have different protocol?

Marco

Originally posted by flobpf View Post

UPDATE June 10, 2011: I contacted Illumina and they were confused too. However, finally they (and people at TopHat and our sequencing center) resolved the issue. The reads that come out of the machine have the same sequence as the CODING strand of the DNA and not the template strand.

The correct option to use for Cufflinks is fr-secondstrand
Flobpf

**flobpf** · 06-15-2011, 10:26 AM

Depends on method

Originally posted by marcowanger View Post

Flobpf, nice to hear from you again.

May you kindly do some illustration? I got the protocol from our sequencing center and it seems the other way round. Maybe we have different protocol?

Marco

It depends on what method you are using. If you are using Illumina directional sequencing protocol, you get the coding strand sequence. With dUTP, its the template strand. The correct options to use are noted here now:

404 Not Found

http://cufflinks.cbcb.umd.edu/manual.html#library

**marcowanger** · 06-16-2011, 09:57 PM

It make more sense. We used dUTP so we got template strand.

Originally posted by flobpf View Post

It depends on what method you are using. If you are using Illumina directional sequencing protocol, you get the coding strand sequence. With dUTP, its the template strand. The correct options to use are noted here now:
http://cufflinks.cbcb.umd.edu/manual.html#library

**marcowanger** · 06-29-2011, 01:01 AM

Dear Simon,

How can I modify the code to reverse the --strand from counting coding strand to template strand?

Originally posted by Simon Anders View Post

Just to quickly clarify:

htseq-count does not look at the 'XS' optional field, only a the strand information according to the FLAG field. The strand information in the FLAG field (bit 0x10) specifies whether the sequence as read by the sequencer and as found in the FASTQ file has the same orientation as the sequence in the reference FASTA file ("+" strand, bit cleared), or whether the sequences in FASTQ and FASTA file are reverse-complements of each other ("-" strand, bit set).

With "--stranded=no", htseq-count ignored the strand bit. With "--stranded=yes" and single-end reads, it will count a read only if the alignment strand information according to the FLAG field is the same as the strand information for the gene/exon in the GFF file. For paired-end data, the strands have to be the same for the mate from the first sequencing pass and opposite for the mate from the second pass.

I guess I should add a new option "--stranded=reverse" that reverses this, i.e., counts reads only if, for single end, the strand informations in FLAG field and GFF file are opposite, and, for paired-end, if the first pass mate has opposite and the second pass mate same strand as the gene.

I hope this makes sense.

Simon

**Simon Anders** · 06-29-2011, 01:08 AM

I've added the '--standed=reverse' option promised above since then.

**marcowanger** · 06-29-2011, 01:46 AM

Dear Simon, I just downloaded 0.5.1.p3 .

It seems there is no --stranded=reverse?

Options:
-h, --help show this help message and exit
-m MODE, --mode=MODE mode to handle reads overlapping more than one
feature(choices: union, intersection-strict,
intersection-nonempty; default: union)
-s STRANDED, --stranded=STRANDED
whether the data is from a strand-specific assay
(default: yes)

is it still a hidden function?

Originally posted by Simon Anders View Post

I've added the '--standed=reverse' option promised above since then.

**marcowanger** · 06-29-2011, 01:52 AM

This is the error message I got from 0.5.1 p3

htseq-count: error: option -s: invalid choice: 'reverse' (choose from 'yes', 'no')

**Simon Anders** · 06-29-2011, 05:53 AM

Sorry, it seems I forgot to upload the latest version. Please try with 0.5.3.

**billstevens** · 08-21-2012, 09:41 AM

I'm sorry, I just read this entire thread, and I still have no idea if my data is strand-specific or not. The core I send it to uses the old TruSeq RNA kit, http://epigenome.usc.edu/docs/resour...15008136_A.pdf

I was looking at my accepted_hits.bam file from Tophat to figure it out, but the second column doesn't have a 0 or a 16. Here is the actual output:

HWI-1KL118:23:C0J57ACXX:8:1308:11486:195428 pPR1s chr1 565039 3 100M = 565073 135 CCGTCATCTACTCTACCATCTTTGCAGGCACACTCATCACAGCGCTAAGCTCGCACTGATTTTTTACCTGAGTAGGCCTAGAAATAAACATGCTAGCTTT @CCDDFFFHHDFHIJJJIEHIJJJFHIGHIEIIHFGHBGIIGHGHGHGGHGJJJBGHJCHHEEEDCEECECCCCDCCCCCCDDDDDCDDDDDDCDDDACC NM:i:0 NH:i:2 CC:Z:chrM CP:i:4490 HI:i:0

Can anyone help?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 26 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News