Seqanswers Leaderboard Ad

**Haneko** · 04-13-2010, 10:10 PM

Hi there,

I believe the second column (the flag field. For more information, please refer to SAM format definition) contains the strand information. So you could write a script to determine the strand for that alignment and append the XS:A: field at the end of each line.

**Zimbobo** · 04-14-2010, 03:03 PM

Yes, the 2nd column of sam file could contain this information, but it appears (at least in tophat 1.0.13) the output sam file - accepted_hits.sam - contains the numbers 0 or 16 in this column. I am not sure what this means.

For the sam output of bowtie (0.12.5) it's the same story.

I got around that by running bowtie such that it produces its standard output, there the + - strand information is preserved and then use bowtie2sam.pl (one of the samtools tools).

I still don't know how to get the strand information out of tophat, if it's possible at all.

**Haneko** · 04-14-2010, 05:14 PM

The strand is 'encoded' in hexadecimal within the field values. So to get the strand information, you will have to do an "AND" operation on the flag field.

In the SAM definition for the flag field:
0x0010 strand of the query (1 for reverse)

So if you do AND between the flag and 0x0010 and you get a 1, the entry is on the negative strand, else it's on the positive strand.

**Zimbobo** · 04-15-2010, 01:29 PM

That was the info I needed. Thanks.

**Angela** · 05-05-2010, 09:35 AM

Hi Haneko,

If it were so I suppose I shouldn't be getting the following from TopHat:

Code:

HWI-EAS202_170:5:4:655:1891     0       chr1    11102910        255     15M10277N21M    *       0       0       GGATCATCCACCATGTTACCGATAAGCACCAGTTCA    B>@A@BA@>@??BAB@B@@@@??A>@;@>6>@:?>@    NM:i:0  XS:A:+  NS:i:0
HWI-EAS202_170:5:12:784:1887    16      chr1    11102913        255     12M10277N24M    *       0       0       TCATCCACCATGTTACCGATAAGCACCAGTTCAAAC    @@??AA9@AA@=AA9B@@A@BB?A9@BB@B@BBA@B    NM:i:0  XS:A:+  NS:i:0

One read has flag 0 and the other 16, meaning one maps to the forward and the other to the reverse, yet both have XS set to +

Did I get it wrong? I'm using TopHat v1.0.13
Thanks

**xhuister** · 05-08-2010, 06:16 PM

Hi Angela,

I'm using TopHat v1.0.13 too, but there is no these two columns "XS:A:+ NS:i:0" in my output files.

So, I used this command to add "XS:A:" column to the .sam file for cufflinks:
the flag column number is 2, when 16 then strand=- else strand='+'

awk -F'\t' 'BEGIN { OFS="\t"} {if($2==16) {$(NF+1)="XS:A:-"; print $0,$(NF+1)} else {$(NF+1)="XS:A:+"; print $0,$(NF+1)}}' test.sam > test.sam2

**Thomas Doktor** · 05-09-2010, 04:30 AM

I believe the XS tag is to be understood as the strand of the transcript which was sequenced and not the strand which the read itself was aligned to. That is why you, Angela, see reads aligned to opposite strands with identical XS tag. Because of that, it would be unwise to simply insert the XS tag manually by a script if you wish to run Cufflinks as a downstream analysis.

xhuister, have you checked if your TopHat output SAM contains the XS tag in the reads covering splice junctions but not in the reads contained within exons? Cufflinks only needs the tag for the reads covering splice junctions so if you just did a quick check of the TopHat output you might have missed it.

**xhuister** · 05-09-2010, 05:10 AM

Hi Thomas,

Thanks. You are right, there is XS tag in some of the reads of TOPHAT output, but in only a very small number of the reads.

$ grep -c "XS:A" accepted_hits.sam
17060
wc -c accepted_hits.sam
64463256

About what your said "Cufflinks only needs the tag for the reads covering splice junctions "?
At first, I thought I could used .sam output of Bowtie, then add "XS" tag to the reads, then use cufflinks to find the transcript boundaries. Am I wrong if I do so?

**Thomas Doktor** · 05-09-2010, 06:03 AM

Yes, it would be wrong to run Cufflinks on a SAM file produced directly by Bowtie. Bowtie does not identify the strand of the transcript (or rather the splice junctions) so you need to align your reads with TopHat and then run Cufflinks. TopHat runs Bowtie as part of its pipeline so you don't need to run anything before TopHat. Having said that, I haven't actually run Cufflinks on Bowtie output, it might be able to do reasonably well but there is simply no reason to do it since you get a much better estimate of transcript abundances by using TopHat and then Cufflinks.

The reason TopHat only reports a strand for the transcript when a read spans a splice junction is that it cannot determine the strand of the read if it is contained within an exon. TopHat determines the strand of a read that spans a splice junction by looking at the splice sites to determine what strand contains a valid GT-AG pair (or AT-AC in case of the minor spliceosome).

**xhuister** · 05-09-2010, 08:26 AM

Thanks Thomas, but I'm rather confused about the strand. I thought that when a read is mapped to the chromosome, then it should be + strand, when mapped to the reverse chromosome, then - strand.
But in what you said "it cannot determine the strand of the read within an exon", is this strand not the strand of read mapping to the chromosome?

**Thomas Doktor** · 05-09-2010, 08:33 AM

Sorry, I was being unclear. TopHat can't determine what strand the original transcript came from, not the read itself, if the read is entirely contained within an exon.

**xhuister** · 05-09-2010, 09:28 AM

Sorry, I'm still confused.

Given the followings:
chromosome AAGGGG....
read1 AAGGGG
read2 CCCCTT

Read1 will map to strand +, read2 strand -. If a transcript contains read1 then I think this transcript will be strand +. I don't know why the strand of this transcript is undetermined?

**Thomas Doktor** · 05-09-2010, 10:04 AM

It depends on the way you prepared your library. If you used a non-strand-specific protocol (which most people still do) you're not actually sequencing transcripts, you're sequencing cDNA with two strands. So the read could come from either of the two strands of a cDNA and you don't have any information which of the two strands corresponds to the original mRNA strand. This can be inferred when a read spans a splice junctions because splice site are highly conserved at the first 2 bases and last 2 bases of an intron.
Another way of inferring directionality would be to look at the exon islands of the transcripts and identify valid open reading frames, but TopHat does not, to my knowledge, employ that strategy.

**xhuister** · 05-09-2010, 10:25 AM

Thank you Thomas. I think the library I analyzed is strand-specific, but I didn't see any option in Tophat to specify strand-specific or non-strand-specific, do you know how to set this in Tophat?

Topics	Statistics	Last Post
Evaluating Genome Sequencing for ECMO Patients in the NICU by seqadmin Started by seqadmin, 12-17-2024, 10:28 AM	0 responses 26 views 0 likes	Last Post by seqadmin 12-17-2024, 10:28 AM
New Genetic Toolkit Refines Studies on Gene Function and Disease by seqadmin Started by seqadmin, 12-13-2024, 08:24 AM	0 responses 43 views 0 likes	Last Post by seqadmin 12-13-2024, 08:24 AM
Study Links Brain Mechanism to Emotional Responses in Animals and Humans by seqadmin Started by seqadmin, 12-12-2024, 07:41 AM	0 responses 29 views 0 likes	Last Post by seqadmin 12-12-2024, 07:41 AM
Study Identifies Ribosomal RNA Fingerprints as Early Cancer Biomarkers by seqadmin Started by seqadmin, 12-11-2024, 07:45 AM	0 responses 42 views 0 likes	Last Post by seqadmin 12-11-2024, 07:45 AM

Seqanswers Leaderboard Ad

Announcement

tophat + cufflinks: strand information

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News