SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Re-write strand information in BAM file droog_22 Bioinformatics 1 11-07-2011 03:25 AM
cufflinks transcript on wrong strand ornam Bioinformatics 2 12-13-2010 12:33 PM
Cufflinks library-type for SOLiD strand specific single data gengen Bioinformatics 1 10-19-2010 08:41 AM
tophat islands information syslm01 Bioinformatics 0 08-11-2010 03:48 AM
Bowtie, BWA, strand information Zimbobo Bioinformatics 6 04-24-2010 08:30 AM

Reply
 
Thread Tools
Old 04-13-2010, 04:34 PM   #1
Zimbobo
Member
 
Location: US

Join Date: Mar 2010
Posts: 25
Default tophat + cufflinks: strand information

I am using tophat.
Does anybody know a way to get the strand information in the output sam file?

I am interested in using cufflinks on tophat output. In the following is what I found on the cufflinks web page, it shows what I am after, how do I get that custom tag into the sam file?

Here's an example of an alignment Cufflinks will accept:
s6.25mer.txt-913508 16 chr1 4482736 255 14M431N11M * 0 0 \
CAAGATGCTAGGCAAGTCTTGGAAG IIIIIIIIIIIIIIIIIIIIIIIII NM:i:0 XS:A:-
Note the use of the custom tag XS. This attribute, which must have a value of "+" or "-", indicates which strand the RNA that produced this read came from. While this tag can be applied to any alignment, including unspliced ones, it must be present for all spliced alignment records (those with a 'N' operation in the CIGAR string).
Zimbobo is offline   Reply With Quote
Old 04-13-2010, 10:10 PM   #2
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi there,

I believe the second column (the flag field. For more information, please refer to SAM format definition) contains the strand information. So you could write a script to determine the strand for that alignment and append the XS:A: field at the end of each line.
Haneko is offline   Reply With Quote
Old 04-14-2010, 03:03 PM   #3
Zimbobo
Member
 
Location: US

Join Date: Mar 2010
Posts: 25
Default

Yes, the 2nd column of sam file could contain this information, but it appears (at least in tophat 1.0.13) the output sam file - accepted_hits.sam - contains the numbers 0 or 16 in this column. I am not sure what this means.

For the sam output of bowtie (0.12.5) it's the same story.

I got around that by running bowtie such that it produces its standard output, there the + - strand information is preserved and then use bowtie2sam.pl (one of the samtools tools).

I still don't know how to get the strand information out of tophat, if it's possible at all.
Zimbobo is offline   Reply With Quote
Old 04-14-2010, 05:14 PM   #4
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

The strand is 'encoded' in hexadecimal within the field values. So to get the strand information, you will have to do an "AND" operation on the flag field.

In the SAM definition for the flag field:
0x0010 strand of the query (1 for reverse)

So if you do AND between the flag and 0x0010 and you get a 1, the entry is on the negative strand, else it's on the positive strand.
Haneko is offline   Reply With Quote
Old 04-15-2010, 01:29 PM   #5
Zimbobo
Member
 
Location: US

Join Date: Mar 2010
Posts: 25
Default

That was the info I needed. Thanks.
Zimbobo is offline   Reply With Quote
Old 05-05-2010, 09:35 AM   #6
Angela
Junior Member
 
Location: Cambridge

Join Date: Apr 2009
Posts: 1
Default

Hi Haneko,

If it were so I suppose I shouldn't be getting the following from TopHat:

Code:
HWI-EAS202_170:5:4:655:1891     0       chr1    11102910        255     15M10277N21M    *       0       0       GGATCATCCACCATGTTACCGATAAGCACCAGTTCA    B>@A@BA@>@??BAB@B@@@@??A>@;@>6>@:?>@    NM:i:0  XS:A:+  NS:i:0
HWI-EAS202_170:5:12:784:1887    16      chr1    11102913        255     12M10277N24M    *       0       0       TCATCCACCATGTTACCGATAAGCACCAGTTCAAAC    @@??AA9@AA@=AA9B@@A@BB?A9@BB@B@BBA@B    NM:i:0  XS:A:+  NS:i:0
One read has flag 0 and the other 16, meaning one maps to the forward and the other to the reverse, yet both have XS set to +

Did I get it wrong? I'm using TopHat v1.0.13
Thanks
Angela is offline   Reply With Quote
Old 05-08-2010, 06:16 PM   #7
xhuister
Member
 
Location: China

Join Date: Apr 2010
Posts: 41
Default

Hi Angela,

I'm using TopHat v1.0.13 too, but there is no these two columns "XS:A:+ NS:i:0" in my output files.

So, I used this command to add "XS:A:" column to the .sam file for cufflinks:
the flag column number is 2, when 16 then strand=- else strand='+'


awk -F'\t' 'BEGIN { OFS="\t"} {if($2==16) {$(NF+1)="XS:A:-"; print $0,$(NF+1)} else {$(NF+1)="XS:A:+"; print $0,$(NF+1)}}' test.sam > test.sam2
xhuister is offline   Reply With Quote
Old 05-09-2010, 04:30 AM   #8
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

I believe the XS tag is to be understood as the strand of the transcript which was sequenced and not the strand which the read itself was aligned to. That is why you, Angela, see reads aligned to opposite strands with identical XS tag. Because of that, it would be unwise to simply insert the XS tag manually by a script if you wish to run Cufflinks as a downstream analysis.

xhuister, have you checked if your TopHat output SAM contains the XS tag in the reads covering splice junctions but not in the reads contained within exons? Cufflinks only needs the tag for the reads covering splice junctions so if you just did a quick check of the TopHat output you might have missed it.
Thomas Doktor is offline   Reply With Quote
Old 05-09-2010, 05:10 AM   #9
xhuister
Member
 
Location: China

Join Date: Apr 2010
Posts: 41
Default

Hi Thomas,

Thanks. You are right, there is XS tag in some of the reads of TOPHAT output, but in only a very small number of the reads.

$ grep -c "XS:A" accepted_hits.sam
17060
wc -c accepted_hits.sam
64463256

About what your said "Cufflinks only needs the tag for the reads covering splice junctions "?
At first, I thought I could used .sam output of Bowtie, then add "XS" tag to the reads, then use cufflinks to find the transcript boundaries. Am I wrong if I do so?
xhuister is offline   Reply With Quote
Old 05-09-2010, 06:03 AM   #10
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

Yes, it would be wrong to run Cufflinks on a SAM file produced directly by Bowtie. Bowtie does not identify the strand of the transcript (or rather the splice junctions) so you need to align your reads with TopHat and then run Cufflinks. TopHat runs Bowtie as part of its pipeline so you don't need to run anything before TopHat. Having said that, I haven't actually run Cufflinks on Bowtie output, it might be able to do reasonably well but there is simply no reason to do it since you get a much better estimate of transcript abundances by using TopHat and then Cufflinks.

The reason TopHat only reports a strand for the transcript when a read spans a splice junction is that it cannot determine the strand of the read if it is contained within an exon. TopHat determines the strand of a read that spans a splice junction by looking at the splice sites to determine what strand contains a valid GT-AG pair (or AT-AC in case of the minor spliceosome).

Last edited by Thomas Doktor; 05-09-2010 at 06:06 AM.
Thomas Doktor is offline   Reply With Quote
Old 05-09-2010, 08:26 AM   #11
xhuister
Member
 
Location: China

Join Date: Apr 2010
Posts: 41
Default

Thanks Thomas, but I'm rather confused about the strand. I thought that when a read is mapped to the chromosome, then it should be + strand, when mapped to the reverse chromosome, then - strand.
But in what you said "it cannot determine the strand of the read within an exon", is this strand not the strand of read mapping to the chromosome?
xhuister is offline   Reply With Quote
Old 05-09-2010, 08:33 AM   #12
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

Sorry, I was being unclear. TopHat can't determine what strand the original transcript came from, not the read itself, if the read is entirely contained within an exon.
Thomas Doktor is offline   Reply With Quote
Old 05-09-2010, 09:28 AM   #13
xhuister
Member
 
Location: China

Join Date: Apr 2010
Posts: 41
Default

Sorry, I'm still confused.

Given the followings:
chromosome AAGGGG....
read1 AAGGGG
read2 CCCCTT

Read1 will map to strand +, read2 strand -. If a transcript contains read1 then I think this transcript will be strand +. I don't know why the strand of this transcript is undetermined?
xhuister is offline   Reply With Quote
Old 05-09-2010, 10:04 AM   #14
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

It depends on the way you prepared your library. If you used a non-strand-specific protocol (which most people still do) you're not actually sequencing transcripts, you're sequencing cDNA with two strands. So the read could come from either of the two strands of a cDNA and you don't have any information which of the two strands corresponds to the original mRNA strand. This can be inferred when a read spans a splice junctions because splice site are highly conserved at the first 2 bases and last 2 bases of an intron.
Another way of inferring directionality would be to look at the exon islands of the transcripts and identify valid open reading frames, but TopHat does not, to my knowledge, employ that strategy.

Last edited by Thomas Doktor; 05-09-2010 at 10:06 AM.
Thomas Doktor is offline   Reply With Quote
Old 05-09-2010, 10:25 AM   #15
xhuister
Member
 
Location: China

Join Date: Apr 2010
Posts: 41
Default

Thank you Thomas. I think the library I analyzed is strand-specific, but I didn't see any option in Tophat to specify strand-specific or non-strand-specific, do you know how to set this in Tophat?
xhuister is offline   Reply With Quote
Old 05-09-2010, 10:50 AM   #16
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

No, I don't think TopHat "supports" strand-specific RNA-seq yet. Of course it will align the reads and all, it just won't report an XS tag other than for the splice junction spanning reads. If you're sure your reads came from a strand specific RNA-seq experiment it would probably be safe to go ahead and supply the SAM file with XS tags corresponding to the strand of the read itself. I'm not sure it will have a huge effect on the output of Cufflinks, it would be nice if you could report back on that.
Thomas Doktor is offline   Reply With Quote
Old 05-09-2010, 11:32 AM   #17
xhuister
Member
 
Location: China

Join Date: Apr 2010
Posts: 41
Default

Ok, I'll have a try. Thank you Thomas.
xhuister is offline   Reply With Quote
Old 05-09-2010, 05:18 PM   #18
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi Thomas,

I'm using SOLiD BioScope SAM outputs for input into cufflinks, and it doesn't report the XS:A tag regardless of whether it is spliced or unspliced. My initial assumption was that the XS tag represents strand of the alignment of the read, but as you said, this is incorrect. However, if the read were to be uniquely mapped onto the reference genome, would it be ok to assume that the strand from which the read came from is the same as the strand to which it was aligned?
Haneko is offline   Reply With Quote
Old 05-10-2010, 06:17 AM   #19
Thomas Doktor
Senior Member
 
Location: University of Southern Denmark (SDU), Denmark

Join Date: Apr 2009
Posts: 105
Default

Haneko,

If your reads came from a strand specific RNA-seq experiment, it should be safe to do that. However, I have no experience with BioScope alignment, does it support alignment of reads spanning splice junctions?
Thomas Doktor is offline   Reply With Quote
Old 05-10-2010, 05:09 PM   #20
Haneko
Member
 
Location: Singapore

Join Date: Jan 2010
Posts: 36
Default

Hi Thomas,

Yes, bioscope supports spliced alignments. Unfortunately, the rna-seq experiment was not strand specific.

If that's the case, will assigning the wrong strand to an unspliced read result in any difference in cufflinks output? I understand that the flag is compulsory for spliced reads, but not unspliced ones..
Haneko is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:40 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO