PDA

View Full Version : My reads are mapping to the wrong strand?


rdsqc22
07-23-2014, 11:19 AM
Hello,

So, I'm doing some downstream analysis on a published RNA-seq data set for yeast: http://downloads.yeastgenome.org/published_datasets/Waern_2013_PMID_23390610/fastq/

However, after I mapped them to the yeast genome, I noticed using Samtools that, oddly enough, far, far more reads were mapping to the complement of genes, than to the genes themselves. That is, if a gene was on the (+) strand, between nucleotide 500-1000 (for example), I would find that for most of the genes, far more RNA-seq reads would map to that location on the (-) strand than the (+) strand. I found that only ~800 genes would map in a 'canonical' fashion, that is, having more reads than the complementary region, while ~5800 would map in a non-canonical way, where there were more reads complementary to a gene than within the gene.

I tested the script I wrote to make these measurements among other RNA-seq datasets, and did not find the same thing. What could be wrong with my yeast dataset?

I have performed alignment with both SHRiMP and Tophat- both programs gave the same numbers. Changing the library type on Tophat did not affect the outcome.

Thanks for any help!

Brian Bushnell
07-23-2014, 12:03 PM
Try reverse-complimenting the reads prior to mapping.

....just kidding! Actually, assuming you are using a stranded protocol, the strand reads map to is NOT affected by the library type flag you give Tophat. That only affects downstream processing using Cufflinks/Cuffdif. One of the library types is supposed to have read1 mapping to the 'wrong' strand.

On the other hand, if your protocol was unstranded, it doesn't matter either way. The strand bias in that case is probably an artifact of the number of PCR cycles or some kind of 3'/5' binding affinity difference (just a guess).

rdsqc22
07-23-2014, 01:33 PM
Actually, assuming you are using a stranded protocol, the strand reads map to is NOT affected by the library type flag you give Tophat. That only affects downstream processing using Cufflinks/Cuffdif. One of the library types is supposed to have read1 mapping to the 'wrong' strand.

On the other hand, if your protocol was unstranded, it doesn't matter either way. The strand bias in that case is probably an artifact of the number of PCR cycles or some kind of 3'/5' binding affinity difference (just a guess).

This makes sense- however, according to the manufacturer documentation for the sequencing platform (Illumina GA IIx) it claims to be strand-specific. So, I would have to look into the Cuffdiff results to see if I am indeed seeing most of my reads discarded for most genes, or mostly looked at, or all considered regardless of strand?

Based on my understanding, for stranded data, firststrand means that the read that comes out is equivalent to the original mRNA, and therefore will map to the opposite strand from the gene's location (as I am seeing in my data), whereas secondstrand means that the complement to the cDNA is sequenced, and the read is equivalent to the original gene, which is where it maps on the genome.

Would I be correct, then, to think that this data is probably a firststrand library, which will be clear once I run cuffdiff (on the data I generated from aligning with the firststrand argument in tophat)?

Brian Bushnell
07-23-2014, 01:37 PM
I don't do library prep, but my understanding is that machines are not inherently strand-specific; rather, some machines offer the possibility of using a strand-specific protocol. That does not ensure that your specific library was, in fact, sequenced using a strand-specific protocol; you'd have to check with the people who made it.

Just to be clear, is your data single-ended or paired?

rdsqc22
07-23-2014, 01:42 PM
It is single-end.
I just checked the protocol accompanying the data- it confirms that the reads are indeed strand-specific.

By the way, thank you so much for all of your help so far.

Brian Bushnell
07-23-2014, 02:43 PM
It is single-end.
I just checked the protocol accompanying the data- it confirms that the reads are indeed strand-specific.

By the way, thank you so much for all of your help so far.

You're welcome. As for 'firststrand' vs 'secondstrand', the documentation in Tophat is confusing, but I eventually concluded that for firststrand, read1 gets the sam tag "XS:A:+" if it maps to the plus strand and "XS:A:-" if it maps to the minus strand. This gives results concordant with Tophat, anyway, so I consider it empirically correct. According to the Tophat manual:

Note the use of the custom tag XS. This attribute, which must have a value of "+" or "-", indicates which strand the RNA that produced this read came from.

So... with 'firststrand', a plus-mapped read will get "XS:A:+", which by my reading indicates that its template RNA was minus strand, which indicates the gene is on the plus strand. But the description is vague so I'm not sure.