Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • My reads are mapping to the wrong strand?

    Hello,

    So, I'm doing some downstream analysis on a published RNA-seq data set for yeast: http://downloads.yeastgenome.org/pub...3390610/fastq/

    However, after I mapped them to the yeast genome, I noticed using Samtools that, oddly enough, far, far more reads were mapping to the complement of genes, than to the genes themselves. That is, if a gene was on the (+) strand, between nucleotide 500-1000 (for example), I would find that for most of the genes, far more RNA-seq reads would map to that location on the (-) strand than the (+) strand. I found that only ~800 genes would map in a 'canonical' fashion, that is, having more reads than the complementary region, while ~5800 would map in a non-canonical way, where there were more reads complementary to a gene than within the gene.

    I tested the script I wrote to make these measurements among other RNA-seq datasets, and did not find the same thing. What could be wrong with my yeast dataset?

    I have performed alignment with both SHRiMP and Tophat- both programs gave the same numbers. Changing the library type on Tophat did not affect the outcome.

    Thanks for any help!

  • #2
    Try reverse-complimenting the reads prior to mapping.

    ....just kidding! Actually, assuming you are using a stranded protocol, the strand reads map to is NOT affected by the library type flag you give Tophat. That only affects downstream processing using Cufflinks/Cuffdif. One of the library types is supposed to have read1 mapping to the 'wrong' strand.

    On the other hand, if your protocol was unstranded, it doesn't matter either way. The strand bias in that case is probably an artifact of the number of PCR cycles or some kind of 3'/5' binding affinity difference (just a guess).
    Last edited by Brian Bushnell; 07-23-2014, 12:11 PM.

    Comment


    • #3
      Originally posted by Brian Bushnell View Post
      Actually, assuming you are using a stranded protocol, the strand reads map to is NOT affected by the library type flag you give Tophat. That only affects downstream processing using Cufflinks/Cuffdif. One of the library types is supposed to have read1 mapping to the 'wrong' strand.

      On the other hand, if your protocol was unstranded, it doesn't matter either way. The strand bias in that case is probably an artifact of the number of PCR cycles or some kind of 3'/5' binding affinity difference (just a guess).
      This makes sense- however, according to the manufacturer documentation for the sequencing platform (Illumina GA IIx) it claims to be strand-specific. So, I would have to look into the Cuffdiff results to see if I am indeed seeing most of my reads discarded for most genes, or mostly looked at, or all considered regardless of strand?

      Based on my understanding, for stranded data, firststrand means that the read that comes out is equivalent to the original mRNA, and therefore will map to the opposite strand from the gene's location (as I am seeing in my data), whereas secondstrand means that the complement to the cDNA is sequenced, and the read is equivalent to the original gene, which is where it maps on the genome.

      Would I be correct, then, to think that this data is probably a firststrand library, which will be clear once I run cuffdiff (on the data I generated from aligning with the firststrand argument in tophat)?

      Comment


      • #4
        I don't do library prep, but my understanding is that machines are not inherently strand-specific; rather, some machines offer the possibility of using a strand-specific protocol. That does not ensure that your specific library was, in fact, sequenced using a strand-specific protocol; you'd have to check with the people who made it.

        Just to be clear, is your data single-ended or paired?

        Comment


        • #5
          It is single-end.
          I just checked the protocol accompanying the data- it confirms that the reads are indeed strand-specific.

          By the way, thank you so much for all of your help so far.

          Comment


          • #6
            Originally posted by rdsqc22 View Post
            It is single-end.
            I just checked the protocol accompanying the data- it confirms that the reads are indeed strand-specific.

            By the way, thank you so much for all of your help so far.
            You're welcome. As for 'firststrand' vs 'secondstrand', the documentation in Tophat is confusing, but I eventually concluded that for firststrand, read1 gets the sam tag "XS:A:+" if it maps to the plus strand and "XS:A:-" if it maps to the minus strand. This gives results concordant with Tophat, anyway, so I consider it empirically correct. According to the Tophat manual:

            Note the use of the custom tag XS. This attribute, which must have a value of "+" or "-", indicates which strand the RNA that produced this read came from.
            So... with 'firststrand', a plus-mapped read will get "XS:A:+", which by my reading indicates that its template RNA was minus strand, which indicates the gene is on the plus strand. But the description is vague so I'm not sure.

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Current Approaches to Protein Sequencing
              by seqadmin


              Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
              04-04-2024, 04:25 PM
            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, 04-11-2024, 12:08 PM
            0 responses
            30 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 10:19 PM
            0 responses
            32 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-10-2024, 09:21 AM
            0 responses
            28 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 04-04-2024, 09:00 AM
            0 responses
            53 views
            0 likes
            Last Post seqadmin  
            Working...
            X