Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq and sense/antisense expression differences

    Hello all,

    Wondering if anyone out there could de-mystify the inner working of eXpress and edgeR for me or give me a better suggestion on how to do something. Basically, I'm looking at strand-specific RNA-seq data and trying to identify cases where there might be antisense transcripts involved in gene regulation. I'm looking across species and across tissues and these are non-model guys, so I'm having to use a reference transcriptome for all of the mapping. Isolating the reads that map to the + and - strand is done easily enough, but then what to do is where I've started making myself run around in circles. What I'm currently testing out is using eXpress to generate read counts for each transcript and then comparing those counts with something like edgeR. In the back of head though, I'm a bit worried that I am somehow violating some corrective/normalization factor in edgeR or using the wrong stat test as this is basically a test of distribution of reads within some libraries as well as between some libraries. Seems like as much of an contigency table test or binomial test cases as anything else and I think that's what edgeR is doing, but I'm a little unsure if it's 100% appropriate. Other than that, there the issue of whether I'm using the right "count" variable as well as, perhaps, the bias correct counts from eXpress might be a more accurate means of count estimation. Any thoughts there? Usually, those methods say to use the raw counts, but if you know there is bias in your mapping shouldn't you use the "unbiased" count instead? Finally, what about the fact that transcript coverage may be very different with sense/antisense gene regulation. By this I mean that lncRNAs/miRNAs might match only a portion the targeted transcript. Any thoughts on a good way to id that kind of pattern? I'm dealing with around hundrends of thousands of predicted transcripts here, so keep that in mind as well (i.e. visualizing each transcript in IGV a no go). Thanks in advance for any thoughts and/or insight.

  • #2
    The method I used for this was to create a test based on binomial stats (video tutorial of our method - the tool has actually been updated since this but works in the same way). Because strand specific libraries aren't completely clean I started by working out the global level of antisense transcription to get a measure of the level of antisense noise (assuming that true antisense transcription would be a small portion of all observed antisense reads).

    For the test I then looked at the number of reads which mapped to a given region (I was using genes as my test regions). I then used the global antisense level to calculate an expected number of antisense reads given the total read count and then used a binomial test to see if the observed number of antisense reads was greater than this.

    The test works pretty well, but there are some things you'll want to be aware of. The biggest problem is that in a surprisingly large number of cases we found predicted antisense transcription which occurred because the 3' UTRs of genes on opposite strands of the chromosome overlapped. This isn't an incorrect result as such but it might not be what you're looking for. We also found quite a few cases where we saw a very tightly packed column of antisense reads in a very small area. These could be small transcripts, but we suspect a lot of them will be mapping artefacts, so we might add in a filter to measure the physical extent of antisense transcription (proportion of the gene covered or something similar), rather than just the number of reads.

    Comment


    • #3
      Hi everyone,

      I am also trying to identify antisense transcripts and quantify them.
      My data are coming from cattle macrophages and have been prepared as paired-end strand-specific RNA-seq (using the ScriptSeq v2 kit). I have used STAR for the alignment and featureCounts for gene count summarisation (with option -s 1 to get sense gene counts and -s 2 to get antisense gene counts).

      And based on my analyses, I obtain very high correlation between sense and antisense counts per gene (see picture), which is not really what I was expecting (hoping) to see. Can people share advise on this?
      Can people share their experience on the pipeline they use to identify antisense transcripts (I'll probably give a try to SeqMonk), and also in mammals what is the expected amount of antisense transcription (about 10% of my reads are mapping to a gene but on the opposite strand).
      Also my RNA-seq library preparation include ploy(A)+ purification, do people see antisense transcription after such step generally?

      Thanks a lot, regards,
      Nicolas
      Attached Files

      Comment


      • #4
        I think there is a technical factor to take into consideration. Although libraries are strand-specific (scriptseq), I believe there is some contamination to this, dna, wrong direction etc... At least from my data i get perhaps 2-3 % of reads mapping in the "wrong direction", some of this may be biological, some technical. Try to map to your genome and inspect the distribution of forward and reverse reads at genes. If gene is on negative strand and one gets 50/50 og forward to reverse reads, then it could be mapping problem also. One colleaque had this issue so take it as simple (trivial) advise.

        Comment


        • #5
          Thanks for your answer Puggie. I am aware that I may (must) have some contamination (even using strand-specific protocol), but since I do not have a spiked in in my samples it is impossible to determine the level of contamination.

          The SeqMonk software (as explained by Simon Andrews) may be the way to go, in order to statistically remove such contamination to identify my putative true antisense. Actually does anyone know if SeqMonk software can identify antisense if provided with paired-end data?
          Thanks

          Comment


          • #6
            There is an antisense analysis pipeline in SeqMonk which looks at the global level of antisense (ie how unclean your strand specificity is) and then does a binomial test on each individual gene to see if the proportion of antisense within that gene is incompatible with the global level.

            It seems to work pretty well but the results tend to be contaminated by biological artefacts, particularly extended 3' UTRs which run over the adjacent gene. There's a video tutorial of this which shows the basic process.

            Comment


            • #7
              Thanks for your answer Simon, I input my data into SeqMonk and it seems to work all right (I did follow the video instructions).
              Also was I correct to input my BAM files (containing paired-end reads) without ticking the "split splice reads" options? My understanding from reading the manual is that selecting this option will make the software consider my reads as single-reads, which is not really what I want to look at antisense, am I right?
              Also for the extended 3' UTR contamination, I was planning to just exclude those antisense too close from another gene 3' UTR, should work hopefully.
              Thanks a million for the help,

              Comment


              • #8
                when do you use library type -firststrand or secondstrand? does this relate to the sense, or anti sense strand used in sequencing?

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Essential Discoveries and Tools in Epitranscriptomics
                  by seqadmin




                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                  04-22-2024, 07:01 AM
                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, Today, 08:47 AM
                0 responses
                12 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                60 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                59 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                54 views
                0 likes
                Last Post seqadmin  
                Working...
                X