Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mapping to the genome uniquely or non-uniquely?

    Hi all,
    We are doing some short RNA sequencing (human) at are having an internal debate on what the best approach for mapping the reads would be. We seem to be split along three lines, here are the options:

    1) Map the reads to the miRNA sequences instead of the whole genome. My thought is that this is not the best approach as you are going to be biasing your reads to those regions, when in fact a specific read can some reads may potentially map to some other part of the genome "better". Also, there are other short RNAs than miRNAs, although they are probably the most abundant. Others in our group disagree and feel this is the best method to look at the miRNAs.

    2) map the reads uniquely to the whole genome (i.e. 1 read, 1 location). In this way you are only dealing with reads that are going to a single place and would make doing some differential expression easier. This is my preferred option.

    3) map the reads non-uniquely to the whole genome (i.e. 1 read may go to multiple places). Since the human genome is very repetative, this approach would account for this fact. Although my thought is that since 1 read would be in many places, it would make doing some differential expression rather difficult as 1 read would be counted multiple times making some sort of normalization rather difficult. Also, since a read can go to many places, potential 100's or more, you are assuming that each 1 of those locations is equally expressed, which is most likely not the case, and you will not know the exact location in the genome of your molecule of interest. In this case you may be dealing with specific sequences that are differentially expressed rather than locations or some annotated element.

    Based on these options, does anyone have any thoughts or experiences. We are trying all 3 approaches and seeing which works best. Although "best" can be rather subjective as this may be getting results which you want to see.

    Thanks

  • #2
    I've not worked with RNAseq data yet so I'm no expert but my opinion from a statistical stand point is to agree with your reasoning.

    I particularly dislike option 3 if you're interested in differential expression. Lets say you have 10 million reads and we'll assume they can all be uniquely mapped, this gives you essentially 10 million data points, but if each read is counted multiple times due to the use of non-unique mapping you could end up with 100s of million of data points which in reality is not true. I think this approach would be asking for trouble. I agree that option 1 could introduce bias. Since you are trying all three anyway see how strongly correlated options 1 and 2 are for the miRNAs, if they're not strongly correlated I'd suspect the bias you predict is the reason.

    Comment


    • #3
      Generally when you are aligning it is best to align to the genome for several reasons:
      a) do not bias towards transcriptome or sequences you are aligning too
      b) allows for discovery of novel miRNAs, other short RNAs, etc

      What is your goal ultimately? If you are just looking for differential expression of miRNAs, perhaps you could get away with the first option, but I wouldn't recommend it.

      I wouldn't recommend the third option at all. Tophat will attempt to align and choose the best match for each read. If two reads have equal matching to the genome then one will be pseudorandomly selected (it isn't completely random, as the same input will always yield the same output), and will align. So, if you have sufficient coverage the expression will still be recorded for repetitive elements. If you had two identical miRNA sequences in the genome and one or both were being expressed, you would expect them both to be detected.

      Now miRNAs often only differ by a few basepairs from what I recall. If you had two miRNAs, and only one of them was expressed, but they were similar here is what you would expect from option 2 and option 3:

      Option 2: The alignment would occur for both miRNA sequences, however, as only one would be a perfect match, and the other would have 1 or more mismatches, the expression of only the perfectly matched miRNA would be detected.

      Option 3: As multiple alignments are allowed, both sequences would fall into the alignment criteria (assuming you aren't allowing a mismatch of 0), and the expression of both would be detected.

      Also, if you were using option three then, you could probably only allow 0 mismatches, which may cause you to lose reads that you may be able to detect in option 2 allowing 1 or 2.
      Last edited by Mike2188; 07-30-2014, 02:32 PM.

      Comment


      • #4
        Thanks for the reply's.

        What is your goal ultimately? If you are just looking for differential expression of miRNAs, perhaps you could get away with the first option, but I wouldn't recommend it.
        Our goals are to find differentially expressed miRNAs as well as any other short RNA species which is expressed. We would also like to do some novel miRNA searching. So based on this, aligning to the miRNAs is a bad choice. Although I have seen many papers do this method, but still I am highly against it.

        I wouldn't recommend the third option at all. Tophat will attempt to align and choose the best match for each read. If two reads have equal matching to the genome then one will be pseudorandomly selected (it isn't completely random, as the same input will always yield the same output), and will align. So, if you have sufficient coverage the expression will still be recorded for repetitive elements. If you had two identical miRNA sequences in the genome and one or both were being expressed, you would expect them both to be detected.
        I definitely agree with your points on the non-unique mapping, which is why I am trying to stay away from it.

        Although you mention this:

        Also, if you were using option three then, you could probably only allow 0 mismatches, which may cause you to lose reads that you may be able to detect in option 2 allowing 1 or 2.
        This was an option that I was considering, essentially allowing a read to go to multiple places, but in all instances it allowing 0 mismatches. For the cases in which there are 2 miRNAs with the exact same sequences (e.g. miR-103a I believe is an example), we would get a read to map to both copies of it, but we would not know from which locus it is being expressed. This approach might also help with repetitive elements in genome when each has the same sequence.

        Are samples are currently having the libraries made and should be on the machine next week. We'll see what happens. I'll try to post back with an update after trying a few different approaches.

        Comment


        • #5
          There are a number of approaches for resolving multiply-mapped reads:




          You could also try using option 3 to discover new loci and then collapse redundant loci (such as the miR-103a in your example), then combine these with existing miR/smRNA databases and align using option 1.

          Note that Bowtie is preferred over Bowtie2 for short reads, and that the default Tophat segment length is 25nt. You may want to tweak these things for smRNA alignment.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Today, 11:49 AM
          0 responses
          12 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, Yesterday, 08:47 AM
          0 responses
          16 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          61 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Working...
          X