Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • mapping to the genome uniquely or non-uniquely?

    Hi all,
    We are doing some short RNA sequencing (human) at are having an internal debate on what the best approach for mapping the reads would be. We seem to be split along three lines, here are the options:

    1) Map the reads to the miRNA sequences instead of the whole genome. My thought is that this is not the best approach as you are going to be biasing your reads to those regions, when in fact a specific read can some reads may potentially map to some other part of the genome "better". Also, there are other short RNAs than miRNAs, although they are probably the most abundant. Others in our group disagree and feel this is the best method to look at the miRNAs.

    2) map the reads uniquely to the whole genome (i.e. 1 read, 1 location). In this way you are only dealing with reads that are going to a single place and would make doing some differential expression easier. This is my preferred option.

    3) map the reads non-uniquely to the whole genome (i.e. 1 read may go to multiple places). Since the human genome is very repetative, this approach would account for this fact. Although my thought is that since 1 read would be in many places, it would make doing some differential expression rather difficult as 1 read would be counted multiple times making some sort of normalization rather difficult. Also, since a read can go to many places, potential 100's or more, you are assuming that each 1 of those locations is equally expressed, which is most likely not the case, and you will not know the exact location in the genome of your molecule of interest. In this case you may be dealing with specific sequences that are differentially expressed rather than locations or some annotated element.

    Based on these options, does anyone have any thoughts or experiences. We are trying all 3 approaches and seeing which works best. Although "best" can be rather subjective as this may be getting results which you want to see.

    Thanks

  • #2
    I've not worked with RNAseq data yet so I'm no expert but my opinion from a statistical stand point is to agree with your reasoning.

    I particularly dislike option 3 if you're interested in differential expression. Lets say you have 10 million reads and we'll assume they can all be uniquely mapped, this gives you essentially 10 million data points, but if each read is counted multiple times due to the use of non-unique mapping you could end up with 100s of million of data points which in reality is not true. I think this approach would be asking for trouble. I agree that option 1 could introduce bias. Since you are trying all three anyway see how strongly correlated options 1 and 2 are for the miRNAs, if they're not strongly correlated I'd suspect the bias you predict is the reason.

    Comment


    • #3
      Generally when you are aligning it is best to align to the genome for several reasons:
      a) do not bias towards transcriptome or sequences you are aligning too
      b) allows for discovery of novel miRNAs, other short RNAs, etc

      What is your goal ultimately? If you are just looking for differential expression of miRNAs, perhaps you could get away with the first option, but I wouldn't recommend it.

      I wouldn't recommend the third option at all. Tophat will attempt to align and choose the best match for each read. If two reads have equal matching to the genome then one will be pseudorandomly selected (it isn't completely random, as the same input will always yield the same output), and will align. So, if you have sufficient coverage the expression will still be recorded for repetitive elements. If you had two identical miRNA sequences in the genome and one or both were being expressed, you would expect them both to be detected.

      Now miRNAs often only differ by a few basepairs from what I recall. If you had two miRNAs, and only one of them was expressed, but they were similar here is what you would expect from option 2 and option 3:

      Option 2: The alignment would occur for both miRNA sequences, however, as only one would be a perfect match, and the other would have 1 or more mismatches, the expression of only the perfectly matched miRNA would be detected.

      Option 3: As multiple alignments are allowed, both sequences would fall into the alignment criteria (assuming you aren't allowing a mismatch of 0), and the expression of both would be detected.

      Also, if you were using option three then, you could probably only allow 0 mismatches, which may cause you to lose reads that you may be able to detect in option 2 allowing 1 or 2.
      Last edited by Mike2188; 07-30-2014, 02:32 PM.

      Comment


      • #4
        Thanks for the reply's.

        What is your goal ultimately? If you are just looking for differential expression of miRNAs, perhaps you could get away with the first option, but I wouldn't recommend it.
        Our goals are to find differentially expressed miRNAs as well as any other short RNA species which is expressed. We would also like to do some novel miRNA searching. So based on this, aligning to the miRNAs is a bad choice. Although I have seen many papers do this method, but still I am highly against it.

        I wouldn't recommend the third option at all. Tophat will attempt to align and choose the best match for each read. If two reads have equal matching to the genome then one will be pseudorandomly selected (it isn't completely random, as the same input will always yield the same output), and will align. So, if you have sufficient coverage the expression will still be recorded for repetitive elements. If you had two identical miRNA sequences in the genome and one or both were being expressed, you would expect them both to be detected.
        I definitely agree with your points on the non-unique mapping, which is why I am trying to stay away from it.

        Although you mention this:

        Also, if you were using option three then, you could probably only allow 0 mismatches, which may cause you to lose reads that you may be able to detect in option 2 allowing 1 or 2.
        This was an option that I was considering, essentially allowing a read to go to multiple places, but in all instances it allowing 0 mismatches. For the cases in which there are 2 miRNAs with the exact same sequences (e.g. miR-103a I believe is an example), we would get a read to map to both copies of it, but we would not know from which locus it is being expressed. This approach might also help with repetitive elements in genome when each has the same sequence.

        Are samples are currently having the libraries made and should be on the machine next week. We'll see what happens. I'll try to post back with an update after trying a few different approaches.

        Comment


        • #5
          There are a number of approaches for resolving multiply-mapped reads:




          You could also try using option 3 to discover new loci and then collapse redundant loci (such as the miR-103a in your example), then combine these with existing miR/smRNA databases and align using option 1.

          Note that Bowtie is preferred over Bowtie2 for short reads, and that the default Tophat segment length is 25nt. You may want to tweak these things for smRNA alignment.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Advancing Precision Medicine for Rare Diseases in Children
            by seqadmin




            Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
            12-16-2024, 07:57 AM
          • seqadmin
            Recent Advances in Sequencing Technologies
            by seqadmin



            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

            Long-Read Sequencing
            Long-read sequencing has seen remarkable advancements,...
            12-02-2024, 01:49 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 12-17-2024, 10:28 AM
          0 responses
          33 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-13-2024, 08:24 AM
          0 responses
          49 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-12-2024, 07:41 AM
          0 responses
          34 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-11-2024, 07:45 AM
          0 responses
          46 views
          0 likes
          Last Post seqadmin  
          Working...
          X