Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lre1234
    Senior Member
    • Aug 2011
    • 110

    mapping to the genome uniquely or non-uniquely?

    Hi all,
    We are doing some short RNA sequencing (human) at are having an internal debate on what the best approach for mapping the reads would be. We seem to be split along three lines, here are the options:

    1) Map the reads to the miRNA sequences instead of the whole genome. My thought is that this is not the best approach as you are going to be biasing your reads to those regions, when in fact a specific read can some reads may potentially map to some other part of the genome "better". Also, there are other short RNAs than miRNAs, although they are probably the most abundant. Others in our group disagree and feel this is the best method to look at the miRNAs.

    2) map the reads uniquely to the whole genome (i.e. 1 read, 1 location). In this way you are only dealing with reads that are going to a single place and would make doing some differential expression easier. This is my preferred option.

    3) map the reads non-uniquely to the whole genome (i.e. 1 read may go to multiple places). Since the human genome is very repetative, this approach would account for this fact. Although my thought is that since 1 read would be in many places, it would make doing some differential expression rather difficult as 1 read would be counted multiple times making some sort of normalization rather difficult. Also, since a read can go to many places, potential 100's or more, you are assuming that each 1 of those locations is equally expressed, which is most likely not the case, and you will not know the exact location in the genome of your molecule of interest. In this case you may be dealing with specific sequences that are differentially expressed rather than locations or some annotated element.

    Based on these options, does anyone have any thoughts or experiences. We are trying all 3 approaches and seeing which works best. Although "best" can be rather subjective as this may be getting results which you want to see.

    Thanks
  • N311V
    Member
    • Jul 2013
    • 34

    #2
    I've not worked with RNAseq data yet so I'm no expert but my opinion from a statistical stand point is to agree with your reasoning.

    I particularly dislike option 3 if you're interested in differential expression. Lets say you have 10 million reads and we'll assume they can all be uniquely mapped, this gives you essentially 10 million data points, but if each read is counted multiple times due to the use of non-unique mapping you could end up with 100s of million of data points which in reality is not true. I think this approach would be asking for trouble. I agree that option 1 could introduce bias. Since you are trying all three anyway see how strongly correlated options 1 and 2 are for the miRNAs, if they're not strongly correlated I'd suspect the bias you predict is the reason.

    Comment

    • Mike2188
      Member
      • Oct 2013
      • 27

      #3
      Generally when you are aligning it is best to align to the genome for several reasons:
      a) do not bias towards transcriptome or sequences you are aligning too
      b) allows for discovery of novel miRNAs, other short RNAs, etc

      What is your goal ultimately? If you are just looking for differential expression of miRNAs, perhaps you could get away with the first option, but I wouldn't recommend it.

      I wouldn't recommend the third option at all. Tophat will attempt to align and choose the best match for each read. If two reads have equal matching to the genome then one will be pseudorandomly selected (it isn't completely random, as the same input will always yield the same output), and will align. So, if you have sufficient coverage the expression will still be recorded for repetitive elements. If you had two identical miRNA sequences in the genome and one or both were being expressed, you would expect them both to be detected.

      Now miRNAs often only differ by a few basepairs from what I recall. If you had two miRNAs, and only one of them was expressed, but they were similar here is what you would expect from option 2 and option 3:

      Option 2: The alignment would occur for both miRNA sequences, however, as only one would be a perfect match, and the other would have 1 or more mismatches, the expression of only the perfectly matched miRNA would be detected.

      Option 3: As multiple alignments are allowed, both sequences would fall into the alignment criteria (assuming you aren't allowing a mismatch of 0), and the expression of both would be detected.

      Also, if you were using option three then, you could probably only allow 0 mismatches, which may cause you to lose reads that you may be able to detect in option 2 allowing 1 or 2.
      Last edited by Mike2188; 07-30-2014, 02:32 PM.

      Comment

      • lre1234
        Senior Member
        • Aug 2011
        • 110

        #4
        Thanks for the reply's.

        What is your goal ultimately? If you are just looking for differential expression of miRNAs, perhaps you could get away with the first option, but I wouldn't recommend it.
        Our goals are to find differentially expressed miRNAs as well as any other short RNA species which is expressed. We would also like to do some novel miRNA searching. So based on this, aligning to the miRNAs is a bad choice. Although I have seen many papers do this method, but still I am highly against it.

        I wouldn't recommend the third option at all. Tophat will attempt to align and choose the best match for each read. If two reads have equal matching to the genome then one will be pseudorandomly selected (it isn't completely random, as the same input will always yield the same output), and will align. So, if you have sufficient coverage the expression will still be recorded for repetitive elements. If you had two identical miRNA sequences in the genome and one or both were being expressed, you would expect them both to be detected.
        I definitely agree with your points on the non-unique mapping, which is why I am trying to stay away from it.

        Although you mention this:

        Also, if you were using option three then, you could probably only allow 0 mismatches, which may cause you to lose reads that you may be able to detect in option 2 allowing 1 or 2.
        This was an option that I was considering, essentially allowing a read to go to multiple places, but in all instances it allowing 0 mismatches. For the cases in which there are 2 miRNAs with the exact same sequences (e.g. miR-103a I believe is an example), we would get a read to map to both copies of it, but we would not know from which locus it is being expressed. This approach might also help with repetitive elements in genome when each has the same sequence.

        Are samples are currently having the libraries made and should be on the machine next week. We'll see what happens. I'll try to post back with an update after trying a few different approaches.

        Comment

        • fanli
          Senior Member
          • Jul 2014
          • 197

          #5
          There are a number of approaches for resolving multiply-mapped reads:




          You could also try using option 3 to discover new loci and then collapse redundant loci (such as the miR-103a in your example), then combine these with existing miR/smRNA databases and align using option 1.

          Note that Bowtie is preferred over Bowtie2 for short reads, and that the default Tophat segment length is 25nt. You may want to tweak these things for smRNA alignment.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Pathogen Surveillance with Advanced Genomic Tools
            by seqadmin




            The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
            03-24-2025, 11:48 AM
          • seqadmin
            New Genomics Tools and Methods Shared at AGBT 2025
            by seqadmin


            This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

            The Headliner
            The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
            03-03-2025, 01:39 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 03-20-2025, 05:03 AM
          0 responses
          49 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-19-2025, 07:27 AM
          0 responses
          57 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-18-2025, 12:50 PM
          0 responses
          50 views
          0 reactions
          Last Post seqadmin  
          Started by seqadmin, 03-03-2025, 01:15 PM
          0 responses
          201 views
          0 reactions
          Last Post seqadmin  
          Working...