Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • MichalO
    Member
    • Jan 2011
    • 10

    counting wars ;) HTSeq vs RSEM

    We have right now an internal lab discussion, a bit of classic "mapper wars" case. What is better, eg in the sense "closer to real biology":

    HTSeq (opponents claim: "does not do counting well in case of overlapping genes")
    RSEM on the level of gene summaries... - is the model there good enough to distinguish where the read is from in case of overlapping genes? If so - is this advantage so important that we should give up HTSeq?

    I was defending a bit HTSEq side, as I know that SimonA. knows well what he's doing and RSEM is more for transcript de-convolution not for gene-level counting... but I run out of arguments.

    Did anyone do a comparison like that or has a good intuition to help?

    Any suggestions welcome! Thanks!
  • dpryan
    Devon Ryan
    • Jul 2011
    • 3478

    #2
    If overlapping genes is such an issue for whatever you're working on, just use a stranded library prep. The likely more common objection to HTSeq is that it "ignores" multimappers rather than trying to extract some meaning from them. Honestly, that particular objection has never really swayed me, since the regions of genes not giving rise to multimapping reads should suffice to provide enough reliable single for differential expression.

    Which method you choose will largely come down to how risk averse you are and what your downstream needs will be. If I'm going to use RNAseq results to generate a transgenic mouse or start some drug screens, I'm not going to spend time with RSEM data, the validity of which I'm no where near 100% certain of.

    Comment

    • MichalO
      Member
      • Jan 2011
      • 10

      #3
      Thanks dpryan! The stranded protocol is definitely a good point here. Still it costs some $100 per sample, so thrifty biologists often skip it...

      Originally posted by dpryan View Post
      If I'm going to use RNAseq results to generate a transgenic mouse or start some drug screens, I'm not going to spend time with RSEM data, the validity of which I'm no where near 100% certain of.
      Could you briefly write down your objections towards RSEM? I have mine - like heavy dependence on annotation, not being sure in case of many isoforms, etc etc. Thanks!

      Comment

      • jparsons
        Member
        • Feb 2012
        • 62

        #4
        So I pulled up HTSeq data and RSEM data from the same run, which I have because i've been trying to come up with a good metric to judge quantitation (both of genes and transcripts).

        Generally, the HTS count and the RSEM expected counts are within a few percent of one another. However, there are some significant outliers, which from a cursory inspection appear to be almost exclusively mitochondrial genes - presumably ones which are consisting entirely of multi-mapped reads. HTS also assigns some low counts to some pseudogenes which RSEM seems to avoid doing.

        I usually advocate HTSeq for gene counting due to its simplicity, but I'd say that RSEM is on the right side of what we consider to be biological 'truth' in this comparison.

        Comment

        • MichalO
          Member
          • Jan 2011
          • 10

          #5
          Thanks a lot too! That's what I suspected - some small artifacts on both sides, no big differences, at least at the gene level. Have to stop being lazy and try myself What was the species? H.Sapiens?

          Originally posted by jparsons View Post
          both of genes and transcripts
          Did you do HTSeq on transcript level? and was it similar indeed?

          Comment

          • jparsons
            Member
            • Feb 2012
            • 62

            #6
            It was a human sample. HTSeq claims not to work on the transcript level, I used other programs there. I might just throw it at the wall anyway, but don't have high expectations.

            Comment

            • chadn737
              Senior Member
              • Jan 2009
              • 392

              #7
              Originally posted by jparsons View Post
              So I pulled up HTSeq data and RSEM data from the same run, which I have because i've been trying to come up with a good metric to judge quantitation (both of genes and transcripts).

              Generally, the HTS count and the RSEM expected counts are within a few percent of one another. However, there are some significant outliers, which from a cursory inspection appear to be almost exclusively mitochondrial genes - presumably ones which are consisting entirely of multi-mapped reads. HTS also assigns some low counts to some pseudogenes which RSEM seems to avoid doing.

              I usually advocate HTSeq for gene counting due to its simplicity, but I'd say that RSEM is on the right side of what we consider to be biological 'truth' in this comparison.
              The "HTS also assigns some low counts to some pseudogenes which RSEM seems to avoid doing" does not make sense to me given how htseq-count works, those reads assigned to pseudogenes would have to be uniquely aligned there in the first place by the aligner. Unless of course, these are specifically psuedogenes overlapping other genes, which even then, the read would have to largely come from the pseudogene not to be discarded by htseq-counts default settings.

              Comment

              • jparsons
                Member
                • Feb 2012
                • 62

                #8
                It didn't make sense to me either, but when I was looking for places where there were discrepancies, that's what popped. If I had to hypothesize, i would think that the pseudo gene has unique sequence relative to the main gene, which by chance a sequencing error manages to catch. The alignment settings that RSEM uses were not identical to the ones I used for HTS, and may have been differently tolerant of mismatches, or maybe RSEM decided that a mm1 alignment to the main gene was more likely than a perfect match to the pseudo gene.

                Comment

                • chadn737
                  Senior Member
                  • Jan 2009
                  • 392

                  #9
                  Originally posted by jparsons View Post
                  It didn't make sense to me either, but when I was looking for places where there were discrepancies, that's what popped. If I had to hypothesize, i would think that the pseudo gene has unique sequence relative to the main gene, which by chance a sequencing error manages to catch. The alignment settings that RSEM uses were not identical to the ones I used for HTS, and may have been differently tolerant of mismatches, or maybe RSEM decided that a mm1 alignment to the main gene was more likely than a perfect match to the pseudo gene.
                  Then that is a difference between aligners, not htseq-count vs RSEM. htseq-count does not align reads or determine their locations. That is done by whatever aligner is used prior to that. So an observed discrepancy in this instance will have occurred at earlier steps and is not a valid comparison of RSEM or htseq-count.

                  Comment

                  • Simon Anders
                    Senior Member
                    • Feb 2010
                    • 995

                    #10
                    I would like to add that RSEM and htseq-count are tools with different purposes. RSEM aim is designed to quantify expression strength; htseq-count is not! Rather, it is a tool for the express and sole purpose of forming the first step of an analysis for diferential expression on the gene level. See my post #4 in this thread for an elaboration why these two goals suggest different treatments of overlapping genes and multimapping reads.

                    Comment

                    • MichalO
                      Member
                      • Jan 2011
                      • 10

                      #11
                      Thanks a lot Simon! Precisely and down to the point as usual!!

                      Comment

                      • lpachter
                        Member
                        • Feb 2010
                        • 40

                        #12
                        Its tempting to think that how one counts doesn't matter (for differential expression purposes), but here I argue that it does:

                        RNA-Seq is the new kid on the block, but there is still something to be learned from the stodgy microarray. One of the lessons is hidden in a tech report by Daniela Witten and Robert Tibshirani fro…

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Pathogen Surveillance with Advanced Genomic Tools
                          by seqadmin




                          The COVID-19 pandemic highlighted the need for proactive pathogen surveillance systems. As ongoing threats like avian influenza and newly emerging infections continue to pose risks, researchers are working to improve how quickly and accurately pathogens can be identified and tracked. In a recent SEQanswers webinar, two experts discussed how next-generation sequencing (NGS) and machine learning are shaping efforts to monitor viral variation and trace the origins of infectious...
                          03-24-2025, 11:48 AM
                        • seqadmin
                          New Genomics Tools and Methods Shared at AGBT 2025
                          by seqadmin


                          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                          The Headliner
                          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                          03-03-2025, 01:39 PM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 03-20-2025, 05:03 AM
                        0 responses
                        49 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-19-2025, 07:27 AM
                        0 responses
                        57 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-18-2025, 12:50 PM
                        0 responses
                        50 views
                        0 reactions
                        Last Post seqadmin  
                        Started by seqadmin, 03-03-2025, 01:15 PM
                        0 responses
                        201 views
                        0 reactions
                        Last Post seqadmin  
                        Working...