Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to rescue multi-reads when using htseq to generate edgeR/DESeq counts?

    Hi. After using Bowtie2/Tophat2 for mapping an RNA-Seq study, I ran my data through htseq and then have aimed for methods like edgeR/DESeq that allow a neg. binomial distribution -- I gathered for differential gene expression, this may be more appropriate than Cufflinks/CuffDiff...?

    Yet because we just have Single End reads (100 bp HiSeq), there are a lot of multi-reads (=multi-hits / multi-mapped reads). HTSeq scores these as ambiguous and discards them; I'd like to rescue the multi-reads. Yet the Mortazavi method E-RANGE uses RPKM, and I'm unsure if I could reliably convert it back to raw counts for edgeR/DESeq. I found a newer package (BM-MAP) but am concerned (1) of the level of dependence it has on the estimate of polymorphism rates (if this is critical -- that could be a problem because I'm not sure if I can ascertain too reliably transcriptome-wide), and (2) I'm not sufficiently adept at coding to know how to translate the sam+ file it generates (appending a mapping probability to the end of each line in the sam file) into something I can put through htseq or some other program to generate counts, or to write my own script to do this. (3) Even if I CAN make this work, is it viable (or is the uncertainty due to using a Probability for multi-mapped, enough to make the DESeq/edgeR output unreliable)?

    I'm tempted to just run Cuffdiff instead -- to avoid losing the potential information gleaned by including multi-reads (for downstream analysis it may be enough to know if a certain gene family is up/down-regulated, even if we can't get at the exact gene). Yet again I gather that may be more for isoform-specific comparisons vs differential Gene expression?

    If anyone has time, suggestions are very welcome!
    ~Hilary

  • #2
    Hi Hilary, I am facing the same problem. At first I wrote a wrapper for a programme called SEQEM (http://www.ncbi.nlm.nih.gov/pubmed/21385047), which worked fine until the amount of data we were producing became too much for the programme *Snag 1* I was then able to get BM-MAP to work and use the sam+ files to assign the most probable location, but it was also unable to cope with the level of data that we are producing from a HiSeq2000 run. *Snag 2*

    One possibility for you would be to use RSEM. This uses an expectation maximisation approach to assign multiply mapped reads (for which the 'expected_count' output can be used in edgeR/DESeq or the programme EBSeq can, apparently, deal with RSEM output to give you DEGs). However, with RSEM you have to align to a transcriptome unless your genome is unspliced, as RSEM cannot cope with spliced reads. *Snag 3*

    If you are not using matched samples then I believe that Cuffdiff would be a very good choice for you as it can deal with gene-level as well as isoform-level. My issue is that I have paired samples (disease and non-disease tissues from the same patient) and Cuffdiff does not use this info, meaning I don't get the power I want from the study. *Snag 4* I am, therefore, wanting to use edgeR or DESeq... which means I am back to square one with how to assign multireads so I can get my count data....

    I am going to attempt to use Cufflinks to assign multiply mapped reads in the probabilistic manner it employs using the -u parameter, and then use the output FPKM values, per sample, to get count data (i.e. multiply by gene length and reads mapped) and then make use of the paired tests in edgeR (*phew*) but if anyone else has suggestions/comments regarding the issue of both a) using multireads in my analysis and b) performing tests on matched samples, I would love to hear them.

    As an aside, I am also using RSEM so I can see how the 2 different approaches affects the lists of significantly DEGs I get at the end. If you are interested in the results I will post them here

    Cheers

    EDIT: I have realised that there is a raw_frags column in the read_group_tracking output from cuffdiff so I am going to use that rather than fudging the FPKM
    Last edited by SEQquestions; 05-02-2013, 11:05 PM. Reason: spelling mistakes and a little more info needed

    Comment


    • #3
      Hi. Statistically our design is more complex than pairwise comparisons, so we also have trouble applying Cufflinks/Cuffdiff. I'm not (yet ... pending if I can find the time to teach myself) adept at perl/python parsing so I admit I was also a bit swayed against BM-MAP due to being unsure of an efficient way to convert the sam+ into a usable, standard sam format.

      I know there are some issues in converting from the Cufflinks output to edgeR and other count-based approaches (http://seqanswers.com/forums/showthread.php?t=5793). Though I think there's some option for raw counts now vs FPKM in one of the outputs.
      If memory serves me right, RSEM also uses the FPKM approach and wouldn't be easily compatible with edgeR/DESeq.

      So I have yet to find a good solution. I was thinking of a hybrid approach: basing most analysis on the more conservative (and hopefully easily defensible) route of discarding multi-mapped reads (using htseq with the "union" option, then edgeR) and just for comparison, perhaps also reporting what a Cufflinks/Cuffdiff run would yield. It's not ideal, especially because our data is single-end so of course we can't assign all reads -- and for our purposes it'd be good enough to know if a read belonged to gene family X (or to gene Y vs knowing the exact isoform -- we're looking for the big picture first).

      Thank you for your insight. The Tuxedo suite/Cufflinks has some really nice features, as does the htseq/edgeR/DESeq route; it's too bad there isn't a way to combine the best of both worlds. I think the fact that the uncertainty from mapping multi-hits isn't accounted for with edgeR etc., also adds a level of error -- and that the extent of the possible error / P value inflation isn't known (unless someone has done a simulation study I'm not aware of ...).
      Best,
      Hilary

      Comment


      • #4
        See here for an explanation of the rational behind htseq-count's discarding of non-unique alignments: http://sourceforge.net/p/htseq/support-requests/10/

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        11 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        68 views
        0 likes
        Last Post seqadmin  
        Working...
        X