Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RNA-seq: calculating per-gene coverage

    Hello,

    I know that this topic has been massively covered in this forum, but I wasn't able to find the right answer for my questions.

    Basically, I'm dealing with some bacterial RNA-seq data. I have two biological replicates and two conditions (so four libraries). I used bowtie for alignment of my reads and subsequently DESeq to calculate means/fold change/stats in order to compare my two conditions.

    Now, I would like to calculate gene coverage per 1kb (relative) in order to compare absolute expression of every gene. I was thinking that if I take number of reads and I divide it by the gene length and then multiply by 1000 I will get some kind of "gene expression ratio per 1kb" which I could use to compare for all my genes. So, I could say that gene "A" is ten times more expressed then gene "B" just by comparing those ratios. Is this the right way of doing it, or perhaps I'm missing something?

    Also, I sought to use BEDTools for this, but I don't know if I could insert multiple files (biological replicates) for my analysis. Anybody know how?

    Finaly, anybody know how to extract gene length from a GFF or even GenBank file? BEDTools does it but I have duplicated values (for gene/CDS/exons).

    Thanks in advance! I don't know what I would do without this forum...

    TP

  • #2
    After re-reading a couple of papers having for subject RNA-seq analysis, I'm a little bit confused. Many papers use Gene Expression Index (GEI) and it is calculated as mean coverage depth of the gene normalized by the total number of reads mapped to the non-rRNA regions (Wang 2011, BMC Genomics)

    Well, in my case, I've performed HTSeq to get reads/gene count and then DESeq to compare those data for significant genes. And, normally, DESeq already takes into account library depth and perform correction, right? So, I can't do it again?

    I might be missing something here... any help would be greatly appreciated!

    TP

    Comment


    • #3
      Is there a reason you don't want to use RPKM (reads per kilobase per million reads), such as you'd get out of a program like Cufflinks?

      RPKM values are nice, since they correct for different samples arbitrarily being sequenced to different depths (and unless you magically loaded precisely the same amount on the sequencer, and the poisson noise worked exactly to zero, you did sequence them to slightly different depths). On the other hand, the statistics for properly interpreting differential expression from RPKM alone are unconvincing to me, at best, so you'd want to use something like DESeq anyways.

      EDIT: another nice advantage is that people are used to seeing RPKM values, and have a sense of how to reason about them, and where they work/don't work, so reviewers will more readily accept them than something you cobbled together.
      Last edited by rflrob; 10-17-2012, 04:44 PM.

      Comment


      • #4
        Thanks for your answer.

        Well, that's exactly why I went for DESeq: it works with raw data (number of reads that mapped to a specific gene). Since I have only two biological replicates, I was not confident to compare with RPKM.

        On the other side, if I just want to get the general idea of gene expression in my samples, I cannot compare raw counts for two genes that have different length and coming from different sequencing lanes (sequencing depth being different). I don't know if I'm being clear but, for example, let's say that a gene A and B have both 100 reads aligned but gene A is 200pb long and gene B 500pb, their absolute expression is not the same even though they have the same coverage (reads/gene).

        Next, since data obtained from DESeq (coverage per gene, i.e. number of aligned reads per gene) is already normalized for sequencing depth, I could just take those data and divide it by genes lenght, so I would take into account seq-depth and gene length. Sound logic to me... and almost like RPKM.

        Finally, I do agree that calculating RPKM is convenient because everybody is familiar with that and won't stick on it.

        What do you think?

        Comment


        • #5
          The raw data (counts, i.e. DESeq, edgeR, etc) is the way to go for differential expression.

          FPKM is the way to go for relative expression within a condition.

          You just need to understand the limitations and deficiencies.

          For example, map-ability can drastically effect your FPKM value. As can other biases.
          --------------
          Ethan

          Comment


          • #6
            Okay! So, I have to use cufflinks to get RPKM? (I'm dealing with single-end data, so it's RPKM in my case). I'll try that...

            And, while we are on that subject, can you (or someone) give me some limitations and deficiencies on RPKM? What you mean by map-ability? The "ability" of reads to map on a reference sequence? Sorry if it sounds basic to you guys, but I just want to be sure I understand all this

            TP

            Comment


            • #7
              I can't say for certain where I got this impression, but I thought that relative expression in RNA-seq was close to un-knowable. (I know there have been papers criticizing RPKM lately*)

              There are just too many other factors in play (most notably differential PCR amplification of A vs B, although others exist) to be able to confidently say that 10X as many counts reflects a true difference in relative gene expression.

              Even spiked-in controls with known 'truth' values for their relative gene expression have some difficulty comparing A to B**

              *Bias detection and correction in RNA-Sequencing data.
              *Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples
              **Synthetic spike-in standards for RNA-seq experiments.

              Comment


              • #8
                Originally posted by jparsons View Post

                There are just too many other factors in play (most notably differential PCR amplification of A vs B, although others exist) to be able to confidently say that 10X as many counts reflects a true difference in relative gene expression.
                I do agree with this however I think it can give an overall idea... I'm not sure how much can we actually be confident in interpretation but it can give an idea especially if the difference is huge.

                Comment


                • #9
                  Hi, does anybody know how to get the result of raw number of mapped reads for each gene?I don't want the FPKM or RPKM value.

                  Comment


                  • #10
                    HTSeq-count does the job. I have some scripts that will take the grunt work out of it, but it takes a little time to set up and there's a reasonable chance it won't work on your platform.
                    I updated a bunch of my scripts and they are working on Linux and Mac now.  Nothing innovative as I am not a statistician or informatician, but there are a couple nice easy to use scripts that link…
                    --------------
                    Ethan

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Strategies for Sequencing Challenging Samples
                      by seqadmin


                      Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                      03-22-2024, 06:39 AM
                    • seqadmin
                      Techniques and Challenges in Conservation Genomics
                      by seqadmin



                      The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                      Avian Conservation
                      Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                      03-08-2024, 10:41 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 06:37 PM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 06:07 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-22-2024, 10:03 AM
                    0 responses
                    51 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-21-2024, 07:32 AM
                    0 responses
                    68 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X