Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Total RNAseq and cufflinks/cuffmerge/cuffdiff/cummeRbund pipeline

    Dear all,

    I'm looking for small, non-coding RNAs in 3 samples of mouse tissue. To do this we have performed 2x76bp strand specific RNAseq on total RNA depleted for ribosomal RNA with a kit called RiboMinus. I have the data from our collaborators in BAM format. Although this data is for non-coding RNA I want to extract information on gene expression levels from it as well, if possible. So far I have done the following using the Galaxy bioinformatics platform to try and get expression levels:

    1) Transcript assembly with cufflinks for each sample with no reference annotation
    2) Used cuffmerge to combine the 3 samples with the UCSC genes reference annotation
    3) Used cuffdiff on the output of cuffmerge, along with my 3 original BAM files
    4) Taken the output from cuffdiff and used R with cummeRbund to analyse the results

    I have attached the output of csVolcano and csScatter for one pairwise comparison to give an idea of the data.

    Now, my questions:

    a) Should I be worried about the appearance of the plots. I have highlighted the regions which appear to represent 'expression' in only one sample. They seem to form a large portion of my data - a problem with cufflinks, perhaps?

    b) Why, when I do

    Code:
    mygene <- getGene(cuff_data, 'GENENAME') 
    mygene
    does the output for that gene look like this:

    Code:
    CuffGene instance for gene XLOC_181410 
    Short name:	 uc009ajk.1 
    Slots:
    	 annotation
    	 fpkm
    	 diff
    	 isoforms	 CuffFeature instance of size 1 
    	 TSS		 CuffFeature instance of size 1 
    	 CDS		 CuffFeature instance of size 1
    No fpkm, no diff etc. even for genes I know are highly expressed. This means I cannot, for example, plot expressionBarPlot for my genes of interest.

    c) What general quality control recommendations can anyone make for determining how good this data is at capturing gene expression profiles, bearing in mind that it is total RNA and therefore (I assume) not as well suited as polyA-enriched RNA for this task?


    Thanks for reading this, this is my first foray into NGS data and my only information has come from the excellent Nature Protocols paper "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks" by Trapnell et al. It is entirely probable that I have missed some pretty basic points as a result so please recommend any beginners literature if you can.

    Cheers,

    J
    Attached Files

  • #2
    re (a): This is a good example for my claim, stated often here, that one should use raw counts (number of reads mapped to a gene) and not FPKM data, when comparing samples. Maybe it is simply a weak gene, expressed by only two or three reads in one sample and no reads in the other, which is, however, very short, so that the FPKM value looks large, even though your evidence for its expression is very weak. Only raw counts show you the actual evidence, i.e., how often you have seen the gene.

    Comment


    • #3
      Hi JWC,

      For 1) and 2), please try to view the alignments (BAM) in a browser like IGV to actually examine the reads/coverage as the first step.

      Best regards,
      Douglas

      Comment


      • #4
        Hi,

        Thanks for the replies.

        DZhang - I have looked at the data in the UCSC genome browser - it looks how I would expect; genes I would expect to be expressed are highly covered whilst those I would not expect to be expressed are present at low levels or not at all. Would you recommend any further QC for RNA-seq data?

        Simon - Do you have any recommendations for pipelines which work on a RPKM rather than FPKM basis?

        Also, I seem to have fixed question b) in my original post - I can now pull out gene info with the short name (e.g. "Actb" for Beta-Actin). I did this by running cuffmerge with the ensembl reference annotations for the mouse genome (Mus_musculus.GRCm38.68.gtf) rather than the UCSC known_genes annotations, which label chromosomes "chr1" etc. rather than simply "1".

        Cheers,

        J

        Comment


        • #5
          Originally posted by JWC View Post
          Simon - Do you have any recommendations for pipelines which work on a RPKM rather than FPKM basis?
          RPKM instead of FPKM? What difference should that make?

          I recommended to make a scatter plot of raw counts.

          For analysis, I recommend, of course, our tool, DESeq, if you want to try an alternative to cuffdiff.

          Comment


          • #6
            Originally posted by JWC View Post
            DZhang - I have looked at the data in the UCSC genome browser - it looks how I would expect; genes I would expect to be expressed are highly covered whilst those I would not expect to be expressed are present at low levels or not at all. Would you recommend any further QC for RNA-seq data?
            I assume that DZhang did not mean that you should look at arbitrary genes, but rather at those which look strange in your scatter plot, i.e., at the genes that correspond to the dots that you marked with an arrow.

            Comment


            • #7
              Originally posted by JWC View Post
              c) What general quality control recommendations can anyone make for determining how good this data is at capturing gene expression profiles, bearing in mind that it is total RNA and therefore (I assume) not as well suited as polyA-enriched RNA for this task?
              Is total RNA (depleted for ribosomal RNA by e.g. Ribo-Zero Gold) still inferior to poly(A)-selection for analysis of gene expression, and alternative splicing? Compared to poly(A)-selection the possibility to get data on ncRNA too is attractive.

              I assume that there's a need for more sequencing depth with total RNA (rRNA depleted) because of a larger RNA pool, and that full coverage incl. miRNA is obstructed by size filtering of popular RNA extraction kits?

              Comment


              • #8
                Concerning sequencing depth for poly(A) enriched samples vs. total RNA (rRNA depleted): could you extrapolate from EpiCentre's RiboZero Gold/EMBL data, that you need about 3-4x more reads on total RNA in order to get a coverage comparable to Poly(A) enriched libraries?


                Last edited by kben; 08-01-2012, 03:52 AM. Reason: spelling

                Comment


                • #9
                  Originally posted by Simon Anders View Post
                  I assume that DZhang did not mean that you should look at arbitrary genes, but rather at those which look strange in your scatter plot, i.e., at the genes that correspond to the dots that you marked with an arrow.
                  Simon, thank yo for clarifying it for me.

                  JWC, how many reads do you have for each sample? Coverage is an important factor to consider if you are concerned about some genes are covered in only one of the three samples.

                  Best regards,
                  Douglas

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    Strategies for Sequencing Challenging Samples
                    by seqadmin


                    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                    03-22-2024, 06:39 AM
                  • seqadmin
                    Techniques and Challenges in Conservation Genomics
                    by seqadmin



                    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                    Avian Conservation
                    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                    03-08-2024, 10:41 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-27-2024, 06:37 PM
                  0 responses
                  15 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-27-2024, 06:07 PM
                  0 responses
                  13 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-22-2024, 10:03 AM
                  0 responses
                  55 views
                  0 likes
                  Last Post seqadmin  
                  Started by seqadmin, 03-21-2024, 07:32 AM
                  0 responses
                  70 views
                  0 likes
                  Last Post seqadmin  
                  Working...
                  X