SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
cuffcompare or cuffmerge for cuffdiff upper Bioinformatics 6 04-07-2015 03:06 PM
Can cummeRbund analyze the output of running cufflinks? jerryeah RNA Sequencing 7 08-03-2013 04:09 PM
Biological replicates with cuffdiff, cummeRbund turnersd Bioinformatics 15 11-19-2012 05:59 AM
RNASeq from total RNA with a RIN under 8 jo_mason Sample Prep / Library Generation 2 11-17-2011 06:27 PM
RNAseq with <100 ng total RNA input lukas1848 Sample Prep / Library Generation 0 10-26-2011 02:14 AM

Reply
 
Thread Tools
Old 07-31-2012, 02:54 AM   #1
JWC
Junior Member
 
Location: Europe

Join Date: Jul 2012
Posts: 2
Default Total RNAseq and cufflinks/cuffmerge/cuffdiff/cummeRbund pipeline

Dear all,

I'm looking for small, non-coding RNAs in 3 samples of mouse tissue. To do this we have performed 2x76bp strand specific RNAseq on total RNA depleted for ribosomal RNA with a kit called RiboMinus. I have the data from our collaborators in BAM format. Although this data is for non-coding RNA I want to extract information on gene expression levels from it as well, if possible. So far I have done the following using the Galaxy bioinformatics platform to try and get expression levels:

1) Transcript assembly with cufflinks for each sample with no reference annotation
2) Used cuffmerge to combine the 3 samples with the UCSC genes reference annotation
3) Used cuffdiff on the output of cuffmerge, along with my 3 original BAM files
4) Taken the output from cuffdiff and used R with cummeRbund to analyse the results

I have attached the output of csVolcano and csScatter for one pairwise comparison to give an idea of the data.

Now, my questions:

a) Should I be worried about the appearance of the plots. I have highlighted the regions which appear to represent 'expression' in only one sample. They seem to form a large portion of my data - a problem with cufflinks, perhaps?

b) Why, when I do

Code:
mygene <- getGene(cuff_data, 'GENENAME') 
mygene
does the output for that gene look like this:

Code:
CuffGene instance for gene XLOC_181410 
Short name:	 uc009ajk.1 
Slots:
	 annotation
	 fpkm
	 diff
	 isoforms	 CuffFeature instance of size 1 
	 TSS		 CuffFeature instance of size 1 
	 CDS		 CuffFeature instance of size 1
No fpkm, no diff etc. even for genes I know are highly expressed. This means I cannot, for example, plot expressionBarPlot for my genes of interest.

c) What general quality control recommendations can anyone make for determining how good this data is at capturing gene expression profiles, bearing in mind that it is total RNA and therefore (I assume) not as well suited as polyA-enriched RNA for this task?


Thanks for reading this, this is my first foray into NGS data and my only information has come from the excellent Nature Protocols paper "Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks" by Trapnell et al. It is entirely probable that I have missed some pretty basic points as a result so please recommend any beginners literature if you can.

Cheers,

J
Attached Images
File Type: jpg Scatter.jpg (78.1 KB, 105 views)
File Type: jpg Volcano.jpg (77.3 KB, 106 views)
JWC is offline   Reply With Quote
Old 07-31-2012, 03:58 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

re (a): This is a good example for my claim, stated often here, that one should use raw counts (number of reads mapped to a gene) and not FPKM data, when comparing samples. Maybe it is simply a weak gene, expressed by only two or three reads in one sample and no reads in the other, which is, however, very short, so that the FPKM value looks large, even though your evidence for its expression is very weak. Only raw counts show you the actual evidence, i.e., how often you have seen the gene.
Simon Anders is offline   Reply With Quote
Old 07-31-2012, 10:51 AM   #3
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi JWC,

For 1) and 2), please try to view the alignments (BAM) in a browser like IGV to actually examine the reads/coverage as the first step.

Best regards,
Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 08-01-2012, 01:20 AM   #4
JWC
Junior Member
 
Location: Europe

Join Date: Jul 2012
Posts: 2
Default

Hi,

Thanks for the replies.

DZhang - I have looked at the data in the UCSC genome browser - it looks how I would expect; genes I would expect to be expressed are highly covered whilst those I would not expect to be expressed are present at low levels or not at all. Would you recommend any further QC for RNA-seq data?

Simon - Do you have any recommendations for pipelines which work on a RPKM rather than FPKM basis?

Also, I seem to have fixed question b) in my original post - I can now pull out gene info with the short name (e.g. "Actb" for Beta-Actin). I did this by running cuffmerge with the ensembl reference annotations for the mouse genome (Mus_musculus.GRCm38.68.gtf) rather than the UCSC known_genes annotations, which label chromosomes "chr1" etc. rather than simply "1".

Cheers,

J
JWC is offline   Reply With Quote
Old 08-01-2012, 01:28 AM   #5
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by JWC View Post
Simon - Do you have any recommendations for pipelines which work on a RPKM rather than FPKM basis?
RPKM instead of FPKM? What difference should that make?

I recommended to make a scatter plot of raw counts.

For analysis, I recommend, of course, our tool, DESeq, if you want to try an alternative to cuffdiff.
Simon Anders is offline   Reply With Quote
Old 08-01-2012, 01:29 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by JWC View Post
DZhang - I have looked at the data in the UCSC genome browser - it looks how I would expect; genes I would expect to be expressed are highly covered whilst those I would not expect to be expressed are present at low levels or not at all. Would you recommend any further QC for RNA-seq data?
I assume that DZhang did not mean that you should look at arbitrary genes, but rather at those which look strange in your scatter plot, i.e., at the genes that correspond to the dots that you marked with an arrow.
Simon Anders is offline   Reply With Quote
Old 08-01-2012, 01:49 AM   #7
kben
Member
 
Location: Frankfurt/Main

Join Date: Feb 2012
Posts: 17
Default

Quote:
Originally Posted by JWC View Post
c) What general quality control recommendations can anyone make for determining how good this data is at capturing gene expression profiles, bearing in mind that it is total RNA and therefore (I assume) not as well suited as polyA-enriched RNA for this task?
Is total RNA (depleted for ribosomal RNA by e.g. Ribo-Zero Gold) still inferior to poly(A)-selection for analysis of gene expression, and alternative splicing? Compared to poly(A)-selection the possibility to get data on ncRNA too is attractive.

I assume that there's a need for more sequencing depth with total RNA (rRNA depleted) because of a larger RNA pool, and that full coverage incl. miRNA is obstructed by size filtering of popular RNA extraction kits?
kben is offline   Reply With Quote
Old 08-01-2012, 03:41 AM   #8
kben
Member
 
Location: Frankfurt/Main

Join Date: Feb 2012
Posts: 17
Default

Concerning sequencing depth for poly(A) enriched samples vs. total RNA (rRNA depleted): could you extrapolate from EpiCentre's RiboZero Gold/EMBL data, that you need about 3-4x more reads on total RNA in order to get a coverage comparable to Poly(A) enriched libraries?


http://www.epibio.com/item.asp?id=589

Last edited by kben; 08-01-2012 at 03:52 AM. Reason: spelling
kben is offline   Reply With Quote
Old 08-01-2012, 05:41 AM   #9
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Quote:
Originally Posted by Simon Anders View Post
I assume that DZhang did not mean that you should look at arbitrary genes, but rather at those which look strange in your scatter plot, i.e., at the genes that correspond to the dots that you marked with an arrow.
Simon, thank yo for clarifying it for me.

JWC, how many reads do you have for each sample? Coverage is an important factor to consider if you are concerned about some genes are covered in only one of the three samples.

Best regards,
Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Reply

Tags
cuffdiff, cufflinks, cuffmerge, cummerbund, rnaseq alignment tuxedo

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:47 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO