![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Differential expression from RNA-seq: variation between replicates | beans | RNA Sequencing | 6 | 11-03-2011 10:45 AM |
Differential Expression analysis without replicates | polsum | Bioinformatics | 1 | 08-05-2011 04:40 AM |
help with differential gene expression with cufflinks and tophat | waterboy | Bioinformatics | 1 | 11-28-2010 10:51 AM |
Differential gene expression of gene clusters | anjana.vr | RNA Sequencing | 1 | 10-28-2010 11:33 AM |
Differential gene expression: Can Cufflinks/Cuffcompare handle biological replicates? | marcora | Bioinformatics | 0 | 05-19-2010 02:11 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]()
I am new to this RNAseq business and I would like to use it to measure differential gene expression between wild-type and mutant mouse brains.
I have designed my experiment to have 4 biological replicates each for case and control. I really like bowtie/tophat and would like to use cufflinks/cuffdiff to measure diff gene expression, but I can't seem to find a way to make cuffdiff handle biological replicates (which I would hope is the way everybody designs their experiments ![]() ![]() I would greatly appreciate any feedback from this community of experts! Cheers ![]() |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994
|
![]() |
![]() |
![]() |
![]() |
#3 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
indeed I have very much enjoyed using HTSeq tools and I am looking forward to use DESeq. However, I would love to see an integrated solution such as bowtie>tophat>cufflinks because I would like to perform the analysis using two different set of tools and make sure the results match to a large extent. In regards to htseq-count this is my oversimplified pipeline at the moment (disclaimer: I have been doing RNAseq analysis for only a few days now): - run tophat using the RefSeq reference in GFF format without novel junctions discovery which outputs 'accepted_hits.sam' - run cufflinks on 'accepted_hits.sam' using the RefSeq reference in GTF format which outputs 'transcripts.gtf' - run htseq-count as follows: Code:
htseq-count accepted_hits.sam transcripts.gtf > counts.table Code:
chr1 Cufflinks transcript 136186581 136189125 1000 + . gene_id "Myog"; transcript_id "Myog"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000"; chr1 Cufflinks exon 136186581 136187103 1000 + . gene_id "Myog"; transcript_id "Myog"; exon_number "1"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000"; chr1 Cufflinks exon 136187670 136187751 1000 + . gene_id "Myog"; transcript_id "Myog"; exon_number "2"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000"; chr1 Cufflinks exon 136188251 136189125 1000 + . gene_id "Myog"; transcript_id "Myog"; exon_number "3"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000"; Code:
Zxdc 88 Zyg11a 0 Zyg11b 760 Zyx 97 Zzef1 150 Zzz3 163 l7Rn6 181 rp9 123 no_feature 10221124 ambiguous 15234 The size of accepted_hits.sam is 17.283 million lines. Given that the combination of all exons of Myog is 1.477 Kb and the reported FPKM (RPKM in this case because I am using reads that are NOT paired end) is 81.234 the Myog raw count from cufflinks output should be RPKM * length (Kb) * total no of reads (million) = 81.234 * 1.477 * 17.283 = 2,073 ![]() Not sure what I am doing wrong but both cufflinks and htseq-count outputs are, at this early stage of my understanding, very confusing in the fact that they don't match. Any help would be greatly appreciated ![]() Edoardo "Dado" Marcora p.s.: I am very soon going to install IGV on my machine and manually count the reads mapped to the Myog gene locus ![]() UPDATE: by approximate visual inspection, it looks like the Myog raw count estimated using cufflinks output is the correct one... htseq-count is lower probably because of all those "no_feature" hits Last edited by marcora; 05-19-2010 at 07:03 AM. |
|
![]() |
![]() |
![]() |
#4 |
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994
|
![]()
Unless you have stranded RNA-Seq data, it is important to use the option "--stranded=no". Otherwise, htseq-count discards everything which is on the other strand.
Could this be the reason? (Maybe, I should mark this fact in bold on the web page.) Simon |
![]() |
![]() |
![]() |
#5 |
Member
Location: USA TN Join Date: Mar 2010
Posts: 17
|
![]()
Currently I am using the HTSeq to get count number. After adding -s, the result is much better.
Command: python -m HTSeq.scripts.count accepted_hits.sam transcripts.gtf >s1.count -s no Thanks both. |
![]() |
![]() |
![]() |
#6 |
Senior Member
Location: Kansas City Join Date: Mar 2008
Posts: 197
|
![]()
You can also count reads using bedtools on a bam file, like this.
coverageBed -abam ../data/accepted_hits.sorted.bam -b ../data/Mus_musculus.NCBIM37.52.bed > counts_bam.txt I think it's really smart to check things out. There's a lot that can go wrong with all these complicated programs, sequences, etc. |
![]() |
![]() |
![]() |
#7 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
Code:
Zxdc 170 Zyg11a 5 Zyg11b 1554 Zyx 209 Zzef1 297 Zzz3 375 l7Rn6 365 rp9 229 no_feature 5949010 ambiguous 63483 By the way, here is the script that I am using to generate raw counts for genes from 'accepted_hits.sam' (tophat output) and 'transcripts.gtf' (cufflinks output); perhaps somebody may find it useful (the output should be directly loadable into an R data frame for use with DEseq) n.b.: it's tailored to my file/directory structure... but the gist of it should be easily generalizable. Code:
#!/bin/bash for f in $( ls *.fa ); do if [ -f $f ]; then ACCEPTED_HITS_SIZE=`wc -l $f.tophat_out/accepted_hits.sam | cut -d' ' -f1` # (for each exon in gtf file input) # $10 = gene_id => $1 # $16 = FPKM => $2 # $5-$4 = exon lenght => $3 # => $1;$2;$3 # => $1 $2 $3 # # aggregate sum of $3 (exon length) by $1 (gene_id) # => $1 $2 $3 # # count = FPKM * total exon lenght (Kb) * number of mapped reads (million) # count = $2 * ($3 / 1,000) * ((ACCEPTED_HITS_SIZE-2) / 1,000,000) awk '$3 ~ /^exon$/ { print $10 $16 ($5-$4) }' $f.cufflinks_out/transcripts.gtf | \ awk '{ gsub(/"/, "", $0); print $0; }' - | \ awk -F\; '{ printf("%s\t%s\t%s\n",$1,$2,$3) }' - | \ awk 'BEGIN { gene_id = ""; i = 0 ; len = 0; } { if (gene_id == $1) { i++; len += $3; fpkm = $2; } else if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); gene_id = $1; len = $3; i = 1; fpkm = $2; } else { gene_id = $1; len += $3; i++; fpkm = $2; } } END { if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); } }' - | \ awk 'BEGIN { print "\t'${f%.fa}'" } { gene_id = $1; count = $2*($3/1000)*(('${ACCEPTED_HITS_SIZE}'-2)/1000000); printf("%s\t%d\n",gene_id,count); }' - > $f.counts.table fi done Dado ![]() |
|
![]() |
![]() |
![]() |
#8 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
![]() ![]() Gotta pick up the pace! ![]() |
|
![]() |
![]() |
![]() |
#9 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
Code:
chr1 Cufflinks transcript 136186581 136189125 1000 + . gene_id "Myog"; transcript_id "Myog"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000"; 1474 2532 2545 0.9948919 chr1 Cufflinks exon 136186581 136187103 1000 + . gene_id "Myog"; transcript_id "Myog"; exon_number "1"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000"; 494 510 523 0.9751434 chr1 Cufflinks exon 136187670 136187751 1000 + . gene_id "Myog"; transcript_id "Myog"; exon_number "2"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000"; 122 82 82 1.0000000 chr1 Cufflinks exon 136188251 136189125 1000 + . gene_id "Myog"; transcript_id "Myog"; exon_number "3"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000"; 906 875 875 1.0000000 Cheers, Dado |
|
![]() |
![]() |
![]() |
#10 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
Please find attached a screenshot of my igv viewer showing Sox17. htseq-count reports a raw count of 46 for Sox17. coverageBed reports a raw count of 46 for Sox17. cufflinks "reports" a raw count of 66 for Sox17. If you count the reads overlapping with exons in the screenshot, they are ~45. If you count all reads within the boundaries of the gene they are ~64. Started to wonder what the FPKM reported by cufflinks (which is exactly the same across transcript and exon features) takes into consideration. Just my two cents, Dado ![]() Last edited by marcora; 05-19-2010 at 10:17 AM. |
|
![]() |
![]() |
![]() |
#11 | |
Senior Member
Location: Boston, MA Join Date: Nov 2008
Posts: 212
|
![]() Quote:
We will not be directly reporting the number of fragments that come from each isoform, because when dealing with fragments that could have come from one of several transcripts in the transcriptome, the fragments need to be "probabilistically" assigned, because we can't say for for sure which transcript each fragment came from. Cufflinks takes this assignment uncertainty into account when performing its tests for differential expression. Testing directly on "estimated" counts throws out much of the "alignment uncertainty" information Cufflinks has computed along with the abundance estimates. When dealing with ambiguously mappable fragments (e.g. all fragments incident on constitutive features of alternatively spliced genes) ignoring this uncertainty during testing, even at the gene level, is hazardous. It's also worth noting that what Cufflinks is doing internally is computing for each transcript a number (in our paper we call this number "rho") that is it's relative abundance in the sample. We could write this down as a very small fraction, but instead we scale it up by a large constant and report FPKM - the expected number of fragments per kilobase of transcript per million fragments mapped. Now the source of the difference between the tools may be in two places: 1) When you use a GTF, Cufflinks only counts fragments that fall on features for the "millions of fragments mapped" in the denominator 2) Cufflinks uses each fragment's MAPQ to downweight multi-mapping fragments, because otherwise these get overcounted and contribute more to the overall map than uniquely mappable fragments. See the SAM format for more details on this. So I think comparing the actual FPKM, counts, etc. between tools is tricky, and comparing Cufflinks to a tool that directly reports counts is even more so. Because FPKM for a transcript is just a proxy for its relative abundance rho, it doesn't actually matter what this constant multiplier is, as long as its the same for all transcripts. We report FPKM simply because many people performing RNA-Seq are used to it. |
|
![]() |
![]() |
![]() |
#12 | |||
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
thanx for your quick reply. The addition of this feature to cufflinks would definitively improve its versatility and adoption, since most experimental designs include/should include biological replicates. Quote:
Quote:
Thanx a lot for these great tools!!! Dado ![]() |
|||
![]() |
![]() |
![]() |
#13 |
Member
Location: Berkeley, cA Join Date: Feb 2010
Posts: 40
|
![]()
While it is ok to use raw counts to compare gene expression between samples, as Cole explained, to test differential expression of _isoforms_ its necessary to account for uncertainty in the assignment of reads to transcripts. Converting isoform FPKMs to counts and then applying DEseq is a bad idea because the uncertainty is then not incorporated into the DE calculation.
Secondly, to compare expression of genes within one experiment (or to measure relative expression of isoforms within a gene) one has to test the FPKMs and not counts, because counts do not accurately reflect the relative abundances. This is explained in more detail in the Cufflinks paper. |
![]() |
![]() |
![]() |
#14 | |
Member
Location: Pasadena, CA USA Join Date: Jan 2010
Posts: 52
|
![]() Quote:
Thanx all for your helpful feedback. ![]() |
|
![]() |
![]() |
![]() |
#15 |
I like code
Location: San Diego, CA, USA Join Date: Sep 2009
Posts: 438
|
![]()
I see this is mostly resolved but I thought I'd throw out something I was doing to find the approximate number of reads that hit a gene.
In the coverage.wig file produced by Tophat you can find the bp ranges of where reads were aligned to the genome. If you find the bp range of any exon from the genome within the coverage.wig file you do the following to compute the approx number of reads that hit that exon: at each row within the exon's bp range take the second number column, subtract the first and multiply by the third column; sum those products up over the range of the exon; divide by your read length. It's rough but it works. I was also using the coverage.wig file to find approximate read alignment numbers. They end up pretty similar to the count of number of lines in the accepted_hits.sam file. I figure that it doesn't take into account the fact that not all reads are fully aligned but on the scales we are working with here i don't find it makes too much difference in the end results of the numbers I've computed based on these values. |
![]() |
![]() |
![]() |
#16 | |
Senior Member
Location: Rochester, MN Join Date: Mar 2009
Posts: 191
|
![]() Quote:
By the way, I am not a statstician, so I could be completely off base. But I am learning a lot in these discussions and thanks to everyone for participating! |
|
![]() |
![]() |
![]() |
#17 |
Member
Location: Berkeley, cA Join Date: Feb 2010
Posts: 40
|
![]()
You are absolutely correct that biological replicates will help with accurately estimating relative isoform level abundance, and the replicated deconvolutions inform about the variability in isoform expression. With many replicates, one can directly estimate the variability in the MLE that way. But with few replicates, it is still necessary to estimate variability by leveraging variability in other transcripts and in addition it is important to account for the uncertainty in isoform level expression.
|
![]() |
![]() |
![]() |
#18 |
Member
Location: North Carolina Join Date: Jan 2010
Posts: 82
|
![]()
Just curious - any updates on the ETA for biological replicates in cufflinks?
also another question - if i have multiple RNAseq runs and want to predict isoforms - is the best thing to do to combine all the data into one big file and then run cufflinks? THere does not appear to be an option for including multiple sam files? Chris |
![]() |
![]() |
![]() |
#19 |
Junior Member
Location: Philadelphia Join Date: Jan 2009
Posts: 3
|
![]()
Hi Chris,
I am pretty new to RNA-seq data but my first had experience with cufflinks tells me that combining files is not a very good idea.....Cufflinks is memory intense algorithm and when looking to predict new isoforms it can run forever. However if you are using a reference .gtf file for the analysis and only concerned with those which are in the refrence file, it may work. Arpit |
![]() |
![]() |
![]() |
#20 |
Senior Member
Location: Woodbridge CT Join Date: Oct 2008
Posts: 231
|
![]()
Isoform expression is not only a key developmental marker, but also a response mechanism to environmental conditions and changes. Both are very non-static exercises of a given genome.
|
![]() |
![]() |
![]() |
Thread Tools | |
|
|