SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Differential expression from RNA-seq: variation between replicates beans RNA Sequencing 6 11-03-2011 10:45 AM
Differential Expression analysis without replicates polsum Bioinformatics 1 08-05-2011 04:40 AM
help with differential gene expression with cufflinks and tophat waterboy Bioinformatics 1 11-28-2010 10:51 AM
Differential gene expression of gene clusters anjana.vr RNA Sequencing 1 10-28-2010 11:33 AM
Differential gene expression: Can Cufflinks/Cuffcompare handle biological replicates? marcora Bioinformatics 0 05-19-2010 02:11 AM

Reply
 
Thread Tools
Old 05-19-2010, 03:33 AM   #1
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Cool Differential gene expression: Can Cufflinks/Cuffcompare handle biological replicates?

I am new to this RNAseq business and I would like to use it to measure differential gene expression between wild-type and mutant mouse brains.

I have designed my experiment to have 4 biological replicates each for case and control.

I really like bowtie/tophat and would like to use cufflinks/cuffdiff to measure diff gene expression, but I can't seem to find a way to make cuffdiff handle biological replicates (which I would hope is the way everybody designs their experiments )... am I missing something? Alternatively, I could use R packages like DESeq, but it's not easy - at least for me - to extract raw counts from cufflinks output (using awk at the moment)... it would be a great additional feature for cufflinks to output unnormalized read counts for each feature alongside FPKM values to avoid extra-steps and make it simpler and more versatile for beginner's use.

I would greatly appreciate any feedback from this community of experts!

Cheers
marcora is offline   Reply With Quote
Old 05-19-2010, 05:09 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

You can use htseq-count to extract raw counts.

Simon
Simon Anders is offline   Reply With Quote
Old 05-19-2010, 06:31 AM   #3
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Red face

Quote:
Originally Posted by Simon Anders View Post
You can use htseq-count to extract raw counts.
Dear Simon,

indeed I have very much enjoyed using HTSeq tools and I am looking forward to use DESeq. However, I would love to see an integrated solution such as bowtie>tophat>cufflinks because I would like to perform the analysis using two different set of tools and make sure the results match to a large extent.

In regards to htseq-count this is my oversimplified pipeline at the moment (disclaimer: I have been doing RNAseq analysis for only a few days now):

- run tophat using the RefSeq reference in GFF format without novel junctions discovery which outputs 'accepted_hits.sam'

- run cufflinks on 'accepted_hits.sam' using the RefSeq reference in GTF format which outputs 'transcripts.gtf'

- run htseq-count as follows:

Code:
 htseq-count accepted_hits.sam transcripts.gtf > counts.table
transcripts.gtf looks like this:

Code:
chr1	Cufflinks	transcript	136186581	136189125	1000	+	.	gene_id "Myog"; transcript_id "Myog"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";
chr1	Cufflinks	exon	136186581	136187103	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "1"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";
chr1	Cufflinks	exon	136187670	136187751	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "2"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";
chr1	Cufflinks	exon	136188251	136189125	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "3"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";
whereas the tail of htseq-count output looks like this:

Code:
Zxdc	88
Zyg11a	0
Zyg11b	760
Zyx	97
Zzef1	150
Zzz3	163
l7Rn6	181
rp9	123
no_feature	10221124
ambiguous	15234
Not sure why there are so many 'no_feature' hits. Moreover, htseq-count reports a raw count of 741 for Myog (the gene shown above in the GTF bit). The sum of all genes raw counts is 4,281,575.

The size of accepted_hits.sam is 17.283 million lines.

Given that the combination of all exons of Myog is 1.477 Kb and the reported FPKM (RPKM in this case because I am using reads that are NOT paired end) is 81.234 the Myog raw count from cufflinks output should be RPKM * length (Kb) * total no of reads (million) = 81.234 * 1.477 * 17.283 = 2,073

Not sure what I am doing wrong but both cufflinks and htseq-count outputs are, at this early stage of my understanding, very confusing in the fact that they don't match.

Any help would be greatly appreciated

Edoardo "Dado" Marcora

p.s.: I am very soon going to install IGV on my machine and manually count the reads mapped to the Myog gene locus

UPDATE: by approximate visual inspection, it looks like the Myog raw count estimated using cufflinks output is the correct one... htseq-count is lower probably because of all those "no_feature" hits

Last edited by marcora; 05-19-2010 at 07:03 AM.
marcora is offline   Reply With Quote
Old 05-19-2010, 07:10 AM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Unless you have stranded RNA-Seq data, it is important to use the option "--stranded=no". Otherwise, htseq-count discards everything which is on the other strand.

Could this be the reason? (Maybe, I should mark this fact in bold on the web page.)

Simon
Simon Anders is offline   Reply With Quote
Old 05-19-2010, 08:33 AM   #5
sunnyvu
Member
 
Location: USA TN

Join Date: Mar 2010
Posts: 17
Default

Currently I am using the HTSeq to get count number. After adding -s, the result is much better.

Command:
python -m HTSeq.scripts.count accepted_hits.sam transcripts.gtf >s1.count -s no

Thanks both.
sunnyvu is offline   Reply With Quote
Old 05-19-2010, 08:45 AM   #6
mgogol
Senior Member
 
Location: Kansas City

Join Date: Mar 2008
Posts: 197
Default

You can also count reads using bedtools on a bam file, like this.

coverageBed -abam ../data/accepted_hits.sorted.bam -b ../data/Mus_musculus.NCBIM37.52.bed > counts_bam.txt

I think it's really smart to check things out. There's a lot that can go wrong with all these complicated programs, sequences, etc.
mgogol is offline   Reply With Quote
Old 05-19-2010, 09:34 AM   #7
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Wink

Quote:
Originally Posted by Simon Anders View Post
Unless you have stranded RNA-Seq data, it is important to use the option "--stranded=no". Otherwise, htseq-count discards everything which is on the other strand.

Could this be the reason? (Maybe, I should mark this fact in bold on the web page.)
All right, thanx... I missed that! Now the count for Myog is 1,445 (vs 2073 as I calculated it from cufflinks output - see prev post). The tail of htseq-count output is:

Code:
Zxdc	170
Zyg11a	5
Zyg11b	1554
Zyx	209
Zzef1	297
Zzz3	375
l7Rn6	365
rp9	229
no_feature	5949010
ambiguous	63483
Still quite a few "no_feature" hits, but overall much more realistic... it would be great to have a discussion between Simon and Cole about the different methodologies used to allocate reads to features by htseq-count and cufflinks that might explain the difference in output.

By the way, here is the script that I am using to generate raw counts for genes from 'accepted_hits.sam' (tophat output) and 'transcripts.gtf' (cufflinks output); perhaps somebody may find it useful (the output should be directly loadable into an R data frame for use with DEseq)

n.b.: it's tailored to my file/directory structure... but the gist of it should be easily generalizable.

Code:
#!/bin/bash
for f in $( ls *.fa ); do
    if [ -f $f ]; then
        ACCEPTED_HITS_SIZE=`wc -l $f.tophat_out/accepted_hits.sam | cut -d' ' -f1`
        # (for each exon in gtf file input)
        # $10 = gene_id => $1
        # $16 = FPKM => $2
        # $5-$4 = exon lenght => $3
        # => $1;$2;$3
        # => $1 $2 $3
        #
        # aggregate sum of $3 (exon length) by $1 (gene_id)
        # => $1 $2 $3
        #
        # count = FPKM * total exon lenght (Kb) * number of mapped reads (million)
        # count = $2 * ($3 / 1,000) * ((ACCEPTED_HITS_SIZE-2) / 1,000,000)
        awk '$3 ~ /^exon$/ { print $10 $16 ($5-$4) }' $f.cufflinks_out/transcripts.gtf | \
        awk '{ gsub(/"/, "", $0); print $0; }' - | \
        awk -F\; '{ printf("%s\t%s\t%s\n",$1,$2,$3) }' - | \
        awk 'BEGIN { gene_id = ""; i = 0 ; len = 0; } { if (gene_id == $1) { i++; len += $3; fpkm = $2; } else if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); gene_id = $1; len = $3; i = 1; fpkm = $2; } else { gene_id = $1; len += $3; i++; fpkm = $2; } } END { if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); } }' - | \
        awk 'BEGIN { print "\t'${f%.fa}'" } { gene_id = $1; count = $2*($3/1000)*(('${ACCEPTED_HITS_SIZE}'-2)/1000000); printf("%s\t%d\n",gene_id,count); }' - > $f.counts.table
    fi
done
But pleeeeeez, add raw counts to cufflinks output!!!

Dado
marcora is offline   Reply With Quote
Old 05-19-2010, 09:39 AM   #8
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Talking

Quote:
Originally Posted by marcora View Post
By the way, here is the script that I am using to generate raw counts for genes from 'accepted_hits.sam' (tophat output) and 'transcripts.gtf' (cufflinks output); perhaps somebody may find it useful (the output should be directly loadable into an R data frame for use with DEseq)

n.b.: it's tailored to my file/directory structure... but the gist of it should be easily generalizable.

Code:
#!/bin/bash
for f in $( ls *.fa ); do
    if [ -f $f ]; then
        ACCEPTED_HITS_SIZE=`wc -l $f.tophat_out/accepted_hits.sam | cut -d' ' -f1`
        # (for each exon in gtf file input)
        # $10 = gene_id => $1
        # $16 = FPKM => $2
        # $5-$4 = exon lenght => $3
        # => $1;$2;$3
        # => $1 $2 $3
        #
        # aggregate sum of $3 (exon length) by $1 (gene_id)
        # => $1 $2 $3
        #
        # count = FPKM * total exon lenght (Kb) * number of mapped reads (million)
        # count = $2 * ($3 / 1,000) * ((ACCEPTED_HITS_SIZE-2) / 1,000,000)
        awk '$3 ~ /^exon$/ { print $10 $16 ($5-$4) }' $f.cufflinks_out/transcripts.gtf | \
        awk '{ gsub(/"/, "", $0); print $0; }' - | \
        awk -F\; '{ printf("%s\t%s\t%s\n",$1,$2,$3) }' - | \
        awk 'BEGIN { gene_id = ""; i = 0 ; len = 0; } { if (gene_id == $1) { i++; len += $3; fpkm = $2; } else if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); gene_id = $1; len = $3; i = 1; fpkm = $2; } else { gene_id = $1; len += $3; i++; fpkm = $2; } } END { if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); } }' - | \
        awk 'BEGIN { print "\t'${f%.fa}'" } { gene_id = $1; count = $2*($3/1000)*(('${ACCEPTED_HITS_SIZE}'-2)/1000000); printf("%s\t%d\n",gene_id,count); }' - > $f.counts.table
    fi
done
Wow , I've just come to realize that that was my first public contribution to the field of hardcore bioinformatics... 4 lines of code!!!

Gotta pick up the pace!
marcora is offline   Reply With Quote
Old 05-19-2010, 09:51 AM   #9
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by mgogol View Post
You can also count reads using bedtools on a bam file, like this.

coverageBed -abam ../data/accepted_hits.sorted.bam -b ../data/Mus_musculus.NCBIM37.52.bed > counts_bam.txt
That was the other option I tried out... the output for Myog is the following (only the last 4 columns are added by coverageBed: 1] raw count/depth 2] # bases at depth 3] size of transcript/exon 4] % of transcript/exon at depth):

Code:
chr1	Cufflinks	transcript	136186581	136189125	1000	+	.	gene_id "Myog"; transcript_id "Myog"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";	1474	2532	2545	0.9948919
chr1	Cufflinks	exon	136186581	136187103	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "1"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";	494	510	523	0.9751434
chr1	Cufflinks	exon	136187670	136187751	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "2"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";	122	82	82	1.0000000
chr1	Cufflinks	exon	136188251	136189125	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "3"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";	906	875	875	1.0000000
Which is more consistent with htseq-count output (raw count of 1,522 vs 1,445 respectively). Gotta reopen IGV and count the reads one by one to decide whether the raw count of 2,073 derived from cufflinks output is correct or an overestimation.

Cheers,

Dado
marcora is offline   Reply With Quote
Old 05-19-2010, 10:15 AM   #10
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by marcora View Post
Which is more consistent with htseq-count output (raw count of 1,522 vs 1,445 respectively). Gotta reopen IGV and count the reads one by one to decide whether the raw count of 2,073 derived from cufflinks output is correct or an overestimation.
Okay, here we go.

Please find attached a screenshot of my igv viewer showing Sox17.

htseq-count reports a raw count of 46 for Sox17.

coverageBed reports a raw count of 46 for Sox17.

cufflinks "reports" a raw count of 66 for Sox17.

If you count the reads overlapping with exons in the screenshot, they are ~45. If you count all reads within the boundaries of the gene they are ~64.

Started to wonder what the FPKM reported by cufflinks (which is exactly the same across transcript and exon features) takes into consideration.

Just my two cents,

Dado
Attached Images
File Type: jpg screenshot_sox17.jpg (20.8 KB, 128 views)

Last edited by marcora; 05-19-2010 at 10:17 AM.
marcora is offline   Reply With Quote
Old 05-20-2010, 06:01 PM   #11
Cole Trapnell
Senior Member
 
Location: Boston, MA

Join Date: Nov 2008
Posts: 212
Default

Quote:
Originally Posted by marcora View Post
I am new to this RNAseq business and I would like to use it to measure differential gene expression between wild-type and mutant mouse brains.

I have designed my experiment to have 4 biological replicates each for case and control.

I really like bowtie/tophat and would like to use cufflinks/cuffdiff to measure diff gene expression, but I can't seem to find a way to make cuffdiff handle biological replicates (which I would hope is the way everybody designs their experiments )... am I missing something? Alternatively, I could use R packages like DESeq, but it's not easy - at least for me - to extract raw counts from cufflinks output (using awk at the moment)... it would be a great additional feature for cufflinks to output unnormalized read counts for each feature alongside FPKM values to avoid extra-steps and make it simpler and more versatile for beginner's use.

I would greatly appreciate any feedback from this community of experts!

Cheers
We will be directly supporting biological replicates within the next few weeks in both cuffdiff and cufflinks itself. We've recently worked out the math for how to handle them well in our model and improve the robustness of our statistical testing. I need a few weeks to implement the enhancements and do the testing, etc.

We will not be directly reporting the number of fragments that come from each isoform, because when dealing with fragments that could have come from one of several transcripts in the transcriptome, the fragments need to be "probabilistically" assigned, because we can't say for for sure which transcript each fragment came from. Cufflinks takes this assignment uncertainty into account when performing its tests for differential expression. Testing directly on "estimated" counts throws out much of the "alignment uncertainty" information Cufflinks has computed along with the abundance estimates. When dealing with ambiguously mappable fragments (e.g. all fragments incident on constitutive features of alternatively spliced genes) ignoring this uncertainty during testing, even at the gene level, is hazardous.

It's also worth noting that what Cufflinks is doing internally is computing for each transcript a number (in our paper we call this number "rho") that is it's relative abundance in the sample. We could write this down as a very small fraction, but instead we scale it up by a large constant and report FPKM - the expected number of fragments per kilobase of transcript per million fragments mapped. Now the source of the difference between the tools may be in two places:

1) When you use a GTF, Cufflinks only counts fragments that fall on features for the "millions of fragments mapped" in the denominator
2) Cufflinks uses each fragment's MAPQ to downweight multi-mapping fragments, because otherwise these get overcounted and contribute more to the overall map than uniquely mappable fragments. See the SAM format for more details on this.

So I think comparing the actual FPKM, counts, etc. between tools is tricky, and comparing Cufflinks to a tool that directly reports counts is even more so.

Because FPKM for a transcript is just a proxy for its relative abundance rho, it doesn't actually matter what this constant multiplier is, as long as its the same for all transcripts. We report FPKM simply because many people performing RNA-Seq are used to it.
Cole Trapnell is offline   Reply With Quote
Old 05-21-2010, 05:22 AM   #12
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by Cole Trapnell View Post
We will be directly supporting biological replicates within the next few weeks in both cuffdiff and cufflinks itself. We've recently worked out the math for how to handle them well in our model and improve the robustness of our statistical testing. I need a few weeks to implement the enhancements and do the testing, etc.
Dear Cole,

thanx for your quick reply. The addition of this feature to cufflinks would definitively improve its versatility and adoption, since most experimental designs include/should include biological replicates.

Quote:
We will not be directly reporting the number of fragments that come from each isoform, because when dealing with fragments that could have come from one of several transcripts in the transcriptome, the fragments need to be "probabilistically" assigned, because we can't say for for sure which transcript each fragment came from. Cufflinks takes this assignment uncertainty into account when performing its tests for differential expression. Testing directly on "estimated" counts throws out much of the "alignment uncertainty" information Cufflinks has computed along with the abundance estimates. When dealing with ambiguously mappable fragments (e.g. all fragments incident on constitutive features of alternatively spliced genes) ignoring this uncertainty during testing, even at the gene level, is hazardous.
I don't quite get this. If cufflinks outputs FPKMs then it should be able to output raw counts... at least for unambiguous assignments, like those to exons and genes (not alternatively spliced transcripts).

Quote:
So I think comparing the actual FPKM, counts, etc. between tools is tricky, and comparing Cufflinks to a tool that directly reports counts is even more so.
If I've got this right, because of splice and multireads assignment, what cufflinks reports as FPKM cannot be directly translated into the raw counts of the kind that htseq-count and coverageBed output. Do you think that translating this FPKM value into a raw count value and then using tools like DESeq is valid or not, in order to complement or independently verify cuffdiff output once the 'biological replicates" feature is added?

Thanx a lot for these great tools!!!

Dado
marcora is offline   Reply With Quote
Old 05-21-2010, 11:28 AM   #13
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

While it is ok to use raw counts to compare gene expression between samples, as Cole explained, to test differential expression of _isoforms_ its necessary to account for uncertainty in the assignment of reads to transcripts. Converting isoform FPKMs to counts and then applying DEseq is a bad idea because the uncertainty is then not incorporated into the DE calculation.

Secondly, to compare expression of genes within one experiment (or to measure relative expression of isoforms within a gene) one has to test the FPKMs and not counts, because counts do not accurately reflect the relative abundances. This is explained in more detail in the Cufflinks paper.
lpachter is offline   Reply With Quote
Old 05-21-2010, 02:32 PM   #14
marcora
Member
 
Location: Pasadena, CA USA

Join Date: Jan 2010
Posts: 52
Default

Quote:
Originally Posted by lpachter View Post
While it is ok to use raw counts to compare gene expression between samples, as Cole explained, to test differential expression of _isoforms_ its necessary to account for uncertainty in the assignment of reads to transcripts. Converting isoform FPKMs to counts and then applying DEseq is a bad idea because the uncertainty is then not incorporated into the DE calculation.

Secondly, to compare expression of genes within one experiment (or to measure relative expression of isoforms within a gene) one has to test the FPKMs and not counts, because counts do not accurately reflect the relative abundances. This is explained in more detail in the Cufflinks paper.
I most definitively understand the rationale behind FPKMs and why they're advantageous in certain contexts over raw counts. But the contrary is also true. I was just thinking about providing the end user with both for added flexibility. Not sure if it is possible, but it would also be great to see which reads get assigned to which isoform.

Thanx all for your helpful feedback.
marcora is offline   Reply With Quote
Old 05-21-2010, 06:16 PM   #15
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

I see this is mostly resolved but I thought I'd throw out something I was doing to find the approximate number of reads that hit a gene.

In the coverage.wig file produced by Tophat you can find the bp ranges of where reads were aligned to the genome. If you find the bp range of any exon from the genome within the coverage.wig file you do the following to compute the approx number of reads that hit that exon: at each row within the exon's bp range take the second number column, subtract the first and multiply by the third column; sum those products up over the range of the exon; divide by your read length. It's rough but it works. I was also using the coverage.wig file to find approximate read alignment numbers. They end up pretty similar to the count of number of lines in the accepted_hits.sam file. I figure that it doesn't take into account the fact that not all reads are fully aligned but on the scales we are working with here i don't find it makes too much difference in the end results of the numbers I've computed based on these values.
sdriscoll is offline   Reply With Quote
Old 05-21-2010, 07:22 PM   #16
RockChalkJayhawk
Senior Member
 
Location: Rochester, MN

Join Date: Mar 2009
Posts: 191
Default

Quote:
Originally Posted by lpachter View Post
While it is ok to use raw counts to compare gene expression between samples, as Cole explained, to test differential expression of _isoforms_ its necessary to account for uncertainty in the assignment of reads to transcripts. Converting isoform FPKMs to counts and then applying DEseq is a bad idea because the uncertainty is then not incorporated into the DE calculation.
But wouldn't the uncertainty in the isoform abundances be overcome by the addition of biological replicates. For instance, gene X has 2 isoforms, A and B. A is assigned 20% of the gene expression, while 80% is assigned to B. If the uncertainty of the abundance of A is +- 10%, it would be dificult to assess the isoform abundance with confidence using this sample as you suggest. However, if two additional replicates call the abundance of A 15% and 23%, then the uncertainty of measurement should decrease if the precision of these combined measurements is that high. Is this correct? I recognize that the uncertainty in isoform abundance is a concern, but at what point will the accuracy and precision of the measurement only be a statistical excercise? In other words, would we benefit from knowing an isoform changed from 20% to 30%? Wouldn't the biological replicates inherentlly provide more power?

By the way, I am not a statstician, so I could be completely off base. But I am learning a lot in these discussions and thanks to everyone for participating!
RockChalkJayhawk is offline   Reply With Quote
Old 05-22-2010, 08:41 AM   #17
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

You are absolutely correct that biological replicates will help with accurately estimating relative isoform level abundance, and the replicated deconvolutions inform about the variability in isoform expression. With many replicates, one can directly estimate the variability in the MLE that way. But with few replicates, it is still necessary to estimate variability by leveraging variability in other transcripts and in addition it is important to account for the uncertainty in isoform level expression.
lpachter is offline   Reply With Quote
Old 07-26-2010, 07:29 AM   #18
chrisbala
Member
 
Location: North Carolina

Join Date: Jan 2010
Posts: 82
Default

Just curious - any updates on the ETA for biological replicates in cufflinks?

also another question - if i have multiple RNAseq runs and want to predict isoforms - is the best thing to do to combine all the data into one big file and then run cufflinks? THere does not appear to be an option for including multiple sam files?

Chris
chrisbala is offline   Reply With Quote
Old 07-26-2010, 12:29 PM   #19
Agent47
Junior Member
 
Location: Philadelphia

Join Date: Jan 2009
Posts: 3
Default

Hi Chris,
I am pretty new to RNA-seq data but my first had experience with cufflinks tells me that combining files is not a very good idea.....Cufflinks is memory intense algorithm and when looking to predict new isoforms it can run forever. However if you are using a reference .gtf file for the analysis and only concerned with those which are in the refrence file, it may work.

Arpit
Agent47 is offline   Reply With Quote
Old 07-27-2010, 08:11 AM   #20
Joann
Senior Member
 
Location: Woodbridge CT

Join Date: Oct 2008
Posts: 231
Default General wet lab comment:

Isoform expression is not only a key developmental marker, but also a response mechanism to environmental conditions and changes. Both are very non-static exercises of a given genome.
Joann is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 11:11 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO