Seqanswers Leaderboard Ad

**sdriscoll** · 11-19-2012, 01:15 PM

No problem. I haven't figured out if these are the best settings for me yet but I can at least explain why I'm changing them from their default values. I'm setting all of the following options:

Code:

--max-bundle-frags 999999999
--no-effective-length-correction
--min-isoform-fraction 0
--pre-mrna-fraction 0.05
--junc-alpha 0.05
--max-bundle-length 5500000
-b <genome.fa>
-G <my_annotation.gtf>

I've set --max-bundle-frags and --max-bundle-length to these values in order to force it to evaluate all of the bundles it finds in the mouse data that I usually work with. So you would want to tweak those depending on your depth of sequencing and the genome you're working with. Just watch for "warning skipping ...." in the cufflinks output it generates while running.

Per Cole in this same thread I've disabled effective length correction (--no-effective-length-correction) because I don't like that this correction extrapolates the raw data. It also greatly exaggerates the expression estimation of single exon genes that are in the gene annotations. Basically disabling this feature makes it so the counts/expression seem to match up with the raw coverage a little better.

Bias correction (-b) makes a big difference in the quality of the quantifications. It takes longer to run but it seems to really help cufflinks produce logical output.

--min-isoform-fraction and --pre-mrna-fraction are both settings that allow cufflinks to discard information based on some arbirary thresholding. --min-isoform-fraction filters out low-count isoforms and --pre-mrna-fraction filters out intronic alignment data. I'm not sure what a difference --pre-mrna-fraction makes but when I was picking out these options I was looking for any options that made any filtering less strict. My and the researchers I work with would rather do the filtering afterwards. It's more useful for us to see as much of the raw data as possible.

--junc-alpha is one that I haven't tested with different values but I have set it to be less strict. The default is 0.001. I plan to mess around with that one a little more to see what sort of impact it has. I'm not sure if it even has any impact at all on quantification (maybe it only applies to assembly).

I believe --min-isoform-fraction has an impact on what is reported when you use cufflinks to do assembly. For example if you aren't prepared to trust cufflinks' estimations of expression quantification but you want to use it to build the most robust assembly possible then by setting this value to 0 you ensure that cufflinks doesn't throw out assembled isoforms it thinks are low expressed. I assume --junc-alpha will have some kind of impact as well but I haven't tested it out much.

I used the above settings to quantify some real data against the mm9 known gene annotation (from UCSC) and compared those quantifications to those I got using eXpress. As I posted before - it seemed like cufflinks was doing a much better job of making specific decisions about which isoforms really needed to be expressed to fully explain the alignments. For example if all of the coverage and junction information in a locus can be explained by a single isoform why should these programs report that multiple isoforms are expressed? If they do then I think that decreases the sensitivity of differential splicing analysis. Maybe in another sample there's new junction and coverage information in that locus that DOES justify expression of a second isoform. While eXpress would have givin you expression of both isoforms in both cases in my tests cufflinks would more likely report that second isoform as something that was activated in the second sample. That translates to me that there was sufficient splicing or coverage evidence of that new isoform and not just that some proportion of reads are being assigned to it in both cases because they share exons.

There are some very subtle differences in isoforms in some of the loci in the mouse genome. For example I've seen 2-isoform genes where the isoforms differ only by a single amino acid somewhere in the middle of the isoform making an exon shared between the two be 3bp longer than the corresponding exon in the other isoform. Cufflinks picks up on that because of the spliced alignment data - if all of the junctions are anchoring into the exon with the extra amino it can use that information to help it make expression assignment. In those cases eXpress assigns nearly equal expression to both isoforms even though the genome alignment evidence points heavily towards one verses the other.

**adarob** · 12-10-2012, 10:46 AM

sdriscoll,

Note that in the case of a provided annotation, eXpress and Cufflinks use the same information to determine the transcript abundances. The spliced reads are no longer spliced in the case of transcriptome mappings, but they are no less informative.

What you are seeing is due to the fact that eXpress uses the online EM algorithm to optimize the likelihood function, which will only approximate the MLE in the case of low coverage. This can be good or bad. eXpress is less likely to make strong calls with insufficient data (for example, a single read), but it will also rarely have enough data to call an isoform completely unexpressed. Therefore, your analysis will need to reflect an understanding of this.

However, you can run multiple rounds of eXpress using the batch update (with the -B option), which will improve its accuracy in the cases you mention.

-Adam

**sdriscoll** · 12-10-2012, 01:36 PM

I noticed that Cufflinks and Express produced very similar results for higher expressed loci. It was indeed the lower expresed loci where I found the most differences between the two.

**dvanic** · 01-28-2013, 08:26 PM

sdriscoll, have you tried doing the same analysis with the -g option (i.e cufflinks using reference transcripts as well as identifying novel ones) vs the -G one (which is only quantifying known isoforms)? How much is the "accuracy" of expression estimates reduced, if at all?

**anthurium** · 04-30-2013, 07:24 AM

Originally posted by drdna View Post

Cole,
Thanks very much for your comprehensive explanation of how FPKM calculations and locus intervals are defined - it does a lot to demystify the inner workings of the programs. However, most of the transcripts I have looked at are considerably longer than the average fragment length and, so, should not be overly prone to the edge effects that you described. Moreover, the control dataset that I used for one analysis was generated by sampling fragments evenly across the transcript and, hence, had no bias in read distribution. In general, with the exception of very short RNAs which, obviously, will be excluded from size-selected libraries, there should be no selection-associated bias against transcript "edges." The only bias should come from partially degraded RNA preps, or incomplete reverse transcription. I was wondering if you have examined read distribution along a set of known transcripts to quantify the supposed "edge" effect? In addition, I draw you attention to the fact that the manner of shearing has a major effect on sequence read distribution. Specifically, mechanical shearing results in a massive overrepresentation of end fragments (Schwartz and Farman, 2010; BMC Genomics 11:87). For reverse transcripts, this would produce a superabundance of 3' end fragments.

In your explanation of transcript loci, you stated that cufflinks processes all the transcripts in a "locus" together, as part of a single isoform deconvolution. However the loci that I am referring to span several different genes. Do I understand, then, that cufflinks processes multiple genes "as part of a single isoform deconvolution"?

A more concerning issue, however, is that cufflinks and its associated programs appear to be processing "loci" that have no reads mapping within them and assigning FPKM values to those loci.

For example, I have a gene_exp.diff output line that reads:

XLOC_000005 XLOC_000005 - chr1:7945468-7967929 EG_LVNT EG_LV1 OK 420.778 415.167 -0.0193665 0.0561998 0.955183 0.9758 no

This suggests that there are 420+ fragments per kb per million mapped reads, yet when I look at the chr1:7945468-7967929 region in the bam file, there are no fragments mapping in this region at all. Furthermore, there are only 8 reads mapping within 20 kb of the boundaries of this region.

Hi all,
I would like to share with you a little observation I made about genomic loci in Cufflinks.
From Cufflinks Manual I read the definition of transcription loci: “a locus that contains a set of transcripts all of whose genomic locations do not overlap the genomic location of any transcript in any other locus” and I understood that the transcription loci are not biological and they have a computational purpose.
I’m analyzing RNA seq data from Arabidopsis 72bp PE Illumina
When I use the options in Cufflinks:
-g/--GTF-guide <reference_annotation.(gtf/gff)>
-b/--frag-bias-correct <genome.fa>
-u/--multi-read-correct
I found that transcription loci for certain genes are different from the analysis run with the default parameters.
Comparing results of my analysis obtained with default settings vs gbu options I noticed that along with the loci the significance (q-value) is also dramatically changed.
In particular I show this example:
Default settings: gene A, B, C, D, E are in different transcription loci
geneA is significant qvalue 10-6
gene B, C, D, E are not significant

gbu option: gene A,B,C, D, E are in the same transcription locus
gene A, D, E not significant
gene B, C significant qvalue 10-27

Can someone please comment this pattern?
thanks

**pengchy** · 05-09-2013, 11:24 PM

Hi anthurium,

I am a bit confused about your question.
1. What is the relation between gene and transcript loci?
In the output of cufflinks file transcripts.gtf, more than one transcript_id can come from one gene_id, but not vice versa.

2. where did the significant qvalue come from if you only run cufflinks?

**anthurium** · 05-09-2013, 11:53 PM

Hi pengchy,

I forgot to say that after cufflinks also cuffdiff has been run, and then I compared cuffdiff output files.

when I speak about transcription loci and q-value I refer to "isoform_exp.diff file" produced by cuffdiff when testing for Differential Expression.

in this file I find that the same genomic region (in the column "locus") is often assigned to multiple genes.

My observation is that using different options in cufflinks the boundaries of transcription loci can change. I observed also that the estimated significance of DE (q value) can change when the dimension of the transcription locus changes.

**pengchy** · 05-10-2013, 02:48 AM

Hi all,

For the problem of inconsistent between many reads coverage and zero fpkm, I think we should check the scope of this inconsistent. @lshen had done this. The scatter plot of the rounded fpkm and count-based rpkm showed r= 0.99 using transformation log(x)+1. Indeed, there are many points with very high rpkm but with very low or zero fpkm, while it not so vice versa. It seems fpkm always smaller than rpkm.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News