No problem. I haven't figured out if these are the best settings for me yet but I can at least explain why I'm changing them from their default values. I'm setting all of the following options:
I've set --max-bundle-frags and --max-bundle-length to these values in order to force it to evaluate all of the bundles it finds in the mouse data that I usually work with. So you would want to tweak those depending on your depth of sequencing and the genome you're working with. Just watch for "warning skipping ...." in the cufflinks output it generates while running.
Per Cole in this same thread I've disabled effective length correction (--no-effective-length-correction) because I don't like that this correction extrapolates the raw data. It also greatly exaggerates the expression estimation of single exon genes that are in the gene annotations. Basically disabling this feature makes it so the counts/expression seem to match up with the raw coverage a little better.
Bias correction (-b) makes a big difference in the quality of the quantifications. It takes longer to run but it seems to really help cufflinks produce logical output.
--min-isoform-fraction and --pre-mrna-fraction are both settings that allow cufflinks to discard information based on some arbirary thresholding. --min-isoform-fraction filters out low-count isoforms and --pre-mrna-fraction filters out intronic alignment data. I'm not sure what a difference --pre-mrna-fraction makes but when I was picking out these options I was looking for any options that made any filtering less strict. My and the researchers I work with would rather do the filtering afterwards. It's more useful for us to see as much of the raw data as possible.
--junc-alpha is one that I haven't tested with different values but I have set it to be less strict. The default is 0.001. I plan to mess around with that one a little more to see what sort of impact it has. I'm not sure if it even has any impact at all on quantification (maybe it only applies to assembly).
I believe --min-isoform-fraction has an impact on what is reported when you use cufflinks to do assembly. For example if you aren't prepared to trust cufflinks' estimations of expression quantification but you want to use it to build the most robust assembly possible then by setting this value to 0 you ensure that cufflinks doesn't throw out assembled isoforms it thinks are low expressed. I assume --junc-alpha will have some kind of impact as well but I haven't tested it out much.
I used the above settings to quantify some real data against the mm9 known gene annotation (from UCSC) and compared those quantifications to those I got using eXpress. As I posted before - it seemed like cufflinks was doing a much better job of making specific decisions about which isoforms really needed to be expressed to fully explain the alignments. For example if all of the coverage and junction information in a locus can be explained by a single isoform why should these programs report that multiple isoforms are expressed? If they do then I think that decreases the sensitivity of differential splicing analysis. Maybe in another sample there's new junction and coverage information in that locus that DOES justify expression of a second isoform. While eXpress would have givin you expression of both isoforms in both cases in my tests cufflinks would more likely report that second isoform as something that was activated in the second sample. That translates to me that there was sufficient splicing or coverage evidence of that new isoform and not just that some proportion of reads are being assigned to it in both cases because they share exons.
There are some very subtle differences in isoforms in some of the loci in the mouse genome. For example I've seen 2-isoform genes where the isoforms differ only by a single amino acid somewhere in the middle of the isoform making an exon shared between the two be 3bp longer than the corresponding exon in the other isoform. Cufflinks picks up on that because of the spliced alignment data - if all of the junctions are anchoring into the exon with the extra amino it can use that information to help it make expression assignment. In those cases eXpress assigns nearly equal expression to both isoforms even though the genome alignment evidence points heavily towards one verses the other.
Code:
--max-bundle-frags 999999999 --no-effective-length-correction --min-isoform-fraction 0 --pre-mrna-fraction 0.05 --junc-alpha 0.05 --max-bundle-length 5500000 -b <genome.fa> -G <my_annotation.gtf>
Per Cole in this same thread I've disabled effective length correction (--no-effective-length-correction) because I don't like that this correction extrapolates the raw data. It also greatly exaggerates the expression estimation of single exon genes that are in the gene annotations. Basically disabling this feature makes it so the counts/expression seem to match up with the raw coverage a little better.
Bias correction (-b) makes a big difference in the quality of the quantifications. It takes longer to run but it seems to really help cufflinks produce logical output.
--min-isoform-fraction and --pre-mrna-fraction are both settings that allow cufflinks to discard information based on some arbirary thresholding. --min-isoform-fraction filters out low-count isoforms and --pre-mrna-fraction filters out intronic alignment data. I'm not sure what a difference --pre-mrna-fraction makes but when I was picking out these options I was looking for any options that made any filtering less strict. My and the researchers I work with would rather do the filtering afterwards. It's more useful for us to see as much of the raw data as possible.
--junc-alpha is one that I haven't tested with different values but I have set it to be less strict. The default is 0.001. I plan to mess around with that one a little more to see what sort of impact it has. I'm not sure if it even has any impact at all on quantification (maybe it only applies to assembly).
I believe --min-isoform-fraction has an impact on what is reported when you use cufflinks to do assembly. For example if you aren't prepared to trust cufflinks' estimations of expression quantification but you want to use it to build the most robust assembly possible then by setting this value to 0 you ensure that cufflinks doesn't throw out assembled isoforms it thinks are low expressed. I assume --junc-alpha will have some kind of impact as well but I haven't tested it out much.
I used the above settings to quantify some real data against the mm9 known gene annotation (from UCSC) and compared those quantifications to those I got using eXpress. As I posted before - it seemed like cufflinks was doing a much better job of making specific decisions about which isoforms really needed to be expressed to fully explain the alignments. For example if all of the coverage and junction information in a locus can be explained by a single isoform why should these programs report that multiple isoforms are expressed? If they do then I think that decreases the sensitivity of differential splicing analysis. Maybe in another sample there's new junction and coverage information in that locus that DOES justify expression of a second isoform. While eXpress would have givin you expression of both isoforms in both cases in my tests cufflinks would more likely report that second isoform as something that was activated in the second sample. That translates to me that there was sufficient splicing or coverage evidence of that new isoform and not just that some proportion of reads are being assigned to it in both cases because they share exons.
There are some very subtle differences in isoforms in some of the loci in the mouse genome. For example I've seen 2-isoform genes where the isoforms differ only by a single amino acid somewhere in the middle of the isoform making an exon shared between the two be 3bp longer than the corresponding exon in the other isoform. Cufflinks picks up on that because of the spliced alignment data - if all of the junctions are anchoring into the exon with the extra amino it can use that information to help it make expression assignment. In those cases eXpress assigns nearly equal expression to both isoforms even though the genome alignment evidence points heavily towards one verses the other.
Comment