Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • sdriscoll
    I like code
    • Sep 2009
    • 436

    #61
    No problem. I haven't figured out if these are the best settings for me yet but I can at least explain why I'm changing them from their default values. I'm setting all of the following options:

    Code:
    --max-bundle-frags 999999999
    --no-effective-length-correction
    --min-isoform-fraction 0
    --pre-mrna-fraction 0.05
    --junc-alpha 0.05
    --max-bundle-length 5500000
    -b <genome.fa>
    -G <my_annotation.gtf>
    I've set --max-bundle-frags and --max-bundle-length to these values in order to force it to evaluate all of the bundles it finds in the mouse data that I usually work with. So you would want to tweak those depending on your depth of sequencing and the genome you're working with. Just watch for "warning skipping ...." in the cufflinks output it generates while running.

    Per Cole in this same thread I've disabled effective length correction (--no-effective-length-correction) because I don't like that this correction extrapolates the raw data. It also greatly exaggerates the expression estimation of single exon genes that are in the gene annotations. Basically disabling this feature makes it so the counts/expression seem to match up with the raw coverage a little better.

    Bias correction (-b) makes a big difference in the quality of the quantifications. It takes longer to run but it seems to really help cufflinks produce logical output.

    --min-isoform-fraction and --pre-mrna-fraction are both settings that allow cufflinks to discard information based on some arbirary thresholding. --min-isoform-fraction filters out low-count isoforms and --pre-mrna-fraction filters out intronic alignment data. I'm not sure what a difference --pre-mrna-fraction makes but when I was picking out these options I was looking for any options that made any filtering less strict. My and the researchers I work with would rather do the filtering afterwards. It's more useful for us to see as much of the raw data as possible.

    --junc-alpha is one that I haven't tested with different values but I have set it to be less strict. The default is 0.001. I plan to mess around with that one a little more to see what sort of impact it has. I'm not sure if it even has any impact at all on quantification (maybe it only applies to assembly).

    I believe --min-isoform-fraction has an impact on what is reported when you use cufflinks to do assembly. For example if you aren't prepared to trust cufflinks' estimations of expression quantification but you want to use it to build the most robust assembly possible then by setting this value to 0 you ensure that cufflinks doesn't throw out assembled isoforms it thinks are low expressed. I assume --junc-alpha will have some kind of impact as well but I haven't tested it out much.

    I used the above settings to quantify some real data against the mm9 known gene annotation (from UCSC) and compared those quantifications to those I got using eXpress. As I posted before - it seemed like cufflinks was doing a much better job of making specific decisions about which isoforms really needed to be expressed to fully explain the alignments. For example if all of the coverage and junction information in a locus can be explained by a single isoform why should these programs report that multiple isoforms are expressed? If they do then I think that decreases the sensitivity of differential splicing analysis. Maybe in another sample there's new junction and coverage information in that locus that DOES justify expression of a second isoform. While eXpress would have givin you expression of both isoforms in both cases in my tests cufflinks would more likely report that second isoform as something that was activated in the second sample. That translates to me that there was sufficient splicing or coverage evidence of that new isoform and not just that some proportion of reads are being assigned to it in both cases because they share exons.

    There are some very subtle differences in isoforms in some of the loci in the mouse genome. For example I've seen 2-isoform genes where the isoforms differ only by a single amino acid somewhere in the middle of the isoform making an exon shared between the two be 3bp longer than the corresponding exon in the other isoform. Cufflinks picks up on that because of the spliced alignment data - if all of the junctions are anchoring into the exon with the extra amino it can use that information to help it make expression assignment. In those cases eXpress assigns nearly equal expression to both isoforms even though the genome alignment evidence points heavily towards one verses the other.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment

    • adarob
      Member
      • Jul 2010
      • 71

      #62
      sdriscoll,

      Note that in the case of a provided annotation, eXpress and Cufflinks use the same information to determine the transcript abundances. The spliced reads are no longer spliced in the case of transcriptome mappings, but they are no less informative.

      What you are seeing is due to the fact that eXpress uses the online EM algorithm to optimize the likelihood function, which will only approximate the MLE in the case of low coverage. This can be good or bad. eXpress is less likely to make strong calls with insufficient data (for example, a single read), but it will also rarely have enough data to call an isoform completely unexpressed. Therefore, your analysis will need to reflect an understanding of this.

      However, you can run multiple rounds of eXpress using the batch update (with the -B option), which will improve its accuracy in the cases you mention.

      -Adam
      Last edited by adarob; 12-10-2012, 11:05 AM. Reason: Additional information.

      Comment

      • sdriscoll
        I like code
        • Sep 2009
        • 436

        #63
        I noticed that Cufflinks and Express produced very similar results for higher expressed loci. It was indeed the lower expresed loci where I found the most differences between the two.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment

        • dvanic
          Member
          • Jan 2012
          • 61

          #64
          sdriscoll, have you tried doing the same analysis with the -g option (i.e cufflinks using reference transcripts as well as identifying novel ones) vs the -G one (which is only quantifying known isoforms)? How much is the "accuracy" of expression estimates reduced, if at all?

          Comment

          • anthurium
            Junior Member
            • Apr 2013
            • 4

            #65
            Originally posted by drdna View Post
            Cole,
            Thanks very much for your comprehensive explanation of how FPKM calculations and locus intervals are defined - it does a lot to demystify the inner workings of the programs. However, most of the transcripts I have looked at are considerably longer than the average fragment length and, so, should not be overly prone to the edge effects that you described. Moreover, the control dataset that I used for one analysis was generated by sampling fragments evenly across the transcript and, hence, had no bias in read distribution. In general, with the exception of very short RNAs which, obviously, will be excluded from size-selected libraries, there should be no selection-associated bias against transcript "edges." The only bias should come from partially degraded RNA preps, or incomplete reverse transcription. I was wondering if you have examined read distribution along a set of known transcripts to quantify the supposed "edge" effect? In addition, I draw you attention to the fact that the manner of shearing has a major effect on sequence read distribution. Specifically, mechanical shearing results in a massive overrepresentation of end fragments (Schwartz and Farman, 2010; BMC Genomics 11:87). For reverse transcripts, this would produce a superabundance of 3' end fragments.

            In your explanation of transcript loci, you stated that cufflinks processes all the transcripts in a "locus" together, as part of a single isoform deconvolution. However the loci that I am referring to span several different genes. Do I understand, then, that cufflinks processes multiple genes "as part of a single isoform deconvolution"?

            A more concerning issue, however, is that cufflinks and its associated programs appear to be processing "loci" that have no reads mapping within them and assigning FPKM values to those loci.

            For example, I have a gene_exp.diff output line that reads:

            XLOC_000005 XLOC_000005 - chr1:7945468-7967929 EG_LVNT EG_LV1 OK 420.778 415.167 -0.0193665 0.0561998 0.955183 0.9758 no

            This suggests that there are 420+ fragments per kb per million mapped reads, yet when I look at the chr1:7945468-7967929 region in the bam file, there are no fragments mapping in this region at all. Furthermore, there are only 8 reads mapping within 20 kb of the boundaries of this region.
            Hi all,
            I would like to share with you a little observation I made about genomic loci in Cufflinks.
            From Cufflinks Manual I read the definition of transcription loci: “a locus that contains a set of transcripts all of whose genomic locations do not overlap the genomic location of any transcript in any other locus” and I understood that the transcription loci are not biological and they have a computational purpose.
            I’m analyzing RNA seq data from Arabidopsis 72bp PE Illumina
            When I use the options in Cufflinks:
            -g/--GTF-guide <reference_annotation.(gtf/gff)>
            -b/--frag-bias-correct <genome.fa>
            -u/--multi-read-correct
            I found that transcription loci for certain genes are different from the analysis run with the default parameters.
            Comparing results of my analysis obtained with default settings vs gbu options I noticed that along with the loci the significance (q-value) is also dramatically changed.
            In particular I show this example:
            Default settings: gene A, B, C, D, E are in different transcription loci
            geneA is significant qvalue 10-6
            gene B, C, D, E are not significant

            gbu option: gene A,B,C, D, E are in the same transcription locus
            gene A, D, E not significant
            gene B, C significant qvalue 10-27

            Can someone please comment this pattern?
            thanks

            Comment

            • pengchy
              Senior Member
              • Feb 2009
              • 116

              #66
              Hi anthurium,

              I am a bit confused about your question.
              1. What is the relation between gene and transcript loci?
              In the output of cufflinks file transcripts.gtf, more than one transcript_id can come from one gene_id, but not vice versa.

              2. where did the significant qvalue come from if you only run cufflinks?

              Comment

              • anthurium
                Junior Member
                • Apr 2013
                • 4

                #67
                Hi pengchy,

                I forgot to say that after cufflinks also cuffdiff has been run, and then I compared cuffdiff output files.

                when I speak about transcription loci and q-value I refer to "isoform_exp.diff file" produced by cuffdiff when testing for Differential Expression.

                in this file I find that the same genomic region (in the column "locus") is often assigned to multiple genes.

                My observation is that using different options in cufflinks the boundaries of transcription loci can change. I observed also that the estimated significance of DE (q value) can change when the dimension of the transcription locus changes.

                Comment

                • pengchy
                  Senior Member
                  • Feb 2009
                  • 116

                  #68
                  Hi all,

                  For the problem of inconsistent between many reads coverage and zero fpkm, I think we should check the scope of this inconsistent. @lshen had done this. The scatter plot of the rounded fpkm and count-based rpkm showed r= 0.99 using transformation log(x)+1. Indeed, there are many points with very high rpkm but with very low or zero fpkm, while it not so vice versa. It seems fpkm always smaller than rpkm.

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    06-02-2026, 10:05 AM
                  • SEQadmin2
                    Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                    by SEQadmin2


                    With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                    Introduction

                    Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                    05-22-2026, 06:42 AM
                  • SEQadmin2
                    Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                    by SEQadmin2

                    Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                    Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                    05-06-2026, 09:04 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Yesterday, 08:59 AM
                  0 responses
                  14 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 12:03 PM
                  0 responses
                  22 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 06-02-2026, 11:40 AM
                  0 responses
                  19 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-28-2026, 11:40 AM
                  0 responses
                  32 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...