Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #46
    Originally posted by drdna View Post
    I've come to the conclusion that the tophat, cufflinks, cuffdiff suite absolutely cannot be trusted to provide reliable results. Just recently I've been mapping paired end RNAseq reads to a bacterial genome using the -r and --mate-std-dev options. In the "bowtie.[left/right]_kept_reads.fixmap.log" files it states that the number of unpaired reads was 100%. However, the accepted_hits.bam file contains millions of pairs of lines that represent paired end data. Can we really trust that the programs aren't treating these reads as independent, if they don't even know they are dealing with paired ends? I'm not prepared to take the risk and will have to resort to throwing away the paired-end data and using reads coming from just one end. Either that or switch to CLC Bio...
    Are you sure you understood the log files correctly? That sounds like one of the intermediate log files, which would not necessarily say anything about the final results. I have had always reliable results from Tophat.

    And why recommend CLC bio over say BWA or a whole host of other open source programs?

    Comment


    • #47
      Hi Cole and the other friends here,

      Thanks for these wonderful discussions here.

      Hope you would have time to give a hand in this thread (about the status tag in Cuffdiff):
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


      Many thanks.

      Best regards,

      Qiongyi
      Last edited by qiongyi; 08-24-2012, 06:21 AM.

      Comment


      • #48
        i too have had many issues with cufflinks. i can describe them as similar to the problems drdna has had in this thread. it's very hard to trust a tool when you cannot check it's work. for example i have been witness to many occasions when cufflinks has assigned seemingly excessive expression to isoforms that have very few actual read counts. how can the software be so certain that the raw data needs to be interpreted in this way? this breaks down to cufflinks making decisions behind the scenes. either i don't understand them or i don't trust them - i'm not sure which. in all other areas of biological research i'm involved in we don't make such extrapolations to the raw data and i'm not convinced that it's OK to do so with this type of data either.

        differential expression results from cuffdiff have always be suspect to me. more recently i've been analyzing a knock-out experiment with a gene that has decent FPKM values (between 10 and 15 across replicates) and exactly 0 across the KO replicates. somehow this is not sufficient for that geen to be called significantly mis-expressed. if the purpose of our investigation is to see what sort of impact knocking that gene out has on gene expression of genes how can i possibly trust the results from cuffdiff if not even the completely knocked out gene is picked up in the statistical tests. i think cufflinks 1.3 detected the gene but cufflinks 1.3 was severely plagued with FAILs and NOTEST status' at many genes which clearly had reads hitting them. i couldn't for the life of my understand why a statistical test was being used that seemed to fail so widely.

        the other issue i have with cufflinks is its estimation of novel isoform structures. in my lab i've had many discussions with the post-docs who have used RNA-Seq in their projects and we cannot figure out why it's biologically relevant for an isoform assembly tool, like cufflinks, to force minimization of the number of isoforms thus dictating which intron trains are available for downstream analysis. i understand that it is applying some statistical methods to try to determine the most likely minimum set of isoforms but what i don't understand is why it wouldn't provide the option for the researcher to figure that out for themselves. maybe what I want to know is that my samples have 10 fold more splice variants than my controls. this would be hidden from me with cufflinks.

        in experiments where i've been tasked to attempt to measure the differential splicing between controls and mutants i've had a very hard time using cufflinks. i've tried several approaches at each one seems to be flawed in some way.

        one approach with a 3 vs 3 set of samples is to run cufflinks allowing novel isoform discovery separately for each sample and then use cuffcompare to compare those 6 assemblies and relate them to some reference annotation so that i can have cufflinks' assignment of '=', 'j', etc for each isoform. to try to select those isoforms unique to the mutants i'll scan for rows where an isoform is present in at least two of my mutants and not present in any of the controls. of course this gives me a list of isoforms (this process does yield results) but when i fact check these isoforms that's where things get fuzzy. i'll load up the cuffcmp.combined.gtf file into the UCSC genome browser and start to attempt to make sense of the results. for all of the isoforms that i have checked manually i've not been able to visually confirm, based on junctions and coverage, why it would have assigned these isoforms to the mutants but not the controls (they share similar coverage and many of the same junctions). this leads me to thinking that another approach would be wiser...

        so that's when i use the cuffmerge approach to merge those 6 assemblies into one and then use cuffdiff to attempt to measure differential expression between the groups in terms of isoforms. the problem is that when i run cuffmerge most of the interesting dynamics of the novel isoforms discovered in the initial cufflinks assemblies is lost. in fact the resulting merged.gtf often has completely different (possibly novel) isoforms in it from those in the cuffcompare assembly. this troubles me to no end.

        i feel like the developers can look at its output and say, "oh well that decision makes sense because i have a complete understanding of the assumptions i've made in this software and the algorithm i used" but for the rest of us that didn't write the code we look at the results and simply say, "hey...this doesn't make sense". it also doesn't seem like it's a big deal to them when people appear to be getting misled by these odd results or results that exclude genes and loci due to computation errors in their variance estimations. it's also odd to me that sometimes huge chunks of a chromosome are skipped because cufflinks finds it to be too complex thus resulting in a large net loss in the genes that are examined. why is this OK? if that behavior is expected there should be a solution. either that or the entire algorithm needs to be reexamined so that these loci don't have to be skipped. can you imagine if it turned out that a publication worthy gene of interest was in one of those loci and you never got to find it because of cufflinks' own computation issues?

        so it has been for me and cufflinks since i first tried it in 2009. it seems to me that we don't need anything more than raw level information and some basic estimation of the noise level in the data for gene level DE. if we have replicates then the alignment biases should be pretty consistent across them and we should still be able to call DE genes. isoform level DE seems to be to be a game of guesses and estimations. i'm not convinced we can really do that yet and i doubt anybody is going to attempt to confirm 10's of thousands of differential isoform expression levels at the bench to see if the tools really are working. as far as estimated isoform structures i think we need an assembly that applies some basic noise level filter to the alignments (including detected junctions) and simply presents us with a full list of possible structures so we can observe the splice variations with our eyes before making attempts to rule out specific isoforms. maybe that's just me.
        Last edited by sdriscoll; 11-03-2012, 12:50 PM.
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #49
          sdriscoll. My experiences with cuffdiff just get more and more bizarre. Like you, I have a situation where I have an FPKM value for one sample and zero for the other and yet the P-values are non-significant. Problem is I have genes with >1000 FPKM versus 0 and cuffdiff is still not reporting significant differences. Even more bizarre, I have one gene that showed an FPKM difference of 730 versus 0 - not significant; another gene had an FPKM difference of 700 versus 0 - significant: WTF!!! So I decided that it must be something to do with gene length differences and perhaps cuffdiff's significance calls involved statistical analysis of raw hit data. Looking at the genes.read_group_tracking information on raw hits, I was still unable to explain the lack of significance because raw hits were >5,000 versus 0 (seems pretty significant to me). As a last resort, I decided to query the raw alignment data in the .bam file. This told me that the one sample had >9,000 hits to the gene in question, while the other had 2,000.... are you f..king kidding me? All of the cufflinks, cuffdiff outputs said there were no hits to the gene when there were thousands - right there, clear as day, in the .bam file! THESE PROGRAMS ARE ABSOLUTE GARBAGE - STOP USING THEM AND SPREAD THE WORD!

          Comment


          • #50
            I'm using my own C++ program, that simply counts reads within isoforms it takes annotation from UCSC genomebrowser mysql database. Also, http://www-huber.embl.de/users/anders/HTSeq/ - this one is good for the beginning.

            Comment


            • #51
              Originally posted by drdna View Post
              sdriscoll. My experiences with cuffdiff just get more and more bizarre. Like you, I have a situation where I have an FPKM value for one sample and zero for the other and yet the P-values are non-significant. Problem is I have genes with >1000 FPKM versus 0 and cuffdiff is still not reporting significant differences. Even more bizarre, I have one gene that showed an FPKM difference of 730 versus 0 - not significant; another gene had an FPKM difference of 700 versus 0 - significant: WTF!!! So I decided that it must be something to do with gene length differences and perhaps cuffdiff's significance calls involved statistical analysis of raw hit data. Looking at the genes.read_group_tracking information on raw hits, I was still unable to explain the lack of significance because raw hits were >5,000 versus 0 (seems pretty significant to me). As a last resort, I decided to query the raw alignment data in the .bam file. This told me that the one sample had >9,000 hits to the gene in question, while the other had 2,000.... are you f..king kidding me? All of the cufflinks, cuffdiff outputs said there were no hits to the gene when there were thousands - right there, clear as day, in the .bam file! THESE PROGRAMS ARE ABSOLUTE GARBAGE - STOP USING THEM AND SPREAD THE WORD!
              That is really bizarre. The thing I wonder when I see such discrepancy between the cufflinks values and the actual coverage is either there's some fundamental property of interpretation of RNA seq data that I don't understand or that the cufflinks algorithm is kinda crazy sometimes. One issue is that they are setup to be happy with getting things mostly right. So if it gets 95% of the genes right that's good...right? Unfortunately it's not. As a good scientist you question everything. With a data set so big you can't possibly check it all just seeing a handful of strange illogical results out of the thousands throws everything into doubt. That's really the issue. I've never been disappointed with DE results from DESeq. That pipeline involves assigning read counts exactly as you see then with you own eyes. The data is noisy so it does the logical thing and attempts to estimate the noise (two kinds) and then it just has to model the variance since we usually only have a couple replicates. Cufflinks is trying to unambiguously assign reads to isoforms. It's clear to me that its algorithm takes extrapolation a little too far sometimes and sometimes it seems to get just plain confused. Then other times it totally fails and doesn't compute anything at all.

              It's worth trying RSEM if you don't mind waiting forever for the algorithm to finish its job. Keep in mind thought that all these tools have to go on are the sometimes tiny portions of isoforms that differentiate them from other isoforms in the same locus. It's really a big leap.
              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
              Salk Institute for Biological Studies, La Jolla, CA, USA */

              Comment


              • #52
                The problem with cufflinks' interpretation of my data is that they are from bacterial RNAseq. Therefore, there should be no isoforms to confuse it. Honestly, after using it for analysis of human, fungal and bacterial RNA seq data, I am not sure that cufflinks ever gets counts correct for any gene. How many people actually go in and check their results to be sure they make sense? I have spent many hours trying to reconcile peculiar results with the tophat alignments (.bam file) and I have seen every possible error there can be: FPKM values in the hundreds when they are no reads from the gene in question; FPKM values of zero when the gene is represented by thousands of reads; merging transcripts when there are no reads supporting the merge, etc. Pretty much every time I have backtracked to see how an FPKM value was calculated, I have been unable to reach a value anywhere close to what was reported by these programs. I will not use them again.

                Comment


                • #53
                  While I've never agreed with cufflinks I've never had any trouble with tophat, with the exception of its long run time. It increases the number of alignments verses Bowie or BWA and the junction data is useful for me. I suppose, however, that i could achieve the same results by simply aligning to the transcriptome and then converting the alignments to genomic coordinates but I do sometime want to know about potential novel exons and splicing so Tophat it is. ENCODE developed a spliced aligned called STAR which claims to be able to align with higher sensitivity than Tophat and at a tremendous rate of speed. They claim 500 million pairs per hour which is nuts. I haven't gotten it running yet on my system but I do want to try it out. In the mean time ill keep using Tophat to align my reads. Lie Portah I've got my own abundance counting code (in C though, and I have a Python version too) that intersects almost a million reads per second with a GTF file using the same intersection theory as bedtools. I can use that to investigate counts and depth/breadth of coverage. Combined with the junction data from Tophat I've got my own half-assed alternative splicing pipeline. I look forward to trying out EBSeq next week as it claims to be robust to outlier expressions in replicates (although I'm not sure how obvious an outlier can be with three replicates). I'd really like to use RSEM for quantification but its slow as ****. It's been working on a single sample all day with a mere 40 million alignments. Cufflinks could have quantified those reads in less than 20 minutes. Once I have a set of samples quantified with RSEM ill try to do some checking on its values...see if they make sense.

                  *edit*
                  According to its author, RSEM should have finished my run in a hour so there's some kind of bug at least on my system. I'll report back.
                  Last edited by sdriscoll; 11-04-2012, 09:58 AM.
                  /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                  Salk Institute for Biological Studies, La Jolla, CA, USA */

                  Comment


                  • #54
                    Originally posted by Cole Trapnell View Post
                    As far as the transcript merging issues that others have mentioned: Cufflinks will merge transcripts, especially in particularly gene-dense organisms like some fungi, because it tries to fill in coverage gaps (by default of 50bp or less) in order to make more full length structures in low-abundance genes. It's a tradeoff - by not filling in such gaps, your overall transcriptome assembly will be more fragmented, but you'll get fewer erroneous merges. You can disable this behavior by setting --olap-radius 0 in the Cufflinks assembler. We haven't exposed this option yet for cuffmerge but we can do so in a future release.
                    the option appears to be --overlap-radius and when i attempt to set it i get an error that says "--max-multiread-fraction must be at least 1"

                    so there's two possibilities here: 1. there's an error in the code during argument parsing that's using --overlap-radius' value when checking the --max-multiread-fraction value OR it's using the value i specified for --overlap-radius to set --max-multiread-fraction and then it throws an error. which ever it is it would appear I cannot set overlap radius to 0 and if i set it to 1 am i really setting it to 1 or am i setting --max-multiread-fraction to 1? but then in the documentation it says --max-multiread-fraction is a decimal (up to 1) so I'm guessing that i AM setting the overlap-radius and it requires a value of at least 1. so we can't set it to 0 as was suggested.
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • #55
                      in the interest of seeing if it's possible to figure out a way to run cufflinks that demystifies the output a little i've been looking closer at the various options you can specify. there there appear to 5 options that control filtering and thus "non-reporting" of isoforms in the output. They are:

                      Code:
                      --max-bundle-frags
                      --min-isoform-fraction
                      --min-frags-per-transfrag
                      --max-bundle-length
                      --max-multiread-fraction
                      it seems like fine tuning these parameters might produce different output results that may "make more sense" when visually checking results. i also think that if you want to utilize cufflinks if only for novel isoform/splice variant discovery tweaking these parameters might help you get more information out of it. say, for example, you don't trust cufflinks' quantification of expression then you might want to adjust --min-isoform-fraction because that controls reporting of what cufflinks has determined to be very low count isoforms within a locus. so after assembly and abundance quantification cufflinks does some pruning of the data and at this point some assembled isoforms may not be reported if it thinks they have very low expression. the other parameter is --min-frags-per-transfrag. This controls, again, dropping of assembled isoforms based on estimated low count/low data. If you set both of these options to 0 you should get a more elaborate, unfiltered, set of isoforms that you can pass off to another quantification tool.

                      there's a spot in the mouse genome (chr14:3498688-7119724) that cufflinks always skips with its default setting of --max-bundle-length (3,500,000bp). it's unfortunate that it is unable to even attempt to assign expression to that section. the basis for skipping that section is because cufflinks determins that section of the genome to be a single gene locus and it would otherwise attempt to build a single gene model from the alignments. i'm not clear with it can't even quantify that area with cufflinks is run with -G since in that mode we aren't asking for assembly. so maybe to get a more complete quantification the --max-bundle-length option can be set a little higher and then cufflinks will quantify the annotated genes in that section of the genome.

                      two options that i kind of don't understand the "why is this an option" for are both --max-bundle-frags and --max-bundle-length. these options explicitly allow cufflinks to throw away data and/or skip over semi-complex sections of the genome. i ask this: if the default setting was for it to NOT do either of those things - what researcher would ever enable either of those options? i can't think of a reason why we wouldn't always want to see the full extent of the data quantified. if we have to post-filter the information so be it.
                      /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                      Salk Institute for Biological Studies, La Jolla, CA, USA */

                      Comment


                      • #56
                        Hi all
                        Im a newby to RNA-Seq. I've been running tophat on my data sets and used the accepted_hits.bam file in cufflinks. I used the default parameters. In the output file of cufflinks i.e transcripts.gtf I see there's zero kb data. In the subsequent run i.e cuffcompare I need to use the transcript.gtf file for the two data sets I have to compare. Is this output of cufflinks one of the flaws? Has anyone experienced this or is there something I'm not doing right?
                        Lx
                        Last edited by Lizex; 11-14-2012, 12:51 AM.

                        Comment


                        • #57
                          that doesn't sound right. regardless of our agreement with cufflinks' output we still normally get output. transcripts.gtf should be a pretty good size file containing a GTF formatted annotation for every exon discovered in the assembly. you should also have the genes.fpkm_tracking and isoforms.fpkm_tracking files which contain gene and isoform level expression estimates.

                          if you post what you ran at the command line it would be helpful. i'm sure you just ran the standard options though. also if you could post the output from the following command run on your accepted_hits.bam file from Tophat:

                          Code:
                          samtools flagstat accepted_hits.bam
                          /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                          Salk Institute for Biological Studies, La Jolla, CA, USA */

                          Comment


                          • #58
                            Originally posted by sdriscoll View Post
                            that doesn't sound right. regardless of our agreement with cufflinks' output we still normally get output. transcripts.gtf should be a pretty good size file containing a GTF formatted annotation for every exon discovered in the assembly. you should also have the genes.fpkm_tracking and isoforms.fpkm_tracking files which contain gene and isoform level expression estimates.

                            if you post what you ran at the command line it would be helpful. i'm sure you just ran the standard options though. also if you could post the output from the following command run on your accepted_hits.bam file from Tophat:

                            Code:
                            samtools flagstat accepted_hits.bam
                            Thanks sdriscoll for the reply.
                            I think I've discovered my error. Initially I ran this command:
                            cufflinks -p 8 -N -v -o C2_Cufflinks_output accepted_hits.bam,
                            without the -N option and gave me 0 kb in the transcript.gtf file. After I've included the -N option, I got a substantial amount of data in the transcript.gtf file.

                            Lx

                            Comment


                            • #59
                              to update this dicussion i'm starting to feel that cufflinks itself can be tweaked to produce very logical results. i've been comparing cufflinks quantified isoforms verses eXpress quantified isoforms. i grouped the isoforms into loci and jump around randomly to loci where the relative assigment if expression was quite different between the two tools. for both tools i used counts that were not effective length corrected. as a control i simply loaded up the BAM alignments to the genome (with Tophat) in IGV and performed a visual inspection of the loci. I could wittle them down to which isoforms probably shouldn't have expression assigned based on the evidence of other isoforms and usually which isoform should have the most expression. it would all be based on junction presence and coverage pileup, of course, and what parts of the isoforms could be used to uniquely identify them relative to all others in the locus. i made it through about 50 loci and so far cufflinks has agreed wtih my inspection 75% of the time. it seems like eXpress tends to assign expression to a lot more isoforms than cufflinks. i believe this is due to the fact that cufflinks has knowledge of exons and eXpress does not. in most of those cases that cufflinks agreed with my own inspection the overall profile of isoform expression assignment was pretty similar - maybe the isoform cufflinks gave the most to was also the isoform that eXpress gave the most to. I think what I saw most of the time that was a major difference was that if the coverage could be explained by one isoform and there didn't seem to be the need to assign coverage to a second one (maybe one that's shorter than the other and mostly contained) then it wouldn't do that. eXpress would assign expression to both. so while maybe it's possible that both exist - there doesn't appear to be enough evidicent to justify the claim that they both exist so why report that they both exist?

                              anyways, as it stands I think that cuffidff is what you don't want to use however cufflinks, with a bit of tweaking of the options, seems to perform very well and it does in fact make pretty logical expression assignments. it seems like it would be a bit more sensitive if you were looking to compare locus level expression distributions between samples.
                              /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                              Salk Institute for Biological Studies, La Jolla, CA, USA */

                              Comment


                              • #60
                                Could you elaborate a bit on what kind of tweaking you have done with the cufflinks options sdriscoll?

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Strategies for Sequencing Challenging Samples
                                  by seqadmin


                                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                  03-22-2024, 06:39 AM
                                • seqadmin
                                  Techniques and Challenges in Conservation Genomics
                                  by seqadmin



                                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                  Avian Conservation
                                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                  03-08-2024, 10:41 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 06:37 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, Yesterday, 06:07 PM
                                0 responses
                                8 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-22-2024, 10:03 AM
                                0 responses
                                49 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 03-21-2024, 07:32 AM
                                0 responses
                                66 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X