Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • jiexiong
    Junior Member
    • May 2010
    • 9

    #16
    batch ORFs finder for cufflinks assembled transcripts(mrna)

    Hi,
    I have used the cufflinks assembled the transcripts(mrna) from RNA-SEQ experiment.
    my purpose is to check the possible length of the UTRs of each transcripts, and i should firstly find the best ORF for each transcripts, is there any tool for batch find the best ORF?

    Comment

    • adarob
      Member
      • Jul 2010
      • 71

      #17
      The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

      I would not draw any conclusions about the FPKM of the FAILED genes.

      Comment

      • ngs
        Junior Member
        • Sep 2009
        • 2

        #18
        Originally posted by adarob View Post
        The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

        I would not draw any conclusions about the FPKM of the FAILED genes.
        Hi Adam,
        I ran tophat (1.1.0) without a mouse gtf file. Run cufflinks (0.9.1) without a mouse gtf file. Then run cuffcompare with a mouse gtf file and two gtf files generated from cufflinks for my two samples. Finally, I ran cuffdiff with compare.combined.gtf and two accepted_hits.bam files.

        However, I checked gene_exp.diff. I found there is still multiple FPKM problem for some genes (see below):

        XLOC_000009 Cspp1 chr1:10053629-10189988 q1 q2 OK 44.5012 58.359 0.271096 -2.93789 0.00330457 yes
        XLOC_000010 Arfgef1 chr1:10053629-10189988 q1 q2 OK 10.0582 7.68137 -0.269589 4.88261 1.04688e-06 yes
        XLOC_000011 Arfgef1 chr1:10053629-10189988 q1 q2 OK 40.66 31.8566 -0.244 17.6406 0 yes
        XLOC_000013 Arfgef1 chr1:10053629-10189988 q1 q2 OK 2.7768 40.8059 2.68753 -144.972 0 yes
        XLOC_000015 Arfgef1 chr1:10053629-10189988 q1 q2 OK 54.0345 65.0081 0.18489 -12.9339 0 yes
        XLOC_000016 Arfgef1 chr1:10053629-10189988 q1 q2 OK 23.4654 43.6672 0.62107 -29.4492 0 yes
        XLOC_000031 Tram2 chr1:20986216-20997026 q1 q2 OK 5.8219 2.96147 -0.67594 3.70609 0.000210487 yes
        XLOC_000032 Tram2 chr1:20986216-20997026 q1 q2 OK 3.33419 14.9065 1.49757 -29.7646 0 yes
        XLOC_000057 Tmem131 chr1:36849038-36996484 q1 q2 OK 37.3723 30.8444 -0.191975 5.03247 4.84195e-07 yes

        Did I do something wrong?

        I have another question regarding gene_exp.diff file. As you can see, the first gene Cspp1 has the same coordiates (chr1:10053629-10189988) as the second gene Arfgef1. But in my mouse gtf file (from Ensembl), the coordinates for those two genes are:
        Cspp1: Chromosome 1: 10,028,299-10,126,849
        Arfgef1: Chromosome 1: 10,127,652-10,222,751

        Those two genes are not overlapped. Why do they have the same coordinates in gene_exp.diff file?

        Thank you very much!

        Comment

        • honey
          Senior Member
          • Feb 2010
          • 151

          #19
          If one has to sum the FPKM for a gene One has to use FPKM gene tracking file or gene expr file of cuffdiff. Mgogol's perl script uses fpkm lo, high and fpkm values which are only in tracking file. Is it ok to sum the fpkm values for a gene?
          Thanks

          Comment

          • ngs_agd
            Junior Member
            • Feb 2011
            • 7

            #20
            Originally posted by adarob View Post
            The multiple FPKM problem occurs when genes have transcripts that do not overlap with any other transcripts in the gene. For example, this occurs in the ENSG00000125388 gene from ENSEMBL/hg19. We are aware of this issue and will eventually change the behavior, but for now a simple solution is just to sum the FPKMs since the gene FPKMs are just the sum of the transcript FPKMs anyways. The issue should not occur in Cuffdiff.

            I would not draw any conclusions about the FPKM of the FAILED genes.
            Does this mean that I will have to download the cuffcompare file, edit it, upload it on galaxy and then run cuffdiff on this gtf file? Thanks for your help!

            Comment

            • ngs_agd
              Junior Member
              • Feb 2011
              • 7

              #21
              Sorry, in my previous thread I had asked whether the cuffcompare file needs to be edited. I just looked at a cuffcompare file, it seems to have only annotation information and no FPKM values. So, how (or where) is one supposed to combine the FPKM values from different transcripts for a gene and run cuffdiff?

              Comment

              • honey
                Senior Member
                • Feb 2010
                • 151

                #22
                Read

                Not clear what you want to say. However, I agree FPKM per gene is an ongoing research.

                Comment

                • ngs_agd
                  Junior Member
                  • Feb 2011
                  • 7

                  #23
                  Hi Honey,
                  Sorry if I am not being clear. This is what I have done so far and I am struggling to make some sense of the information I am getting:
                  1. I have 2 .bam files (1 control and 1 disease). I am trying to identify gene expression differences).
                  2. Using galaxy I ran the cufflinks-cuffcompare-cuffdiff workflow.
                  3. For running cufflinks, I took the .bam files and ran cufflinks with the defaults.
                  4. I ran cuffcompare (with assembled transcripts file from each of the sample, along with the reference).
                  5. I fed the output (transcript file) of cuffcompare along with the two original bam files into cuffdiff.
                  6. I was looking at the output of cuffdiff and am seeing a few things I don't quite understand:
                  There are more than one rows per gene for most of the genes in the output file (I would have thought that the differential expression would be reported at gene level). I read in some other threads on Seqanswers (including this one) that summing up the FPKM values of the transcript shall give me the gene level value (which is file). What I don't understand is which output file fom the workflow should I perform the operation on:
                  a) The cufflinks output has the FPKM, but no gene annotations
                  b) The cuffcompare output has the annotations, but not the FPKM values (unless I m missing them).
                  c) The cuffdiff output has both the FPKM and gene annotation values, but the "statistical" analysis is already done.
                  So should I take the cuffdiff output, edit it and then fed it back into the workflow (again, at what point?)
                  This is where my first confusion is coming from.

                  There is another (possibly related) issue that some of the transcripts in the cuffdiff output have FPKM = 0, so when diff analysis is run, the FC are ridiculous.

                  What is making this all the more frustrating is that I am trying to use published data (with paper that gives some list of genes that are diff expressed between conditions analyzed using galxaxy) in a bid to educate myself and am going in circles.

                  As you pointed out in one of my other threads that I have a lot of reading to do, but at the risk of sounding like a nag and unbelievably dense, i have been unsuccessful in finding some material that might help me understand these things.

                  Any help from anybody greatly appreciated

                  Comment

                  • honey
                    Senior Member
                    • Feb 2010
                    • 151

                    #24
                    You will look for cuffdiff out put files-gene.expr, isoform.expr which are diff files and combined GTF file. However, to get one FPKM per gene it is suggested sum FOKM corresponding to gene name and same location. However as Adam has also suggested if gene has more than on location (overlap) it may not be possible to sum those FPKM. It is on going area of research. I am not very convinced that summing of FPKM all row per gene is good idea. Though several publications including a recent one has reported the same. (http://genome.cshlp.org/content/earl...d-4783a31b68c6). My suggestion is if you are trying to learn RNA-seq start with isoform.expr not gene level.
                    Best.

                    Comment

                    • edge
                      Senior Member
                      • Sep 2009
                      • 199

                      #25
                      Hi yjlui,

                      Do you have already figure out the problem of the description of "test status" that shown "OK" , "LOWDATA", and "FAIL".
                      Should I delete those transcript for downstream analysis and consider them as poor assembly transcript?
                      Apart from that, do you have any idea about FPKM is 0?
                      Is it mean that those transcript is poor assembly transcript as well?
                      Thanks in advance.

                      Comment

                      • emanlee
                        Member
                        • Apr 2013
                        • 15

                        #26
                        Collapse duplicate FPKMs for a gene

                        Originally posted by mgogol View Post
                        I ended up writing a script to sum the FPKMS for a given gene id, which I think is right...

                        Here's my (unpolished) code (a perl script and a shell script).

                        This botches the confidence intervals, by the way.

                        The format of cufflinks outputs (genes.fpkm_tracking files) are now different from previous. I updated the code written by mgogol and published it on sourceforge.net https://sourceforge.net/projects/col...?source=navbar . I hope it will facilitate your work.

                        Comment

                        • tedwong
                          Member
                          • Mar 2015
                          • 13

                          #27
                          I'm using Cufflinks 2.2.1 but still seeing duplicate genes in the tracking file. Has the issue ever fixed?

                          Comment

                          Latest Articles

                          Collapse

                          • SEQadmin2
                            From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                            by SEQadmin2


                            Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                            The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                            ...
                            06-02-2026, 10:05 AM
                          • SEQadmin2
                            Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                            by SEQadmin2


                            With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                            Introduction

                            Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                            05-22-2026, 06:42 AM
                          • SEQadmin2
                            Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                            by SEQadmin2

                            Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                            Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                            05-06-2026, 09:04 AM

                          ad_right_rmr

                          Collapse

                          News

                          Collapse

                          Topics Statistics Last Post
                          Started by SEQadmin2, Yesterday, 08:59 AM
                          0 responses
                          13 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 12:03 PM
                          0 responses
                          22 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 06-02-2026, 11:40 AM
                          0 responses
                          19 views
                          0 reactions
                          Last Post SEQadmin2  
                          Started by SEQadmin2, 05-28-2026, 11:40 AM
                          0 responses
                          32 views
                          0 reactions
                          Last Post SEQadmin2  
                          Working...