Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • silin284
    Member
    • Jul 2009
    • 27

    Bug? duplicated genes in cufflinks output genes.expr

    Hi

    When i supplied a reference gtf to cufflinks (-G), i found there are duplicated geneID in the output "genes.expr". That is a bit weird to me and it is very rare (3 out of 50k genes). I checked those 3 and it turns out that cufflink consider their isoforms as individual genes but still use the same gene_id supplied in the gtf file. All these 3 genes have a common characteristics. The genome positions of each isoform's transcript/exon/CDS are completely different. I guess cufflink use this information to judge whether different transcripts belongs to the same gene instead of using the gene_id information supplied in gtf.

    I can remove them by hand but is there a way to "force" cufflinks to recognize them as a single gene?

    cheers
    silin

    original GTF file
    chr06 SZ transcript 3851140 3853473 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.2";
    chr06 SZ CDS 3851140 3851247 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.2";
    chr06 SZ CDS 3853062 3853304 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.2";
    chr06 SZ exon 3853305 3853473 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.2";
    ###
    chr06 SZ transcript 3851392 3852964 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
    chr06 SZ exon 3851392 3851900 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";
    chr06 SZ CDS 3851901 3852434 . + 0 gene_id "Os06g07923"; transcript_id "Os06g07923.1";
    chr06 SZ exon 3852435 3852964 . + . gene_id "Os06g07923"; transcript_id "Os06g07923.1";

    cufflinks output "genes.expr"
    Os06g07923 141826 chr06 3851139 3853473 0 0 0 OK
    Os06g07923 141826 chr06 3851391 3852964 0 0 0 OK
  • apadr007
    Member
    • Oct 2011
    • 21

    #2
    I have the same question. Why is cufflinks repeating genes?

    Comment

    • kenphi
      Junior Member
      • Nov 2009
      • 2

      #3
      Dear silin

      I think this is because in your reference annotation there are "unrelated" transcripts annotated to the same gene. I noticed that this happens, when there are independent transcript groups, i.e. groups of transcripts that do not overlap in exon coordinates. The can be side-by-side or one in the intron of the other. Some examples are in Ensembl 64

      ENSMUSG00000086255
      ENSMUSG00000062352
      ENSMUSG00000021879
      ENSMUSG00000033705
      ENSMUSG00000087461
      ENSMUSG00000022105
      ENSMUSG00000073791
      ENSMUSG00000052675
      ENSMUSG00000055407
      ENSMUSG00000056856
      ENSMUSG00000027203

      In some of these cases, I would say that Ensembl didn't follow its own guidelines, to assign the same gene identifier to transcripts with overlapping position, because there are clearly independent clusters.

      I keep them and use the gene_id column of cufflinks to make tables unique.

      Philip

      Comment

      • emanlee
        Member
        • Apr 2013
        • 15

        #4
        Another thread on this issue:


        A solution based on mgogol's code:
        CollapseFPKM files. Full list of files for CollapseFPKM, This code is a solution to collapsing duplicate FPKMs for a gene

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          Nine Things a Sample Prep Scientist Thinks About Before Sequencing
          by SEQadmin2


          I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.


          Here are nine questions we think about, in roughly the order they matter, before...
          Today, 07:11 AM
        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          06-02-2026, 10:05 AM
        • SEQadmin2
          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
          by SEQadmin2


          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


          Introduction

          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
          05-22-2026, 06:42 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, Yesterday, 06:09 AM
        0 responses
        16 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-09-2026, 11:58 AM
        0 responses
        37 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-05-2026, 10:09 AM
        0 responses
        42 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 06-04-2026, 08:59 AM
        0 responses
        49 views
        0 reactions
        Last Post SEQadmin2  
        Working...