Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cuffmerge combining genes from adjacent loci??

    Hi,

    After running CuffDiff, I see many of the significant genes in gene_exp.diff are annotated with multiple geneids. I have looked in the merged.gft file I created from the assemblies generated by cufflinks from all of my replicates, and the test_id/gene_id (XLOC_..) is assigned to several different genes that don't appear to have any overlapping exons but are adjacent to one another. I'm really struggling to understand why this is happening and what I can do to improve the analysis.

    Am I doing something wrong in the cuffmerge step? Why would the transcripts be getting merged?

    My workflow is as follows.
    1. Run tophat (v2.0.13) on samples
    $ tophat -G models.gff --library-type=fr-firststrand bowtie_index fastq_file.fq

    2. Run cufflinks (v2.2.1) for all samples and replicates
    $ cufflinks -g models.gff --library-type=fr-firststrand -o (sample)_(replicate) accepted_hits.bam

    3. Merge all assemblies
    $ cuffmerge -g models.gff -s genome_seq.fa -p 6 assemblies.txt
    * Obviously data in assemblies.txt points to all of the transcript.gtf files from cufflinks output.

    4. Run cuffdiff
    $ cuffdiff -o comaprison_id -L cond1,cond2 --library-type=fr-firststrand -b genome_seq.fa -u merged_assemblies.gtf s1_A.bam,s1_B.bam,S1_C.bam s2_A.bam,s2_B.bam,S2_C.bam

    As an example, if I take the output from cuffdiff, I see for the gene XLOC_012766 that several gene ids have been assigned to this gene (GB51730,GB51731,GB51732)
    test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
    XLOC_012766 XLOC_012766 GB51730,GB51731,GB51732 chr8:12970857-12981659 tg_w4 tg_q4 OK 21.5873 56.0719 1.37709 2.65271 5e-05 0.00717169 yes

    And if I look at the entry for XLOC_012766 in the merged gtf file (sorry for the massive amount of output below), there don't appear to be any overlapping exons and yet these assembled transcripts are being grouped into the same gene region.

    chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046670"; exon_number "1"; gene_name "GB51730"; oId "GB51730-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51730-RA"; class_code "="; tss_id "TSS23627"; p_id "P12060";
    chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046670"; exon_number "2"; gene_name "GB51730"; oId "GB51730-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51730-RA"; class_code "="; tss_id "TSS23627"; p_id "P12060";
    chr8 Cufflinks exon 12972974 12973024 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046670"; exon_number "3"; gene_name "GB51730"; oId "GB51730-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51730-RA"; class_code "="; tss_id "TSS23627"; p_id "P12060";
    chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12972974 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12973358 12974471 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "4"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "5"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12975020 12975167 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "6"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "7"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12978130 12978150 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "8"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12972974 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12973358 12974471 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "4"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "5"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12975020 12975167 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "6"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "7"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "8"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12972974 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12973358 12974471 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "4"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "5"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "6"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "7"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12972974 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12973358 12974463 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "4"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "5"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12975020 12975167 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "6"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "7"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "8"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
    chr8 Cufflinks exon 12973248 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046675"; exon_number "1"; gene_name "GB51731"; oId "GB51731-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51731-RA"; class_code "="; tss_id "TSS23628"; p_id "P12061";
    chr8 Cufflinks exon 12973358 12973449 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046675"; exon_number "2"; gene_name "GB51731"; oId "GB51731-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51731-RA"; class_code "="; tss_id "TSS23628"; p_id "P12061";
    chr8 Cufflinks exon 12974356 12974471 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "1"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
    chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "2"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
    chr8 Cufflinks exon 12975020 12975167 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "3"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
    chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "4"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
    chr8 Cufflinks exon 12978130 12978150 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "5"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
    chr8 Cufflinks exon 12976053 12976115 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046677"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.8"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23630";
    chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046677"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.8"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23630";
    chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046677"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.8"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23630";
    chr8 Cufflinks exon 12978234 12978452 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046678"; exon_number "1"; gene_name "GB51676"; oId "CUFF.11857.9"; nearest_ref "GB51676-RA"; class_code "x"; tss_id "TSS23631";
    chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046678"; exon_number "2"; gene_name "GB51676"; oId "CUFF.11857.9"; nearest_ref "GB51676-RA"; class_code "x"; tss_id "TSS23631";

    for completeness, here is the entry for the above genes in my model.gff file.

    $ grep GB51730 models.gff |grep exon
    chr8 amel_OGSv3.2 exon 12970858 12970860 . + . Parent=GB51730-RA
    chr8 amel_OGSv3.2 exon 12972845 12972919 . + . Parent=GB51730-RA
    chr8 amel_OGSv3.2 exon 12972974 12973024 . + . Parent=GB51730-RA
    $ grep GB51731 models.gff |grep exon
    chr8 amel_OGSv3.2 exon 12973248 12973290 . + . Parent=GB51731-RA
    chr8 amel_OGSv3.2 exon 12973358 12973449 . + . Parent=GB51731-RA
    $ grep GB51732 models.gff |grep exon
    chr8 amel_OGSv3.2 exon 12974356 12974471 . + . Parent=GB51732-RA
    chr8 amel_OGSv3.2 exon 12974596 12974698 . + . Parent=GB51732-RA
    chr8 amel_OGSv3.2 exon 12975020 12975167 . + . Parent=GB51732-RA
    chr8 amel_OGSv3.2 exon 12976190 12976308 . + . Parent=GB51732-RA
    chr8 amel_OGSv3.2 exon 12978130 12978150 . + . Parent=GB51732-RA
    grep GB51676 models.gff |grep exon
    chr8 amel_OGSv3.2 exon 12979494 12979760 . - . Parent=GB51676-RA
    chr8 amel_OGSv3.2 exon 12980861 12981025 . - . Parent=GB51676-RA
    chr8 amel_OGSv3.2 exon 12981489 12981659 . - . Parent=GB51676-RA


    Am I doing something stupid ?? Can anyone explain what is going on and how I might be able to overcome this? I'm really stumped and hoping someone with more experience with cufflinks can point out where I am going wrong or suggest some options for improving the output.

    Thanks in advance.
    Last edited by beeman; 11-26-2014, 06:00 PM. Reason: added entries from models.gff file

  • #2
    *bump*

    Can no-one provide some guidance on my problem

    Comment


    • #3
      I can't really give you any guidance other than pointing you to a post I spotted on biostars, as I am having the same problem of adjacent loci being lumped together under the same XLOC. Actually, I would take a look at the loci with the adjacent genes in the IGV browser as a first step. In my own case, I noticed that some of the reads for the locus in question spanned two annotated genes. It could be that this overlap represents a potential anti-sense scenario, but without the stranded information we can't be sure. So perhaps it makes "sense" that cufflinks merged these reads into one giant transcriptional unit, which actually contains two distinct gene loci. If you note, there is a dotted vertical line in the following image:



      To the right of the dotted line are the reads mapping to a known Arabidopsis gene. To the left of the dotted line are reads mapping to another known Arabidopsis gene. You'll notice that some of the reads bridge the two genes causing cufflinks to merge them into a giant transcriptional unit (blue bar at the bottom).

      I think what I will end up doing is discarding these merged loci from further analysis as the data may not be reliable.

      Quoting Devon Ryan from Biostars (original post here: https://www.biostars.org/p/104551/)
      The only real answer would be to look through the cufflinks source code, since this isn't documented anywhere. I would guess that these are merged into a single locus for processing because the annotation file you gave to cufflinks, likely combined with the modifications it made to the annotated transcripts given your alignments, produced possibly overlapping features (genes in this case) that might need to be processed as a single unit. If you used an unstranded library where WASH7P was expressed, then cufflinks might have just merged that, DDX11L1, and MIR1302-10 into a single transcript, in which case treating the whole region as a single locus would make more sense. I suspect that cufflinks pre-bins the genome according to possible cases like this and then processes them separately, often producing multiple final loci. That's a slightly educated guess, at least.

      Welcome to the wonderful world of completely undocumented features :P

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM
      • seqadmin
        Techniques and Challenges in Conservation Genomics
        by seqadmin



        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

        Avian Conservation
        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
        03-08-2024, 10:41 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:37 PM
      0 responses
      10 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, Yesterday, 06:07 PM
      0 responses
      9 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-22-2024, 10:03 AM
      0 responses
      51 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 03-21-2024, 07:32 AM
      0 responses
      67 views
      0 likes
      Last Post seqadmin  
      Working...
      X