SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Finding novel genes using Cufflinks/Cuffmerge kwatts59 Bioinformatics 10 04-17-2014 10:07 PM
cuffmerge merges multiple loci in RABT secda1 Bioinformatics 3 12-03-2012 04:42 AM
Cufflinks merges adjacent genes proteomania Bioinformatics 1 11-20-2010 02:58 PM

Reply
 
Thread Tools
Old 11-26-2014, 04:55 PM   #1
beeman
Member
 
Location: Australia

Join Date: May 2012
Posts: 20
Question cuffmerge combining genes from adjacent loci??

Hi,

After running CuffDiff, I see many of the significant genes in gene_exp.diff are annotated with multiple geneids. I have looked in the merged.gft file I created from the assemblies generated by cufflinks from all of my replicates, and the test_id/gene_id (XLOC_..) is assigned to several different genes that don't appear to have any overlapping exons but are adjacent to one another. I'm really struggling to understand why this is happening and what I can do to improve the analysis.

Am I doing something wrong in the cuffmerge step? Why would the transcripts be getting merged?

My workflow is as follows.
1. Run tophat (v2.0.13) on samples
$ tophat -G models.gff --library-type=fr-firststrand bowtie_index fastq_file.fq

2. Run cufflinks (v2.2.1) for all samples and replicates
$ cufflinks -g models.gff --library-type=fr-firststrand -o (sample)_(replicate) accepted_hits.bam

3. Merge all assemblies
$ cuffmerge -g models.gff -s genome_seq.fa -p 6 assemblies.txt
* Obviously data in assemblies.txt points to all of the transcript.gtf files from cufflinks output.

4. Run cuffdiff
$ cuffdiff -o comaprison_id -L cond1,cond2 --library-type=fr-firststrand -b genome_seq.fa -u merged_assemblies.gtf s1_A.bam,s1_B.bam,S1_C.bam s2_A.bam,s2_B.bam,S2_C.bam

As an example, if I take the output from cuffdiff, I see for the gene XLOC_012766 that several gene ids have been assigned to this gene (GB51730,GB51731,GB51732)
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
XLOC_012766 XLOC_012766 GB51730,GB51731,GB51732 chr8:12970857-12981659 tg_w4 tg_q4 OK 21.5873 56.0719 1.37709 2.65271 5e-05 0.00717169 yes

And if I look at the entry for XLOC_012766 in the merged gtf file (sorry for the massive amount of output below), there don't appear to be any overlapping exons and yet these assembled transcripts are being grouped into the same gene region.

chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046670"; exon_number "1"; gene_name "GB51730"; oId "GB51730-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51730-RA"; class_code "="; tss_id "TSS23627"; p_id "P12060";
chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046670"; exon_number "2"; gene_name "GB51730"; oId "GB51730-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51730-RA"; class_code "="; tss_id "TSS23627"; p_id "P12060";
chr8 Cufflinks exon 12972974 12973024 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046670"; exon_number "3"; gene_name "GB51730"; oId "GB51730-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51730-RA"; class_code "="; tss_id "TSS23627"; p_id "P12060";
chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12972974 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12973358 12974471 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "4"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "5"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12975020 12975167 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "6"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "7"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12978130 12978150 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046671"; exon_number "8"; gene_name "GB51732"; oId "CUFF.11857.4"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12972974 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12973358 12974471 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "4"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "5"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12975020 12975167 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "6"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "7"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046674"; exon_number "8"; gene_name "GB51732"; oId "CUFF.11857.5"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12972974 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12973358 12974471 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "4"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "5"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "6"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046673"; exon_number "7"; gene_name "GB51732"; oId "CUFF.11857.3"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12970858 12970860 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12972845 12972919 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12972974 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12973358 12974463 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "4"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "5"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12975020 12975167 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "6"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "7"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046672"; exon_number "8"; gene_name "GB51732"; oId "CUFF.11857.2"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23627";
chr8 Cufflinks exon 12973248 12973290 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046675"; exon_number "1"; gene_name "GB51731"; oId "GB51731-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51731-RA"; class_code "="; tss_id "TSS23628"; p_id "P12061";
chr8 Cufflinks exon 12973358 12973449 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046675"; exon_number "2"; gene_name "GB51731"; oId "GB51731-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51731-RA"; class_code "="; tss_id "TSS23628"; p_id "P12061";
chr8 Cufflinks exon 12974356 12974471 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "1"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
chr8 Cufflinks exon 12974596 12974698 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "2"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
chr8 Cufflinks exon 12975020 12975167 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "3"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "4"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
chr8 Cufflinks exon 12978130 12978150 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046676"; exon_number "5"; gene_name "GB51732"; oId "GB51732-RA"; contained_in "TCONS_00046671"; nearest_ref "GB51732-RA"; class_code "="; tss_id "TSS23629"; p_id "P12062";
chr8 Cufflinks exon 12976053 12976115 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046677"; exon_number "1"; gene_name "GB51732"; oId "CUFF.11857.8"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23630";
chr8 Cufflinks exon 12976190 12976308 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046677"; exon_number "2"; gene_name "GB51732"; oId "CUFF.11857.8"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23630";
chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046677"; exon_number "3"; gene_name "GB51732"; oId "CUFF.11857.8"; nearest_ref "GB51732-RA"; class_code "j"; tss_id "TSS23630";
chr8 Cufflinks exon 12978234 12978452 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046678"; exon_number "1"; gene_name "GB51676"; oId "CUFF.11857.9"; nearest_ref "GB51676-RA"; class_code "x"; tss_id "TSS23631";
chr8 Cufflinks exon 12978684 12979560 . + . gene_id "XLOC_012766"; transcript_id "TCONS_00046678"; exon_number "2"; gene_name "GB51676"; oId "CUFF.11857.9"; nearest_ref "GB51676-RA"; class_code "x"; tss_id "TSS23631";

for completeness, here is the entry for the above genes in my model.gff file.

$ grep GB51730 models.gff |grep exon
chr8 amel_OGSv3.2 exon 12970858 12970860 . + . Parent=GB51730-RA
chr8 amel_OGSv3.2 exon 12972845 12972919 . + . Parent=GB51730-RA
chr8 amel_OGSv3.2 exon 12972974 12973024 . + . Parent=GB51730-RA
$ grep GB51731 models.gff |grep exon
chr8 amel_OGSv3.2 exon 12973248 12973290 . + . Parent=GB51731-RA
chr8 amel_OGSv3.2 exon 12973358 12973449 . + . Parent=GB51731-RA
$ grep GB51732 models.gff |grep exon
chr8 amel_OGSv3.2 exon 12974356 12974471 . + . Parent=GB51732-RA
chr8 amel_OGSv3.2 exon 12974596 12974698 . + . Parent=GB51732-RA
chr8 amel_OGSv3.2 exon 12975020 12975167 . + . Parent=GB51732-RA
chr8 amel_OGSv3.2 exon 12976190 12976308 . + . Parent=GB51732-RA
chr8 amel_OGSv3.2 exon 12978130 12978150 . + . Parent=GB51732-RA
grep GB51676 models.gff |grep exon
chr8 amel_OGSv3.2 exon 12979494 12979760 . - . Parent=GB51676-RA
chr8 amel_OGSv3.2 exon 12980861 12981025 . - . Parent=GB51676-RA
chr8 amel_OGSv3.2 exon 12981489 12981659 . - . Parent=GB51676-RA


Am I doing something stupid ?? Can anyone explain what is going on and how I might be able to overcome this? I'm really stumped and hoping someone with more experience with cufflinks can point out where I am going wrong or suggest some options for improving the output.

Thanks in advance.

Last edited by beeman; 11-26-2014 at 05:00 PM. Reason: added entries from models.gff file
beeman is offline   Reply With Quote
Old 11-30-2014, 03:35 PM   #2
beeman
Member
 
Location: Australia

Join Date: May 2012
Posts: 20
Default

*bump*

Can no-one provide some guidance on my problem
beeman is offline   Reply With Quote
Old 12-10-2014, 06:13 AM   #3
dalesan
Member
 
Location: portugal

Join Date: Feb 2011
Posts: 15
Default

I can't really give you any guidance other than pointing you to a post I spotted on biostars, as I am having the same problem of adjacent loci being lumped together under the same XLOC. Actually, I would take a look at the loci with the adjacent genes in the IGV browser as a first step. In my own case, I noticed that some of the reads for the locus in question spanned two annotated genes. It could be that this overlap represents a potential anti-sense scenario, but without the stranded information we can't be sure. So perhaps it makes "sense" that cufflinks merged these reads into one giant transcriptional unit, which actually contains two distinct gene loci. If you note, there is a dotted vertical line in the following image:



To the right of the dotted line are the reads mapping to a known Arabidopsis gene. To the left of the dotted line are reads mapping to another known Arabidopsis gene. You'll notice that some of the reads bridge the two genes causing cufflinks to merge them into a giant transcriptional unit (blue bar at the bottom).

I think what I will end up doing is discarding these merged loci from further analysis as the data may not be reliable.

Quoting Devon Ryan from Biostars (original post here: https://www.biostars.org/p/104551/)
Quote:
The only real answer would be to look through the cufflinks source code, since this isn't documented anywhere. I would guess that these are merged into a single locus for processing because the annotation file you gave to cufflinks, likely combined with the modifications it made to the annotated transcripts given your alignments, produced possibly overlapping features (genes in this case) that might need to be processed as a single unit. If you used an unstranded library where WASH7P was expressed, then cufflinks might have just merged that, DDX11L1, and MIR1302-10 into a single transcript, in which case treating the whole region as a single locus would make more sense. I suspect that cufflinks pre-bins the genome according to possible cases like this and then processes them separately, often producing multiple final loci. That's a slightly educated guess, at least.

Welcome to the wonderful world of completely undocumented features :P
dalesan is offline   Reply With Quote
Reply

Tags
cuffdiff, cufflinks, cuffmerge, rnaseq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:47 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO