Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
cufflinks : difference in --GTF and --GTF-guide result masterpiece RNA Sequencing 0 04-05-2012 07:49 PM
Cufflinks' computation of FPKM for --GTF and --GTF-guide estimation burt Bioinformatics 0 08-23-2011 11:59 PM
Modification of reference genome annotation by cufflinks/cuffdiff? markr Bioinformatics 3 07-20-2011 01:20 AM
Arabidopsis GTF for cufflinks dnusol Bioinformatics 0 02-07-2011 03:52 AM
Cufflinks GTF file ECHo Bioinformatics 0 02-15-2010 02:59 AM

Thread Tools
Old 11-07-2012, 03:57 AM   #1
Junior Member
Location: Maryland, USA

Join Date: Jul 2012
Posts: 2
Unhappy GTF modification for cufflinks

Hey, does anyone have any pointers, advice, or experience on modifying GTF files for use with cufflinks??? (v 2.0.2)

In the course of examining RNA-seq data and performing RNA-seq data analysis, an issue I've run into (using the "tuxedo" software/pipeline of tophat->cufflinks) is that tophat maps to apparent non-coding regions (possibly regulatory) but that cufflinks won't indicate FPKM expressions for the pileups! So a strategy we are trying, whose goal is to trigger cufflinks to tell FPKM expression values, is to either modify or create GTF annotation data and tell tophat/cufflinks to *not* try to find novel transcripts while using the created/modified GTF so that cufflinks might give FPKM values!

One strategy we tried is to create a GTF with features/annotations corresponding to the regions of interest. Created as "pseudogene exons" (in columns 2 and 3), and using exsiting ensemble geneIDs, but custom transcript_ids we fed the GTF to cufflinks. When cufflinks program execution got to the "Loading Annotation" part (at the beginning of the run) it crashed with a segmentation fault! In the attribute column (#9), no information besides the gene_id and transcript_id was provided! cufflinks may have crashed because no gene_name was given. We really don't know however!

Another strategy we are currently trying is to *modify* an existing GTF (from illumina/igenomes/ensemble) that *modify* work with cufflinks. This time, to capture regions upstream and downstream of genes, for each geneid, we modify the lowest start-value over all annotations by decreasing it by 1000 (to *hopefully* capture expressions of regions upstream). Similarly, we modify the highest end-value by increasing it by 1000 to *hopefully* capture expressions of regions downstream. This is currently going on now, so I don't know if the run will work, end successfully, and give us the expression/FPKM values/numbers we are looking for....

Any pointers, advice, experience, knowledge, insight, etc. with GTF file tweaking for cufflinks would be appreciated!

We are using tophat v2.0.4 and cufflinks v2.0.2 by the way.


eddiesalinas is offline   Reply With Quote
Old 11-07-2012, 07:25 AM   #2
Senior Member
Location: Boston

Join Date: Nov 2009
Posts: 224

Be very careful with all of this.

There can be a lot of reasons for seeing reads outside of annotated genes that have nothing to do with real biology. They might be artifacts of your library prep. Even if they were real, you have no guarantee that your coverage is deep enough to accurately determine an FPKM value. If you don't have enough coverage to determine the length of the transcribed region, then the K part of the FPKM could lead to biased expression values.

If you just extend GTF regions with an arbitrary number not informed by the biology of your system, you will create a lot of problems. You will be extending every gene by the same number, but it will not be the same relative to the actual length of the gene. This will lead to an underestimate of the expression of short genes to a higher degree than longer genes. Plus, what are you doing to ensure that your extensions don't create unwanted overlaps with other annotated genes?

Make sure you have a good reason for looking at regions outside of annotated coding regions before you start modifying those annotations.
pbluescript is offline   Reply With Quote
Old 11-07-2012, 07:57 AM   #3
Junior Member
Location: Maryland, USA

Join Date: Jul 2012
Posts: 2

Hi "pbluescript",

Good points to be aware of. I realize now to regard any output with some skepticism.

The goal at hand is to get FPKM values for the loci upstream and downstream of genes ; that's why we "extended" the genes (decreased lowest start value, increased highest end value).


Last edited by eddiesalinas; 11-07-2012 at 08:08 AM.
eddiesalinas is offline   Reply With Quote
Old 07-07-2019, 05:25 AM   #4
Location: Bhopal

Join Date: Jul 2019
Posts: 19

Over the span of looking at RNA-seq information and performing RNA-seq information examination, an issue I've kept running into (utilizing the "tuxedo" programming/pipeline of tophat->cufflinks) is that tophat maps to evident non-coding areas (potentially administrative) yet that sleeve buttons won't demonstrate FPKM articulations for the accidents! So a procedure we are attempting, whose objective is to trigger sleeve buttons to disclose to FPKM articulation esteems, is to either change or make GTF explanation information and tell tophat/sleeve buttons to *not* attempt to discover novel transcripts while utilizing the made/adjusted GTF with the goal that sleeve buttons may give FPKM values!
brojee is offline   Reply With Quote

annotation, cufflinks, gtf, modification, rna-seq

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 03:06 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO