SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
GTF reference files that work with TopHat/Cufflinks marcora Bioinformatics 23 01-14-2014 11:10 PM
.gtf or .gff file for TopHat and Cufflinks (and bowtie2) rubbertjes Bioinformatics 11 08-07-2013 06:31 PM
Best source for GTF file for use with TopHat/Cufflinks sdarko Bioinformatics 17 12-14-2012 11:48 AM
Tophat/Cufflinks GTF (canine/non-human) M&M RNA Sequencing 0 08-01-2011 03:00 PM
cuffdiff gtf input from cufflinks? PFS Bioinformatics 1 03-24-2011 12:46 PM

Reply
 
Thread Tools
Old 07-01-2013, 03:22 AM   #1
rubbertjes
Member
 
Location: Netherlands

Join Date: Jun 2013
Posts: 13
Default GTF input tophat and cufflinks

Hi everyone,

I have two questions about the GTF file that you can use as a reference in both TopHat and cuffdiff. A general GTF file that can be downloaded from for instance UCSC will look something like:

chr12 refGene exon 12262139 12262238 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name "Fam49a$
chr12 refGene exon 12304181 12304322 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "2"; exon_id "NM_001146119.2"; gene_name "Fam49a$
chr12 refGene exon 12340679 12340758 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "3"; exon_id "NM_001146119.3"; gene_name "Fam49a$
chr12 refGene CDS 12340689 12340758 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "3"; exon_id "NM_001146119.3"; gene_name "Fam49a$
chr12 refGene exon 12358045 12358166 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "4"; exon_id "NM_001146119.4"; gene_name "Fam49a$
chr12 refGene CDS 12358045 12358166 . + 2 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "4"; exon_id "NM_001146119.4"; gene_name "Fam49a$
chr12 refGene exon 12359213 12359318 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "5"; exon_id "NM_001146119.5"; gene_name "Fam49a$
chr12 refGene CDS 12359213 12359318 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "5"; exon_id "NM_001146119.5"; gene_name "Fam49a$
chr12 refGene exon 12361435 12361571 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "6"; exon_id "NM_001146119.6"; gene_name "Fam49a$
chr12 refGene CDS 12361435 12361571 . + 2 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "6"; exon_id "NM_001146119.6"; gene_name "Fam49a$
chr12 refGene exon 12362015 12362092 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "7"; exon_id "NM_001146119.7"; gene_name "Fam49a$
chr12 refGene CDS 12362015 12362092 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "7"; exon_id "NM_001146119.7"; gene_name "Fam49a$
chr12 refGene exon 12362252 12362368 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "8"; exon_id "NM_001146119.8"; gene_name "Fam49a$
chr12 refGene CDS 12362252 12362368 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "8"; exon_id "NM_001146119.8"; gene_name "Fam49a$
chr12 refGene exon 12362461 12362540 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "9"; exon_id "NM_001146119.9"; gene_name "Fam49a$
chr12 refGene CDS 12362461 12362540 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "9"; exon_id "NM_001146119.9"; gene_name "Fam49a$
chr12 refGene exon 12364720 12364846 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "10"; exon_id "NM_001146119.10"; gene_name "Fam4$
chr12 refGene CDS 12364720 12364846 . + 1 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "10"; exon_id "NM_001146119.10"; gene_name "Fam4$
chr12 refGene exon 12369894 12369964 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "11"; exon_id "NM_001146119.11"; gene_name "Fam4$
chr12 refGene CDS 12369894 12369964 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "11"; exon_id "NM_001146119.11"; gene_name "Fam4$
chr12 refGene exon 12372747 12376361 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "12"; exon_id "NM_001146119.12"; gene_name "Fam4$
chr12 refGene CDS 12372747 12372807 . + 1 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "12"; exon_id "NM_001146119.12"; gene_name "Fam4$
chr12 refGene start_codon 12340689 12340691 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name$
chr12 refGene stop_codon 12372808 12372810 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name$
chr7 refGene exon 24902986 24903128 . + . gene_id "Arhgef1"; transcript_id "NM_008488"; exon_number "1"; exon_id "NM_008488.1"; gene_name "Arhgef1";

question 1:
the GTF file includes exons, coding sequences (CDS), and also miRNA, start and stop codons. Often the coding sequence and exons will be identical or otherwise almost identical. For this reason I kept only the exons in my reference file. I was wondering if this is the wisest thing to do. What will TopHat and cufflinks do when I keep the additional information? will it be able to use these annotations of exons coding sequences etc., or will it just try to map the reads to each individual line in the reference file and not be able to distinguish between the different "types" (Exon versus CDS versus micro RNA)? IF the latter is the case, will this basically mean that the number of reads will halve for exons, since now halve are mapped to the CDS?

question 2:
how can cufflinks perform CDS-level transcription difference tests, splicing tests, promoter preference tests and relative CDS output tests? Where do you provide the inputs so that it knows where these are?
My output when just using exons look like this:
Performed 12350 isoform-level transcription difference tests
Performed 0 tss-level transcription difference tests
Performed 10502 gene-level transcription difference tests
Performed 0 CDS-level transcription difference tests
Performed 0 splicing tests
Performed 0 promoter preference tests
Performing 0 relative CDS output tests
rubbertjes is offline   Reply With Quote
Old 07-01-2013, 03:28 AM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

(1) Don't edit your GTF file, keep those things in. Tophat et al. will still work fine (tophat should produce the same results, in fact).

(2) If you leave the CDS and other fields in and follow the normal cufflinks workflow, you'll generate a merged GTF file with pid and tid fields, that cufflinks can use.
dpryan is offline   Reply With Quote
Old 07-01-2013, 03:45 AM   #3
rubbertjes
Member
 
Location: Netherlands

Join Date: Jun 2013
Posts: 13
Default

Quote:
Originally Posted by dpryan View Post
(1) Don't edit your GTF file, keep those things in. Tophat et al. will still work fine (tophat should produce the same results, in fact).
So TopHat/cuffdiff do indeed use those lines containing the information about it being an exon or coding sequence?

Quote:
Originally Posted by dpryan View Post
(2) If you leave the CDS and other fields in and follow the normal cufflinks workflow, you'll generate a merged GTF file with pid and tid fields, that cufflinks can use.
I'm using the workflow without gene discovery, does that matter at all for your answer?
rubbertjes is offline   Reply With Quote
Old 07-01-2013, 04:03 AM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,479
Default

The GTF file with the tss_id and p_id fields can either be generated with cuffmerge (on the transcripts.gtf files from your samples) or cuffcompare (on the original GTF file). In the rare instances where I've used cufflinks, I've always used the cuffmerge route. I'm not familiar enough with the inner workings of cufflinks to state whether not doing novel gene detection really changes the transcripts.gtf file (again, I've never tried that route), so I can't offer any insight there.
dpryan is offline   Reply With Quote
Old 07-01-2013, 12:42 PM   #5
geneart
Member
 
Location: DC area

Join Date: Sep 2011
Posts: 42
Default

Hi all,
I had a few questions about cufflinks for RNA seq analysis.
first : I understand cufflinks is designed for RNA seq and not specifically for miRNA seq.
I am working with miRNA data from Illumina GA and have reads about 36bases long.
My problem is the trascripts.gtf (while cuffcomparing two samples) generated as an output , gives me Loci that has mergedfour distinct clusters together into one CUFF.48.1 location. However when I visualize this on IGV I see that the region has four clusters of reads with varied depths seperated by a region as long a the length of the read (~36bases).
Why are these merged into one cluster?
Is cufflinks treating my seqs as mRNA seqs?
What criteria or parameters are used to merge two clusters in cuffcompare into one CUFF ID , for example what is the min or max distance that is necessary to identify them as one cluster?
Also in the output file that it generated from cuffcompare I have "-" for first seq file and a CUFFvalue for second file I am comparing. However in IGV I see the same four clusters present in both first and second seq files I am comparing.Why is this happening??

Any tips in helping me understand this concept is greatly appreciated I looked into cufflinks manual but could not get much info.

Are there any differential analysis tools available for miRNA seq specifically???

Thanks,
Geneart.
geneart is offline   Reply With Quote
Old 07-02-2013, 06:35 AM   #6
rubbertjes
Member
 
Location: Netherlands

Join Date: Jun 2013
Posts: 13
Default

Quote:
Originally Posted by geneart View Post
Hi all,
I had a few questions about cufflinks for RNA seq analysis.
first : I understand cufflinks is designed for RNA seq and not specifically for miRNA seq.
I am working with miRNA data from Illumina GA and have reads about 36bases long.
My problem is the trascripts.gtf (while cuffcomparing two samples) generated as an output , gives me Loci that has mergedfour distinct clusters together into one CUFF.48.1 location. However when I visualize this on IGV I see that the region has four clusters of reads with varied depths seperated by a region as long a the length of the read (~36bases).
Why are these merged into one cluster?
Is cufflinks treating my seqs as mRNA seqs?
What criteria or parameters are used to merge two clusters in cuffcompare into one CUFF ID , for example what is the min or max distance that is necessary to identify them as one cluster?
Also in the output file that it generated from cuffcompare I have "-" for first seq file and a CUFFvalue for second file I am comparing. However in IGV I see the same four clusters present in both first and second seq files I am comparing.Why is this happening??

Any tips in helping me understand this concept is greatly appreciated I looked into cufflinks manual but could not get much info.

Are there any differential analysis tools available for miRNA seq specifically???

Thanks,
Geneart.
You might want to create a new topic
rubbertjes is offline   Reply With Quote
Old 07-02-2013, 07:42 AM   #7
malcook
Member
 
Location: 66206

Join Date: Sep 2009
Posts: 23
Default why tss_id and p_id? how to add them? -

The use of tss_id and p_id by cuffdiff is explained as
tss_group_exp.diff Primary transcript differential FPKM. Tests differences in the summed FPKM of transcripts sharing each tss_id
cds_exp.diff Coding sequence differential FPKM. Tests differences in the summed FPKM of transcripts sharing each p_id independent of tss_id
You can get GTF already containing p_id and tss_id from iGenomes for some organisms.

If you want/need to use GTF from other sources, this advice is offered
Note: If an arbitrary GTF/GFF3 file is used as input (instead of the .combined.gtf file produced by Cuffcompare), these attributes will not be present, but Cuffcompare can still be used to obtain these attributes with a command like this:

cuffcompare -s /path/to/genome_seqs.fa -CG -r annotation.gtf annotation.gtf

The resulting cuffcmp.combined.gtf file created by this command will have the tss_id and p_id attributes added to each record and this file can be used as input for cuffdiff.

However, as the comment in my alternate approach
to the problem states:
## NOTE: cuffdiff's documented way of adding these attributes is to
## create a .combined.gtf file using `cuffcompare`, but this method
## unfortunately (unnecessarily!) resets the gene_id and
## transcript_id to newly generated unique values.
malcook is offline   Reply With Quote
Reply

Tags
cuffdiff, cufflinks, gtf, tophat

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:24 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO