Hi everyone,
I have two questions about the GTF file that you can use as a reference in both TopHat and cuffdiff. A general GTF file that can be downloaded from for instance UCSC will look something like:
chr12 refGene exon 12262139 12262238 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name "Fam49a$
chr12 refGene exon 12304181 12304322 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "2"; exon_id "NM_001146119.2"; gene_name "Fam49a$
chr12 refGene exon 12340679 12340758 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "3"; exon_id "NM_001146119.3"; gene_name "Fam49a$
chr12 refGene CDS 12340689 12340758 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "3"; exon_id "NM_001146119.3"; gene_name "Fam49a$
chr12 refGene exon 12358045 12358166 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "4"; exon_id "NM_001146119.4"; gene_name "Fam49a$
chr12 refGene CDS 12358045 12358166 . + 2 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "4"; exon_id "NM_001146119.4"; gene_name "Fam49a$
chr12 refGene exon 12359213 12359318 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "5"; exon_id "NM_001146119.5"; gene_name "Fam49a$
chr12 refGene CDS 12359213 12359318 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "5"; exon_id "NM_001146119.5"; gene_name "Fam49a$
chr12 refGene exon 12361435 12361571 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "6"; exon_id "NM_001146119.6"; gene_name "Fam49a$
chr12 refGene CDS 12361435 12361571 . + 2 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "6"; exon_id "NM_001146119.6"; gene_name "Fam49a$
chr12 refGene exon 12362015 12362092 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "7"; exon_id "NM_001146119.7"; gene_name "Fam49a$
chr12 refGene CDS 12362015 12362092 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "7"; exon_id "NM_001146119.7"; gene_name "Fam49a$
chr12 refGene exon 12362252 12362368 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "8"; exon_id "NM_001146119.8"; gene_name "Fam49a$
chr12 refGene CDS 12362252 12362368 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "8"; exon_id "NM_001146119.8"; gene_name "Fam49a$
chr12 refGene exon 12362461 12362540 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "9"; exon_id "NM_001146119.9"; gene_name "Fam49a$
chr12 refGene CDS 12362461 12362540 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "9"; exon_id "NM_001146119.9"; gene_name "Fam49a$
chr12 refGene exon 12364720 12364846 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "10"; exon_id "NM_001146119.10"; gene_name "Fam4$
chr12 refGene CDS 12364720 12364846 . + 1 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "10"; exon_id "NM_001146119.10"; gene_name "Fam4$
chr12 refGene exon 12369894 12369964 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "11"; exon_id "NM_001146119.11"; gene_name "Fam4$
chr12 refGene CDS 12369894 12369964 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "11"; exon_id "NM_001146119.11"; gene_name "Fam4$
chr12 refGene exon 12372747 12376361 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "12"; exon_id "NM_001146119.12"; gene_name "Fam4$
chr12 refGene CDS 12372747 12372807 . + 1 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "12"; exon_id "NM_001146119.12"; gene_name "Fam4$
chr12 refGene start_codon 12340689 12340691 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name$
chr12 refGene stop_codon 12372808 12372810 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name$
chr7 refGene exon 24902986 24903128 . + . gene_id "Arhgef1"; transcript_id "NM_008488"; exon_number "1"; exon_id "NM_008488.1"; gene_name "Arhgef1";
question 1:
the GTF file includes exons, coding sequences (CDS), and also miRNA, start and stop codons. Often the coding sequence and exons will be identical or otherwise almost identical. For this reason I kept only the exons in my reference file. I was wondering if this is the wisest thing to do. What will TopHat and cufflinks do when I keep the additional information? will it be able to use these annotations of exons coding sequences etc., or will it just try to map the reads to each individual line in the reference file and not be able to distinguish between the different "types" (Exon versus CDS versus micro RNA)? IF the latter is the case, will this basically mean that the number of reads will halve for exons, since now halve are mapped to the CDS?
question 2:
how can cufflinks perform CDS-level transcription difference tests, splicing tests, promoter preference tests and relative CDS output tests? Where do you provide the inputs so that it knows where these are?
My output when just using exons look like this:
I have two questions about the GTF file that you can use as a reference in both TopHat and cuffdiff. A general GTF file that can be downloaded from for instance UCSC will look something like:
chr12 refGene exon 12262139 12262238 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name "Fam49a$
chr12 refGene exon 12304181 12304322 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "2"; exon_id "NM_001146119.2"; gene_name "Fam49a$
chr12 refGene exon 12340679 12340758 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "3"; exon_id "NM_001146119.3"; gene_name "Fam49a$
chr12 refGene CDS 12340689 12340758 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "3"; exon_id "NM_001146119.3"; gene_name "Fam49a$
chr12 refGene exon 12358045 12358166 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "4"; exon_id "NM_001146119.4"; gene_name "Fam49a$
chr12 refGene CDS 12358045 12358166 . + 2 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "4"; exon_id "NM_001146119.4"; gene_name "Fam49a$
chr12 refGene exon 12359213 12359318 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "5"; exon_id "NM_001146119.5"; gene_name "Fam49a$
chr12 refGene CDS 12359213 12359318 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "5"; exon_id "NM_001146119.5"; gene_name "Fam49a$
chr12 refGene exon 12361435 12361571 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "6"; exon_id "NM_001146119.6"; gene_name "Fam49a$
chr12 refGene CDS 12361435 12361571 . + 2 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "6"; exon_id "NM_001146119.6"; gene_name "Fam49a$
chr12 refGene exon 12362015 12362092 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "7"; exon_id "NM_001146119.7"; gene_name "Fam49a$
chr12 refGene CDS 12362015 12362092 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "7"; exon_id "NM_001146119.7"; gene_name "Fam49a$
chr12 refGene exon 12362252 12362368 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "8"; exon_id "NM_001146119.8"; gene_name "Fam49a$
chr12 refGene CDS 12362252 12362368 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "8"; exon_id "NM_001146119.8"; gene_name "Fam49a$
chr12 refGene exon 12362461 12362540 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "9"; exon_id "NM_001146119.9"; gene_name "Fam49a$
chr12 refGene CDS 12362461 12362540 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "9"; exon_id "NM_001146119.9"; gene_name "Fam49a$
chr12 refGene exon 12364720 12364846 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "10"; exon_id "NM_001146119.10"; gene_name "Fam4$
chr12 refGene CDS 12364720 12364846 . + 1 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "10"; exon_id "NM_001146119.10"; gene_name "Fam4$
chr12 refGene exon 12369894 12369964 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "11"; exon_id "NM_001146119.11"; gene_name "Fam4$
chr12 refGene CDS 12369894 12369964 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "11"; exon_id "NM_001146119.11"; gene_name "Fam4$
chr12 refGene exon 12372747 12376361 . + . gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "12"; exon_id "NM_001146119.12"; gene_name "Fam4$
chr12 refGene CDS 12372747 12372807 . + 1 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "12"; exon_id "NM_001146119.12"; gene_name "Fam4$
chr12 refGene start_codon 12340689 12340691 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name$
chr12 refGene stop_codon 12372808 12372810 . + 0 gene_id "Fam49a"; transcript_id "NM_001146119"; exon_number "1"; exon_id "NM_001146119.1"; gene_name$
chr7 refGene exon 24902986 24903128 . + . gene_id "Arhgef1"; transcript_id "NM_008488"; exon_number "1"; exon_id "NM_008488.1"; gene_name "Arhgef1";
question 1:
the GTF file includes exons, coding sequences (CDS), and also miRNA, start and stop codons. Often the coding sequence and exons will be identical or otherwise almost identical. For this reason I kept only the exons in my reference file. I was wondering if this is the wisest thing to do. What will TopHat and cufflinks do when I keep the additional information? will it be able to use these annotations of exons coding sequences etc., or will it just try to map the reads to each individual line in the reference file and not be able to distinguish between the different "types" (Exon versus CDS versus micro RNA)? IF the latter is the case, will this basically mean that the number of reads will halve for exons, since now halve are mapped to the CDS?
question 2:
how can cufflinks perform CDS-level transcription difference tests, splicing tests, promoter preference tests and relative CDS output tests? Where do you provide the inputs so that it knows where these are?
My output when just using exons look like this:
Performed 12350 isoform-level transcription difference tests
Performed 0 tss-level transcription difference tests
Performed 10502 gene-level transcription difference tests
Performed 0 CDS-level transcription difference tests
Performed 0 splicing tests
Performed 0 promoter preference tests
Performing 0 relative CDS output tests
Performed 0 tss-level transcription difference tests
Performed 10502 gene-level transcription difference tests
Performed 0 CDS-level transcription difference tests
Performed 0 splicing tests
Performed 0 promoter preference tests
Performing 0 relative CDS output tests
Comment