Hello;
I am trying to get familiar with RNA-seq analysis and want to have an overall work flow with the bundle(bowtie--->tophat--->cufflinks) to analyze my RNA-seq data, but there is no clear answer over the website. The closest discussion to my question was addressed to HTSeq in another thread, which seems irrelevant to my question although it shares some points.
Now I come to the final step to use cuffcompare to get the expression level of each gene. Here is my work flow:
1. First I built the index:
2. Next I use tophat to map the reads to the genome:
3. Then I use cufflinks to get the expression of genes and transcripts:
4. Then
Now I got result as following which contains redundant information. What I need is just AT1G01040 as a combination of all the transcript of this gene.
I have the standard gtf file of the genome (Arabidopsis TAIR10)
and the Tophat output gtf file seems fine too.
The output file combined.gtf should to be what I need according to the manual, but what I got still contains redundant information:
How do I ask cuffcompare to output the gene-based expression results ignoring all the alternative transcripts? For example, I only need the expression level (FPKM) for AT1G11240 as a union of all the exons and other alternative transcripts.
I am aware this post is long, but believe this is a very common question for the first step of RNA-seq analysis. Your reply is highly appreciated!
Yifang
I am trying to get familiar with RNA-seq analysis and want to have an overall work flow with the bundle(bowtie--->tophat--->cufflinks) to analyze my RNA-seq data, but there is no clear answer over the website. The closest discussion to my question was addressed to HTSeq in another thread, which seems irrelevant to my question although it shares some points.
Now I come to the final step to use cuffcompare to get the expression level of each gene. Here is my work flow:
1. First I built the index:
Code:
bowtie-build TAIR10_chrAll.fa ./Bowtie-indexes/TAIR10
Code:
tophat -r 20 -o ./Result-Tophat-out-20110329 -p 24 TAIR10 s_6_sequence.txt
Code:
cufflinks -o Result-cufflinks-20110329tair10c -G TAIR10_GFF3_genesb.gff -r TAIR10_chrAll.fa tophat-out/accepted_hits.bam
Code:
cuffcompare -o cuffcompare -r TAIR10_GFF3_genes.gff tophat-out/transcripts.gtf
Code:
trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length effective_length status AT1G01040.1 31361 Chr1 23145 31227 1.71887e-08 2.38915e-05 3.08942e-10 0 1.05579e-06 2.10524e-09 6251 6151 OK AT1G01040.1-Protein 31361 Chr1 23518 31079 0.00071945 1 1.5685e-05 0.00050699 0.00093191 8.81164e-05 7561 7461 OK AT1G01040.2 31361 Chr1 23415 31120 29.3171 1 0.494891 21.6766 36.9575 3.59068 5877 5777 OK AT1G01040.1,AT1G01040.1-Protein 31361 Chr1 23518 31079 19.0961 1 0.314152 14.1945 23.9976 2.33884 5730 5630 OK AT1G01040.2,AT1G01040.2-Protein 31361 Chr1 23518 31079 3.218e-08 1.68516e-09 5.29679e-10 0 0.000645987 3.94133e-09 5733 5633 OK
Code:
Chr1 TAIR10 gene 3631 5899 . + . ID=AT1G01010;Note=protein_coding_gene;Name=AT1G01010 Chr1 TAIR10 mRNA 3631 5899 . + . ID=AT1G01010.1;Parent=AT1G01010;Name=AT1G01010.1;Index=1 Chr1 TAIR10 protein 3760 5630 . + . ID=AT1G01010.1-Protein;Name=AT1G01010.1;Derives_from=AT1G01010.1 Chr1 TAIR10 exon 3631 3913 . + . Parent=AT1G01010.1 Chr1 TAIR10 five_prime_UTR 3631 3759 . + . Parent=AT1G01010.1 Chr1 TAIR10 CDS 3760 3913 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein; Chr1 TAIR10 exon 3996 4276 . + . Parent=AT1G01010.1 Chr1 TAIR10 CDS 3996 4276 . + 2 Parent=AT1G01010.1,AT1G01010.1-Protein; Chr1 TAIR10 exon 4486 4605 . + . Parent=AT1G01010.1 Chr1 TAIR10 CDS 4486 4605 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein; Chr1 TAIR10 exon 4706 5095 . + . Parent=AT1G01010.1 Chr1 TAIR10 CDS 4706 5095 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein; Chr1 TAIR10 exon 5174 5326 . + . Parent=AT1G01010.1 Chr1 TAIR10 CDS 5174 5326 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein; Chr1 TAIR10 exon 5439 5899 . + . Parent=AT1G01010.1 Chr1 TAIR10 CDS 5439 5630 . + 0 Parent=AT1G01010.1,AT1G01010.1-Protein; Chr1 TAIR10 three_prime_UTR 5631 5899 . + . Parent=AT1G01010.1
Code:
trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length effective_length status Chr1 Cufflinks transcript 3631 5899 1 + . gene_id "AT1G01010.1"; transcript_id "AT1G01010.1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks exon 3631 3913 1 + . gene_id "AT1G01010.1"; transcript_id "AT1G01010.1"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks exon 3996 4276 1 + . gene_id "AT1G01010.1"; transcript_id "AT1G01010.1"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks exon 4486 4605 1 + . gene_id "AT1G01010.1"; transcript_id "AT1G01010.1"; exon_number "3"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks exon 4706 5095 1 + . gene_id "AT1G01010.1"; transcript_id "AT1G01010.1"; exon_number "4"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks exon 5174 5326 1 + . gene_id "AT1G01010.1"; transcript_id "AT1G01010.1"; exon_number "5"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks transcript 3766936 3768212 1000 - . gene_id "CUFF.4957"; transcript_id "AT1G11240.1,AT1G11240.1-Protein"; FPKM "37.6814271783"; frac "0.999315"; conf_lo "25.348853"; conf_hi "50.014001"; cov "4.615128"; Chr1 Cufflinks exon 3766936 3767192 1000 - . gene_id "CUFF.4957"; transcript_id "AT1G11240.1,AT1G11240.1-Protein"; exon_number "1"; FPKM "37.6814271783"; frac "0.999315"; conf_lo "25.348853"; conf_hi "50.014001"; cov "4.615128"; Chr1 Cufflinks exon 3767406 3767520 1000 - . gene_id "CUFF.4957"; transcript_id "AT1G11240.1,AT1G11240.1-Protein"; exon_number "2"; FPKM "37.6814271783"; frac "0.999315"; conf_lo "25.348853"; conf_hi "50.014001"; cov "4.615128"; Chr1 Cufflinks exon 3767885 3767990 1000 - . gene_id "CUFF.4957"; transcript_id "AT1G11240.1,AT1G11240.1-Protein"; exon_number "3"; FPKM "37.6814271783"; frac "0.999315"; conf_lo "25.348853"; conf_hi "50.014001"; cov "4.615128"; Chr1 Cufflinks exon 3768088 3768212 1000 - . gene_id "CUFF.4957"; transcript_id "AT1G11240.1,AT1G11240.1-Protein"; exon_number "4"; FPKM "37.6814271783"; frac "0.999315"; conf_lo "25.348853"; conf_hi "50.014001"; cov "4.615128"; Chr1 Cufflinks transcript 3768724 3769706 1 + . gene_id "AT1G11250.1"; transcript_id "AT1G11250.1-Protein"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks exon 3768724 3769706 1 + . gene_id "AT1G11250.1"; transcript_id "AT1G11250.1-Protein"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks transcript 3768724 3769706 1 + . gene_id "CUFF.4961"; transcript_id "AT1G11250.1,AT1G11250.1-Protein"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks exon 3768724 3769187 1 + . gene_id "CUFF.4961"; transcript_id "AT1G11250.1,AT1G11250.1-Protein"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000"; Chr1 Cufflinks exon 3769274 3769706 1 + . gene_id "CUFF.4961"; transcript_id "AT1G11250.1,AT1G11250.1-Protein"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
Code:
Chr1 Cufflinks exon 3631 3913 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; Chr1 Cufflinks exon 3996 4276 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; Chr1 Cufflinks exon 4486 4605 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; Chr1 Cufflinks exon 4706 5095 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "4"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; Chr1 Cufflinks exon 5174 5326 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "5"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; Chr1 Cufflinks exon 5439 5899 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "6"; oId "AT1G01010.1"; nearest_ref "AT1G01010.1"; class_code "="; tss_id "TSS1"; Chr1 Cufflinks exon 3760 5630 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000002"; exon_number "1"; oId "AT1G01010.1-Protein"; nearest_ref "AT1G01010.1-Protein"; class_code "=";
I am aware this post is long, but believe this is a very common question for the first step of RNA-seq analysis. Your reply is highly appreciated!
Yifang
Comment