SEQanswers (
-   RNA Sequencing (
-   -   What do I do with output files from tophat/cufflinks (

thickrick99 07-04-2014 08:30 AM

What do I do with output files from tophat/cufflinks
Hi I am a beginner with RNA-sequencing and I used tophat to align RNA-seq reads from geuvadis to hg19 from UCSC. In tophat, I provided the reference transcript and then used the accepted_hits.bam file from the output as the input file for cufflinks.

I tested cufflinks with both the reference and without the reference transcripts and have the outputs for both of them. So now I am stuck... What exactly can I do now. I mean I have the isoforms and gene fpkm files with the values but how should I approach analyzing them in general? I am not doing a project but just want to know about the different processes I can do with these files as well as the transcripts.gtf file.

Also, what does an FPKM value of 0 mean? I know some other forums mentioned about this meaning that none of the reads mapped to the reference so I created a simple script to filter all of these values out of the isoforms.fpkm_tracking file. is this ok?

Lastly, what can I do to compare both the isoforms/transcripts files from cufflinks with and without the reference annotation?

Thank you so much for the help in advance!!! :)


blancha 07-04-2014 10:38 AM

These are vast questions.
I don't have time to answer them fully, but here are some tips, which I hope you will find helpful.

If you're willing to use some R commands, you might want to try CummeRbund for the downstream analysis.
It's not the greatest software, but it does it make it easier to extract more information out of all the data.

I'm not sure why you want to remove isoforms with an FPKM out of 0. An FPKM of 0 means that the isoform is either not expressed, or so lowly expressed that it cannot be detected at this sequencing depth. This is useful information, so I would not remove it.

thickrick99 07-04-2014 11:01 AM

Thanks blancha for your advice especially on the FPKM values! Yeah it makes sense to keep them since I can identify genes that are not expressed.

Sorry for the really broad questions. Essentially I just needed some advice on what to do next.

One question which I believe I mentioned above was using cufflinks with and without the reference. How can I view the novel transcripts that cufflinks found without the reference in de novo mode compared to the output file using the reference?

Lastly, does anyone know a good way to identify SNV's from the data? I wasn't sure how to approach this either. Thanks!

blancha 07-04-2014 11:12 AM

For the SNP calling, I would recommend reading the Broad Institute Best Practices Workflow.

thickrick99 07-04-2014 11:17 AM

Alright Cool! Yeah I heard that GATK is useful in SNP calling so I will definitely read through the protocol.

Thanks Again!

blancha 07-05-2014 03:36 AM

There are also several ways of analyzing the biological significance of the data.

goseq: R package to do gene ontology analysis. Corrects for length bias in RNA-Seq. Cumbersome to use. Default output not complete, e.g. ontology terms but not the genes inputted that are associated with the terms.

DAVID: Very easy to use. Biologists can do it. Does not correct for length bias. Algorithm rather mysterious. Interactive and informative output. Very easy to play with.

GSEA: Different algorithm. Can pick gene sets. Criteria must be chosen to rank genes however. There is no perfect ranking. Ranking by fold changes or adjusted p-values both have their disadvantages.

All times are GMT -8. The time now is 10:20 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.