SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bowtie2 in the Tophat/Cufflinks pipeline? jdanderson Bioinformatics 1 12-30-2011 06:10 AM
any pipeline for tophat with 454 sequences? feng RNA Sequencing 0 11-29-2011 05:41 PM
Cufflinks pipeline for splicing differences between animal samples seqhorn Bioinformatics 0 07-08-2011 01:23 AM
vcftools typical usage dawe Bioinformatics 1 08-13-2010 05:40 AM
tuning the tophat/cufflink pipeline oliviera RNA Sequencing 2 06-08-2010 11:49 PM

Reply
 
Thread Tools
Old 02-28-2011, 09:15 AM   #1
Aurelien Mazurie
Member
 
Location: Bozeman, MT

Join Date: Feb 2011
Posts: 15
Default Typical TopHat/Cufflinks pipeline?

Greetings,
I am new to RNA-seq analysis, and am trying to understand the high-order logic of the bowtie/tophat/cufflinks/cuffcompare/cuffdiff pipeline. As far as I understand, a typical run would be something like:

1. Creation of a genome index (needed for Bowtie and TopHat to operate)
bowtie-build genome.fa bowtie_output/genome
2. Alignments of RNA-seq reads to the genome with identification of exon-exon splice functions using TopHat
tophat -o=tophat_output/ bowtie_output/genome reads.fastq
3. Assembly of RNA-seq reads into transcripts and report of the RPKM/FPKM using Cufflinks
cufflinks -o=cufflinks_output/ --reference-seq=genome.fa tophat_output/accepted_hits.bam
4. Comparison of assembled transcripts with a reference annotation, or across experiments, using Cuffcompare
cuffcompare -o=cuffcompare_output -r=genome.gtf -R cufflinks_output/transcripts.gtf
5. Finding significant changes in transcript expression, splicing, and promoter use using Cuffdiff
cuffdiff -o=cuffdiff_output/ cufflinks_output/transcripts.gff tophat_output/accepted_hits1a.bam,tophat_output/accepted_hits1b.bam tophat_output/accepted_hits2.bam
(in this example, accepted_hits1a.bam and accepted_hits1b.sam are replicates, while accepted_hits2.bam is another experimental condition)

Now I have the following questions:

(1) I originally thought step 2 could be performed with all reads files (from different experimental conditions) in the same command line. However, tophat only generates one set of output file. I thus switched to calling tophat for each reads file at a time. Is that the correct approach?

(2) In step 2, tophat can uses a reference genome annotation (GTF file). I am using this pipeline on a genome which, so far, has only been automatically annotated (Phaeodactylum tricornutum). From people's experience, should I rely on an automatically generated GTF file (in this case, from ENSEMBL) with hypothetical gene models or is it better to 'trust' the gene models that tophat would generate from the RNA-seq reads alone? I am worried about which option would give the most false positives; indeed, the tophat documentation states that 'TopHat will use the exon records in this file to build a set of known splice junctions for each gene, and will attempt to align reads to these junctions even if they would not normally be covered by the initial mapping.'.

(3) In step 3, cufflinks also can uses a reference genome annotation. The question is the same: should I trust the reads' alignments more than an automatically generated GTF file for gene models? After all, a lot of person are using ChIP-seq and RNA-seq to improve on gene models.

(4) In step 3, cufflinks is ran with one reads file (i.e., one experimental condition) at a time, producing a transcripts.gtf file with the model transcripts for this condition. However in step 5 cuffdiff is expecting a transcripts GTF file that will be used to compare all SAM/BAM files (from various experimental conditions) provided as input. Should I understand that, in order to compare different experimental conditions, I have to generate a 'meta' transcripts file by running steps 2 and 3 with a concatenation of all my reads files? (ensuring the resulting transcripts.gtf file in step 3 will be the union of all transcripts.gtf files possible?)

Those are the key questions that prevent me, for now, to really use this pipeline (that, and an 'sort order of reads in BAMs must be the same' error in step 3 which, according to thread #9304 -- http://seqanswers.com/forums/showthread.php?t=9304 -- is due to a bug of cufflinks).

Best,
Aurelien
Aurelien Mazurie is offline   Reply With Quote
Old 02-28-2011, 11:31 PM   #2
Bioboatr
Member
 
Location: Lawrence, KS

Join Date: Jan 2010
Posts: 11
Default

Aurelien,

I'm working along the same lines as you, though my first step is simply to generate an annotation of our new genome from several different RNAseq datasests. I'll worrying about differential expression later...

Other folks have asked similar questions, see the following post and where it links too. Unfortunately, there seems to be a fair bit ambiguity surrounding how best to do annotations with cufflinks (as opposed to diff-express).

http://seqanswers.com/forums/showthread.php?t=7942

In principle I like the idea of a 'meta' transcripts file and plan to generate one via samtools merge of several different tophat runs. I seem to have some memory issues on our machine that prevents pooling all the samples. Haven't actually done this yet and I wonder if I'm not gonna be stopped by that sort order error...


Jamie
Bioboatr is offline   Reply With Quote
Old 03-01-2011, 09:46 AM   #3
Aurelien Mazurie
Member
 
Location: Bozeman, MT

Join Date: Feb 2011
Posts: 15
Default

Thank you for the links; I will definitively look at them. I never heard about diff-express, and couldn't find anything on Google. Could you give some more details about it?

Aurelien
Aurelien Mazurie is offline   Reply With Quote
Old 03-01-2011, 09:58 AM   #4
Aurelien Mazurie
Member
 
Location: Bozeman, MT

Join Date: Feb 2011
Posts: 15
Default

I found an answer for question (4): http://seqanswers.com/forums/showthr...3800#post13800

The suggestion from Cole Trapnell is indeed to pool samples with cuffcompare in step 4, then run cuffdiff (step 5) with the combined transcripts GTF file.
Aurelien Mazurie is offline   Reply With Quote
Old 03-01-2011, 08:41 PM   #5
Bioboatr
Member
 
Location: Lawrence, KS

Join Date: Jan 2010
Posts: 11
Default

diff-express was just a lazy way of writing 'differential expression', which I meant in opposition to the goal of 'annotation'. no software here... just a lazy typist.
Bioboatr is offline   Reply With Quote
Old 08-02-2012, 12:23 AM   #6
ymc
Senior Member
 
Location: Hong Kong

Join Date: Mar 2010
Posts: 498
Default

I didn't run 3) and 4). Instead I used the genes.gtf in the hg19 tar ball I downloaded from here:

http://cufflinks.cbcb.umd.edu/igenomes.html

If I am only interested in differential expression but not in transcript assembly, is it good enough?
ymc is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:21 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO