SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
cuffmerge crashes when converting gtf files to sam files swbiggs4 Bioinformatics 20 02-16-2017 09:19 AM
cuffdiff: use merged.gtf from cuffmerge or combined.gtf from cuffcompare? turnersd Bioinformatics 21 10-02-2014 03:41 AM
cufflinks GTF files incompatible with cuffmerge? shurjo Bioinformatics 21 09-23-2013 04:47 PM
Filtering "merge.gtf" (from cuffmerge) by redundance Ramon Vidal Bioinformatics 3 02-13-2012 06:47 AM
Cufflinks' computation of FPKM for --GTF and --GTF-guide estimation burt Bioinformatics 0 08-23-2011 11:59 PM

Reply
 
Thread Tools
Old 04-23-2012, 01:15 PM   #1
doublealice
Member
 
Location: US

Join Date: Feb 2011
Posts: 24
Default merge.gtf from cuffmerge

Hi all,

I used cufflinks on a small amount of reads and a part of reference sequence to perform transcript assembly. After running cufflinks and cuffmerge, I got a merge.gtf file finally. From the manual, it sounds this is the final output of the assembly. But I am a little confused when looking at this output file. Here is the first line of it:

Code:
chr1	Cufflinks	exon	1	3516	.	.	.	gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.4.1"; tss_id "TSS1";
My question is, does it mean the newly assembled exon locates on the reference genome's chr1: 1-3516?

But I don't think that the new transcript would contain exactly identical exon sequence with the reference. Was my understanding wrong?

I appreciate if anyone can help to clarify.

Alice

Last edited by doublealice; 04-23-2012 at 01:19 PM.
doublealice is offline   Reply With Quote
Old 04-23-2012, 05:03 PM   #2
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

did you only run a small amount of reads so the run wouldn't take too long?

in this case it might be helpful if you could paste in the exact commands you ran because the options you specified along the way can change the content of merged.gtf.

you are correct in reading that line from merged.gtf - that line says there is an exon found over the region chr1:1-3516. i'm not sure what you were expecting to find though - please clarify.
sdriscoll is offline   Reply With Quote
Old 04-24-2012, 08:13 AM   #3
doublealice
Member
 
Location: US

Join Date: Feb 2011
Posts: 24
Default

@sdriscoll, thanks very much for your reply.

I just finished the benchwork on RNA extraction and submitted all samples to a seq facility for RNA-seq. While I am waiting the result, I started to try these popular assemblers.

As you said, I used a small amount of reads and reference so that the run wouldn't take too long. All my testing data were downloaded from web resource.

In stead of whole transcriptome assembly, I am only interested in 20 genes in my studied species A. I therefore pickup the orthologs of these 20 genes (CDS sequence) from very closed species B, C and D to comprise a reference transcriptome. I also downloaded RNA-seq data of species B from NCBI SRA. My working flow is:

1. Use bowtie2 maps reads to reference. I didn't use tophat because my reference is CDS and no need to decide splice site.

bowtie2 -t -p 12 --sensitive -x ref_index -1 read_1_fastq -2 read_2_fastq -S output.sam

2. Convert to bam, run cufflinks.

cufflinks -p 12 -o cuff_1 output.sorted.bam

3. I have six pair of read files, so run above steps 6 times, then cuffmerge
cuffmerge -p 12 -s ref.fa assemblies.txt

Then I got above merge.gtf. I just simply changed reference gene names to chr1. The real file is:

gene1_fromB Cufflinks exon 1 3516
gene2_fromC Cufflinks exon 425 753
gene3_fromD Cufflinks exon 830 1200
gene3_fromD Cufflinks exon 1671 2045
...

I don't know if people use normal reference genome perform cufflinks assembly, would they get a similar merge.gtf. I am curious why the newly assembled exons have identical sequence with reference. I expect new outputs may have some mismatches, and indels in some regions. It should be a fasta sequence file instead of a gtf table based on reference.

In other words, does it mean cufflinks only assembles an exon which may have exact same sequence with reference?

Please correct me if my thought is not proper. Thanks!

Alice

Last edited by doublealice; 04-24-2012 at 08:18 AM.
doublealice is offline   Reply With Quote
Old 04-24-2012, 09:09 AM   #4
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

I think I see. The full cufflinks to cuffmerge pipeline is designed and only really useful for full genome alignments. So you'd align your RNA-Seq data to the full genome. Then cufflinks can use it's magic to assemble the islands of piled up reads into transcripts followed by cuffmerge building a consensus annotation, all in terms of the genome to which you aligned your data.

So that last step where you swapped in the chromosome name in place of the original reference names from your reference isn't correct. Since you did not align your reads to the genome then everything has to remain in terms of your original reference.

I guess to figure out what the "right" thing for you to do would be I need to know what your goal is. Are you looking for differential expression or are you looking to discover splice variations of your 20 genes? Also what species are you working with?
sdriscoll is offline   Reply With Quote
Old 04-24-2012, 11:49 AM   #5
doublealice
Member
 
Location: US

Join Date: Feb 2011
Posts: 24
Default

Thanks!

I didn't edit anything on my reference names when I performed the analysis. Just when I posted this thread, I guessed that using gene name in the first column may cause more confusion and whatever name it was may not be the point, so simply used chr1 instead. Sorry for causing more misunderstandings.

I am working with four species of gymnosperms, none of them has complete genome. My goal is to identify 20 genes from these gymonsperms' RNA-seq. Some of them have EST available online and some of them are quite conserved through all plants. I therefore consider to collect those EST and orthologs from well studied species, e.g, arabidopsis, as reference to perform assembly.

From what you said, my impression is cufflinks are good at whole genome / transcriptom assembly. For my such small amount data, cufflinks may not a good choice. Is it correct? What kind of other tools I should try?

I am thinking of de novo assembly, e.g. trinity or oases. But it does a challenge to my computer.

Thanks again for kind help.

Alice

Last edited by doublealice; 04-24-2012 at 11:51 AM.
doublealice is offline   Reply With Quote
Old 04-24-2012, 01:56 PM   #6
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

So it sounds like you want something like this:

RNA-Seq reads -> [some pipeline] -> transcriptome assembly

if that's right then what you need are tools for novel transcriptome assembly. something that essentially aligns your RNA-Seq reads to each other rather than to a reference. similar to how they build genomes from sequencing data. if that's right then bowtie isn't what you want. Cufflinks seeks to explain reads that are aligned to a complete genome. it "discovers" genes by looking at their alignments to a genome and annotates them as a GTF file which just gives you start/end coordinates for exons.

What you want to do is assemble reads into a reference (or something that you could in fact align reads to).

does that sound right?

so something like this would be relevant: http://www.nature.com/nbt/journal/v2.../nbt.1883.html
sdriscoll is offline   Reply With Quote
Old 04-25-2012, 07:40 AM   #7
doublealice
Member
 
Location: US

Join Date: Feb 2011
Posts: 24
Default

Thanks! I think trinity is probably the next assembler I should try. The reason I didn't want to use it at the beginning is the demand of computer resource. To my knowledge, trinity cannot separate millions reads into several parts, work on them one by one then merge the result finally. In that case, my computer maybe not powerful enough to finish that kind of job. I don't have confidence to guess if my job was still working very very slowly or crashed already after 350 hours waiting. I maybe wrong.

It is true, no reference genome for my studied species. But, I do have reference genes for my interested targets. To my understanding, theoretically, these genes could be used as a template to fish out reads from millions read pool, then assemble to a complete full-length gene. Compare to work on whole genome, is it faster? I found someone else ask similar question, but no respond. Probably this kind of question is too "naive" or too "beginning"??

Thanks again for the helps.

Alice

Last edited by doublealice; 04-25-2012 at 07:42 AM.
doublealice is offline   Reply With Quote
Old 04-25-2012, 12:30 PM   #8
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

i think it's more because most people aren't doing what you are doing and there isn't a coherent pipeline for it. the only reason i can give you any advice is because a friend of mine was working with frog data and wanted to do something similar. he too misinterpreted what cufflinks does thinking it would assemble reads into ESTs without a reference genome. He ended up having a friend assemble his reads into ESTs for him using this software: http://www.mybiosoftware.com/sequence-analysis/1282.

I think what you want to do is this. Run your collected data through an assembler, one that can "assemble reads without a reference genome". I know there are more than just Trinity. I think something called Abyss does this as well. The output of that analysis should be a long list of "genes" in FASTA format. You'll need to compare this list to your reference genes so that you can identify your genes of interest in the list of assembled genes. Use something like BLAST for that where you can do pairwise blasts and look for the best matches. Keep in mind that the assembler may have assembled several versions of the same gene so you might get perfect partial matches which might indicate a new isoform of your reference gene.

Once you have identified your reference genes in the assembled genes you should have a much longer list of genes to build your bowtie index from. You can build a Bowtie reference from this list (a single large FASTA file where each record has a unique gene name) and then align your raw reads to that reference. You can then parse the alignments to quantify the numbers of reads aligning to each of your reference genes and, finally, use something like DESeq to quantify differential expression across your samples.

It's a long pipeline. You'll need to know some R, Bash, and possibly Perl or Python to get through it. You can use the online BLAST or, if you are so inclined, you can download and install your own local copy so that you can do pairwise blasts between your reference genes and the assembled genes from Trinity (or whichever software you use) from a Bash script or something.

I think that's basically what my friend did with his frog data and it seemed to work well.
sdriscoll is offline   Reply With Quote
Old 04-25-2012, 02:01 PM   #9
doublealice
Member
 
Location: US

Join Date: Feb 2011
Posts: 24
Default

Thanks for giving such an elaborate explanation. I guess I may misinterpret the usage of cufflinks as well. I just want to "avoid" the whole genome assembly but seems it is not possible at the moment. The method your friend used is practical and I will try to follow it.

I am very happy to hear from you and appreciate what you suggested.

Thanks!

Alice

Last edited by doublealice; 04-25-2012 at 02:03 PM.
doublealice is offline   Reply With Quote
Old 04-25-2012, 02:05 PM   #10
sdriscoll
I like code
 
Location: San Diego, CA, USA

Join Date: Sep 2009
Posts: 438
Default

No problem. Good luck!
sdriscoll is offline   Reply With Quote
Old 11-15-2012, 02:34 AM   #11
pengchy
Senior Member
 
Location: China

Join Date: Feb 2009
Posts: 116
Default

The bowtie may not suitable for the alignment of RNA to CDS. The CDS is only one isoform and the exon skiping alternative splicing also produce gapped alignment which are not considered by bowtie. Anyway, tophat is a good choice to align RNA-seq reads to the reference, even the reference is denovo assembled transcripts.
pengchy is offline   Reply With Quote
Reply

Tags
cufflinks, rna-seq, transcriptome assembly

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:17 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO