Seqanswers Leaderboard Ad

**sdriscoll** · 04-23-2012, 05:03 PM

did you only run a small amount of reads so the run wouldn't take too long?

in this case it might be helpful if you could paste in the exact commands you ran because the options you specified along the way can change the content of merged.gtf.

you are correct in reading that line from merged.gtf - that line says there is an exon found over the region chr1:1-3516. i'm not sure what you were expecting to find though - please clarify.

**doublealice** · 04-24-2012, 08:13 AM

@sdriscoll, thanks very much for your reply.

I just finished the benchwork on RNA extraction and submitted all samples to a seq facility for RNA-seq. While I am waiting the result, I started to try these popular assemblers.

As you said, I used a small amount of reads and reference so that the run wouldn't take too long. All my testing data were downloaded from web resource.

In stead of whole transcriptome assembly, I am only interested in 20 genes in my studied species A. I therefore pickup the orthologs of these 20 genes (CDS sequence) from very closed species B, C and D to comprise a reference transcriptome. I also downloaded RNA-seq data of species B from NCBI SRA. My working flow is:

1. Use bowtie2 maps reads to reference. I didn't use tophat because my reference is CDS and no need to decide splice site.

bowtie2 -t -p 12 --sensitive -x ref_index -1 read_1_fastq -2 read_2_fastq -S output.sam

2. Convert to bam, run cufflinks.

cufflinks -p 12 -o cuff_1 output.sorted.bam

3. I have six pair of read files, so run above steps 6 times, then cuffmerge
cuffmerge -p 12 -s ref.fa assemblies.txt

Then I got above merge.gtf. I just simply changed reference gene names to chr1. The real file is:

gene1_fromB Cufflinks exon 1 3516
gene2_fromC Cufflinks exon 425 753
gene3_fromD Cufflinks exon 830 1200
gene3_fromD Cufflinks exon 1671 2045
...

I don't know if people use normal reference genome perform cufflinks assembly, would they get a similar merge.gtf. I am curious why the newly assembled exons have identical sequence with reference. I expect new outputs may have some mismatches, and indels in some regions. It should be a fasta sequence file instead of a gtf table based on reference.

In other words, does it mean cufflinks only assembles an exon which may have exact same sequence with reference?

Please correct me if my thought is not proper. Thanks!

Alice

**sdriscoll** · 04-24-2012, 09:09 AM

I think I see. The full cufflinks to cuffmerge pipeline is designed and only really useful for full genome alignments. So you'd align your RNA-Seq data to the full genome. Then cufflinks can use it's magic to assemble the islands of piled up reads into transcripts followed by cuffmerge building a consensus annotation, all in terms of the genome to which you aligned your data.

So that last step where you swapped in the chromosome name in place of the original reference names from your reference isn't correct. Since you did not align your reads to the genome then everything has to remain in terms of your original reference.

I guess to figure out what the "right" thing for you to do would be I need to know what your goal is. Are you looking for differential expression or are you looking to discover splice variations of your 20 genes? Also what species are you working with?

**doublealice** · 04-24-2012, 11:49 AM

Thanks!

I didn't edit anything on my reference names when I performed the analysis. Just when I posted this thread, I guessed that using gene name in the first column may cause more confusion and whatever name it was may not be the point, so simply used chr1 instead. Sorry for causing more misunderstandings.

I am working with four species of gymnosperms, none of them has complete genome. My goal is to identify 20 genes from these gymonsperms' RNA-seq. Some of them have EST available online and some of them are quite conserved through all plants. I therefore consider to collect those EST and orthologs from well studied species, e.g, arabidopsis, as reference to perform assembly.

From what you said, my impression is cufflinks are good at whole genome / transcriptom assembly. For my such small amount data, cufflinks may not a good choice. Is it correct? What kind of other tools I should try?

I am thinking of de novo assembly, e.g. trinity or oases. But it does a challenge to my computer.

Thanks again for kind help.

Alice

**sdriscoll** · 04-24-2012, 01:56 PM

So it sounds like you want something like this:

RNA-Seq reads -> [some pipeline] -> transcriptome assembly

if that's right then what you need are tools for novel transcriptome assembly. something that essentially aligns your RNA-Seq reads to each other rather than to a reference. similar to how they build genomes from sequencing data. if that's right then bowtie isn't what you want. Cufflinks seeks to explain reads that are aligned to a complete genome. it "discovers" genes by looking at their alignments to a genome and annotates them as a GTF file which just gives you start/end coordinates for exons.

What you want to do is assemble reads into a reference (or something that you could in fact align reads to).

does that sound right?

so something like this would be relevant: http://www.nature.com/nbt/journal/v2.../nbt.1883.html

**doublealice** · 04-25-2012, 07:40 AM

Thanks! I think trinity is probably the next assembler I should try. The reason I didn't want to use it at the beginning is the demand of computer resource. To my knowledge, trinity cannot separate millions reads into several parts, work on them one by one then merge the result finally. In that case, my computer maybe not powerful enough to finish that kind of job. I don't have confidence to guess if my job was still working very very slowly or crashed already after 350 hours waiting. I maybe wrong.

It is true, no reference genome for my studied species. But, I do have reference genes for my interested targets. To my understanding, theoretically, these genes could be used as a template to fish out reads from millions read pool, then assemble to a complete full-length gene. Compare to work on whole genome, is it faster? I found someone else ask similar question, but no respond. Probably this kind of question is too "naive" or too "beginning"??

Thanks again for the helps.

Alice

**sdriscoll** · 04-25-2012, 12:30 PM

i think it's more because most people aren't doing what you are doing and there isn't a coherent pipeline for it. the only reason i can give you any advice is because a friend of mine was working with frog data and wanted to do something similar. he too misinterpreted what cufflinks does thinking it would assemble reads into ESTs without a reference genome. He ended up having a friend assemble his reads into ESTs for him using this software: http://www.mybiosoftware.com/sequence-analysis/1282.

I think what you want to do is this. Run your collected data through an assembler, one that can "assemble reads without a reference genome". I know there are more than just Trinity. I think something called Abyss does this as well. The output of that analysis should be a long list of "genes" in FASTA format. You'll need to compare this list to your reference genes so that you can identify your genes of interest in the list of assembled genes. Use something like BLAST for that where you can do pairwise blasts and look for the best matches. Keep in mind that the assembler may have assembled several versions of the same gene so you might get perfect partial matches which might indicate a new isoform of your reference gene.

Once you have identified your reference genes in the assembled genes you should have a much longer list of genes to build your bowtie index from. You can build a Bowtie reference from this list (a single large FASTA file where each record has a unique gene name) and then align your raw reads to that reference. You can then parse the alignments to quantify the numbers of reads aligning to each of your reference genes and, finally, use something like DESeq to quantify differential expression across your samples.

It's a long pipeline. You'll need to know some R, Bash, and possibly Perl or Python to get through it. You can use the online BLAST or, if you are so inclined, you can download and install your own local copy so that you can do pairwise blasts between your reference genes and the assembled genes from Trinity (or whichever software you use) from a Bash script or something.

I think that's basically what my friend did with his frog data and it seemed to work well.

**doublealice** · 04-25-2012, 02:01 PM

Thanks for giving such an elaborate explanation. I guess I may misinterpret the usage of cufflinks as well. I just want to "avoid" the whole genome assembly but seems it is not possible at the moment. The method your friend used is practical and I will try to follow it.

I am very happy to hear from you and appreciate what you suggested.

Thanks!

Alice

**sdriscoll** · 04-25-2012, 02:05 PM

No problem. Good luck!

**pengchy** · 11-15-2012, 03:34 AM

The bowtie may not suitable for the alignment of RNA to CDS. The CDS is only one isoform and the exon skiping alternative splicing also produce gapped alignment which are not considered by bowtie. Anyway, tophat is a good choice to align RNA-seq reads to the reference, even the reference is denovo assembled transcripts.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

merge.gtf from cuffmerge

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News