Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merge.gtf from cuffmerge

    Hi all,

    I used cufflinks on a small amount of reads and a part of reference sequence to perform transcript assembly. After running cufflinks and cuffmerge, I got a merge.gtf file finally. From the manual, it sounds this is the final output of the assembly. But I am a little confused when looking at this output file. Here is the first line of it:

    Code:
    chr1	Cufflinks	exon	1	3516	.	.	.	gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.4.1"; tss_id "TSS1";
    My question is, does it mean the newly assembled exon locates on the reference genome's chr1: 1-3516?

    But I don't think that the new transcript would contain exactly identical exon sequence with the reference. Was my understanding wrong?

    I appreciate if anyone can help to clarify.

    Alice
    Last edited by doublealice; 04-23-2012, 01:19 PM.

  • #2
    did you only run a small amount of reads so the run wouldn't take too long?

    in this case it might be helpful if you could paste in the exact commands you ran because the options you specified along the way can change the content of merged.gtf.

    you are correct in reading that line from merged.gtf - that line says there is an exon found over the region chr1:1-3516. i'm not sure what you were expecting to find though - please clarify.
    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
    Salk Institute for Biological Studies, La Jolla, CA, USA */

    Comment


    • #3
      @sdriscoll, thanks very much for your reply.

      I just finished the benchwork on RNA extraction and submitted all samples to a seq facility for RNA-seq. While I am waiting the result, I started to try these popular assemblers.

      As you said, I used a small amount of reads and reference so that the run wouldn't take too long. All my testing data were downloaded from web resource.

      In stead of whole transcriptome assembly, I am only interested in 20 genes in my studied species A. I therefore pickup the orthologs of these 20 genes (CDS sequence) from very closed species B, C and D to comprise a reference transcriptome. I also downloaded RNA-seq data of species B from NCBI SRA. My working flow is:

      1. Use bowtie2 maps reads to reference. I didn't use tophat because my reference is CDS and no need to decide splice site.

      bowtie2 -t -p 12 --sensitive -x ref_index -1 read_1_fastq -2 read_2_fastq -S output.sam

      2. Convert to bam, run cufflinks.

      cufflinks -p 12 -o cuff_1 output.sorted.bam

      3. I have six pair of read files, so run above steps 6 times, then cuffmerge
      cuffmerge -p 12 -s ref.fa assemblies.txt

      Then I got above merge.gtf. I just simply changed reference gene names to chr1. The real file is:

      gene1_fromB Cufflinks exon 1 3516
      gene2_fromC Cufflinks exon 425 753
      gene3_fromD Cufflinks exon 830 1200
      gene3_fromD Cufflinks exon 1671 2045
      ...

      I don't know if people use normal reference genome perform cufflinks assembly, would they get a similar merge.gtf. I am curious why the newly assembled exons have identical sequence with reference. I expect new outputs may have some mismatches, and indels in some regions. It should be a fasta sequence file instead of a gtf table based on reference.

      In other words, does it mean cufflinks only assembles an exon which may have exact same sequence with reference?

      Please correct me if my thought is not proper. Thanks!

      Alice
      Last edited by doublealice; 04-24-2012, 08:18 AM.

      Comment


      • #4
        I think I see. The full cufflinks to cuffmerge pipeline is designed and only really useful for full genome alignments. So you'd align your RNA-Seq data to the full genome. Then cufflinks can use it's magic to assemble the islands of piled up reads into transcripts followed by cuffmerge building a consensus annotation, all in terms of the genome to which you aligned your data.

        So that last step where you swapped in the chromosome name in place of the original reference names from your reference isn't correct. Since you did not align your reads to the genome then everything has to remain in terms of your original reference.

        I guess to figure out what the "right" thing for you to do would be I need to know what your goal is. Are you looking for differential expression or are you looking to discover splice variations of your 20 genes? Also what species are you working with?
        /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
        Salk Institute for Biological Studies, La Jolla, CA, USA */

        Comment


        • #5
          Thanks!

          I didn't edit anything on my reference names when I performed the analysis. Just when I posted this thread, I guessed that using gene name in the first column may cause more confusion and whatever name it was may not be the point, so simply used chr1 instead. Sorry for causing more misunderstandings.

          I am working with four species of gymnosperms, none of them has complete genome. My goal is to identify 20 genes from these gymonsperms' RNA-seq. Some of them have EST available online and some of them are quite conserved through all plants. I therefore consider to collect those EST and orthologs from well studied species, e.g, arabidopsis, as reference to perform assembly.

          From what you said, my impression is cufflinks are good at whole genome / transcriptom assembly. For my such small amount data, cufflinks may not a good choice. Is it correct? What kind of other tools I should try?

          I am thinking of de novo assembly, e.g. trinity or oases. But it does a challenge to my computer.

          Thanks again for kind help.

          Alice
          Last edited by doublealice; 04-24-2012, 11:51 AM.

          Comment


          • #6
            So it sounds like you want something like this:

            RNA-Seq reads -> [some pipeline] -> transcriptome assembly

            if that's right then what you need are tools for novel transcriptome assembly. something that essentially aligns your RNA-Seq reads to each other rather than to a reference. similar to how they build genomes from sequencing data. if that's right then bowtie isn't what you want. Cufflinks seeks to explain reads that are aligned to a complete genome. it "discovers" genes by looking at their alignments to a genome and annotates them as a GTF file which just gives you start/end coordinates for exons.

            What you want to do is assemble reads into a reference (or something that you could in fact align reads to).

            does that sound right?

            so something like this would be relevant: http://www.nature.com/nbt/journal/v2.../nbt.1883.html
            /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
            Salk Institute for Biological Studies, La Jolla, CA, USA */

            Comment


            • #7
              Thanks! I think trinity is probably the next assembler I should try. The reason I didn't want to use it at the beginning is the demand of computer resource. To my knowledge, trinity cannot separate millions reads into several parts, work on them one by one then merge the result finally. In that case, my computer maybe not powerful enough to finish that kind of job. I don't have confidence to guess if my job was still working very very slowly or crashed already after 350 hours waiting. I maybe wrong.

              It is true, no reference genome for my studied species. But, I do have reference genes for my interested targets. To my understanding, theoretically, these genes could be used as a template to fish out reads from millions read pool, then assemble to a complete full-length gene. Compare to work on whole genome, is it faster? I found someone else ask similar question, but no respond. Probably this kind of question is too "naive" or too "beginning"??

              Thanks again for the helps.

              Alice
              Last edited by doublealice; 04-25-2012, 07:42 AM.

              Comment


              • #8
                i think it's more because most people aren't doing what you are doing and there isn't a coherent pipeline for it. the only reason i can give you any advice is because a friend of mine was working with frog data and wanted to do something similar. he too misinterpreted what cufflinks does thinking it would assemble reads into ESTs without a reference genome. He ended up having a friend assemble his reads into ESTs for him using this software: http://www.mybiosoftware.com/sequence-analysis/1282.

                I think what you want to do is this. Run your collected data through an assembler, one that can "assemble reads without a reference genome". I know there are more than just Trinity. I think something called Abyss does this as well. The output of that analysis should be a long list of "genes" in FASTA format. You'll need to compare this list to your reference genes so that you can identify your genes of interest in the list of assembled genes. Use something like BLAST for that where you can do pairwise blasts and look for the best matches. Keep in mind that the assembler may have assembled several versions of the same gene so you might get perfect partial matches which might indicate a new isoform of your reference gene.

                Once you have identified your reference genes in the assembled genes you should have a much longer list of genes to build your bowtie index from. You can build a Bowtie reference from this list (a single large FASTA file where each record has a unique gene name) and then align your raw reads to that reference. You can then parse the alignments to quantify the numbers of reads aligning to each of your reference genes and, finally, use something like DESeq to quantify differential expression across your samples.

                It's a long pipeline. You'll need to know some R, Bash, and possibly Perl or Python to get through it. You can use the online BLAST or, if you are so inclined, you can download and install your own local copy so that you can do pairwise blasts between your reference genes and the assembled genes from Trinity (or whichever software you use) from a Bash script or something.

                I think that's basically what my friend did with his frog data and it seemed to work well.
                /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                Salk Institute for Biological Studies, La Jolla, CA, USA */

                Comment


                • #9
                  Thanks for giving such an elaborate explanation. I guess I may misinterpret the usage of cufflinks as well. I just want to "avoid" the whole genome assembly but seems it is not possible at the moment. The method your friend used is practical and I will try to follow it.

                  I am very happy to hear from you and appreciate what you suggested.

                  Thanks!

                  Alice
                  Last edited by doublealice; 04-25-2012, 02:03 PM.

                  Comment


                  • #10
                    No problem. Good luck!
                    /* Shawn Driscoll, Gene Expression Laboratory, Pfaff
                    Salk Institute for Biological Studies, La Jolla, CA, USA */

                    Comment


                    • #11
                      The bowtie may not suitable for the alignment of RNA to CDS. The CDS is only one isoform and the exon skiping alternative splicing also produce gapped alignment which are not considered by bowtie. Anyway, tophat is a good choice to align RNA-seq reads to the reference, even the reference is denovo assembled transcripts.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      22 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      24 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      20 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X