Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • annotate cufflink assembled transcripts with reference gtf

    Hello, there,

    1. I did genome-guided de novo transcripts assembly for my RNAseq data using cufflinks. Here .sam file is from STAR mapping

    cufflinks -p 8 /mapping/mapped.sam

    2. I then merged the resultant gtf files from the same tissue to have merged.gtf without including reference.gtf

    cuffmerge -p 8 gtf.filelist.DeNovo

    3. I tried to find the closest gene id for those de novo assembled transcripts

    cuffcompare merged.gtf -r reference.gtf

    What I've found is that none of my de novo assembled transcripts are mapped to the reference gtf even though some introns are apparently identical between the merged.gtf and reference.gtf

    for example:

    from the cufflinks merged.gtf, I have

    more XLOC_005458.gtf
    chr2 Cufflinks exon 25289899 25290661 . . . gene_id "XLOC_005458"; transcript_id "TCONS_00010739"; exon_number "1"; oId "CUFF.5451.1"; tss_i
    d "TSS7438";
    chr2 Cufflinks exon 25290738 25290883 . . . gene_id "XLOC_005458"; transcript_id "TCONS_00010739"; exon_number "2"; oId "CUFF.5451.1"; tss_i
    d "TSS7438";
    chr2 Cufflinks exon 25290976 25291190 . . . gene_id "XLOC_005458"; transcript_id "TCONS_00010739"; exon_number "3"; oId "CUFF.5451.1"; tss_i
    d "TSS7438";
    chr2 Cufflinks exon 25289938 25290082 . . . gene_id "XLOC_005458"; transcript_id "TCONS_00010740"; exon_number "1"; oId "CUFF.5451.2"; tss_i
    d "TSS7438";
    chr2 Cufflinks exon 25290388 25291177 . . . gene_id "XLOC_005458"; transcript_id "TCONS_00010740"; exon_number "2"; oId "CUFF.5451.2"; tss_i
    d "TSS7438";

    from the reference.gtf, I have:
    2 ensembl_havana CDS 25289989 25290661 . + 0 ccds_id "CCDS15763"; exon_number "1"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000026961"; gene_name "Lrrc26"; gene_source "ensembl_havana"; gene_version "6"; havana_gene "OTTMUSG00000011934"; havana_gene_version "1"; havana_transcript "OTTMUST00000028197"; havana_transcript_version "1"; p_id "P45943"; protein_id "ENSMUSP00000028337"; protein_version "6"; tag "basic"; transcript_biotype "protein_coding"; transcript_id "ENSMUST00000028337"; transcript_name "Lrrc26-001"; transcript_source "ensembl_havana"; transcript_support_level "1"; transcript_version "6"; tss_id "TSS86428";

    2 ensembl_havana CDS 25290738 25291057 . + 2 ccds_id "CCDS15763"; exon_number "2"; gene_biotype "protein_coding"; gene_id "ENSMUSG00000026961"; gene_name "Lrrc26"; gene_source "ensembl_havana"; gene_version "6"; havana_gene "OTTMUSG00000011934"; havana_gene_version "1"; havana_transcript "OTTMUST00000028197"; havana_transcript_version "1"; p_id "P45943"; protein_id "ENSMUSP00000028337"; protein_version "6"; tag "basic"; transcript_biotype "protein_coding"; transcript_id "ENSMUST00000028337"; transcript_name "Lrrc26-001"; transcript_source "ensembl_havana"; transcript_support_level "1"; transcript_version "6"; tss_id "TSS86428";

    Apparently the same intron (25290661 .. 25290738) exists in both the de novo assemble transcript and the reference. So my question is why the XLOC_005458 from cufflinks output is not mapped to the Lrrc26 in reference.gtf even though they share the same gene region?

    Thanks for any inputs!

    C.
    Last edited by capricy; 02-01-2017, 08:15 AM.

  • #2
    RNA molecules can suffer from degradation. However, introns are identified by splice junctions and are often in the middle of the RNA reads. So, it is more likely for introns to be identified correctly. If you want all genes to be mapped very similar to the reference, you might need higher sequencing depth and/or higher quality data.

    Comment


    • #3
      Then what is the easy way to annotate those assembled transcripts? I meant, I would like to find the closest reference gene IDs for the transcripts.

      Thanks.

      C.

      Comment


      • #4
        Hi Capricy,

        You can supply your gene annotation (reference.gtf) to Cufflinks during assembly, using the -g argument.
        Or you can use bedtools intersect to overlap and combine your merged.gtf and reference.gtf. Here is its document. You need to convert the gtf files into bed files for this method.


        I hope this helps,

        Comment


        • #5
          According to the cuffcompare document, if I use -r <reference.gtf>, the output should be able to identify the overlapped transfrags. But it did not in my case.

          Just wonder if there is something wrong with my steps?

          C.

          Comment


          • #6
            I didn't use -g since I only would like to see the de novo assembled transfrags.

            Comment


            • #7
              Then, it would seem that an easy way for you is to use bedtools.

              You can convert a gtf file to bed file using:
              Code:
              cut -f 1,4,5,9 yourfile.gtf > yourfile.bed
              This extracts the 1st, 4th, 5th and 9th columns from the gtf files and write them to a new file.

              Then, you can use bedtools intersect to overlap the two files.
              It seems that the -loj and -wao arguments suit your case well. You can take a look.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X