Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Finding novel genes using Cufflinks/Cuffmerge

    I am trying to identify novel genes from my RNA-seq data using Cufflinks. I ran Cufflinks on each of my bam files created by Tophat. Then I used Cuffmerge to combine the transcripts.gtf files generated by Cufflinks. The resulting merged.gtf file does not contain any transcripts, just the exons. Is this a known problem with Cuffmerge? Will it be fixed anytime soon? Am I doing something wrong?

  • #2
    Here is some additional info

    Below is the run.log file:
    /usr/local/bioinfo/cufflinks-2.1.1.Linux_x86_64/cuffmerge -o /rice2/cuffmerge4 /rice2/cuffmerge4/transcripts_gtf.list
    gtf_to_sam -F /rice2/control/7961X1/cufflinks3/transcripts.gtf /rice2/cuffmerge4/tmp/gtf2sam_file4OrNLa
    gtf_to_sam -F /rice2/aba_treated/7961X2/cufflinks3/transcripts.gtf /rice2/cuffmerge4/tmp/gtf2sam_file4kSObI
    gtf_to_sam -F /rice2/ga_treated/8198X1/cufflinks3/transcripts.gtf /rice2/cuffmerge4/tmp/gtf2sam_file4z0eHq
    gtf_to_sam -F /rice2/aba_ga_treated/8198X2/cufflinks3/transcripts.gtf /rice2/cuffmerge4/tmp/gtf2sam_fileOQfcsk
    sort -k 3,3 -k 4,4n --temporary-directory=/rice2/cuffmerge4/tmp/ /rice2/cuffmerge4/tmp/gtf2sam_file4OrNLa /rice2/cuffmerge4/tmp/gtf2sam_file4kSObI /rice2/cuffmerge4/tmp/gtf2sam_file4z0eHq /rice2/cuffmerge4/tmp/gtf2sam_fileOQfcsk > /rice2/cuffmerge4/tmp/mergeSam_fileUgYASo
    cufflinks -o /rice2/cuffmerge4/ -F 0.05 -q --overhang-tolerance 200 --library-type=transfrags -A 0.0 --min-frags-per-transfrag 0 --no-5-extend -p 1 /rice2/cuffmerge4/tmp/mergeSam_fileUgYASo
    cuffcompare -o tmp_meta_asm -C -G /rice2/cuffmerge4//transcripts.gtf
    cuffcompare -o tmp_meta_asm -C -G /rice2/cuffmerge4//merged.gtf


    Only a single output file is generated and it is named "merged.gtf".
    Below are the first 10 lines of the merged.gtf file. Note that the third column says "exon" throughout the entire file.
    chr01 Cufflinks exon 3354 3616 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.2.1"; tss_id "TSS1";
    chr01 Cufflinks exon 4357 4458 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.2.1"; tss_id "TSS1";
    chr01 Cufflinks exon 7133 7944 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.4.1"; tss_id "TSS2";
    chr01 Cufflinks exon 8028 8156 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.4.1"; tss_id "TSS2";
    chr01 Cufflinks exon 27054 27292 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.25.1"; tss_id "TSS3";
    chr01 Cufflinks exon 27370 27894 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.25.1"; tss_id "TSS3";
    chr01 Cufflinks exon 29682 29976 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.27.1"; tss_id "TSS4";
    chr01 Cufflinks exon 30146 30400 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "2"; oId "CUFF.27.1"; tss_id "TSS4";
    chr01 Cufflinks exon 32716 32908 . + . gene_id "XLOC_000005"; transcript_id "TCONS_00000005"; exon_number "1"; oId "CUFF.34.1"; tss_id "TSS5";
    chr01 Cufflinks exon 33277 34486 . + . gene_id "XLOC_000005"; transcript_id "TCONS_00000005"; exon_number "2"; oId "CUFF.34.1"; tss_id "TSS5";

    If anybody can help me, your efforts will be appreciated.
    Thanks in advance.

    Comment


    • #3
      novel INDELs, splice variants, transcripts

      Dear everybody,

      I have been running the tophat->cufflinks->cuffcompare software in order to find new INDELs, splice variants, and transcripts. This is a screenshot of my cuffcompare results:

      ref_gene_id ref_id class_code cuff_gene_id cuff_id FMI FPKM FPKM_conf_lo FPKM_conf_hi cov len major_iso_id ref_match_len
      NM_011541 NM_011541 p CUFF.2 CUFF.2.1 100 0.359342 0.150146 0.540524 2.236101 1081 CUFF.2.1 2671
      NM_001177795 NM_001177795 p CUFF.3 CUFF.3.1 100 0.498016 0.298153 0.689480 3.034472 1742 CUFF.3.1 2125
      NM_001177658 NM_001177658 = CUFF.1 CUFF.1.1 100 19.493129 17.552255 21.434003 115.930347 947 CUFF.1.1 4201
      NM_008866 NM_008866 = CUFF.4 CUFF.4.1 100 11.250214 9.844004 11.981710 68.637326 2460 CUFF.4.1 2433
      - - u CUFF.6 CUFF.6.1 100 1.115727 0.931503 1.305515 6.851548 4600 CUFF.6.1 -
      NM_021374 NM_021374 = CUFF.5 CUFF.5.1 100 1.345565 0.996552 1.666366 7.926568 1987 CUFF.5.1 1778
      NM_001159750 NM_001159750 = CUFF.7 CUFF.7.1 19 4.022620 3.373178 4.662116 24.433896 2545 CUFF.7.2 2668
      NM_011541 NM_011541 = CUFF.7 CUFF.7.2 100 21.482097 20.103894 22.897623 130.484947 2548 CUFF.7.2 2671
      NM_133826 NM_133826 = CUFF.8 CUFF.8.1 100 23.449073 22.067371 24.844060 143.647096 2621 CUFF.8.1 1976
      - - u CUFF.9 CUFF.9.1 100 0.425803 0.172667 0.656136 2.649674 940 CUFF.9.1 -
      - - u CUFF.10 CUFF.10.1 100 0.539862 0.335615 0.727167 3.101017 1741 CUFF.10.1 -
      - - u CUFF.11 CUFF.11.1 100 0.729683 0.292007 1.119361 3.973062 667 CUFF.11.1 -
      - - u CUFF.12 CUFF.12.1 100 5.273319 1.347881 4.043642 25.459617 289 CUFF.12.1 -
      NM_021511 NM_021511 = CUFF.13 CUFF.13.1 100 11.054199 9.417533 11.546153 66.036162 2013 CUFF.13.1 2048
      NM_183028 NM_183028 = CUFF.14 CUFF.14.1 100 13.063534 12.150132 13.717891 79.539141 5135 CUFF.14.1 5232
      NM_009826 NM_009826 j CUFF.15 CUFF.15.1 89 8.605083 7.931097 9.289027 52.128971 5126 CUFF.15.2 7046
      NM_009826 NM_009826 = CUFF.15 CUFF.15.2 100 9.674768 9.110064 10.277111 58.609051 7493 CUFF.15.2 7046
      NM_177547 NM_177547 c CUFF.17 CUFF.17.1 100 0.817438 0.497571 1.172846 4.754981 1091 CUFF.17.1 5408
      NR_024067 NR_024067 j CUFF.18 CUFF.18.1 18 3.811516 3.105650 4.526064 20.487205 1526 CUFF.18.1 407
      NR_024067 NR_024067 = CUFF.18 CUFF.18.2 100 21.203930 16.631886 25.800746 113.972831 329 CUFF.18.1 407
      NM_001285425 NM_001285425 j CUFF.19 CUFF.19.1 100 0.867241 0.668385 1.069417 5.237056 4063 CUFF.19.2 3746
      NM_001285425 NM_001285425 j CUFF.19 CUFF.19.2 79 0.682889 0.490159 0.872317 4.123800 4085 CUFF.19.2 3746

      Now, I understand that in the 3rd column those genes with a letter j represent new transcripts. Do these include:

      a) new INDELs?
      b) new splice variants?
      c) new transcripts?

      One of my big questions is, if a new gene (one marked with j) has a Refseq id, such as NM_001285425, then why is it a new gene? I mean, if it already has a Refseq id, then why is it new? Doesn't it count as already having been discovered?

      Thanks!

      Comment


      • #4
        kwatts59

        First, if a reference annotation for your organism already exists, add it to the cuffmerge run, so that you can identify transcripts that have already been annotated and can distinguish between known and novel transcripts. If you gave cuffmerge a reference annotation, you will be able to distinguish the novel transcripts by their gene id that will always start with XLOC_. Know genes will be identified by the gene id specified in the reference annotation.

        Second, the GTF file created by cuffmerge is merely an annotation file, identifying the transcripts (transcript_id) and their exons (exon_number). To quantify the amount of transcripts, run cuffdiff with the GTF file created by cuffmerge.
        Last edited by blancha; 04-13-2014, 08:07 PM.

        Comment


        • #5
          That gtf is showing genes and transcripts. You should read about the gtf format. There are many guides, here’s one: http://cufflinks.cbcb.umd.edu/gff.html

          basically, in each exon line, the transcripts and genes are all specified in the last column with the 'gene_id "XLOC_000001"; transcript_id “TCONS_00000001”’ bit that sets up the parent/child relationship between the exon, transcript and gene.

          As a general comment, I would say cufflinks is very loose in finding novel transcripts when you run the whole RABT mode pipeline. So, I’d suggest setting some strict parameters.

          Comment


          • #6
            novel transcript in cuffcompare data

            hello, all

            how i can identify novel transcript when i run cuffcompare ???
            tophat -o output arabidopsis.fa file1_R1.fq file1_R2.fq
            cufflinks -o output accepted_hits.bam
            cuffmerge -s arabidopsis.fa assemblies.txt
            assemblies.txt(transcripts_1.gtf........transcripts_n.gtf)
            cuffcompare -s arabidopsis.fa -r known_annotation.gtf merged.gtf

            when i run this command i didn't get any FPKM values in the output file !! so please any one suggest that how can i identify novel transcripts??
            and output file (cuff_compare.merged.gtf.tmap) -
            ref_gene_id ref_id class_code cuff_gene_id cuff_id FMI FPKM FPKM_conf_lo FPKM_conf_hi cov len major_iso_id ref_match_len
            ANAC001 AT1G01010.1 = XLOC_000001 TCONS_00000002 0 0.000000 0.000000 0.000000 0.000000 1694 TCONS_00000002 1688
            ANAC001 AT1G01010.1 j XLOC_000001 TCONS_00000001 0 0.000000 0.000000 0.000000 0.000000 1674 TCONS_00000002 1688
            DCL1 AT1G01040.1 j XLOC_000002 TCONS_00000004 0 0.000000 0.000000 0.000000 0.000000 6611 TCONS_00000004 6251
            DCL1 AT1G01040.1 = XLOC_000002 TCONS_00000003 0 0.000000 0.000000 0.000000 0.000000 6251 TCONS_00000004 6251
            DCL1 AT1G01040.2 = XLOC_000002 TCONS_00000005 0 0.000000 0.000000 0.000000 0.000000 5984 TCONS_00000004 5877
            AT1G01073 AT1G01073.1 = XLOC_000003 TCONS_00000006 0 0.000000 0.000000 0.000000 0.000000 111 TCONS_00000006 111
            IQD18 AT1G01110.2 = XLOC_000004 TCONS_00000007 0 0.000000 0.000000 0.000000 0.000000 1782 TCONS_00000007 1782
            AT1G01115 AT1G01115.1 = XLOC_000005 TCONS_00000008 0 0.000000 0.000000 0.000000 0.000000 117 TCONS_00000008 117
            GIF2 AT1G01160.1 = XLOC_000006 TCONS_00000009 0 0.000000 0.000000 0.000000 0.000000 1045 TCONS_00000010 1045
            GIF2 AT1G01160.2 = XLOC_000006 TCONS_00000010 0 0.000000 0.000000 0.000000 0.000000 1129 TCONS_00000010 1129
            AT1G01180 AT1G01180.1 = XLOC_000007 TCONS_00000011 0 0.000000 0.000000 0.000000 0.000000 1176 TCONS_00000011 1176
            MIR165A AT1G01183.1 x XLOC_000008 TCONS_00000012 0 0.000000 0.000000 0.000000 0.000000 651 TCONS_00000012 101
            F6F3.2 AT1G01210.1 = XLOC_000009 TCONS_00000013 0 0.000000 0.000000 0.000000 0.000000 616 TCONS_00000013 616
            FKGP AT1G01220.1 = XLOC_000010 TCONS_00000014 0 0.000000 0.000000 0.000000 0.000000 3532 TCONS_00000014 3532
            Last edited by am@i; 04-15-2014, 04:52 AM.

            Comment


            • #7
              @am@i: As far as I can tell, there are no novel transcripts in the output you've posted. They all have a ref_gene_id, meaning that all the transcripts you've posted were found in your reference annotation file. I've been wrong before though .

              Comment


              • #8
                Hello Everyone,
                I have been running the tophat->cufflinks->cuffcompare software in order to find novel transcripts!! This is a some part of my cuffcompare results:


                ref_gene_id ref_id class_code cuff_gene_id cuff_id FMI FPKM FPKM_conf_lo FPKM_conf_hi cov len major_iso_id ref_match_len
                AT1G01070 AT1G01070.1 = CUFF.3 CUFF.3.1 100 11.658709 8.811293 12.76999 25.330956 1334 CUFF.3.1 1311
                NGA3 AT1G01030.1 o CUFF.1 CUFF.1.1 100 1.210378 0.680911 1.733229 2.87049 1376 CUFF.1.1 1905
                LHY AT1G01060.3 j CUFF.4 CUFF.4.1 100 3.877988 3.027325 4.70032 8.757048 2318 CUFF.4.1 2517
                LHY AT1G01060.3 j CUFF.4 CUFF.4.2 19 0.745358 0.295704 1.225058 1.683125 2196 CUFF.4.1 2517
                ARV1 AT1G01020.1 c CUFF.2 CUFF.2.1 100 55.954548 13.748855 27.162373 98.898929 254 CUFF.2.1 1623
                ARV1 AT1G01020.1 j CUFF.5 CUFF.5.1 46 3.093444 1.312381 4.874558 6.489805 634 CUFF.5.2 1623
                ARV1 AT1G01020.1 j CUFF.5 CUFF.5.2 52 3.5225 1.734063 5.359831 7.389933 720 CUFF.5.2 1623
                ARV1 AT1G01020.2 c CUFF.5 CUFF.5.3 100 6.796436 4.118432 9.41356 14.258397 614 CUFF.5.2 1085
                ANAC001 AT1G01010.1 j CUFF.6 CUFF.6.1 24 3.075754 1.940899 4.185063 6.97969 1584 CUFF.6.1 1688
                ATRAD51D AT1G07745.1 j CUFF.604 CUFF.604.2 53 2.043285 1.104131 2.918061 4.845783 1080 CUFF.604.1 1188
                F24B9.13 AT1G07750.1 = CUFF.605 CUFF.605.1 100 27.163205 24.111159 30.21525 63.156172 1296 CUFF.605.1 1414
                RPS15A AT1G07770.1 p CUFF.606 CUFF.606.1 100 19.492055 9.645981 17.653966 44.82586 468 CUFF.606.1 725
                RPS15A AT1G07770.1 = CUFF.611 CUFF.611.1 74 64.794364 39.888684 57.433706 149.649577 568 CUFF.611.1 725
                ATMC8 AT1G16420.1 = CUFF.1258 CUFF.1258.1 100 4.306088 2.441968 5.567687 9.482726 872 CUFF.1258.1 1338
                AT1G16515 AT1G16515.1 o CUFF.1259 CUFF.1259.1 100 9.753342 4.017728 10.646979 22.304586 424 CUFF.1259.1 265
                F3O9.31 AT1G16510.1 c CUFF.1261 CUFF.1261.1 100 8.44923 4.4527 9.623577 18.571708 593 CUFF.1261.1 872
                AT1G16480 AT1G16480.1 c CUFF.1260 CUFF.1260.1 100 1.567716 0.589045 2.473987 3.532044 723 CUFF.1260.1 2814
                F3O9.30 AT1G16500.1 = CUFF.1262 CUFF.1262.1 100 27.439232 19.789247 28.908715 64.270531 934 CUFF.1262.1 1005
                AT1G16550 AT1G16550.1 o CUFF.1265 CUFF.1265.1 100 2.576989 1.439353 3.640717 5.378117 1006 CUFF.1265.1 2303
                F3O9.32 AT1G16520.1 j CUFF.1263 CUFF.1263.1 57 3.932072 2.676381 5.152034 9.010123 1273 CUFF.1263.1 1291
                F3O9.32 AT1G16520.1 c CUFF.1263 CUFF.1263.2 100 6.897696 2.557242 9.245414 15.805685 433 CUFF.1263.1 1291
                SIR3 AT1G16540.1 j CUFF.1267 CUFF.1267.1 91 2.995379 2.221049 3.786707 6.445242 2519 CUFF.1267.2 2758
                SIR3 AT1G16540.1 j CUFF.1267 CUFF.1267.2 100 3.278397 2.452575 4.087624 7.05422 2576 CUFF.1267.2 2758
                AT1G16489 AT1G16489.1 j CUFF.1264 CUFF.1264.1 100 2.208782 1.080485 3.349502 5.177359 968 CUFF.1266.1 412
                AT1G16489 AT1G16489.1 e CUFF.1266 CUFF.1266.1 100 1.80018 0.957649 2.633535 4.219601 1247 CUFF.1266.1 412
                SR45 AT1G16610.1 o CUFF.1269 CUFF.1269.1 100 114.428505 11.655641 26.449338 162.070939 190 CUFF.1269.1 1560
                - - u CUFF.1668 CUFF.1668.1 100 8.522216 1.780679 6.825938 15.61758 287 CUFF.1668.1 -


                how i can identify novel transcript from my output files????
                thank you for your help,
                Amrita
                Last edited by am@i; 04-16-2014, 12:54 AM.

                Comment


                • #9
                  The ones marked with j in the 3rd column are new transcripts, according to the cuffcompare manual.

                  Comment


                  • #10
                    csmatyi's answer is better than my previous answer.
                    Here is the full list of class codes from the manual.

                    ---

                    Class Codes

                    If you ran cuffcompare with the -r option, tracking rows will contain the following values. If you did not use -r, the rows will all contain "-" in their class code column.
                    Priority Code Description
                    1 = Complete match of intron chain
                    2 c Contained
                    3 j Potentially novel isoform (fragment): at least one splice junction is shared with a reference transcript
                    4 e Single exon transfrag overlapping a reference exon and at least 10 bp of a reference intron, indicating a possible pre-mRNA fragment.
                    5 i A transfrag falling entirely within a reference intron
                    6 o Generic exonic overlap with a reference transcript
                    7 p Possible polymerase run-on fragment (within 2Kbases of a reference transcript)
                    8 r Repeat. Currently determined by looking at the soft-masked reference sequence and applied to transcripts where at least 50% of the bases are lower case
                    9 u Unknown, intergenic transcript
                    10 x Exonic overlap with reference on the opposite strand
                    11 s An intron of the transfrag overlaps a reference intron on the opposite strand (likely due to read mapping errors)
                    12 . (.tracking file only, indicates multiple classifications)

                    Last edited by blancha; 04-17-2014, 12:52 PM. Reason: Added information

                    Comment


                    • #11
                      Thanks for d reply!!
                      I have another questions ...

                      When I compare the assembled transcripts with a reference annotation with cuffcompare, one of the output files is cuff_in.tmap. According to class code column in this file, we can see the relationship between Cufflinks transcripts with reference transcripts. So when the class code is j, it is potentially novel isoform, but how can we validate it is actually a novel isoform????

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      22 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      24 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      20 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X