Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks and annotation file

    Hi,
    I'm using Cufflinks and I have a great problem when I use the annotation file;
    In particular, if:

    1) my input is only bam file (cufflinks_0.9.3.Linux_x86_64 -v -p 1 -Q 0 -I 300000 --library-type fr-unstranded --num-importance-samples 1000 --max-mle-iterations 5000 -a 0.01 -j 0.05 -F 0.05 --min-frags-per-transfrag 10 "accepted_hits.sorted.bam"), my output are genes.expr,transcripts.expr and transcript.gtf like hereunder:

    genes.expr

    gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status
    CUFF.329 35842 chr1 11968237 11968315 923.039 862.276 983.802 OK
    CUFF.333 35844 chr1 11969913 11969985 1791.31 1706.66 1875.96 OK
    CUFF.631 35993 chr1 22973756 22974004 287.909 253.973 321.845 OK
    CUFF.661 36008 chr1 23696746 23696781 30806.2 30455.2 31157.3 OK
    CUFF.807 36081 chr1 28160911 28160947 87284.4 86693.5 87875.3 OK
    CUFF.835 36095 chr1 28833876 28834087 1740.4 1656.96 1823.83 OK

    transcripts.expr

    trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length effective_length status
    CUFF.329.1 35842 chr1 11968237 11968315 923.039 1 1 862.276 983.802 6.29217 78 44 OK
    CUFF.333.1 35844 chr1 11969913 11969985 1791.31 1 1 1706.66 1875.96 12.211 72 38 OK
    CUFF.631.1 35993 chr1 22973756 22974004 287.909 1 1 253.973 321.845 1.96262 248 214 OK
    CUFF.661.1 36008 chr1 23696746 23696781 30806.2 1 1 30455.2 31157.3 210 35 1 OK
    CUFF.807.1 36081 chr1 28160911 28160947 87284.4 1 1 86693.5 87875.3 595 36 2 OK

    transcrips.gtf

    chr1 Cufflinks transcript 11968238 11968315 1000 . . gene_id "CUFF.329"; transcript_id "CUFF.329.1"; FPKM "923.0388234431"; frac "1.000000"; conf_lo "862.275715"; conf_hi "983.801931"; cov "6.292170";
    chr1 Cufflinks exon 11968238 11968315 1000 . . gene_id "CUFF.329"; transcript_id "CUFF.329.1"; exon_number "1"; FPKM "923.0388234431"; frac "1.000000"; conf_lo "862.275715"; conf_hi "983.801931"; cov "6.292170";
    chr1 Cufflinks transcript 11969914 11969985 1000 . . gene_id "CUFF.333"; transcript_id "CUFF.333.1"; FPKM "1791.3078160611"; frac "1.000000"; conf_lo "1706.660127"; conf_hi "1875.955505"; cov "12.210985";
    chr1 Cufflinks exon 11969914 11969985 1000 . . gene_id "CUFF.333"; transcript_id "CUFF.333.1"; exon_number "1"; FPKM "1791.3078160611"; frac "1.000000"; conf_lo "1706.660127"; conf_hi "1875.955505"; cov "12.210985";

    2) my inputs are bam file and annotation file (cufflinks_0.9.3.Linux_x86_64 -v --GTF "genes.gtf" -p 1 -Q 0 -I 300000 --library-type fr-unstranded --num-importance-samples 1000 --max-mle-iterations 5000 -a 0.01 -j 0.05 -F 0.05 --min-frags-per-transfrag 10 "accepted_hits.sorted.bam"), my output are genes.expr,transcripts.expr and transcript.gtf like hereunder:

    genes.expr

    gene_id bundle_id chr left right FPKM FPKM_conf_lo FPKM_conf_hi status
    ENSG00000253101 32866 1 11868 14409 0 0 0 OK
    ENSG00000223972 32866 1 12009 13670 0 0 0 OK
    ENSG00000243485 32866 1 29553 31109 0 0 0 OK
    ENSG00000221311 32866 1 30365 30503 0 0 0 OK

    transcript.expr

    trans_id bundle_id chr left right FPKM FMI frac FPKM_conf_lo FPKM_conf_hi coverage length effective_length status
    ENST00000518655 32866 1 11868 14409 0 0 0 0 0 0 1657 1657 OK
    ENST00000450305 32866 1 12009 13670 0 0 0 0 0 0 632 632 OK
    ENST00000473358 32866 1 29553 31097 0 0 0 0 0 0 712 712 OK
    ENST00000469289 32866 1 30266 31109 0 0 0 0 0 0 535 535 OK
    ENST00000408384 32866 1 30365 30503 0 0 0 0 0 0 138 138 OK

    transcript.gtf

    1 Cufflinks transcript 11869 14409 1 + . gene_id "ENSG00000253101"; transcript_id "ENST00000518655"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
    1 Cufflinks exon 11869 12227 1 + . gene_id "ENSG00000253101"; transcript_id "ENST00000518655"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
    1 Cufflinks exon 12613 12721 1 + . gene_id "ENSG00000253101"; transcript_id "ENST00000518655"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
    1 Cufflinks exon 13221 14409 1 + . gene_id "ENSG00000253101"; transcript_id "ENST00000518655"; exon_number "3"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
    1 Cufflinks transcript 12010 13670 1 + . gene_id "ENSG00000223972"; transcript_id "ENST00000450305"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
    1 Cufflinks exon 12010 12057 1 + . gene_id "ENSG00000223972"; transcript_id "ENST00000450305"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";
    1 Cufflinks exon 12179 12227 1 + . gene_id "ENSG00000223972"; transcript_id "ENST00000450305"; exon_number "2"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";

    You can observe the difference (ok for gene and transcript ID.), expecially: FPKM, FMI, frac, FPKM_conf_lo, FPKM_conf_hi, coverage.

    Any suggestion?
    Thanks a lot!!!!!!!!

  • #2
    What exactly is your problem?

    Comment


    • #3
      Hi Mattia,

      It looks like you are using different chromosome names for the genome and for the genome annotations. The bowtie index you are using in TopHat has a "chr" prefix for the chromosomes (UCSC?), and the GTF file from EnsEMBL doesn't. It may be the reason why FPKM are not calculated correctly. You can try using fasta and gtf files from the same database (EnsEMBL for example) or changing the chromosome names in the gtf file or in the initial fasta file before building the index.

      Emilie

      Comment


      • #4
        Thanks Emilie; Tomorrow I'll download (maybe from http://cufflinks.cbcb.umd.edu/igenomes.html or do you suggest me more?) gtf and bowtie index for color space from the same source.

        Comment


        • #5
          I downloaded from http://cufflinks.cbcb.umd.edu/igenomes.html Homo_Sapiens_Ensembl_GRCh37.tar.gz; in this file there are reference (fasta) and gtf file. Also using these, I have the same problem................ (for this test I used reads in fastq format,not in color space)
          Last edited by mattia; 10-14-2011, 04:36 AM.

          Comment


          • #6
            Did you check that the Homo_Sapiens_Ensembl_GRCh37.tar.gz contained a gtf file with "chr1,chr2,.." type identifiers instead of the "1,2,..." type?

            You can convert it this way:
            Code:
            cat genes.gtf | awk '{print "chr"$0}' | sed 's/chrMT/chrM/g' > genes.cufflinks.gtf
            Note that this will not convert the names of the "other" chromosomes (the "random" ones).

            Comment


            • #7
              Hi Mattia

              Did you re-run TopHat with the new bowtie index that you have downloaded before running Cufflinks? TopHat and Cufflinks need to be run using the same chromosome names.

              Emilie

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              8 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X