Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Modification of reference genome annotation by cufflinks/cuffdiff?

    Hi,

    I am totally new to NGS technologies. We are just now starting an RNA-seq project, and I am trying to figure out how to find differentially expressed genes with my Illumina (100bp) reads in Drosophila.

    We aligned our reads to the dm3 genome in galaxy, and I imported a gtf file from ucsc. One thing that was really weird is that some genes I knew to be expressed in the tissue we isolated was listed with a value of 0 FPKM in cuffdiff. However, when I looked at the tophat aligned reads in the ucsc browser, there were plenty of reads.

    Looking more closely at the data, I noticed that the chromosomal position reported by cuffdiff for the gene I was looking at had changed. It's previous range was 264127-264127 on the chromosome, but cuffdiff had set it to 264730-364670. So the gene is reported by cuffdiff to be a 100kb gene instead of a 2kb gene. The other weird thing is that many genes in that region of the genome had been changed w.r.t. the reference genome gtf file to have identical boundaries.

    Is this some kind of artifact that causes genes to become associated with a different gene's annotation?

    Thanks for any help you can provide!

    -Mark

  • #2
    Update: I ran cuffdiff on my computer (in verbose mode), and have discovered that it seems to be changing the gene position data before measuring the number of reads. Does anyone have any ideas to help me?

    Here is the fragment of the gtf file I used.
    chrX dm3_flyBaseGene exon 240182 241740 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
    chrX dm3_flyBaseGene CDS 242113 242271 0.000000 + 2 gene_id "CG3777"; transcript_id "CG3777-RC";
    chrX dm3_flyBaseGene exon 242113 242271 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
    chrX dm3_flyBaseGene CDS 242680 243357 0.000000 + 2 gene_id "CG3777"; transcript_id "CG3777-RC";
    chrX dm3_flyBaseGene exon 242680 243357 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
    chrX dm3_flyBaseGene CDS 243477 243703 0.000000 + 2 gene_id "CG3777"; transcript_id "CG3777-RC";
    chrX dm3_flyBaseGene stop_codon 243704 243706 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
    chrX dm3_flyBaseGene exon 243477 244237 0.000000 + . gene_id "CG3777"; transcript_id "CG3777-RC";
    chrX dm3_flyBaseGene start_codon 250712 250714 0.000000 + . gene_id "CG3757"; transcript_id "CG3757-RA";
    chrX dm3_flyBaseGene CDS 250712 250950 0.000000 + 0 gene_id "CG3757"; transcript_id "CG3757-RA";
    chrX dm3_flyBaseGene exon 250542 250950 0.000000 + . gene_id "CG3757"; transcript_id "CG3757-RA";
    chrX dm3_flyBaseGene CDS 253650 255034 0.000000 + 1 gene_id "CG3757"; transcript_id "CG3757-RA";
    chrX dm3_flyBaseGene stop_codon 255035 255037 0.000000 + . gene_id "CG3757"; transcript_id "CG3757-RA";
    chrX dm3_flyBaseGene exon 253650 255278 0.000000 + . gene_id "CG3757"; transcript_id "CG3757-RA";
    chrX dm3_flyBaseGene start_codon 264127 264129 0.000000 + . gene_id "CG3796"; transcript_id "CG3796-RA";
    chrX dm3_flyBaseGene CDS 264127 264729 0.000000 + 0 gene_id "CG3796"; transcript_id "CG3796-RA";
    chrX dm3_flyBaseGene stop_codon 264730 264732 0.000000 + . gene_id "CG3796"; transcript_id "CG3796-RA";
    chrX dm3_flyBaseGene exon 264064 264980 0.000000 + . gene_id "CG3796"; transcript_id "CG3796-RA";


    Here is the -verbose output :
    Inspecting bundle chrX:162553-173667 with 484 reads
    Inspecting bundle chrX:173735-364670 with 29393 reads
    Inspecting bundle chrX:365084-370625 with 418 reads
    ....

    Convergence reached in 24 iterations
    Tossing likely garbage isoforms
    Filtering isoform 162553-173667
    Filtering isoform 162553-173667
    Filtering isoform 162553-173667
    Filtering isoform 162553-173667
    Filtering isoform 162553-173667
    Filtering isoform 162553-173667
    Revising MLE
    Convergence reached in 6 iterations
    Importance sampling posterior distribution
    Testing for differential expression and regulation in locus [chrX:162553-173667]

    Reduced 659 frags to 55 (8.345979 percent)
    Calculating intial MLE
    Convergence reached in 770 iterations
    Tossing likely garbage isoforms
    Filtering isoform 210608-228832
    Filtering isoform 210608-228832
    Filtering isoform 210608-364670
    Revising MLE
    Convergence reached in 770 iterations
    Importance sampling posterior distribution

    Reduced 1244 frags to 54 (4.340836 percent)
    Calculating intial MLE
    Convergence reached in 567 iterations
    Tossing likely garbage isoforms
    Filtering isoform 210608-228832
    Filtering isoform 210608-228832
    Filtering isoform 210608-364670
    Revising MLE
    Convergence reached in 567 iterations
    Importance sampling posterior distribution
    Testing for differential expression and regulation in locus [chrX:173735-364670]

    Reduced 33 frags to 11 (33.333333 percent)
    Calculating intial MLE
    Tossing likely garbage isoforms
    Importance sampling posterior distribution
    ....
    And here is my output from cuffdiff, sorted by gene and position:
    CG3114 CG3114 - chrX 162553 173667 q1 q2 OK 4.26393 6.93421 0.486276
    CG12470 CG12470 - chrX 173735 364670 q1 q2 OK 0 0 0
    CG13374 CG13374 - chrX 173735 364670 q1 q2 OK 0 0 0
    CG13375 CG13375 - chrX 173735 364670 q1 q2 OK 0 0 0
    CG17885 CG17885 - chrX 173735 364670 q1 q2 OK 0 0 0
    CG3258 CG3258 - chrX 173735 364670 q1 q2 OK 0.078959 3.53832 3.80248
    CG32816 CG32816 - chrX 173735 364670 q1 q2 OK 0.0761256 0.950462 2.52456
    CG3757 CG3757 - chrX 173735 364670 q1 q2 OK 0 0.0544975 1.79769e+308
    CG3777 CG3777 - chrX 173735 364670 q1 q2 OK 135.839 262.513 0.658829
    CG3796 CG3796 - chrX 173735 364670 q1 q2 OK 0.287887 6.5634 3.1267
    CG3827 CG3827 - chrX 173735 364670 q1 q2 OK 0.945195 19.5633 3.03002
    CG3839 CG3839 - chrX 173735 364670 q1 q2 OK 0 0 0
    CG3972 CG3972 - chrX 173735 364670 q1 q2 OK 0.0259724 0 -1.79769e+308
    CG3923 CG3923 - chrX 365084 370625 q1 q2 OK 3.33054 9.50618 1.04881


    Any help would be much appreciated!

    Mark

    Comment


    • #3
      Hi Mark,

      I find myself in your shoes, did you figure out why you have FPKM of "0" for genes you know to be expressed? I also, find that if I use a reference annotation of dmel to run cufflinks I get "0" for FPKM, but if I run it without a reference file I get values. Any insight if you have, would be much appreciated.

      nsl

      Comment


      • #4
        This might be a trivial advice but worth checking... Are the reference sequence file and GTF file using the same chromosome names? (E.g. ENSEMBL calls chromosomes as 1, 2, 3 etc while UCSC calls them chr1, chr2 ...)

        Dario

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        22 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        19 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        50 views
        0 likes
        Last Post seqadmin  
        Working...
        X