Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cufflinks v2.2.1 : FPKM = 0 and coverage=0, but read maps to locus

    Hi,

    Edit : I felt the question was a little too long-winded. Just to be clear :
    a) Even though paired reads fall within an exon, cufflinks seems to be reporting zero coverage for that exon.
    b) cufflinks does not report the transcript length for any of the genes in my data set.

    I have been running a tophat (2.0.11) -> cufflinks (2.2.1) -> cuffnorm/cuffdiff pipeline for looking at some mouse RNA-seq data.

    When looking at the output of cufflinks, I found some genes where the expression level of a gene was zero, but there were many reads that mapped to the locus. This was a bug that was supposed to have been fixed in v1.2.0. This issue has been brought up in earlier threads, but not resolved clearly :
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

    Application of sequencing to RNA analysis (RNA-Seq, whole transcriptome, SAGE, expression analysis, novel organism mining, splice variants)

    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


    I first ran the tophat->cufflinks pipeline on my data. I then wrote a shell script to look through the transcripts.gtf file produced by cufflinks for transcripts whose FPKM readout was 0. This gave me a fairly large number of gene_id's, of the order of a few thousand.

    I then extracted the exon definitions for each of these gene_ids from the merged transcriptome. I looked for reads falling into these exon loci, and used samtools to pull out the information on all reads that fell into these loci.

    Roughly, I found two kinds of instances in which cufflinks reports zero fpkm for a locus when reads map to that locus.

    i) Read alignments imply a fragment length much much larger than the exon or transcript definition implies. For example, I found that some tophat alignment reports a template length of 10^6 nt even though the transcript length is about 10^2 nt in length in the supplied gtf file. A lot of microRNAs and lincRNAs fell into this category. When I filtered out all these reads, I found zero reads mapping to these loci. The cufflinks output seemed reasonable here.

    ii) This was the category that confused me. For one of the genes whose expression was zero fpkm, I found that 184 reads aligned to one of the exons. This number dropped to 16 when I counted only paired alignments and ignoring supplementary alignments.

    Of these 16 alignments, I found that the fragment lengths implied by the alignments were are "reasonable" - 118 nt, 130 nt, etc. The cigar strings were all 101M, and these were the highest scoring alignments for these reads. An example read :

    HWI-1KL120:1171C63ACXX:3:1108:3575:87870 99 chr13 21924938 3 101M = 21925038 201 GTGAACGGCTGGTCTTATTTTCCCTTGGCCTTGTGGTGGCTCTCGGTCTTCTTGGGCAGCAGCACGGCCTGGATGTTGGGCAGGACGCCGCCCTGCTCGAT @@@BDEFFHHGH+AABFGGIIJCHBHEGGGGGIGGGFCHICBBD@B;AGHIIIIIIBHHEHHDBBDCDB?@AACAACC?DACDDBBD>@B@>BBABCC?@> AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:96G4 YT:Z:UU XS:A:- NH:i:2 HI:i:1

    I understand that out of millions of reads, 16 reads falling on a gene is more likely to be explained by sequencing and alignment errors than true gene expression. But, the transcripts.gtf file still reported zero coverage across any exon. For example the read above maps to chr13:21924938. But, the transcripts.gtf file states that the coverage of the exon containing this coordinate is zero :

    chr13 Cufflinks exon 21924893 21925444 1 - . gene_id "Hist1h2ao"; transcript_id "Hist1h2ao_dup1"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";

    What's weirder is that the read shown above has two possible alignments. One is to chr13:21924938 (alignment score AS:i:-5), and the other is to chr13:21902542 (As:i:0). This second alignment falls into this exon :

    chr13 Cufflinks exon 21902236 21902787 1 + . gene_id "Hist1h2ao"; transcript_id "Hist1h2ao"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";

    Given that the tophat alignment states that the read came from either one of these exons, why is the coverage of *both* exons zero? I'm guessing the multi-read correct option had something to do with it?

    Also, oddly, cufflinks does not report the transcript length for any of the transcripts in the genes.fpkm_tracking file. All transcript length entries are a "-". For example :

    tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
    Rp1 - - Rp1 - - chr1:4280926-4399322 - - 0 0 0 OK

    Is there something that I am missing on how cufflinks calculates transcript lengths, or am I using the wrong inputs in some way?

    My cufflinks command :
    cufflinks --no-update-check -p $NCORES -o $OUTPUTDIR $INPUTFILE -G ${GTF[$INDEX]} --multi-read-correct --frag-bias-correct ${GENOME_FA[$INDEX]}

    The GTF file I used is from the UCSC mm9 refFlat tables.
    Last edited by vishakad; 10-07-2014, 01:46 AM.

  • #2
    I am having the exact same problem. Did you find a solution to this?

    Comment


    • #3
      Given that the tophat alignment states that the read came from either one of these exons, why is the coverage of *both* exons zero? I'm guessing the multi-read correct option had something to do with it?
      For the above part of the query, it looks like exons that align in more than one place are probably not being included.

      Comment


      • #4
        Yeah, I did figure it out I think. Sorry for not posting it.

        The issue wasn't something from the multi-read correction. It's that tophat is being pretty smart about interpreting the meaning of reads aligning to a given locus. When I went back and looked back at the alignment BAMS, I found that all the reads that landed in these FPKM=0 loci were reads where the mate pair aligned many megabases away.

        So I think tophat was merely discarding these reads, since the mate alignment is just way too far to figure out which transcript/fragment gave rise to the paired reads.

        I might well be wrong on this. Perhaps a more experienced tophat user can attest to this problem.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM
        • seqadmin
          Techniques and Challenges in Conservation Genomics
          by seqadmin



          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

          Avian Conservation
          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
          03-08-2024, 10:41 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:37 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Yesterday, 06:07 PM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-22-2024, 10:03 AM
        0 responses
        51 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 03-21-2024, 07:32 AM
        0 responses
        67 views
        0 likes
        Last Post seqadmin  
        Working...
        X