SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Cufflinks get stuck at a locus Mayflower Bioinformatics 1 01-06-2014 03:52 AM
How to Estimate the Average Coverage per RAD Locus fcr De novo discovery 11 02-08-2012 01:58 AM
cufflinks not able to assemble transcripts in high read coverage area apratap Bioinformatics 2 11-03-2011 07:21 PM
Cufflinks convert FPKM to Read Counts zee Bioinformatics 0 03-08-2010 06:35 PM
Why do my coverage maps look like complexity traces The_Roads Bioinformatics 3 07-10-2009 10:29 AM

Reply
 
Thread Tools
Old 10-07-2014, 12:29 AM   #1
vishakad
Junior Member
 
Location: India

Join Date: Oct 2014
Posts: 2
Default Cufflinks v2.2.1 : FPKM = 0 and coverage=0, but read maps to locus

Hi,

Edit : I felt the question was a little too long-winded. Just to be clear :
a) Even though paired reads fall within an exon, cufflinks seems to be reporting zero coverage for that exon.
b) cufflinks does not report the transcript length for any of the genes in my data set.

I have been running a tophat (2.0.11) -> cufflinks (2.2.1) -> cuffnorm/cuffdiff pipeline for looking at some mouse RNA-seq data.

When looking at the output of cufflinks, I found some genes where the expression level of a gene was zero, but there were many reads that mapped to the locus. This was a bug that was supposed to have been fixed in v1.2.0. This issue has been brought up in earlier threads, but not resolved clearly :
http://seqanswers.com/forums/showthread.php?t=17682
http://seqanswers.com/forums/showthread.php?t=20702
http://seqanswers.com/forums/showthr...cufflinks+fpkm

I first ran the tophat->cufflinks pipeline on my data. I then wrote a shell script to look through the transcripts.gtf file produced by cufflinks for transcripts whose FPKM readout was 0. This gave me a fairly large number of gene_id's, of the order of a few thousand.

I then extracted the exon definitions for each of these gene_ids from the merged transcriptome. I looked for reads falling into these exon loci, and used samtools to pull out the information on all reads that fell into these loci.

Roughly, I found two kinds of instances in which cufflinks reports zero fpkm for a locus when reads map to that locus.

i) Read alignments imply a fragment length much much larger than the exon or transcript definition implies. For example, I found that some tophat alignment reports a template length of 10^6 nt even though the transcript length is about 10^2 nt in length in the supplied gtf file. A lot of microRNAs and lincRNAs fell into this category. When I filtered out all these reads, I found zero reads mapping to these loci. The cufflinks output seemed reasonable here.

ii) This was the category that confused me. For one of the genes whose expression was zero fpkm, I found that 184 reads aligned to one of the exons. This number dropped to 16 when I counted only paired alignments and ignoring supplementary alignments.

Of these 16 alignments, I found that the fragment lengths implied by the alignments were are "reasonable" - 118 nt, 130 nt, etc. The cigar strings were all 101M, and these were the highest scoring alignments for these reads. An example read :

HWI-1KL120:1171C63ACXX:3:1108:3575:87870 99 chr13 21924938 3 101M = 21925038 201 GTGAACGGCTGGTCTTATTTTCCCTTGGCCTTGTGGTGGCTCTCGGTCTTCTTGGGCAGCAGCACGGCCTGGATGTTGGGCAGGACGCCGCCCTGCTCGAT @@@BDEFFHHGH+AABFGGIIJCHBHEGGGGGIGGGFCHICBBD@B;AGHIIIIIIBHHEHHDBBDCDB?@AACAACC?DACDDBBD>@B@>BBABCC?@> AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:96G4 YT:Z:UU XS:A:- NH:i:2 HI:i:1

I understand that out of millions of reads, 16 reads falling on a gene is more likely to be explained by sequencing and alignment errors than true gene expression. But, the transcripts.gtf file still reported zero coverage across any exon. For example the read above maps to chr13:21924938. But, the transcripts.gtf file states that the coverage of the exon containing this coordinate is zero :

chr13 Cufflinks exon 21924893 21925444 1 - . gene_id "Hist1h2ao"; transcript_id "Hist1h2ao_dup1"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";

What's weirder is that the read shown above has two possible alignments. One is to chr13:21924938 (alignment score AS:i:-5), and the other is to chr13:21902542 (As:i:0). This second alignment falls into this exon :

chr13 Cufflinks exon 21902236 21902787 1 + . gene_id "Hist1h2ao"; transcript_id "Hist1h2ao"; exon_number "1"; FPKM "0.0000000000"; frac "0.000000"; conf_lo "0.000000"; conf_hi "0.000000"; cov "0.000000";

Given that the tophat alignment states that the read came from either one of these exons, why is the coverage of *both* exons zero? I'm guessing the multi-read correct option had something to do with it?

Also, oddly, cufflinks does not report the transcript length for any of the transcripts in the genes.fpkm_tracking file. All transcript length entries are a "-". For example :

tracking_id class_code nearest_ref_id gene_id gene_short_name tss_id locus length coverage FPKM FPKM_conf_lo FPKM_conf_hi FPKM_status
Rp1 - - Rp1 - - chr1:4280926-4399322 - - 0 0 0 OK

Is there something that I am missing on how cufflinks calculates transcript lengths, or am I using the wrong inputs in some way?

My cufflinks command :
cufflinks --no-update-check -p $NCORES -o $OUTPUTDIR $INPUTFILE -G ${GTF[$INDEX]} --multi-read-correct --frag-bias-correct ${GENOME_FA[$INDEX]}

The GTF file I used is from the UCSC mm9 refFlat tables.

Last edited by vishakad; 10-07-2014 at 01:46 AM.
vishakad is offline   Reply With Quote
Old 10-26-2015, 12:20 PM   #2
michaelleonard
Junior Member
 
Location: Los Angeles

Join Date: Oct 2015
Posts: 3
Default

I am having the exact same problem. Did you find a solution to this?
michaelleonard is offline   Reply With Quote
Old 10-26-2015, 01:29 PM   #3
mastal
Senior Member
 
Location: uk

Join Date: Mar 2009
Posts: 667
Default

Quote:
Given that the tophat alignment states that the read came from either one of these exons, why is the coverage of *both* exons zero? I'm guessing the multi-read correct option had something to do with it?
For the above part of the query, it looks like exons that align in more than one place are probably not being included.
mastal is offline   Reply With Quote
Old 10-26-2015, 10:09 PM   #4
vishakad
Junior Member
 
Location: India

Join Date: Oct 2014
Posts: 2
Default

Yeah, I did figure it out I think. Sorry for not posting it.

The issue wasn't something from the multi-read correction. It's that tophat is being pretty smart about interpreting the meaning of reads aligning to a given locus. When I went back and looked back at the alignment BAMS, I found that all the reads that landed in these FPKM=0 loci were reads where the mate pair aligned many megabases away.

So I think tophat was merely discarding these reads, since the mate alignment is just way too far to figure out which transcript/fragment gave rise to the paired reads.

I might well be wrong on this. Perhaps a more experienced tophat user can attest to this problem.
vishakad is offline   Reply With Quote
Reply

Tags
coverage, cufflinks, fpkm

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO