Seqanswers Leaderboard Ad

**blancha** · 10-09-2015, 05:42 AM

Which version of Cufflinks are you using?
Also, what is the total number of uniquely mapped reads for each sample?
Have you tried comparing your results with the output of featureCounts and DESeq2?

**priya** · 10-09-2015, 06:29 AM

Thanks for your reply!

Originally posted by blancha View Post

Which version of Cufflinks are you using?
Also, what is the total number of uniquely mapped reads for each sample?
Have you tried comparing your results with the output of featureCounts and DESeq2?

Which version of Cufflinks are you using?
cufflinks/2.1.1 is used on bam files to calculate the fpkms

Also, what is the total number of uniquely mapped reads for each sample?
This number i got from this command
samtools view -F 4 accepted_hits_sorted_dupRemoved_P1827_101.bam | wc -l
31837064 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
31837064 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

Have you tried comparing your results with the output of featureCounts and DESeq2?
No, But the counts I used for comparision is from htSeqCount package

**yzzhang** · 10-09-2015, 09:30 AM

So you used different bam file for cufflinks and htseqcount? accepted_hits.bam for cufflinks and accepted_hits_sorted_dupRemoved_P1827_101.bam for htseqcount?

**blancha** · 10-09-2015, 05:32 PM

I would just start by updating Cufflinks.
I have noticed many bugs with older versions of Cufflinks, which often result in incorrect results.
I wonder in fact how much of an impact the Cufflinks bugs have had on genomic research given its very widespread use.
I'd be curious to know if your problem disappears with the latest version of Cufflinks.

**priya** · 10-12-2015, 12:33 AM

Originally posted by blancha View Post

I would just start by updating Cufflinks.
I have noticed many bugs with older versions of Cufflinks, which often result in incorrect results.
I wonder in fact how much of an impact the Cufflinks bugs have had on genomic research given its very widespread use.
I'd be curious to know if your problem disappears with the latest version of Cufflinks.

Thanks for the suggestion! As the final output tables for htseq and cufflinks are delivered for us by sequencing facility, I am not sure which files they used as input for htseqcount and cufflinks (with duplicates or without), now I am rerunning all the results of Cufflinks and htSeq-Count with duplicated_removed_bam files and update you if I overcome the issue.

**blancha** · 10-12-2015, 07:29 AM

Yes, please let me know if the issue is resolved.
I would not remove the duplicates, though.

**priya** · 10-13-2015, 06:32 AM

Originally posted by blancha View Post

Yes, please let me know if the issue is resolved.
I would not remove the duplicates, though.

Dear blancha,
I rerun htseq-count (0.6.1) and cufflinks(version 2.2.1) on my RNA-Seq samples without removing the duplicates. But I see the same results now, without much changes

First 3 values from Condition A with 3 replicates (red) and last 3 from Condition B with 3 replicates (Green)
Counts
ENSMUSG00000033450 Tagap 169 171 124 17 38 21
FPKMS
ENSMUSG00000033450 Tagap 4.43869 3.79506 3.20408 4.99311 4.66142 4.2739

Counts = Cond A (mean) > Cond B (mean)
FPKMs = Cond A (mean) < Cond B (mean)

Actually this is not for all the genes, for fewer genes (say 15 genes as far I noticed)

Commands used for htseq and cufflinks:

samtools sort $raw_dir/$sam $bam_dir/$nnm'.sorted'

cufflinks --library-type fr-firststrand -p 8 -G $main_dir/data/Mus_musculus.GRCm38.82.gtf -M $main_dir/data/mask_MR.gtf $bam_dir/$nnm'.sorted.bam' -o $cuff_dir

samtools view $bam_dir/$nnm'.sorted.bam' | htseq-count -s reverse - $main_dir/data/Mus_musculus.GRCm38.82.gtf> $ht_dir/$nnm"_count.txt"

**dpryan** · 10-13-2015, 12:06 PM

Have a look at the BAM files in IGV. It's likely that the files with low raw counts have a bunch of multimappers that htseq-count is (correctly, given its purpose) excluding while cufflinks (possibly correctly, given its purpose) is including. You should be able to tell by eye whether cufflinks is being reasonable or not.

**blancha** · 10-13-2015, 12:39 PM

Yes, @dpryan has a point.

@priya, I just want to point out that Cufflinks doesn't provide you only with the FPKM values, but also with the raw counts.
I initially thought you were giving me the raw counts for Cufflinks, and that there was a discrepancy between the raw counts given by Cufflinks and the FPKM given by Cufflinks.

If there is a difference between the counts for htseq-count and the counts for Cufflinks, this is a different matter altogether, and not necessarily indicative of a bug, since they may count reads differently, as pointed out by @dpryan.

The file "genes.count_tracking" contains the counts for Cuffdiff.

As pointed out by @dpryan, the simplest solution is always to check the reads in the BAM file yourself with IGV, if you are getting different counts with different software programs.

I hope I have not given you any misleading information

**priya** · 10-15-2015, 01:25 AM

Thanks dpryan and blancha for your suggestions!

@dpryan is correct, I noticed it when I viewed my bam files in IGV, actually its issue with multiple mapped reads for low raw counts.

@blancha: My aim of the analysis is to identify differentially expressed genes between wild-type and knockout lines, after going through this old posts from seqanswers
http://seqanswers.com/forums/showthread.php?t=9129
http://seqanswers.com/forums/showthread.php?t=5793
Two ways - htseqcount -> DE-tests (DESEq/EdgeR/Voom)
Cufflinks -> Cuffdiff
I stick to the first approach and also in my case, problem arises with lower counts and genes with higher counts has no issues (even i compare htseq_counts and fpkms) ; so for downstream analysis I am considering to eliminate low counts with considerable cutoff and perform differential expression analysis.

Another question is ;
If we want to compare htseq_counts between two different genes, we need to normalize by gene-lengths ; so can I do by this formulae: (my beginning question of this post)
for gene i; normalized count (Nc)
Nci = (htseq_count)i * 10^9/ (gene length * Total reads in the sample)

so by this way both gene lengths and sample sizes are corrected (If i am not wrong??)

**dpryan** · 10-15-2015, 03:17 AM

So that will sort of but not really make things comparable. There are a couple issues:

1) Unless there's only 1 isoform for each of the genes, there's always the question of what the most appropriate gene length is. In an ideal world you'd quantify things at the transcript level and then have an "effective gene length" to use, but that sort of thing takes a bit more work.

2) Typically things like GC content affect what gets sequenced, so you end up needing to correct for that as well.

3) Inevitably there are things not listed here that also need to be accounted for.

The safest method would be to try to do sequencing with some spike-ins that are similar so you can determine how to correct everything, but that'll take a significant time and resource commitment.

**priya** · 10-15-2015, 04:35 AM

Good to know about those points! Thanks !

**blancha** · 10-15-2015, 06:17 AM

@priya

The most recent versions of DESeq2 have a function fpkm() which will compute the FPKM values.
You do need to provide the gene lengths yourself.
If you use featureCounts to count the reads, it will compute the gene lengths from the GTF file.
The computations performed by fpkm() are very similar to your formula, but there are some differences explained in the vignette.

The results for the lower count genes are always less reliable, regardless of the program used. The slight differences in the algorithms used to count the reads will also be magnified in lower count genes.

**priya** · 10-15-2015, 06:54 AM

Originally posted by blancha View Post

@priya

The most recent versions of DESeq2 have a function fpkm() which will compute the FPKM values.
You do need to provide the gene lengths yourself.
If you use featureCounts to count the reads, it will compute the gene lengths from the GTF file.
The computations performed by fpkm() are very similar to your formula, but there are some differences explained in the vignette.

The results for the lower count genes are always less reliable, regardless of the program used. The slight differences in the algorithms used to count the reads will also be magnified in lower count genes.

@blancha Thanks for bringing into my notice about the latest version, its just updated yesterday .. (https://www.bioconductor.org/package...man/DESeq2.pdf)
I see there are lot of new functions added into DESeq2 , I need to explore more of it esp. fpkms() and normalizeGeneLength() .

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 39 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 41 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 35 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Problems with FPKM values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News