Unconfigured Ad

**Simon Anders** · 05-19-2010, 04:09 AM

You can use htseq-count to extract raw counts.

Simon

**marcora** · 05-19-2010, 05:31 AM

Originally posted by Simon Anders View Post

You can use htseq-count to extract raw counts.

Dear Simon,

indeed I have very much enjoyed using HTSeq tools and I am looking forward to use DESeq. However, I would love to see an integrated solution such as bowtie>tophat>cufflinks because I would like to perform the analysis using two different set of tools and make sure the results match to a large extent.

In regards to htseq-count this is my oversimplified pipeline at the moment (disclaimer: I have been doing RNAseq analysis for only a few days now):

- run tophat using the RefSeq reference in GFF format without novel junctions discovery which outputs 'accepted_hits.sam'

- run cufflinks on 'accepted_hits.sam' using the RefSeq reference in GTF format which outputs 'transcripts.gtf'

- run htseq-count as follows:

Code:

 htseq-count accepted_hits.sam transcripts.gtf > counts.table

transcripts.gtf looks like this:

Code:

chr1	Cufflinks	transcript	136186581	136189125	1000	+	.	gene_id "Myog"; transcript_id "Myog"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";
chr1	Cufflinks	exon	136186581	136187103	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "1"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";
chr1	Cufflinks	exon	136187670	136187751	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "2"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";
chr1	Cufflinks	exon	136188251	136189125	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "3"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";

whereas the tail of htseq-count output looks like this:

Code:

Zxdc	88
Zyg11a	0
Zyg11b	760
Zyx	97
Zzef1	150
Zzz3	163
l7Rn6	181
rp9	123
no_feature	10221124
ambiguous	15234

Not sure why there are so many 'no_feature' hits. Moreover, htseq-count reports a raw count of 741 for Myog (the gene shown above in the GTF bit). The sum of all genes raw counts is 4,281,575.

The size of accepted_hits.sam is 17.283 million lines.

Given that the combination of all exons of Myog is 1.477 Kb and the reported FPKM (RPKM in this case because I am using reads that are NOT paired end) is 81.234 the Myog raw count from cufflinks output should be RPKM * length (Kb) * total no of reads (million) = 81.234 * 1.477 * 17.283 = 2,073

Not sure what I am doing wrong but both cufflinks and htseq-count outputs are, at this early stage of my understanding, very confusing in the fact that they don't match.

Any help would be greatly appreciated

Edoardo "Dado" Marcora

p.s.: I am very soon going to install IGV on my machine and manually count the reads mapped to the Myog gene locus

UPDATE: by approximate visual inspection, it looks like the Myog raw count estimated using cufflinks output is the correct one... htseq-count is lower probably because of all those "no_feature" hits

**Simon Anders** · 05-19-2010, 06:10 AM

Unless you have stranded RNA-Seq data, it is important to use the option "--stranded=no". Otherwise, htseq-count discards everything which is on the other strand.

Could this be the reason? (Maybe, I should mark this fact in bold on the web page.)

Simon

**sunnyvu** · 05-19-2010, 07:33 AM

Currently I am using the HTSeq to get count number. After adding -s, the result is much better.

Command:
python -m HTSeq.scripts.count accepted_hits.sam transcripts.gtf >s1.count -s no

Thanks both.

**mgogol** · 05-19-2010, 07:45 AM

You can also count reads using bedtools on a bam file, like this.

coverageBed -abam ../data/accepted_hits.sorted.bam -b ../data/Mus_musculus.NCBIM37.52.bed > counts_bam.txt

I think it's really smart to check things out. There's a lot that can go wrong with all these complicated programs, sequences, etc.

**marcora** · 05-19-2010, 08:34 AM

Originally posted by Simon Anders View Post

Unless you have stranded RNA-Seq data, it is important to use the option "--stranded=no". Otherwise, htseq-count discards everything which is on the other strand.

Could this be the reason? (Maybe, I should mark this fact in bold on the web page.)

All right, thanx... I missed that! Now the count for Myog is 1,445 (vs 2073 as I calculated it from cufflinks output - see prev post). The tail of htseq-count output is:

Code:

Zxdc	170
Zyg11a	5
Zyg11b	1554
Zyx	209
Zzef1	297
Zzz3	375
l7Rn6	365
rp9	229
no_feature	5949010
ambiguous	63483

Still quite a few "no_feature" hits, but overall much more realistic... it would be great to have a discussion between Simon and Cole about the different methodologies used to allocate reads to features by htseq-count and cufflinks that might explain the difference in output.

By the way, here is the script that I am using to generate raw counts for genes from 'accepted_hits.sam' (tophat output) and 'transcripts.gtf' (cufflinks output); perhaps somebody may find it useful (the output should be directly loadable into an R data frame for use with DEseq)

n.b.: it's tailored to my file/directory structure... but the gist of it should be easily generalizable.

Code:

#!/bin/bash
for f in $( ls *.fa ); do
    if [ -f $f ]; then
        ACCEPTED_HITS_SIZE=`wc -l $f.tophat_out/accepted_hits.sam | cut -d' ' -f1`
        # (for each exon in gtf file input)
        # $10 = gene_id => $1
        # $16 = FPKM => $2
        # $5-$4 = exon lenght => $3
        # => $1;$2;$3
        # => $1 $2 $3
        #
        # aggregate sum of $3 (exon length) by $1 (gene_id)
        # => $1 $2 $3
        #
        # count = FPKM * total exon lenght (Kb) * number of mapped reads (million)
        # count = $2 * ($3 / 1,000) * ((ACCEPTED_HITS_SIZE-2) / 1,000,000)
        awk '$3 ~ /^exon$/ { print $10 $16 ($5-$4) }' $f.cufflinks_out/transcripts.gtf | \
        awk '{ gsub(/"/, "", $0); print $0; }' - | \
        awk -F\; '{ printf("%s\t%s\t%s\n",$1,$2,$3) }' - | \
        awk 'BEGIN { gene_id = ""; i = 0 ; len = 0; } { if (gene_id == $1) { i++; len += $3; fpkm = $2; } else if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); gene_id = $1; len = $3; i = 1; fpkm = $2; } else { gene_id = $1; len += $3; i++; fpkm = $2; } } END { if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); } }' - | \
        awk 'BEGIN { print "\t'${f%.fa}'" } { gene_id = $1; count = $2*($3/1000)*(('${ACCEPTED_HITS_SIZE}'-2)/1000000); printf("%s\t%d\n",gene_id,count); }' - > $f.counts.table
    fi
done

But pleeeeeez, add raw counts to cufflinks output!!!

Dado

**marcora** · 05-19-2010, 08:39 AM

Originally posted by marcora View Post

By the way, here is the script that I am using to generate raw counts for genes from 'accepted_hits.sam' (tophat output) and 'transcripts.gtf' (cufflinks output); perhaps somebody may find it useful (the output should be directly loadable into an R data frame for use with DEseq)

n.b.: it's tailored to my file/directory structure... but the gist of it should be easily generalizable.

Code:

#!/bin/bash
for f in $( ls *.fa ); do
    if [ -f $f ]; then
        ACCEPTED_HITS_SIZE=`wc -l $f.tophat_out/accepted_hits.sam | cut -d' ' -f1`
        # (for each exon in gtf file input)
        # $10 = gene_id => $1
        # $16 = FPKM => $2
        # $5-$4 = exon lenght => $3
        # => $1;$2;$3
        # => $1 $2 $3
        #
        # aggregate sum of $3 (exon length) by $1 (gene_id)
        # => $1 $2 $3
        #
        # count = FPKM * total exon lenght (Kb) * number of mapped reads (million)
        # count = $2 * ($3 / 1,000) * ((ACCEPTED_HITS_SIZE-2) / 1,000,000)
        awk '$3 ~ /^exon$/ { print $10 $16 ($5-$4) }' $f.cufflinks_out/transcripts.gtf | \
        awk '{ gsub(/"/, "", $0); print $0; }' - | \
        awk -F\; '{ printf("%s\t%s\t%s\n",$1,$2,$3) }' - | \
        awk 'BEGIN { gene_id = ""; i = 0 ; len = 0; } { if (gene_id == $1) { i++; len += $3; fpkm = $2; } else if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); gene_id = $1; len = $3; i = 1; fpkm = $2; } else { gene_id = $1; len += $3; i++; fpkm = $2; } } END { if (i != 0) { printf("%s\t%s\t%s\n",gene_id,fpkm,len); } }' - | \
        awk 'BEGIN { print "\t'${f%.fa}'" } { gene_id = $1; count = $2*($3/1000)*(('${ACCEPTED_HITS_SIZE}'-2)/1000000); printf("%s\t%d\n",gene_id,count); }' - > $f.counts.table
    fi
done

Wow

, I've just come to realize that that was my first public contribution to the field of hardcore bioinformatics... 4 lines of code!!!

Gotta pick up the pace!

**marcora** · 05-19-2010, 08:51 AM

Originally posted by mgogol View Post

You can also count reads using bedtools on a bam file, like this.

coverageBed -abam ../data/accepted_hits.sorted.bam -b ../data/Mus_musculus.NCBIM37.52.bed > counts_bam.txt

That was the other option I tried out... the output for Myog is the following (only the last 4 columns are added by coverageBed: 1] raw count/depth 2] # bases at depth 3] size of transcript/exon 4] % of transcript/exon at depth):

Code:

chr1	Cufflinks	transcript	136186581	136189125	1000	+	.	gene_id "Myog"; transcript_id "Myog"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";	1474	2532	2545	0.9948919
chr1	Cufflinks	exon	136186581	136187103	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "1"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";	494	510	523	0.9751434
chr1	Cufflinks	exon	136187670	136187751	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "2"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";	122	82	82	1.0000000
chr1	Cufflinks	exon	136188251	136189125	1000	+	.	gene_id "Myog"; transcript_id "Myog"; exon_number "3"; FPKM "81.2341305460"; frac "1.000000"; conf_lo "76.954202"; conf_hi "85.514059"; cov "36.025000";	906	875	875	1.0000000

Which is more consistent with htseq-count output (raw count of 1,522 vs 1,445 respectively). Gotta reopen IGV and count the reads one by one to decide whether the raw count of 2,073 derived from cufflinks output is correct or an overestimation.

Cheers,

Dado

**marcora** · 05-19-2010, 09:15 AM

Originally posted by marcora View Post

Which is more consistent with htseq-count output (raw count of 1,522 vs 1,445 respectively). Gotta reopen IGV and count the reads one by one to decide whether the raw count of 2,073 derived from cufflinks output is correct or an overestimation.

Okay, here we go.

Please find attached a screenshot of my igv viewer showing Sox17.

htseq-count reports a raw count of 46 for Sox17.

coverageBed reports a raw count of 46 for Sox17.

cufflinks "reports" a raw count of 66 for Sox17.

If you count the reads overlapping with exons in the screenshot, they are ~45. If you count all reads within the boundaries of the gene they are ~64.

Started to wonder what the FPKM reported by cufflinks (which is exactly the same across transcript and exon features) takes into consideration.

Just my two cents,

Dado

Attached Files

screenshot_sox17.jpg (20.8 KB, 325 views)

**Cole Trapnell** · 05-20-2010, 05:01 PM

Originally posted by marcora View Post

I am new to this RNAseq business and I would like to use it to measure differential gene expression between wild-type and mutant mouse brains.

I have designed my experiment to have 4 biological replicates each for case and control.

I really like bowtie/tophat and would like to use cufflinks/cuffdiff to measure diff gene expression, but I can't seem to find a way to make cuffdiff handle biological replicates (which I would hope is the way everybody designs their experiments

)... am I missing something?

Alternatively, I could use R packages like DESeq, but it's not easy - at least for me - to extract raw counts from cufflinks output (using awk at the moment)... it would be a great additional feature for cufflinks to output unnormalized read counts for each feature alongside FPKM values to avoid extra-steps and make it simpler and more versatile for beginner's use.

I would greatly appreciate any feedback from this community of experts!

Cheers

We will be directly supporting biological replicates within the next few weeks in both cuffdiff and cufflinks itself. We've recently worked out the math for how to handle them well in our model and improve the robustness of our statistical testing. I need a few weeks to implement the enhancements and do the testing, etc.

We will not be directly reporting the number of fragments that come from each isoform, because when dealing with fragments that could have come from one of several transcripts in the transcriptome, the fragments need to be "probabilistically" assigned, because we can't say for for sure which transcript each fragment came from. Cufflinks takes this assignment uncertainty into account when performing its tests for differential expression. Testing directly on "estimated" counts throws out much of the "alignment uncertainty" information Cufflinks has computed along with the abundance estimates. When dealing with ambiguously mappable fragments (e.g. all fragments incident on constitutive features of alternatively spliced genes) ignoring this uncertainty during testing, even at the gene level, is hazardous.

It's also worth noting that what Cufflinks is doing internally is computing for each transcript a number (in our paper we call this number "rho") that is it's relative abundance in the sample. We could write this down as a very small fraction, but instead we scale it up by a large constant and report FPKM - the expected number of fragments per kilobase of transcript per million fragments mapped. Now the source of the difference between the tools may be in two places:

1) When you use a GTF, Cufflinks only counts fragments that fall on features for the "millions of fragments mapped" in the denominator
2) Cufflinks uses each fragment's MAPQ to downweight multi-mapping fragments, because otherwise these get overcounted and contribute more to the overall map than uniquely mappable fragments. See the SAM format for more details on this.

So I think comparing the actual FPKM, counts, etc. between tools is tricky, and comparing Cufflinks to a tool that directly reports counts is even more so.

Because FPKM for a transcript is just a proxy for its relative abundance rho, it doesn't actually matter what this constant multiplier is, as long as its the same for all transcripts. We report FPKM simply because many people performing RNA-Seq are used to it.

**marcora** · 05-21-2010, 04:22 AM

Originally posted by Cole Trapnell View Post

We will be directly supporting biological replicates within the next few weeks in both cuffdiff and cufflinks itself. We've recently worked out the math for how to handle them well in our model and improve the robustness of our statistical testing. I need a few weeks to implement the enhancements and do the testing, etc.

Dear Cole,

thanx for your quick reply. The addition of this feature to cufflinks would definitively improve its versatility and adoption, since most experimental designs include/should include biological replicates.

We will not be directly reporting the number of fragments that come from each isoform, because when dealing with fragments that could have come from one of several transcripts in the transcriptome, the fragments need to be "probabilistically" assigned, because we can't say for for sure which transcript each fragment came from. Cufflinks takes this assignment uncertainty into account when performing its tests for differential expression. Testing directly on "estimated" counts throws out much of the "alignment uncertainty" information Cufflinks has computed along with the abundance estimates. When dealing with ambiguously mappable fragments (e.g. all fragments incident on constitutive features of alternatively spliced genes) ignoring this uncertainty during testing, even at the gene level, is hazardous.

I don't quite get this. If cufflinks outputs FPKMs then it should be able to output raw counts... at least for unambiguous assignments, like those to exons and genes (not alternatively spliced transcripts).

So I think comparing the actual FPKM, counts, etc. between tools is tricky, and comparing Cufflinks to a tool that directly reports counts is even more so.

If I've got this right, because of splice and multireads assignment, what cufflinks reports as FPKM cannot be directly translated into the raw counts of the kind that htseq-count and coverageBed output. Do you think that translating this FPKM value into a raw count value and then using tools like DESeq is valid or not, in order to complement or independently verify cuffdiff output once the 'biological replicates" feature is added?

Thanx a lot for these great tools!!!

Dado

**lpachter** · 05-21-2010, 10:28 AM

While it is ok to use raw counts to compare gene expression between samples, as Cole explained, to test differential expression of _isoforms_ its necessary to account for uncertainty in the assignment of reads to transcripts. Converting isoform FPKMs to counts and then applying DEseq is a bad idea because the uncertainty is then not incorporated into the DE calculation.

Secondly, to compare expression of genes within one experiment (or to measure relative expression of isoforms within a gene) one has to test the FPKMs and not counts, because counts do not accurately reflect the relative abundances. This is explained in more detail in the Cufflinks paper.

**marcora** · 05-21-2010, 01:32 PM

Originally posted by lpachter View Post

While it is ok to use raw counts to compare gene expression between samples, as Cole explained, to test differential expression of _isoforms_ its necessary to account for uncertainty in the assignment of reads to transcripts. Converting isoform FPKMs to counts and then applying DEseq is a bad idea because the uncertainty is then not incorporated into the DE calculation.

Secondly, to compare expression of genes within one experiment (or to measure relative expression of isoforms within a gene) one has to test the FPKMs and not counts, because counts do not accurately reflect the relative abundances. This is explained in more detail in the Cufflinks paper.

I most definitively understand the rationale behind FPKMs and why they're advantageous in certain contexts over raw counts. But the contrary is also true. I was just thinking about providing the end user with both for added flexibility. Not sure if it is possible, but it would also be great to see which reads get assigned to which isoform.

Thanx all for your helpful feedback.

**sdriscoll** · 05-21-2010, 05:16 PM

I see this is mostly resolved but I thought I'd throw out something I was doing to find the approximate number of reads that hit a gene.

In the coverage.wig file produced by Tophat you can find the bp ranges of where reads were aligned to the genome. If you find the bp range of any exon from the genome within the coverage.wig file you do the following to compute the approx number of reads that hit that exon: at each row within the exon's bp range take the second number column, subtract the first and multiply by the third column; sum those products up over the range of the exon; divide by your read length. It's rough but it works. I was also using the coverage.wig file to find approximate read alignment numbers. They end up pretty similar to the count of number of lines in the accepted_hits.sam file. I figure that it doesn't take into account the fact that not all reads are fully aligned but on the scales we are working with here i don't find it makes too much difference in the end results of the numbers I've computed based on these values.

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 20 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

Differential gene expression: Can Cufflinks/Cuffcompare handle biological replicates?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News