SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count with warning for every read to represent all of zero counts in output hibachings2013 RNA Sequencing 10 07-15-2011 10:19 AM
Problem with Cufflinks output admiral RNA Sequencing 0 06-01-2011 01:19 PM
Interesting cufflinks output jetspeeder Bioinformatics 8 12-21-2010 06:51 AM
Cufflinks output - Next wat? ritzriya RNA Sequencing 5 09-13-2010 03:03 AM
cufflinks cuffcompare output Mark Bioinformatics 1 07-19-2010 07:23 AM

Reply
 
Thread Tools
Old 01-27-2011, 01:25 PM   #1
gen2prot
Member
 
Location: Hyderabad, India

Join Date: Apr 2010
Posts: 64
Default HTSeq output not correlated with Cufflinks output... Help

Hi All,

I used HTSeq-count on a .sam file, while providing an annotation in the GTF format. The program htseq-count does not seem to have run properly. I might be wrong. The thing is that I am getting FPKM values for some genes, whereas htseq-count comes up with a zero value. I sorted the Tophat output on the basis of "read id" and then used HTSeq-count with a drosophila GTF file. The gene expression output indicated that a lot of reads were not included. Also the gene expression "raw read count" was zero for a lot of genes, which had an FPKM value using Cufflinks. Here is what my GTF file looks like

Code:
chr4	FlyBase	exon	1144800	1144957	.	+	.	gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "1"; gene_name "Arf102F"; parent_type=mRNA;
chr4	FlyBase	exon	1145062	1145324	.	+	.	gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "2"; gene_name "Arf102F"; parent_type=mRNA;
chr4	FlyBase	exon	1145394	1145519	.	+	.	gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "3"; gene_name "Arf102F"; parent_type=mRNA;
chr4	FlyBase	exon	1145576	1145807	.	+	.	gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "4"; gene_name "Arf102F"; parent_type=mRNA;
chr4	FlyBase	exon	1135804	1136520	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "12"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1135804	1136520	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "12"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1136582	1136705	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "11"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1136582	1136705	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "11"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1136952	1137082	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "10"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1136952	1137082	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "10"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1137141	1137224	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "9"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1137141	1137224	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "9"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1137286	1137840	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "8"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1137286	1137840	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "8"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1138477	1138531	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "7"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1138477	1138531	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "7"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1138594	1139643	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "6"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1138594	1139643	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "6"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1140563	1140673	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "5"; gene_name "cals"; parent_type=mRNA;
chr4	FlyBase	exon	1140563	1140673	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "5"; gene_name "cals"; parent_type=mRNA;
Here are a few lines from the .sam file

Code:
HWI-EAS313:1:1:5:36#0	0	chr3R	1291390	255	42M	*	0	0	TTTTACTCCTGCAGGTCAGTAATTCAAGTCGGATATTAACTT	BCA:;BCBCCCB.?A:A:@:3=7BC@9B/?BB+?)<?5=ABA	NM:i:0	NH:i:1
HWI-EAS313:1:1:5:239#0	16	chr3L	16783628	255	42M	*	0	0	CGCCGGACATGCCAGCGGCGAAGTTTCGTCCCGTAAGGCCGA	BA?BBB@B@7=BAABBAA@BBBB9+=B@9@ABBBBBB@=ABA	NM:i:1	NH:i:1
HWI-EAS313:1:1:5:462#0	16	chr3R	19011622	255	42M	*	0	0	CAAGGAGATTCTTGGTATCTAAGCACTCCTCTTACCAAATTT	###B95+>5:>)+=B>-.;1<5A5)7@AC?>::=?A=>B9A>	NM:i:2	NH:i:1
HWI-EAS313:1:1:5:480#0	16	chr3R	8845667	255	42M	*	0	0	CAAAGGAGGATCTTGAGAGGTTCAAAGCTTTACGCATTATCG	?BCBCC7CBBCBCBCBCA=ACACBBCB@37C;C<BB@BC?CB	NM:i:1	NH:i:1
HWI-EAS313:1:1:5:501#0	16	chr3R	25748393	1	42M	*	0	0	CCCAGCTGCACCTGCCGCGACCAGGACTACGCCGGATGGTGG	##;39;.,51,1%0@==;(:7>?A@6@/-><AA@A=78>5;A	NM:i:5	NH:i:4	CC:Z:=	CP:i:25749839
HWI-EAS313:1:1:5:501#0	0	chr3R	25749839	1	42M	*	0	0	CCACCATCCGGCGTAGTCCTGGTCGCGGCAGGTGCAGCTGGG	A;5>87=A@AA<>-/@6@A?>7:(;==@0%1,15,.;93;##	NM:i:5	NH:i:4	CC:Z:=	CP:i:26313149
HWI-EAS313:1:1:5:501#0	0	chr3R	26313149	1	42M	*	0	0	CCACCATCCGGCGTAGTCCTGGTCGCGGCAGGTGCAGCTGGG	A;5>87=A@AA<>-/@6@A?>7:(;==@0%1,15,.;93;##	NM:i:5	NH:i:4	CC:Z:=	CP:i:26315145
HWI-EAS313:1:1:5:501#0	16	chr3R	26315145	1	42M	*	0	0	CCCAGCTGCACCTGCCGCGACCAGGACTACGCCGGATGGTGG	##;39;.,51,1%0@==;(:7>?A@6@/-><AA@A=78>5;A	NM:i:5	NH:i:4
HWI-EAS313:1:1:5:643#0	0	chr3R	25748461	3	42M	*	0	0	CTACGATGGCAGCCCACTGCCCGACTGGCTCCAGTCCGTCGA	@B9BBBB@BBBBA@ABB9BBBBB<BBBBBAB@==?@A<####	NM:i:0	NH:i:2	CC:Z:=	CP:i:25749771
HWI-EAS313:1:1:5:643#0	16	chr3R	25749771	3	42M	*	0	0	TCGACGGACTGGAGCCAGTCGGGCAGTGGGCTGCCATCGTAG	####<A@?==@BABBBBB<BBBBB9BBA@ABBBB@BBBB9B@	NM:i:0	NH:i:2
HWI-EAS313:1:1:5:761#0	16	chr3L	16031864	255	42M	*	0	0	CATGACCGCATGGCAGGAATGCCGTATTGTACTCGGCGCCGT	A@<?=AAABAABA=ABAB@BB4@AAABA;<>??7@=@3AA@B	NM:i:0	NH:i:1
HWI-EAS313:1:1:5:871#0	16	chrX	5627098	255	42M	*	0	0	CGTTTATCCTGCGGATCGATTGCGGTGCTATCAGTGCAGCGG	=CC@;A@=@@CCCC>CACCAC@@BBABBAACBCCCCCCCBCB	NM:i:0	NH:i:1
HWI-EAS313:1:1:5:1080#0	16	chr3L	3195955	255	42M	*	0	0	AGGTCANCAACTCTGCCTTCGTGGAGCGCGTCAAGGCCCGTG	>@@=9<%8>A@;<@@B@;@922BBB@<@@A@>@BBA?@8@AB	NM:i:1	NH:i:1
HWI-EAS313:1:1:5:1163#0	16	chr3L	14394814	255	42M	*	0	0	AAAGATCGGCATCCATCTGAGCAATGGCACCGGCAATTCTGT	##@<=<ACBBB;4@ABB;*<ACC7;;CC,BBCC?B;AABCCB	NM:i:3	NH:i:1
Here is the python command I gave

Code:
python -m HTSeq.scripts.count accepted_hits.sam ../ALLEXONS.gtf > output.out
Cufflinks output

Code:
gene_id	S1
FBgn0000003	208.127
FBgn0000008	2.84909
FBgn0000014	8.2101
FBgn0000015	3.71491
FBgn0000017	4.53253
FBgn0000018	4.94072
FBgn0000024	5.39896
FBgn0000032	18.2969
FBgn0000036	0.615742
FBgn0000037	0.651662
FBgn0000038	1.6011
FBgn0000039	1.12615
HTSeq-count output

Code:
FBgn0000003	0
FBgn0000008	52
FBgn0000014	157
FBgn0000015	56
FBgn0000017	123
FBgn0000018	36
FBgn0000022	5
FBgn0000024	85
FBgn0000028	0
FBgn0000032	144
As you can see the cufflinks and htseq-count output are not coherent. there are genes missing in cufflinks while "FBGn0000003" shows a value in cufflinks but zero in htseq. Maybe Dr. Simon Anders can help with this. I cannot understand. Furthermore is the formatting of my GTF file correct?
gen2prot is offline   Reply With Quote
Old 01-28-2011, 04:55 AM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

HTSeq-count can be quite conservative in the sense that it discards reads that cannot be unambiguously be assigned to a gene. This can cause values to be lower than reported by other tools. (This is done on purpose: For differential expression analysis, you need to discount ambiguous reads to avoid that differential signal on one gene shows up in another one that overlaps.)

The gene FBgn0000003 is an extremely short ncRNA (less than 300 bp), so it is well possible that very few reads, maybe just a single one, give rise to a rather high FPKM value. If you now wonder why these few reads were discounted by HTSeq and counted by cufflinks, display your SAM file in a genome browser (e.g., IGV) and have a look at the FBgn0000003 locus. How many reads do you see there? Do they maybe overlap with more than one feature? If you write down these reads' IDs, you can then use HTSeq-count's new '-o' option to check out what these reads were assigned to.

Finally: Your gene is on the '+' strand. By default, htseq-count only counts reads that were aligned to the same strand as the feature. This makes sense if your RNA-Seq protocol poreserves strand information. Remember that Illumina's default protocol does not do this, and for such data, you must specify the option '--stranded=no' to get htseq-count to count reads on either strand.

If this does not resolve the issue, post a screen shot from your genome browser, showing a disputed region, and maybe I can spot something.

Simon
Simon Anders is offline   Reply With Quote
Old 01-28-2011, 01:17 PM   #3
gen2prot
Member
 
Location: Hyderabad, India

Join Date: Apr 2010
Posts: 64
Default

Hi Simon,

Thanks for the input. Yes I ran the program with --stranded=no, and immediately the no_feature dropped from ~375000 to ~185000. Ambiguos and alignment_not_unique did not change much.

About the gene "FBgn0000003", I find that a lot of reads are associated with it. I am sending a screenshot. I investigated a few (10 reads), and find that they are "alignment_not_unique". Probably that is why htseq-count gives zero. But why does Cufflinks provide expression value for this? I would have thought that they account for ambiguity.

Thanks
Abhijit
Attached Files
File Type: pdf igv_snapshot.pdf (229.7 KB, 234 views)
gen2prot is offline   Reply With Quote
Old 01-30-2011, 04:07 AM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by gen2prot View Post
But why does Cufflinks provide expression value for this? I would have thought that they account for ambiguity.
Interesting question. Maybe send a mail to the Cufflinks developers and ask. If you do, please tell us what they say.

But note that this is not necessarily a bug. Depending on what you want to do with your data, I see two rationales on how to deal with reads that map to multiple places.

1. For htseq-count, I imagined the user to then use DESeq (or edgeR) to test for differential expression. Imagine we have two paralogous genes that have identical sequence at one half or their length and divergent sequence at the other half, and one of these genes is differentially expressed and the other is not. All reads that stem from the identical-sequence parts of the transcripts will map to both genes, and if we include them in our counts, both genes will appear to be differentially expressed, even though only one is really. If we count only the uniquely mapping reads (i.e., those stemming from the divergent parts of the transcripts), we are safe.

2. It seems to me that the cufflinks developers rather had in mind that the user has one sample and wishes to study which genes are strongly expressed and which are not. If we disregard non-uniquely mapped reads, all those genes that have highly similar paralogs will get less counts than genes that have diverged a lot since their last duplication event, i.e. the counts are biased according to the gene's within-genome evolutionary history.

This is especially an issue because cufflinks reports FPKM values, i.e., it divides by transcript length. Note that mappability is a feature of a the reference, not of the read. If a part of a gene's sequence appears somewhere else in the genome, no read will map there uniquely, i.e., the part cannot get any reads. Hence it should be subtracted from the transcript length when calculating FPKM values. Alternatively, one could distribute non-uniquely mapped reads to the different places, i.e., count them fractionally for each of their possible mappings. Maybe, this is what cufflinks does, but it makes the job of differential expression calling rather difficult, and I wonder how cuffdiff could take it into account.

Simon
Simon Anders is offline   Reply With Quote
Old 01-30-2011, 07:38 AM   #5
lpachter
Member
 
Location: Berkeley, cA

Join Date: Feb 2010
Posts: 40
Default

Cufflinks currently divides multi-reads (i.e. reads that map to multiple locations) uniformly in calculating expression values. This is not correct, however it is worth pointing out that (probabilistic) assignment of multi-reads is mathematically equivalent to the assignment of reads that map uniquely but to multiple isoforms. In fact, the RSEM program correctly handles multi-reads. We have worked out an effective strategy for probabilistically assigning multi-reads within the Cufflinks framework and the implementation is scheduled for the release-after-next (hopefully ~1 month away).

We have procrastinated dealing with multi-reads because as read length has increased they have been less of a problem. However they still affect expression estimates in gene families, and we are sensitive to the fact that many users still use short reads. In terms of differential expression estimates, there is nothing difficult about incorporating probabilistic assignment of multi-reads- it works exactly the same way as the current method implemented in cuffdiff, even with replicates.

Lior
P.S. Actually mappability is not a feature of the reference- it is a feature of the transcriptome. A read may map to multiple locations in a reference, but it may be possible to assign it to only one location if the other transcripts have very low expression (that can be estimated using other reads that map to them). In other words, mappability and expression estimation are intimately related, and fractional assignment of mappings is not an option, it is essential if one doesn't want to lose power by throwing away data.
lpachter is offline   Reply With Quote
Old 01-31-2011, 09:16 AM   #6
gen2prot
Member
 
Location: Hyderabad, India

Join Date: Apr 2010
Posts: 64
Default

Hi Simon and Lior,

Thank you for your thoughts. What I am trying to do is find a ratio between the autosomal expression in wild type male of drosophila and a mutant. We have reason to believe that global autosomal gene expression is elevated in the mutant. Therefore even if we wrongly call differential gene expression for some genes, on a global level it should not matter much. However, mapping reads correctly to call DGE is still important to me looking ahead. I had therefore tried to use as reference the transcripts of drosophila, rather than the genome. I was not concerned with finding novel transcripts or genes. Anyway, I used bowtie-build on the transcripts, then used tophat on the generated indices. I then designed a perl script which would count as "one" any read that mapped to two or more transcripts of the same gene, and would disregard any read which hit two transcripts belonging to different genes (I did not have a formatted -GTF file at the time, because then I could have just used the -GTF option of tophat on the reference genome). I find that the results by running against the transcripts is not very different (when it comes to ratios) from when run against the genome.

But I agree that the problem of "counts biased according to the genes within-genome evolutionary history" and also biased if a domain is the same between two completely unrelated genes will still remain. Mate-pairs and longer read lengths seem to be the only ways to lessen this problem.

Thanks
Abhijit
gen2prot is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:04 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO