![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
htseq-count with warning for every read to represent all of zero counts in output | hibachings2013 | RNA Sequencing | 10 | 07-15-2011 11:19 AM |
Problem with Cufflinks output | admiral | RNA Sequencing | 0 | 06-01-2011 02:19 PM |
Interesting cufflinks output | jetspeeder | Bioinformatics | 8 | 12-21-2010 07:51 AM |
Cufflinks output - Next wat? | ritzriya | RNA Sequencing | 5 | 09-13-2010 04:03 AM |
cufflinks cuffcompare output | Mark | Bioinformatics | 1 | 07-19-2010 08:23 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Member
Location: Hyderabad, India Join Date: Apr 2010
Posts: 66
|
![]()
Hi All,
I used HTSeq-count on a .sam file, while providing an annotation in the GTF format. The program htseq-count does not seem to have run properly. I might be wrong. The thing is that I am getting FPKM values for some genes, whereas htseq-count comes up with a zero value. I sorted the Tophat output on the basis of "read id" and then used HTSeq-count with a drosophila GTF file. The gene expression output indicated that a lot of reads were not included. Also the gene expression "raw read count" was zero for a lot of genes, which had an FPKM value using Cufflinks. Here is what my GTF file looks like Code:
chr4 FlyBase exon 1144800 1144957 . + . gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "1"; gene_name "Arf102F"; parent_type=mRNA; chr4 FlyBase exon 1145062 1145324 . + . gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "2"; gene_name "Arf102F"; parent_type=mRNA; chr4 FlyBase exon 1145394 1145519 . + . gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "3"; gene_name "Arf102F"; parent_type=mRNA; chr4 FlyBase exon 1145576 1145807 . + . gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "4"; gene_name "Arf102F"; parent_type=mRNA; chr4 FlyBase exon 1135804 1136520 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "12"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1135804 1136520 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "12"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1136582 1136705 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "11"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1136582 1136705 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "11"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1136952 1137082 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "10"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1136952 1137082 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "10"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1137141 1137224 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "9"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1137141 1137224 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "9"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1137286 1137840 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "8"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1137286 1137840 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "8"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1138477 1138531 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "7"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1138477 1138531 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "7"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1138594 1139643 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "6"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1138594 1139643 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "6"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1140563 1140673 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "5"; gene_name "cals"; parent_type=mRNA; chr4 FlyBase exon 1140563 1140673 . - . gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "5"; gene_name "cals"; parent_type=mRNA; Code:
HWI-EAS313:1:1:5:36#0 0 chr3R 1291390 255 42M * 0 0 TTTTACTCCTGCAGGTCAGTAATTCAAGTCGGATATTAACTT BCA:;BCBCCCB.?A:A:@:3=7BC@9B/?BB+?)<?5=ABA NM:i:0 NH:i:1 HWI-EAS313:1:1:5:239#0 16 chr3L 16783628 255 42M * 0 0 CGCCGGACATGCCAGCGGCGAAGTTTCGTCCCGTAAGGCCGA BA?BBB@B@7=BAABBAA@BBBB9+=B@9@ABBBBBB@=ABA NM:i:1 NH:i:1 HWI-EAS313:1:1:5:462#0 16 chr3R 19011622 255 42M * 0 0 CAAGGAGATTCTTGGTATCTAAGCACTCCTCTTACCAAATTT ###B95+>5:>)+=B>-.;1<5A5)7@AC?>::=?A=>B9A> NM:i:2 NH:i:1 HWI-EAS313:1:1:5:480#0 16 chr3R 8845667 255 42M * 0 0 CAAAGGAGGATCTTGAGAGGTTCAAAGCTTTACGCATTATCG ?BCBCC7CBBCBCBCBCA=ACACBBCB@37C;C<BB@BC?CB NM:i:1 NH:i:1 HWI-EAS313:1:1:5:501#0 16 chr3R 25748393 1 42M * 0 0 CCCAGCTGCACCTGCCGCGACCAGGACTACGCCGGATGGTGG ##;39;.,51,1%0@==;(:7>?A@6@/-><AA@A=78>5;A NM:i:5 NH:i:4 CC:Z:= CP:i:25749839 HWI-EAS313:1:1:5:501#0 0 chr3R 25749839 1 42M * 0 0 CCACCATCCGGCGTAGTCCTGGTCGCGGCAGGTGCAGCTGGG A;5>87=A@AA<>-/@6@A?>7:(;==@0%1,15,.;93;## NM:i:5 NH:i:4 CC:Z:= CP:i:26313149 HWI-EAS313:1:1:5:501#0 0 chr3R 26313149 1 42M * 0 0 CCACCATCCGGCGTAGTCCTGGTCGCGGCAGGTGCAGCTGGG A;5>87=A@AA<>-/@6@A?>7:(;==@0%1,15,.;93;## NM:i:5 NH:i:4 CC:Z:= CP:i:26315145 HWI-EAS313:1:1:5:501#0 16 chr3R 26315145 1 42M * 0 0 CCCAGCTGCACCTGCCGCGACCAGGACTACGCCGGATGGTGG ##;39;.,51,1%0@==;(:7>?A@6@/-><AA@A=78>5;A NM:i:5 NH:i:4 HWI-EAS313:1:1:5:643#0 0 chr3R 25748461 3 42M * 0 0 CTACGATGGCAGCCCACTGCCCGACTGGCTCCAGTCCGTCGA @B9BBBB@BBBBA@ABB9BBBBB<BBBBBAB@==?@A<#### NM:i:0 NH:i:2 CC:Z:= CP:i:25749771 HWI-EAS313:1:1:5:643#0 16 chr3R 25749771 3 42M * 0 0 TCGACGGACTGGAGCCAGTCGGGCAGTGGGCTGCCATCGTAG ####<A@?==@BABBBBB<BBBBB9BBA@ABBBB@BBBB9B@ NM:i:0 NH:i:2 HWI-EAS313:1:1:5:761#0 16 chr3L 16031864 255 42M * 0 0 CATGACCGCATGGCAGGAATGCCGTATTGTACTCGGCGCCGT A@<?=AAABAABA=ABAB@BB4@AAABA;<>??7@=@3AA@B NM:i:0 NH:i:1 HWI-EAS313:1:1:5:871#0 16 chrX 5627098 255 42M * 0 0 CGTTTATCCTGCGGATCGATTGCGGTGCTATCAGTGCAGCGG =CC@;A@=@@CCCC>CACCAC@@BBABBAACBCCCCCCCBCB NM:i:0 NH:i:1 HWI-EAS313:1:1:5:1080#0 16 chr3L 3195955 255 42M * 0 0 AGGTCANCAACTCTGCCTTCGTGGAGCGCGTCAAGGCCCGTG >@@=9<%8>A@;<@@B@;@922BBB@<@@A@>@BBA?@8@AB NM:i:1 NH:i:1 HWI-EAS313:1:1:5:1163#0 16 chr3L 14394814 255 42M * 0 0 AAAGATCGGCATCCATCTGAGCAATGGCACCGGCAATTCTGT ##@<=<ACBBB;4@ABB;*<ACC7;;CC,BBCC?B;AABCCB NM:i:3 NH:i:1 Code:
python -m HTSeq.scripts.count accepted_hits.sam ../ALLEXONS.gtf > output.out Code:
gene_id S1 FBgn0000003 208.127 FBgn0000008 2.84909 FBgn0000014 8.2101 FBgn0000015 3.71491 FBgn0000017 4.53253 FBgn0000018 4.94072 FBgn0000024 5.39896 FBgn0000032 18.2969 FBgn0000036 0.615742 FBgn0000037 0.651662 FBgn0000038 1.6011 FBgn0000039 1.12615 Code:
FBgn0000003 0 FBgn0000008 52 FBgn0000014 157 FBgn0000015 56 FBgn0000017 123 FBgn0000018 36 FBgn0000022 5 FBgn0000024 85 FBgn0000028 0 FBgn0000032 144 |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994
|
![]()
HTSeq-count can be quite conservative in the sense that it discards reads that cannot be unambiguously be assigned to a gene. This can cause values to be lower than reported by other tools. (This is done on purpose: For differential expression analysis, you need to discount ambiguous reads to avoid that differential signal on one gene shows up in another one that overlaps.)
The gene FBgn0000003 is an extremely short ncRNA (less than 300 bp), so it is well possible that very few reads, maybe just a single one, give rise to a rather high FPKM value. If you now wonder why these few reads were discounted by HTSeq and counted by cufflinks, display your SAM file in a genome browser (e.g., IGV) and have a look at the FBgn0000003 locus. How many reads do you see there? Do they maybe overlap with more than one feature? If you write down these reads' IDs, you can then use HTSeq-count's new '-o' option to check out what these reads were assigned to. Finally: Your gene is on the '+' strand. By default, htseq-count only counts reads that were aligned to the same strand as the feature. This makes sense if your RNA-Seq protocol poreserves strand information. Remember that Illumina's default protocol does not do this, and for such data, you must specify the option '--stranded=no' to get htseq-count to count reads on either strand. If this does not resolve the issue, post a screen shot from your genome browser, showing a disputed region, and maybe I can spot something. Simon |
![]() |
![]() |
![]() |
#3 |
Member
Location: Hyderabad, India Join Date: Apr 2010
Posts: 66
|
![]()
Hi Simon,
Thanks for the input. Yes I ran the program with --stranded=no, and immediately the no_feature dropped from ~375000 to ~185000. Ambiguos and alignment_not_unique did not change much. About the gene "FBgn0000003", I find that a lot of reads are associated with it. I am sending a screenshot. I investigated a few (10 reads), and find that they are "alignment_not_unique". Probably that is why htseq-count gives zero. But why does Cufflinks provide expression value for this? I would have thought that they account for ambiguity. Thanks Abhijit |
![]() |
![]() |
![]() |
#4 | |
Senior Member
Location: Heidelberg, Germany Join Date: Feb 2010
Posts: 994
|
![]() Quote:
But note that this is not necessarily a bug. Depending on what you want to do with your data, I see two rationales on how to deal with reads that map to multiple places. 1. For htseq-count, I imagined the user to then use DESeq (or edgeR) to test for differential expression. Imagine we have two paralogous genes that have identical sequence at one half or their length and divergent sequence at the other half, and one of these genes is differentially expressed and the other is not. All reads that stem from the identical-sequence parts of the transcripts will map to both genes, and if we include them in our counts, both genes will appear to be differentially expressed, even though only one is really. If we count only the uniquely mapping reads (i.e., those stemming from the divergent parts of the transcripts), we are safe. 2. It seems to me that the cufflinks developers rather had in mind that the user has one sample and wishes to study which genes are strongly expressed and which are not. If we disregard non-uniquely mapped reads, all those genes that have highly similar paralogs will get less counts than genes that have diverged a lot since their last duplication event, i.e. the counts are biased according to the gene's within-genome evolutionary history. This is especially an issue because cufflinks reports FPKM values, i.e., it divides by transcript length. Note that mappability is a feature of a the reference, not of the read. If a part of a gene's sequence appears somewhere else in the genome, no read will map there uniquely, i.e., the part cannot get any reads. Hence it should be subtracted from the transcript length when calculating FPKM values. Alternatively, one could distribute non-uniquely mapped reads to the different places, i.e., count them fractionally for each of their possible mappings. Maybe, this is what cufflinks does, but it makes the job of differential expression calling rather difficult, and I wonder how cuffdiff could take it into account. Simon |
|
![]() |
![]() |
![]() |
#5 |
Member
Location: Berkeley, cA Join Date: Feb 2010
Posts: 40
|
![]()
Cufflinks currently divides multi-reads (i.e. reads that map to multiple locations) uniformly in calculating expression values. This is not correct, however it is worth pointing out that (probabilistic) assignment of multi-reads is mathematically equivalent to the assignment of reads that map uniquely but to multiple isoforms. In fact, the RSEM program correctly handles multi-reads. We have worked out an effective strategy for probabilistically assigning multi-reads within the Cufflinks framework and the implementation is scheduled for the release-after-next (hopefully ~1 month away).
We have procrastinated dealing with multi-reads because as read length has increased they have been less of a problem. However they still affect expression estimates in gene families, and we are sensitive to the fact that many users still use short reads. In terms of differential expression estimates, there is nothing difficult about incorporating probabilistic assignment of multi-reads- it works exactly the same way as the current method implemented in cuffdiff, even with replicates. Lior P.S. Actually mappability is not a feature of the reference- it is a feature of the transcriptome. A read may map to multiple locations in a reference, but it may be possible to assign it to only one location if the other transcripts have very low expression (that can be estimated using other reads that map to them). In other words, mappability and expression estimation are intimately related, and fractional assignment of mappings is not an option, it is essential if one doesn't want to lose power by throwing away data. |
![]() |
![]() |
![]() |
#6 |
Member
Location: Hyderabad, India Join Date: Apr 2010
Posts: 66
|
![]()
Hi Simon and Lior,
Thank you for your thoughts. What I am trying to do is find a ratio between the autosomal expression in wild type male of drosophila and a mutant. We have reason to believe that global autosomal gene expression is elevated in the mutant. Therefore even if we wrongly call differential gene expression for some genes, on a global level it should not matter much. However, mapping reads correctly to call DGE is still important to me looking ahead. I had therefore tried to use as reference the transcripts of drosophila, rather than the genome. I was not concerned with finding novel transcripts or genes. Anyway, I used bowtie-build on the transcripts, then used tophat on the generated indices. I then designed a perl script which would count as "one" any read that mapped to two or more transcripts of the same gene, and would disregard any read which hit two transcripts belonging to different genes (I did not have a formatted -GTF file at the time, because then I could have just used the -GTF option of tophat on the reference genome). I find that the results by running against the transcripts is not very different (when it comes to ratios) from when run against the genome. But I agree that the problem of "counts biased according to the genes within-genome evolutionary history" and also biased if a domain is the same between two completely unrelated genes will still remain. Mate-pairs and longer read lengths seem to be the only ways to lessen this problem. Thanks Abhijit |
![]() |
![]() |
![]() |
Thread Tools | |
|
|