Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HTSeq output not correlated with Cufflinks output... Help

    Hi All,

    I used HTSeq-count on a .sam file, while providing an annotation in the GTF format. The program htseq-count does not seem to have run properly. I might be wrong. The thing is that I am getting FPKM values for some genes, whereas htseq-count comes up with a zero value. I sorted the Tophat output on the basis of "read id" and then used HTSeq-count with a drosophila GTF file. The gene expression output indicated that a lot of reads were not included. Also the gene expression "raw read count" was zero for a lot of genes, which had an FPKM value using Cufflinks. Here is what my GTF file looks like

    Code:
    chr4	FlyBase	exon	1144800	1144957	.	+	.	gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "1"; gene_name "Arf102F"; parent_type=mRNA;
    chr4	FlyBase	exon	1145062	1145324	.	+	.	gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "2"; gene_name "Arf102F"; parent_type=mRNA;
    chr4	FlyBase	exon	1145394	1145519	.	+	.	gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "3"; gene_name "Arf102F"; parent_type=mRNA;
    chr4	FlyBase	exon	1145576	1145807	.	+	.	gene_id "FBgn0013749"; transcript_id "FBtr0089192"; exon_number "4"; gene_name "Arf102F"; parent_type=mRNA;
    chr4	FlyBase	exon	1135804	1136520	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "12"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1135804	1136520	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "12"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1136582	1136705	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "11"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1136582	1136705	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "11"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1136952	1137082	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "10"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1136952	1137082	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "10"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1137141	1137224	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "9"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1137141	1137224	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "9"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1137286	1137840	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "8"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1137286	1137840	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "8"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1138477	1138531	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "7"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1138477	1138531	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "7"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1138594	1139643	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "6"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1138594	1139643	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "6"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1140563	1140673	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089207"; exon_number "5"; gene_name "cals"; parent_type=mRNA;
    chr4	FlyBase	exon	1140563	1140673	.	-	.	gene_id "FBgn0039928"; transcript_id "FBtr0089208"; exon_number "5"; gene_name "cals"; parent_type=mRNA;
    Here are a few lines from the .sam file

    Code:
    HWI-EAS313:1:1:5:36#0	0	chr3R	1291390	255	42M	*	0	0	TTTTACTCCTGCAGGTCAGTAATTCAAGTCGGATATTAACTT	BCA:;BCBCCCB.?A:A:@:3=7BC@9B/?BB+?)<?5=ABA	NM:i:0	NH:i:1
    HWI-EAS313:1:1:5:239#0	16	chr3L	16783628	255	42M	*	0	0	CGCCGGACATGCCAGCGGCGAAGTTTCGTCCCGTAAGGCCGA	BA?BBB@B@7=BAABBAA@BBBB9+=B@9@ABBBBBB@=ABA	NM:i:1	NH:i:1
    HWI-EAS313:1:1:5:462#0	16	chr3R	19011622	255	42M	*	0	0	CAAGGAGATTCTTGGTATCTAAGCACTCCTCTTACCAAATTT	###B95+>5:>)+=B>-.;1<5A5)7@AC?>::=?A=>B9A>	NM:i:2	NH:i:1
    HWI-EAS313:1:1:5:480#0	16	chr3R	8845667	255	42M	*	0	0	CAAAGGAGGATCTTGAGAGGTTCAAAGCTTTACGCATTATCG	?BCBCC7CBBCBCBCBCA=ACACBBCB@37C;C<BB@BC?CB	NM:i:1	NH:i:1
    HWI-EAS313:1:1:5:501#0	16	chr3R	25748393	1	42M	*	0	0	CCCAGCTGCACCTGCCGCGACCAGGACTACGCCGGATGGTGG	##;39;.,51,1%0@==;(:7>?A@6@/-><AA@A=78>5;A	NM:i:5	NH:i:4	CC:Z:=	CP:i:25749839
    HWI-EAS313:1:1:5:501#0	0	chr3R	25749839	1	42M	*	0	0	CCACCATCCGGCGTAGTCCTGGTCGCGGCAGGTGCAGCTGGG	A;5>87=A@AA<>-/@6@A?>7:(;==@0%1,15,.;93;##	NM:i:5	NH:i:4	CC:Z:=	CP:i:26313149
    HWI-EAS313:1:1:5:501#0	0	chr3R	26313149	1	42M	*	0	0	CCACCATCCGGCGTAGTCCTGGTCGCGGCAGGTGCAGCTGGG	A;5>87=A@AA<>-/@6@A?>7:(;==@0%1,15,.;93;##	NM:i:5	NH:i:4	CC:Z:=	CP:i:26315145
    HWI-EAS313:1:1:5:501#0	16	chr3R	26315145	1	42M	*	0	0	CCCAGCTGCACCTGCCGCGACCAGGACTACGCCGGATGGTGG	##;39;.,51,1%0@==;(:7>?A@6@/-><AA@A=78>5;A	NM:i:5	NH:i:4
    HWI-EAS313:1:1:5:643#0	0	chr3R	25748461	3	42M	*	0	0	CTACGATGGCAGCCCACTGCCCGACTGGCTCCAGTCCGTCGA	@B9BBBB@BBBBA@ABB9BBBBB<BBBBBAB@==?@A<####	NM:i:0	NH:i:2	CC:Z:=	CP:i:25749771
    HWI-EAS313:1:1:5:643#0	16	chr3R	25749771	3	42M	*	0	0	TCGACGGACTGGAGCCAGTCGGGCAGTGGGCTGCCATCGTAG	####<A@?==@BABBBBB<BBBBB9BBA@ABBBB@BBBB9B@	NM:i:0	NH:i:2
    HWI-EAS313:1:1:5:761#0	16	chr3L	16031864	255	42M	*	0	0	CATGACCGCATGGCAGGAATGCCGTATTGTACTCGGCGCCGT	A@<?=AAABAABA=ABAB@BB4@AAABA;<>??7@=@3AA@B	NM:i:0	NH:i:1
    HWI-EAS313:1:1:5:871#0	16	chrX	5627098	255	42M	*	0	0	CGTTTATCCTGCGGATCGATTGCGGTGCTATCAGTGCAGCGG	=CC@;A@=@@CCCC>CACCAC@@BBABBAACBCCCCCCCBCB	NM:i:0	NH:i:1
    HWI-EAS313:1:1:5:1080#0	16	chr3L	3195955	255	42M	*	0	0	AGGTCANCAACTCTGCCTTCGTGGAGCGCGTCAAGGCCCGTG	>@@=9<%8>A@;<@@B@;@922BBB@<@@A@>@BBA?@8@AB	NM:i:1	NH:i:1
    HWI-EAS313:1:1:5:1163#0	16	chr3L	14394814	255	42M	*	0	0	AAAGATCGGCATCCATCTGAGCAATGGCACCGGCAATTCTGT	##@<=<ACBBB;4@ABB;*<ACC7;;CC,BBCC?B;AABCCB	NM:i:3	NH:i:1
    Here is the python command I gave

    Code:
    python -m HTSeq.scripts.count accepted_hits.sam ../ALLEXONS.gtf > output.out
    Cufflinks output

    Code:
    gene_id	S1
    FBgn0000003	208.127
    FBgn0000008	2.84909
    FBgn0000014	8.2101
    FBgn0000015	3.71491
    FBgn0000017	4.53253
    FBgn0000018	4.94072
    FBgn0000024	5.39896
    FBgn0000032	18.2969
    FBgn0000036	0.615742
    FBgn0000037	0.651662
    FBgn0000038	1.6011
    FBgn0000039	1.12615
    HTSeq-count output

    Code:
    FBgn0000003	0
    FBgn0000008	52
    FBgn0000014	157
    FBgn0000015	56
    FBgn0000017	123
    FBgn0000018	36
    FBgn0000022	5
    FBgn0000024	85
    FBgn0000028	0
    FBgn0000032	144
    As you can see the cufflinks and htseq-count output are not coherent. there are genes missing in cufflinks while "FBGn0000003" shows a value in cufflinks but zero in htseq. Maybe Dr. Simon Anders can help with this. I cannot understand. Furthermore is the formatting of my GTF file correct?

  • #2
    HTSeq-count can be quite conservative in the sense that it discards reads that cannot be unambiguously be assigned to a gene. This can cause values to be lower than reported by other tools. (This is done on purpose: For differential expression analysis, you need to discount ambiguous reads to avoid that differential signal on one gene shows up in another one that overlaps.)

    The gene FBgn0000003 is an extremely short ncRNA (less than 300 bp), so it is well possible that very few reads, maybe just a single one, give rise to a rather high FPKM value. If you now wonder why these few reads were discounted by HTSeq and counted by cufflinks, display your SAM file in a genome browser (e.g., IGV) and have a look at the FBgn0000003 locus. How many reads do you see there? Do they maybe overlap with more than one feature? If you write down these reads' IDs, you can then use HTSeq-count's new '-o' option to check out what these reads were assigned to.

    Finally: Your gene is on the '+' strand. By default, htseq-count only counts reads that were aligned to the same strand as the feature. This makes sense if your RNA-Seq protocol poreserves strand information. Remember that Illumina's default protocol does not do this, and for such data, you must specify the option '--stranded=no' to get htseq-count to count reads on either strand.

    If this does not resolve the issue, post a screen shot from your genome browser, showing a disputed region, and maybe I can spot something.

    Simon

    Comment


    • #3
      Hi Simon,

      Thanks for the input. Yes I ran the program with --stranded=no, and immediately the no_feature dropped from ~375000 to ~185000. Ambiguos and alignment_not_unique did not change much.

      About the gene "FBgn0000003", I find that a lot of reads are associated with it. I am sending a screenshot. I investigated a few (10 reads), and find that they are "alignment_not_unique". Probably that is why htseq-count gives zero. But why does Cufflinks provide expression value for this? I would have thought that they account for ambiguity.

      Thanks
      Abhijit
      Attached Files

      Comment


      • #4
        Originally posted by gen2prot View Post
        But why does Cufflinks provide expression value for this? I would have thought that they account for ambiguity.
        Interesting question. Maybe send a mail to the Cufflinks developers and ask. If you do, please tell us what they say.

        But note that this is not necessarily a bug. Depending on what you want to do with your data, I see two rationales on how to deal with reads that map to multiple places.

        1. For htseq-count, I imagined the user to then use DESeq (or edgeR) to test for differential expression. Imagine we have two paralogous genes that have identical sequence at one half or their length and divergent sequence at the other half, and one of these genes is differentially expressed and the other is not. All reads that stem from the identical-sequence parts of the transcripts will map to both genes, and if we include them in our counts, both genes will appear to be differentially expressed, even though only one is really. If we count only the uniquely mapping reads (i.e., those stemming from the divergent parts of the transcripts), we are safe.

        2. It seems to me that the cufflinks developers rather had in mind that the user has one sample and wishes to study which genes are strongly expressed and which are not. If we disregard non-uniquely mapped reads, all those genes that have highly similar paralogs will get less counts than genes that have diverged a lot since their last duplication event, i.e. the counts are biased according to the gene's within-genome evolutionary history.

        This is especially an issue because cufflinks reports FPKM values, i.e., it divides by transcript length. Note that mappability is a feature of a the reference, not of the read. If a part of a gene's sequence appears somewhere else in the genome, no read will map there uniquely, i.e., the part cannot get any reads. Hence it should be subtracted from the transcript length when calculating FPKM values. Alternatively, one could distribute non-uniquely mapped reads to the different places, i.e., count them fractionally for each of their possible mappings. Maybe, this is what cufflinks does, but it makes the job of differential expression calling rather difficult, and I wonder how cuffdiff could take it into account.

        Simon

        Comment


        • #5
          Cufflinks currently divides multi-reads (i.e. reads that map to multiple locations) uniformly in calculating expression values. This is not correct, however it is worth pointing out that (probabilistic) assignment of multi-reads is mathematically equivalent to the assignment of reads that map uniquely but to multiple isoforms. In fact, the RSEM program correctly handles multi-reads. We have worked out an effective strategy for probabilistically assigning multi-reads within the Cufflinks framework and the implementation is scheduled for the release-after-next (hopefully ~1 month away).

          We have procrastinated dealing with multi-reads because as read length has increased they have been less of a problem. However they still affect expression estimates in gene families, and we are sensitive to the fact that many users still use short reads. In terms of differential expression estimates, there is nothing difficult about incorporating probabilistic assignment of multi-reads- it works exactly the same way as the current method implemented in cuffdiff, even with replicates.

          Lior
          P.S. Actually mappability is not a feature of the reference- it is a feature of the transcriptome. A read may map to multiple locations in a reference, but it may be possible to assign it to only one location if the other transcripts have very low expression (that can be estimated using other reads that map to them). In other words, mappability and expression estimation are intimately related, and fractional assignment of mappings is not an option, it is essential if one doesn't want to lose power by throwing away data.

          Comment


          • #6
            Hi Simon and Lior,

            Thank you for your thoughts. What I am trying to do is find a ratio between the autosomal expression in wild type male of drosophila and a mutant. We have reason to believe that global autosomal gene expression is elevated in the mutant. Therefore even if we wrongly call differential gene expression for some genes, on a global level it should not matter much. However, mapping reads correctly to call DGE is still important to me looking ahead. I had therefore tried to use as reference the transcripts of drosophila, rather than the genome. I was not concerned with finding novel transcripts or genes. Anyway, I used bowtie-build on the transcripts, then used tophat on the generated indices. I then designed a perl script which would count as "one" any read that mapped to two or more transcripts of the same gene, and would disregard any read which hit two transcripts belonging to different genes (I did not have a formatted -GTF file at the time, because then I could have just used the -GTF option of tophat on the reference genome). I find that the results by running against the transcripts is not very different (when it comes to ratios) from when run against the genome.

            But I agree that the problem of "counts biased according to the genes within-genome evolutionary history" and also biased if a domain is the same between two completely unrelated genes will still remain. Mate-pairs and longer read lengths seem to be the only ways to lessen this problem.

            Thanks
            Abhijit

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Strategies for Sequencing Challenging Samples
              by seqadmin


              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
              03-22-2024, 06:39 AM
            • seqadmin
              Techniques and Challenges in Conservation Genomics
              by seqadmin



              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

              Avian Conservation
              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
              03-08-2024, 10:41 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Yesterday, 06:37 PM
            0 responses
            12 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, Yesterday, 06:07 PM
            0 responses
            10 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-22-2024, 10:03 AM
            0 responses
            52 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 03-21-2024, 07:32 AM
            0 responses
            68 views
            0 likes
            Last Post seqadmin  
            Working...
            X