Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • htseq-count with warning for every read to represent all of zero counts in output

    Dear All,

    I encountered a problem in htseq-count when running

    with the command:
    Code:
    python -m HTSeq.scripts.count /home/accepted_hits.sam /home/ref_annotation.gtf --stranded=no > /home/readcount.txt
    got the warning messages for every read.

    Code:
    :
    :
    :
    
    Warning: Skipping read 'SRR064199:1:120:5973:2700#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
    Warning: Skipping read 'SRR064199:1:100:15531:20305#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
    Warning: Skipping read 'SRR064199:1:81:15517:14126#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
    Warning: Skipping read 'SRR064199:1:84:2264:2294#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
    Warning: Skipping read 'SRR064199:1:16:2321:19643#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
    8987862 sam lines  processed.
    
    :
    :
    :
    while it is running, such warnings are coming out on and on.
    Here are some information for my sam and gtf file.

    Code:
    dhcp128036164097:~ $ head -10 accepted_hits.sam
    @HD	VN:1.0	SO:coordinate
    @SQ	SN:1433E	LN:1456
    @SQ	SN:1433F	LN:1767
    @SQ	SN:1433G	LN:2842
    @SQ	SN:1433S	LN:1340
    @SQ	SN:1433T	LN:1803
    @SQ	SN:1433Z	LN:3600
    @SQ	SN:2ABB	LN:1631
    @SQ	SN:3BHS	LN:1521
    @SQ	SN:3BP5L	LN:3264
    Code:
    hcp128036164097:~ $ head -10 ref_annotation.gtf
    chr1	oviAri1_refFlat	stop_codon	2625590	2625592	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
    chr1	oviAri1_refFlat	CDS	2625593	2625786	0.000000	-	2	gene_id "PRLH"; transcript_id "PRLH"; 
    chr1	oviAri1_refFlat	exon	2625494	2625786	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
    chr1	oviAri1_refFlat	CDS	2626259	2626358	0.000000	-	0	gene_id "PRLH"; transcript_id "PRLH"; 
    chr1	oviAri1_refFlat	start_codon	2626356	2626358	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
    chr1	oviAri1_refFlat	exon	2626259	2626400	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
    chr1	oviAri1_refFlat	exon	2628330	2628369	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
    chr1	oviAri1_refFlat	stop_codon	2635951	2635953	0.000000	-	.	gene_id "MLPH"; transcript_id "MLPH"; 
    chr1	oviAri1_refFlat	CDS	2635954	2635977	0.000000	-	2	gene_id "MLPH"; transcript_id "MLPH"; 
    chr1	oviAri1_refFlat	exon	2635892	2635977	0.000000	-	.	gene_id "MLPH"; transcript_id "MLPH";

    And finally, getting all of zero counts for every gene like below:

    Code:
    ABCG2	0
    ABHD5	0
    ACACA	0
    ACLY	0
    ACTB	0
    ACVR2A	0
    ADCYAP1	0
    ADIG	0
    ADORA3	0
    ADRB2	0
    ADRB3	0
    AGPAT1	0
    AGTR1	0
    AHSG	0
    AIMP2	0
    AK1	0
    AKT1	0
    ALB	0
    ALDH1A1	0
    ALDOB	0
    AMP18	0
    ANXA2	0
    APOBEC3F	0
    APOBEC3Z1	0
    APOBEC3Z2	0
    APOBEC3Z3	0
    AQP1	0
    AQP4	0
    AQP5	0
    ARAF	0
    ARF1	0
    ARF3	0
    ARF4	0
    ARF5	0
    ARF6	0
    ARFIP2	0
    ARFRP1	0
    ARL1	0
    ARL2	0
    ARL2BP	0
    ARL3	0
    ARL4A	0
    ARL4C	0
    ARL5A	0
    ARL5C	0
    ARL6IP1	0
    ARL8A	0
    ARL8B	0
    ARNTL	0
    ASF1A	0
    ASIP	0
    ASZ1	0
    ATF4	0
    ATOX1	0
    :
    :
    :

    Anyone who encountered this problem before or who can advise me in details to address this?

    It would be very appreciated for feedback from you guys!!!

  • #2
    Is the gff file you use for tophat (I assume that is what you are running) different than what you are using for htseq-count? I have seen these errors before, and if I remember correctly, its because I was using a gff3 file for tophat and a gtf file for htseq-count and for some of the mitochondrial genes, the names were different, so that htseq-count could not recognize them.

    Comment


    • #3
      thanks chadn737, btw, how do i fix my current gtf file properly to fit with htseq-count??

      Comment


      • #4
        My bad, I meant that the chromsome names used in the fasta file supplied to tophat (not the gff3 file I gave it) were different than those in the gtf file that I supplied to htseq-count.

        How are your chromosomes named in the files you are aligning to? Based on the error message you supplied, am I right in guessing that you are aligning to a cDNA file rather than genomic? ZZEF1 is a gene name, is it not?
        Last edited by chadn737; 07-14-2011, 01:07 PM.

        Comment


        • #5
          chadn737, i just got a reply from simon that actually, the most common mistake is that chromosome 1 might be called 'chr1' or 'chr01' in your GTF file and '1' or 'I' in your FASTA and SAM file. HTSeq-count won't match this up for you. but i am not still quite sure what it means. which means, it seems i need to fix gtf file manually with a script?

          For gtf file of sheep annotation, from ucsc genome browser. sheep => refseq => refFlat => gtf file format i downloaded it.





          For the input files of tophat, i used "fastq files". A combined assembly "fasta file" of bos taurus + ovis aries + de novo oases was also used in the alignment from the paper :http://www.biomedcentral.com/1471-21...58/additional/


          In my fastq file, some of rows,
          Code:
          dhcp128036164097:~ $ head -10 s_1.fastq
          @SRR064199:1:1:1032:21098#/1
          GTAGACGTCAAGACTACCGATGGTTACTTGCTTCGTCTGTTCTGTGTGGGTTTTACTAAAAAGCGCAACAATCAGA
          +SRR064199:1:1:1032:21098#/1
          bbbbbbbbabab^b^bbbbb^bbabbbbb_babbR`b^b]bbbb\bXbbbS_bbb_bb`[bbbbbXb]bbb[b]ba
          @SRR064199:1:1:1033:14319#/1
          GTACCCTCTCCAGAGGGTGGAAATAGGGCTAAGTTTGTCTGCGCACTATTCAATTTGTACCACGGCACCCAGGAAA
          +SRR064199:1:1:1033:14319#/1
          aa_aaaaaaaa^aaaaaaaaXS^^W[````_`aaaaaaaaaaaaaaaY^aa_[`a]aaaa``T```````_a_]]V
          @SRR064199:1:1:1034:11655#/1
          CTGAACAATGCACTGGGAAGGGTGGAGATCGGGTCCTCGCCTGGAATCTATTTCCTCCCCTTGCCGGCGCTTTGAT

          ANy thought i can resolve this problem???

          Many thanks for your help!!
          Last edited by hibachings2013; 07-14-2011, 02:04 PM.

          Comment


          • #6
            So you used the FASTA file in supplement 2?

            That explains the issue then. That FASTA file is a file of transcripts, not the genome. So the chromosome name in your sam file is actually the transcript name and the coordinates are the read's coordinates within that transcript, not within the chromosome. So htseq-count can't find the correct position for your reads in the gtf file and it fails.

            Do you have to use the trancripts from that paper for alignment or is there a sequenced genome (I work in plants, so I don't know what the state of your organism's genome is)?
            Last edited by chadn737; 07-14-2011, 02:22 PM.

            Comment


            • #7
              yes, i need to use the paper data and i am trying to generate a read count or normalized data from raw fasta file in the paper now, it is a sheep organism and the fasta sequence file is a combined assembly of 3, Bos taurus, ovid Arosa and de Novo Ovis Aries.

              Comment


              • #8
                Hi seqanswers_chuck,

                As chadn737 pointed out, the complaint from htseq-count was that the chromosome names used in your bam files were different from those in your gtf files. So first you make sure the difference is really just a naming issue (not using a wrong gtf), then you can simply correct the difference by replacing the chromosome names in the gtf file with those in the bam files. It is a simple find-and-replace thing.

                Best regards,
                Douglas

                Comment


                • #9
                  yes, i will try it, thank you so much guys! i will keep you posted when i get it.

                  Comment


                  • #10
                    You missed the point of #6.

                    The whole point of htseq-count is to take a mapping against the genome and see which genes the mappinmg correspond to. If you map against the transcriptome, it simply makes no sense to use this script. Insetad, you just count directly from the SAM file.

                    Of course, there are many issues with mapping against the transcriptome, and I would advise against it unless you know precisely what you are doing.

                    Comment


                    • #11
                      thanks simon, yes, the comments of you guys seem to be exactly right. when i look at the my sam file from tophatoutput, there are no chromosome names. only ZZEF1...are showing up.

                      Code:
                      dhcp128036165079: $ tail -n 10 accepted_hits.sam
                      SRR064199:1:49:16987:18852#	16	ZZEF1	9166	255	76M	*	0	0	AGCATCTGAAAGTGATTTCCATCTAGACCGACATCCCTCTGATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGG	=CCDCCBCCCCCCCCCCCCCCCCCCCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
                      SRR064199:1:59:8577:15840#	16	ZZEF1	9166	255	76M	*	0	0	AGCATCTGAAAGTGATTTCCATCTAGACCGACATCCCTCTGATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGG	@CCDACCCDCCCCCCCCCCCDCCCCCDCCCDCDDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
                      SRR064199:1:8:18821:18136#	16	ZZEF1	9191	255	76M	*	0	0	GACCGACATCCCTCTGATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGGTGGGACTGGCAGGAATGAAGGTGTT	CBCCCDCCDCCC@CCCCCCCCCCCCCCCBCCCCCCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
                      SRR064199:1:43:19181:11415#	16	ZZEF1	9198	255	76M	*	0	0	ATCCCTCTGATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGGTGGGACTGGCAGGAATGAAGGTGTTGTGGAAA	CDCDCBCACCBCCBBCCCCCCDCCCCCCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
                      SRR064199:1:3:18744:5218#	16	ZZEF1	9206	255	76M	*	0	0	GATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGGTGGGACTGGCAGGAATGAAGGTGTTGTGGAAACATGTGCC	=C@CACAC;CCCCCCCCCDCBCCACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
                      SRR064199:1:120:5973:2700#	16	ZZEF1	9217	255	76M	*	0
                      so htseq-count did not count those at all. so i will try downloading a genomic seq of sheep and rebuild index files of bowtie. and i will redo all of steps and will keep you updated soon. I really appreciate all comments!

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X