SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
DEXSeq Using Counts File From htseq-count FuzzyCoder Bioinformatics 20 01-03-2016 11:18 PM
Compare RNA counts: HTSeq vs Partek schaffer RNA Sequencing 2 12-02-2011 09:01 AM
Exclude chrM, chrUn* from reference // htseq-count warning on chrM ocs Bioinformatics 10 11-02-2011 10:21 AM
How to count Paired-End reads for RNA-Seq read counts epistatic RNA Sequencing 0 08-31-2011 02:44 PM
understanding HTSeq counts nimmi Bioinformatics 3 11-27-2010 07:24 PM

Reply
 
Thread Tools
Old 07-14-2011, 12:34 PM   #1
hibachings2013
Member
 
Location: na

Join Date: Jun 2010
Posts: 15
Default htseq-count with warning for every read to represent all of zero counts in output

Dear All,

I encountered a problem in htseq-count when running

with the command:
Code:
python -m HTSeq.scripts.count /home/accepted_hits.sam /home/ref_annotation.gtf --stranded=no > /home/readcount.txt
got the warning messages for every read.

Code:
:
:
:

Warning: Skipping read 'SRR064199:1:120:5973:2700#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'SRR064199:1:100:15531:20305#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'SRR064199:1:81:15517:14126#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'SRR064199:1:84:2264:2294#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'SRR064199:1:16:2321:19643#', because chromosome 'ZZEF1', to which it has been aligned, did not appear in the GFF file.
8987862 sam lines  processed.

:
:
:
while it is running, such warnings are coming out on and on.
Here are some information for my sam and gtf file.

Code:
dhcp128036164097:~ $ head -10 accepted_hits.sam
@HD	VN:1.0	SO:coordinate
@SQ	SN:1433E	LN:1456
@SQ	SN:1433F	LN:1767
@SQ	SN:1433G	LN:2842
@SQ	SN:1433S	LN:1340
@SQ	SN:1433T	LN:1803
@SQ	SN:1433Z	LN:3600
@SQ	SN:2ABB	LN:1631
@SQ	SN:3BHS	LN:1521
@SQ	SN:3BP5L	LN:3264
Code:
hcp128036164097:~ $ head -10 ref_annotation.gtf
chr1	oviAri1_refFlat	stop_codon	2625590	2625592	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
chr1	oviAri1_refFlat	CDS	2625593	2625786	0.000000	-	2	gene_id "PRLH"; transcript_id "PRLH"; 
chr1	oviAri1_refFlat	exon	2625494	2625786	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
chr1	oviAri1_refFlat	CDS	2626259	2626358	0.000000	-	0	gene_id "PRLH"; transcript_id "PRLH"; 
chr1	oviAri1_refFlat	start_codon	2626356	2626358	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
chr1	oviAri1_refFlat	exon	2626259	2626400	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
chr1	oviAri1_refFlat	exon	2628330	2628369	0.000000	-	.	gene_id "PRLH"; transcript_id "PRLH"; 
chr1	oviAri1_refFlat	stop_codon	2635951	2635953	0.000000	-	.	gene_id "MLPH"; transcript_id "MLPH"; 
chr1	oviAri1_refFlat	CDS	2635954	2635977	0.000000	-	2	gene_id "MLPH"; transcript_id "MLPH"; 
chr1	oviAri1_refFlat	exon	2635892	2635977	0.000000	-	.	gene_id "MLPH"; transcript_id "MLPH";

And finally, getting all of zero counts for every gene like below:

Code:
ABCG2	0
ABHD5	0
ACACA	0
ACLY	0
ACTB	0
ACVR2A	0
ADCYAP1	0
ADIG	0
ADORA3	0
ADRB2	0
ADRB3	0
AGPAT1	0
AGTR1	0
AHSG	0
AIMP2	0
AK1	0
AKT1	0
ALB	0
ALDH1A1	0
ALDOB	0
AMP18	0
ANXA2	0
APOBEC3F	0
APOBEC3Z1	0
APOBEC3Z2	0
APOBEC3Z3	0
AQP1	0
AQP4	0
AQP5	0
ARAF	0
ARF1	0
ARF3	0
ARF4	0
ARF5	0
ARF6	0
ARFIP2	0
ARFRP1	0
ARL1	0
ARL2	0
ARL2BP	0
ARL3	0
ARL4A	0
ARL4C	0
ARL5A	0
ARL5C	0
ARL6IP1	0
ARL8A	0
ARL8B	0
ARNTL	0
ASF1A	0
ASIP	0
ASZ1	0
ATF4	0
ATOX1	0
:
:
:

Anyone who encountered this problem before or who can advise me in details to address this?

It would be very appreciated for feedback from you guys!!!
hibachings2013 is offline   Reply With Quote
Old 07-14-2011, 12:50 PM   #2
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

Is the gff file you use for tophat (I assume that is what you are running) different than what you are using for htseq-count? I have seen these errors before, and if I remember correctly, its because I was using a gff3 file for tophat and a gtf file for htseq-count and for some of the mitochondrial genes, the names were different, so that htseq-count could not recognize them.
chadn737 is offline   Reply With Quote
Old 07-14-2011, 12:55 PM   #3
hibachings2013
Member
 
Location: na

Join Date: Jun 2010
Posts: 15
Default

thanks chadn737, btw, how do i fix my current gtf file properly to fit with htseq-count??
hibachings2013 is offline   Reply With Quote
Old 07-14-2011, 01:04 PM   #4
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

My bad, I meant that the chromsome names used in the fasta file supplied to tophat (not the gff3 file I gave it) were different than those in the gtf file that I supplied to htseq-count.

How are your chromosomes named in the files you are aligning to? Based on the error message you supplied, am I right in guessing that you are aligning to a cDNA file rather than genomic? ZZEF1 is a gene name, is it not?

Last edited by chadn737; 07-14-2011 at 01:07 PM.
chadn737 is offline   Reply With Quote
Old 07-14-2011, 01:49 PM   #5
hibachings2013
Member
 
Location: na

Join Date: Jun 2010
Posts: 15
Default

chadn737, i just got a reply from simon that actually, the most common mistake is that chromosome 1 might be called 'chr1' or 'chr01' in your GTF file and '1' or 'I' in your FASTA and SAM file. HTSeq-count won't match this up for you. but i am not still quite sure what it means. which means, it seems i need to fix gtf file manually with a script?

For gtf file of sheep annotation, from ucsc genome browser. sheep => refseq => refFlat => gtf file format i downloaded it.

http://genome.ucsc.edu/cgi-bin/hgTables



For the input files of tophat, i used "fastq files". A combined assembly "fasta file" of bos taurus + ovis aries + de novo oases was also used in the alignment from the paper :http://www.biomedcentral.com/1471-21...58/additional/


In my fastq file, some of rows,
Code:
dhcp128036164097:~ $ head -10 s_1.fastq
@SRR064199:1:1:1032:21098#/1
GTAGACGTCAAGACTACCGATGGTTACTTGCTTCGTCTGTTCTGTGTGGGTTTTACTAAAAAGCGCAACAATCAGA
+SRR064199:1:1:1032:21098#/1
bbbbbbbbabab^b^bbbbb^bbabbbbb_babbR`b^b]bbbb\bXbbbS_bbb_bb`[bbbbbXb]bbb[b]ba
@SRR064199:1:1:1033:14319#/1
GTACCCTCTCCAGAGGGTGGAAATAGGGCTAAGTTTGTCTGCGCACTATTCAATTTGTACCACGGCACCCAGGAAA
+SRR064199:1:1:1033:14319#/1
aa_aaaaaaaa^aaaaaaaaXS^^W[````_`aaaaaaaaaaaaaaaY^aa_[`a]aaaa``T```````_a_]]V
@SRR064199:1:1:1034:11655#/1
CTGAACAATGCACTGGGAAGGGTGGAGATCGGGTCCTCGCCTGGAATCTATTTCCTCCCCTTGCCGGCGCTTTGAT

ANy thought i can resolve this problem???

Many thanks for your help!!

Last edited by hibachings2013; 07-14-2011 at 02:04 PM.
hibachings2013 is offline   Reply With Quote
Old 07-14-2011, 02:20 PM   #6
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

So you used the FASTA file in supplement 2?

That explains the issue then. That FASTA file is a file of transcripts, not the genome. So the chromosome name in your sam file is actually the transcript name and the coordinates are the read's coordinates within that transcript, not within the chromosome. So htseq-count can't find the correct position for your reads in the gtf file and it fails.

Do you have to use the trancripts from that paper for alignment or is there a sequenced genome (I work in plants, so I don't know what the state of your organism's genome is)?

Last edited by chadn737; 07-14-2011 at 02:22 PM.
chadn737 is offline   Reply With Quote
Old 07-15-2011, 04:50 AM   #7
hibachings2013
Member
 
Location: na

Join Date: Jun 2010
Posts: 15
Default

yes, i need to use the paper data and i am trying to generate a read count or normalized data from raw fasta file in the paper now, it is a sheep organism and the fasta sequence file is a combined assembly of 3, Bos taurus, ovid Arosa and de Novo Ovis Aries.
hibachings2013 is offline   Reply With Quote
Old 07-15-2011, 05:48 AM   #8
DZhang
Senior Member
 
Location: East Coast, US

Join Date: Jun 2010
Posts: 177
Default

Hi seqanswers_chuck,

As chadn737 pointed out, the complaint from htseq-count was that the chromosome names used in your bam files were different from those in your gtf files. So first you make sure the difference is really just a naming issue (not using a wrong gtf), then you can simply correct the difference by replacing the chromosome names in the gtf file with those in the bam files. It is a simple find-and-replace thing.

Best regards,
Douglas
www.contigexpress.com
DZhang is offline   Reply With Quote
Old 07-15-2011, 06:53 AM   #9
hibachings2013
Member
 
Location: na

Join Date: Jun 2010
Posts: 15
Default

yes, i will try it, thank you so much guys! i will keep you posted when i get it.
hibachings2013 is offline   Reply With Quote
Old 07-15-2011, 09:50 AM   #10
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

You missed the point of #6.

The whole point of htseq-count is to take a mapping against the genome and see which genes the mappinmg correspond to. If you map against the transcriptome, it simply makes no sense to use this script. Insetad, you just count directly from the SAM file.

Of course, there are many issues with mapping against the transcriptome, and I would advise against it unless you know precisely what you are doing.
Simon Anders is offline   Reply With Quote
Old 07-15-2011, 10:19 AM   #11
hibachings2013
Member
 
Location: na

Join Date: Jun 2010
Posts: 15
Default

thanks simon, yes, the comments of you guys seem to be exactly right. when i look at the my sam file from tophatoutput, there are no chromosome names. only ZZEF1...are showing up.

Code:
dhcp128036165079: $ tail -n 10 accepted_hits.sam
SRR064199:1:49:16987:18852#	16	ZZEF1	9166	255	76M	*	0	0	AGCATCTGAAAGTGATTTCCATCTAGACCGACATCCCTCTGATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGG	=CCDCCBCCCCCCCCCCCCCCCCCCCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
SRR064199:1:59:8577:15840#	16	ZZEF1	9166	255	76M	*	0	0	AGCATCTGAAAGTGATTTCCATCTAGACCGACATCCCTCTGATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGG	@CCDACCCDCCCCCCCCCCCDCCCCCDCCCDCDDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
SRR064199:1:8:18821:18136#	16	ZZEF1	9191	255	76M	*	0	0	GACCGACATCCCTCTGATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGGTGGGACTGGCAGGAATGAAGGTGTT	CBCCCDCCDCCC@CCCCCCCCCCCCCCCBCCCCCCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
SRR064199:1:43:19181:11415#	16	ZZEF1	9198	255	76M	*	0	0	ATCCCTCTGATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGGTGGGACTGGCAGGAATGAAGGTGTTGTGGAAA	CDCDCBCACCBCCBBCCCCCCDCCCCCCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
SRR064199:1:3:18744:5218#	16	ZZEF1	9206	255	76M	*	0	0	GATCCTTCTTAGGAGGTGGGACTGCCTGGAGTGAGGTGGGACTGGCAGGAATGAAGGTGTTGTGGAAACATGTGCC	=C@CACAC;CCCCCCCCCDCBCCACCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC	NM:i:0	NH:i:1
SRR064199:1:120:5973:2700#	16	ZZEF1	9217	255	76M	*	0
so htseq-count did not count those at all. so i will try downloading a genomic seq of sheep and rebuild index files of bowtie. and i will redo all of steps and will keep you updated soon. I really appreciate all comments!
hibachings2013 is offline   Reply With Quote
Reply

Tags
htseq-count

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:18 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO