Seqanswers Leaderboard Ad

**dpryan** · 11-12-2013, 12:46 PM

Either specify a new value for "-i " or fix the GFF so it has a gene_id field.

**meher** · 11-12-2013, 12:53 PM

Originally posted by dpryan View Post

Either specify a new value for "-i " or fix the GFF so it has a gene_id field.

Could you give a sample output line of the gff file showing how it should be modified?

Thanks.

**dpryan** · 11-12-2013, 12:58 PM

Code:

 chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2;gene_id=100856150

Those gene_ids happen to be from NCBI, should you wonder later. It should be pretty easy to just parse each line and extract the right part. In theory, you could instead use "-i gene", but that's unlikely to always be unique.

**meher** · 11-12-2013, 02:59 PM

Originally posted by dpryan View Post

Code:

 chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2;gene_id=100856150

Those gene_ids happen to be from NCBI, should you wonder later. It should be pretty easy to just parse each line and extract the right part. In theory, you could instead use "-i gene", but that's unlikely to always be unique.

I have tried using:
htseq-count -i gene file.sam file.gff > count.txt

It processed without any error but the counts are zero.The output looks like:

A1BG 0
A1CF 0
A2M 0
A2ML1 0
.
.
.
no_feature 38443217
ambiguous 0
too_low_aQual 0
not_aligned 0
alignment_not_unique 9202815

I also tried by changing the gff file to the format suggested by you:

##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region chr1 1 122678785
##species http://www.ncbi.nlm.nih.gov/Taxonomy...ax.cgi?id=9615
chr1 RefSeq region 1 122678785 . + . ID=id0;Name=1;Dbxref=taxon:9615;breed=boxer;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;sub-species=familiaris,gene_id=9615
chr1 Gnomon gene 251990 322081 . - . ID=gene0;Name=ENPP1;Dbxref=GeneID:100856150;gbkey=Gene;gene=ENPP1;part=1%2F1,gene_id=100856150
chr1 Gnomon mRNA 251990 322081 . - . ID=rna0;Name=XM_003638785.2;Parent=gene0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2,gene_id=100856150

The output and error:
htseq-count sorted.sam ../refGene_canFam3.gff > ref_counts.txt
Error occured in line 6 of file ../refGene_canFam3.gff.
Error: need more than 1 value to unpack
[Exception type: ValueError, raised in __init__.py:220]

Any further tips to fix this?

**dpryan** · 11-12-2013, 03:06 PM

Firstly you need a semicolon before "gene_id". Secondly, if you're getting 0 counts when you specify "-i gene" then there's something else wrong as well. You might just make a miniature SAM file with a few reads that map uniquely and should be counted and just see what happens (try using the -o option to help in tracking).

**meher** · 11-14-2013, 06:50 AM

Originally posted by dpryan View Post

Firstly you need a semicolon before "gene_id". Secondly, if you're getting 0 counts when you specify "-i gene" then there's something else wrong as well. You might just make a miniature SAM file with a few reads that map uniquely and should be counted and just see what happens (try using the -o option to help in tracking).

I have placed semicolon before gene_id and tried htseq-count.

Error occured in line 6 of file ../refGene_canFam3.gff.
Error: need more than 1 value to unpack
[Exception type: ValueError, raised in __init__.py:220

And infact it gave zerou counts with -i gene. so it means that there should be something else wrong. Could you let me know the commands or steps which you mean to check?

**dpryan** · 11-14-2013, 07:13 AM

Not explicitly without seeing how whichever aligner you're using indicates a uniquely aligned read. In general, you can probably just grep for "NH:i:0" and then just put the first thousand or so in a file. That'd be a miniature SAM file. Then just use htseq-count with "-o annotated_input.sam" to see how each read is treated. You might also just want to look at things in IGV.

The chromosome names are the same between your reference and the annotation file, I hope.

**meher** · 11-14-2013, 09:40 AM

Originally posted by dpryan View Post

Not explicitly without seeing how whichever aligner you're using indicates a uniquely aligned read. In general, you can probably just grep for "NH:i:0" and then just put the first thousand or so in a file. That'd be a miniature SAM file. Then just use htseq-count with "-o annotated_input.sam" to see how each read is treated. You might also just want to look at things in IGV.

The chromosome names are the same between your reference and the annotation file, I hope.

Do you mean "NH:i:1"? because "NH:i:0" seems to be absent when it tried

samtools view -S accepted_hits.sam | grep "NH:i:0".

**dpryan** · 11-14-2013, 10:28 AM

Mea culpa, yeah NH:i:1 would be correct :P

**meher** · 11-14-2013, 10:41 AM

Originally posted by dpryan View Post

Mea culpa, yeah NH:i:1 would be correct :P

I have done what u have suggested:

htseq-count -i gene mini.sam gff3 -o mini_annot.sam

So what should be observed from mini_annot.sam?

$ wc -l mini.sam
10000 mini.sam
$ wc -l mini_annot.sam
145 mini_annot.sam

It says "XF:Z:alignment_not_unique" at the end of each line in mini_annot.sam file.

Any clue out of this?

**dpryan** · 11-14-2013, 10:43 AM

I'm not sure why there are so few alignments in the mini_annot.sam file, but generally just check those to ensure that they actually do map to something in the gff file.

**meher** · 11-14-2013, 10:51 AM

Originally posted by dpryan View Post

I'm not sure why there are so few alignments in the mini_annot.sam file, but generally just check those in the reference GTF to ensure that they actually do map to something in the gff file.

There are a few mapping to the features in gff file. It is weird and no clue.

**dpryan** · 11-14-2013, 10:55 AM

Maybe take a subset of the gff file where some of those should map and test that. In general, htseq-count works well, so there's likely something off in the annotation somewhere. So start with a small chunk of the annotation such that things work and then just increase the size until it breaks. You will have then found the source of the problem.

Alternatively you can always try summarizeOverlaps() in GenomicRanges in R or featureCounts from subread. Perhaps one of those will ignore whatever is breaking htseq-count.

**Simon Anders** · 11-15-2013, 02:45 AM

Originally posted by meher View Post

It says "XF:Z:alignment_not_unique" at the end of each line in mini_annot.sam file.

Just post the SAM file, so we can see what's wrong.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 24 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 23 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

HTSeq 'gene_id' attribute error

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News