SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Error in HTSEQ komalsrathi Bioinformatics 4 11-19-2013 01:30 PM
htseq error Rob2012 Bioinformatics 0 11-08-2013 02:16 PM
cummeRbund error genes.gene_id jutri Bioinformatics 0 08-30-2013 09:22 AM
BAM record error: found spliced alignment without XS attribute paolo.kunder Bioinformatics 0 09-06-2012 07:09 AM
htseq-count error sissi Bioinformatics 0 03-21-2012 12:40 AM

Reply
 
Thread Tools
Old 11-12-2013, 12:28 PM   #1
meher
Member
 
Location: helsinki

Join Date: Jun 2011
Posts: 54
Default HTSeq 'gene_id' attribute error

I have just started using HTSeq and found problems using GFF file. I have checked some of the threads in seqanswers but couldn't fix.

Here is the command and error:

htseq-count sorted.sam ../ref_CanFam3.1_top_level.gff3
Error occured in line 9 of file ../ref_CanFam3.1_top_level.gff3.
Error: Feature id1 does not contain a 'gene_id' attribute
[Exception type: SystemExit, raised in count.py:55]


Below are a few lines of my GFF file:


#gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region chr1 1 122678785
##species http://www.ncbi.nlm.nih.gov/Taxonomy...ax.cgi?id=9615
chr1 RefSeq region 1 122678785 . + . ID=id0;Name=1;Dbxref=taxon:9615;breed=boxer;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;sub-species=familiaris
chr1 Gnomon gene 251990 322081 . - . ID=gene0;Name=ENPP1;Dbxref=GeneID:100856150;gbkey=Gene;gene=ENPP1;part=1%2F1
chr1 Gnomon mRNA 251990 322081 . - . ID=rna0;Name=XM_003638785.2;Parent=gene0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
chr1 Gnomon exon 289674 289746 . - . ID=id2;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
chr1 Gnomon exon 287787 287903 . - . ID=id3;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
chr1 Gnomon exon 286842 286967 . - . ID=id4;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2
chr1 Gnomon exon 285904 285964 . - . ID=id5;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2



Could someone help to fix this?
meher is offline   Reply With Quote
Old 11-12-2013, 12:46 PM   #2
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Either specify a new value for "-i " or fix the GFF so it has a gene_id field.
dpryan is offline   Reply With Quote
Old 11-12-2013, 12:53 PM   #3
meher
Member
 
Location: helsinki

Join Date: Jun 2011
Posts: 54
Default

Quote:
Originally Posted by dpryan View Post
Either specify a new value for "-i " or fix the GFF so it has a gene_id field.
Could you give a sample output line of the gff file showing how it should be modified?

Thanks.
meher is offline   Reply With Quote
Old 11-12-2013, 12:58 PM   #4
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Code:
 chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2;gene_id=100856150
Those gene_ids happen to be from NCBI, should you wonder later. It should be pretty easy to just parse each line and extract the right part. In theory, you could instead use "-i gene", but that's unlikely to always be unique.
dpryan is offline   Reply With Quote
Old 11-12-2013, 02:59 PM   #5
meher
Member
 
Location: helsinki

Join Date: Jun 2011
Posts: 54
Default

Quote:
Originally Posted by dpryan View Post
Code:
 chr1 Gnomon exon 321851 322081 . - . ID=id1;Parent=rna0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2;gene_id=100856150
Those gene_ids happen to be from NCBI, should you wonder later. It should be pretty easy to just parse each line and extract the right part. In theory, you could instead use "-i gene", but that's unlikely to always be unique.
I have tried using:
htseq-count -i gene file.sam file.gff > count.txt

It processed without any error but the counts are zero.The output looks like:

A1BG 0
A1CF 0
A2M 0
A2ML1 0
.
.
.
no_feature 38443217
ambiguous 0
too_low_aQual 0
not_aligned 0
alignment_not_unique 9202815

I also tried by changing the gff file to the format suggested by you:

##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region chr1 1 122678785
##species http://www.ncbi.nlm.nih.gov/Taxonomy...ax.cgi?id=9615
chr1 RefSeq region 1 122678785 . + . ID=id0;Name=1;Dbxref=taxon:9615;breed=boxer;chromosome=1;gbkey=Src;genome=chromosome;mol_type=genomic DNA;sex=female;sub-species=familiaris,gene_id=9615
chr1 Gnomon gene 251990 322081 . - . ID=gene0;Name=ENPP1;Dbxref=GeneID:100856150;gbkey=Gene;gene=ENPP1;part=1%2F1,gene_id=100856150
chr1 Gnomon mRNA 251990 322081 . - . ID=rna0;Name=XM_003638785.2;Parent=gene0;Dbxref=GeneID:100856150,Genbank:XM_003638785.2;gbkey=mRNA;gene=ENPP1;product=ectonucleotide pyrophosphatase%2Fphosphodiesterase 1;transcript_id=XM_003638785.2,gene_id=100856150

The output and error:
htseq-count sorted.sam ../refGene_canFam3.gff > ref_counts.txt
Error occured in line 6 of file ../refGene_canFam3.gff.
Error: need more than 1 value to unpack
[Exception type: ValueError, raised in __init__.py:220]



Any further tips to fix this?
meher is offline   Reply With Quote
Old 11-12-2013, 03:06 PM   #6
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Firstly you need a semicolon before "gene_id". Secondly, if you're getting 0 counts when you specify "-i gene" then there's something else wrong as well. You might just make a miniature SAM file with a few reads that map uniquely and should be counted and just see what happens (try using the -o option to help in tracking).
dpryan is offline   Reply With Quote
Old 11-14-2013, 06:50 AM   #7
meher
Member
 
Location: helsinki

Join Date: Jun 2011
Posts: 54
Default

Quote:
Originally Posted by dpryan View Post
Firstly you need a semicolon before "gene_id". Secondly, if you're getting 0 counts when you specify "-i gene" then there's something else wrong as well. You might just make a miniature SAM file with a few reads that map uniquely and should be counted and just see what happens (try using the -o option to help in tracking).
I have placed semicolon before gene_id and tried htseq-count.

Error occured in line 6 of file ../refGene_canFam3.gff.
Error: need more than 1 value to unpack
[Exception type: ValueError, raised in __init__.py:220


And infact it gave zerou counts with -i gene. so it means that there should be something else wrong. Could you let me know the commands or steps which you mean to check?

Last edited by meher; 11-14-2013 at 07:01 AM.
meher is offline   Reply With Quote
Old 11-14-2013, 07:13 AM   #8
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Not explicitly without seeing how whichever aligner you're using indicates a uniquely aligned read. In general, you can probably just grep for "NH:i:0" and then just put the first thousand or so in a file. That'd be a miniature SAM file. Then just use htseq-count with "-o annotated_input.sam" to see how each read is treated. You might also just want to look at things in IGV.

The chromosome names are the same between your reference and the annotation file, I hope.
dpryan is offline   Reply With Quote
Old 11-14-2013, 09:40 AM   #9
meher
Member
 
Location: helsinki

Join Date: Jun 2011
Posts: 54
Default

Quote:
Originally Posted by dpryan View Post
Not explicitly without seeing how whichever aligner you're using indicates a uniquely aligned read. In general, you can probably just grep for "NH:i:0" and then just put the first thousand or so in a file. That'd be a miniature SAM file. Then just use htseq-count with "-o annotated_input.sam" to see how each read is treated. You might also just want to look at things in IGV.

The chromosome names are the same between your reference and the annotation file, I hope.

Do you mean "NH:i:1"? because "NH:i:0" seems to be absent when it tried

samtools view -S accepted_hits.sam | grep "NH:i:0".
meher is offline   Reply With Quote
Old 11-14-2013, 10:28 AM   #10
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Mea culpa, yeah NH:i:1 would be correct :P
dpryan is offline   Reply With Quote
Old 11-14-2013, 10:41 AM   #11
meher
Member
 
Location: helsinki

Join Date: Jun 2011
Posts: 54
Default

Quote:
Originally Posted by dpryan View Post
Mea culpa, yeah NH:i:1 would be correct :P
I have done what u have suggested:

htseq-count -i gene mini.sam gff3 -o mini_annot.sam

So what should be observed from mini_annot.sam?

$ wc -l mini.sam
10000 mini.sam
$ wc -l mini_annot.sam
145 mini_annot.sam

It says "XF:Z:alignment_not_unique" at the end of each line in mini_annot.sam file.

Any clue out of this?
meher is offline   Reply With Quote
Old 11-14-2013, 10:43 AM   #12
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

I'm not sure why there are so few alignments in the mini_annot.sam file, but generally just check those to ensure that they actually do map to something in the gff file.

Last edited by dpryan; 11-14-2013 at 10:51 AM.
dpryan is offline   Reply With Quote
Old 11-14-2013, 10:51 AM   #13
meher
Member
 
Location: helsinki

Join Date: Jun 2011
Posts: 54
Default

Quote:
Originally Posted by dpryan View Post
I'm not sure why there are so few alignments in the mini_annot.sam file, but generally just check those in the reference GTF to ensure that they actually do map to something in the gff file.
There are a few mapping to the features in gff file. It is weird and no clue.
meher is offline   Reply With Quote
Old 11-14-2013, 10:55 AM   #14
dpryan
Devon Ryan
 
Location: Freiburg, Germany

Join Date: Jul 2011
Posts: 3,480
Default

Maybe take a subset of the gff file where some of those should map and test that. In general, htseq-count works well, so there's likely something off in the annotation somewhere. So start with a small chunk of the annotation such that things work and then just increase the size until it breaks. You will have then found the source of the problem.

Alternatively you can always try summarizeOverlaps() in GenomicRanges in R or featureCounts from subread. Perhaps one of those will ignore whatever is breaking htseq-count.
dpryan is offline   Reply With Quote
Old 11-15-2013, 02:45 AM   #15
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by meher View Post
It says "XF:Z:alignment_not_unique" at the end of each line in mini_annot.sam file.
Just post the SAM file, so we can see what's wrong.
Simon Anders is offline   Reply With Quote
Old 11-15-2013, 02:57 AM   #16
meher
Member
 
Location: helsinki

Join Date: Jun 2011
Posts: 54
Default

Quote:
Originally Posted by Simon Anders View Post
Just post the SAM file, so we can see what's wrong.
Here are the first few lines of sam file:

@HD VN:1.0 SO:coordinate
@SQ SN:gi|17737322|ref|chrMT| LN:16727
@SQ SN:gi|357579592|ref|chrX| LN:123869142
@SQ SN:gi|357579593|ref|chr38| LN:23914537
@SQ SN:gi|357579594|ref|chr37| LN:30902991
@SQ SN:gi|357579595|ref|chr36| LN:30810995
@SQ SN:gi|357579596|ref|chr35| LN:26524999
@SQ SN:gi|357579597|ref|chr34| LN:42124431
@SQ SN:gi|357579598|ref|chr33| LN:31377067
@SQ SN:gi|357579599|ref|chr32| LN:38810281
@SQ SN:gi|357579600|ref|chr31| LN:39895921
@SQ SN:gi|357579601|ref|chr30| LN:40214260
@SQ SN:gi|357579602|ref|chr29| LN:41845238
@SQ SN:gi|357579603|ref|chr28| LN:41182112
@SQ SN:gi|357579604|ref|chr27| LN:45876710
@SQ SN:gi|357579605|ref|chr26| LN:38964690
@SQ SN:gi|357579606|ref|chr25| LN:51628933
@SQ SN:gi|357579607|ref|chr24| LN:47698779
@SQ SN:gi|357579608|ref|chr23| LN:52294480
@SQ SN:gi|357579609|ref|chr22| LN:61439934
@SQ SN:gi|357579610|ref|chr21| LN:50858623
@SQ SN:gi|357579611|ref|chr20| LN:58134056
@SQ SN:gi|357579612|ref|chr19| LN:53741614
@SQ SN:gi|357579613|ref|chr18| LN:55844845
@SQ SN:gi|357579614|ref|chr17| LN:64289059
@SQ SN:gi|357579615|ref|chr16| LN:59632846
@SQ SN:gi|357579616|ref|chr15| LN:64190966
@SQ SN:gi|357579617|ref|chr14| LN:60966679
@SQ SN:gi|357579618|ref|chr13| LN:63241923
@SQ SN:gi|357579619|ref|chr12| LN:72498081
@SQ SN:gi|357579620|ref|chr11| LN:74389097
@SQ SN:gi|357579621|ref|chr10| LN:69331447
@SQ SN:gi|357579622|ref|chr9| LN:61074082
@SQ SN:gi|357579623|ref|chr8| LN:74330416
@SQ SN:gi|357579624|ref|chr7| LN:80974532
@SQ SN:gi|357579625|ref|chr6| LN:77573801
@SQ SN:gi|357579626|ref|chr5| LN:88915250
@SQ SN:gi|357579627|ref|chr4| LN:88276631
@SQ SN:gi|357579628|ref|chr3| LN:91889043
@SQ SN:gi|357579629|ref|chr2| LN:85426708
@SQ SN:gi|357579630|ref|chr1| LN:122678785
@PG ID:TopHat VN:2.0.9 CL:/csc/lohi/dog_tools/tophat-2.0.9.Linux_x86_64/tophat -p 3 -o ./ /csc/lohi/canFam3_ref_dogData/ncbi_fasta_canFam3/canFam3 B00EJHW.t30l32.pair1.fastq.gz B00EJHW.t30l32.pair2.fastq.gz
HISEQ4_0112:3:1101:1044:57612#GCCAAT 89 gi|357579618|ref|chr13| 621671950 101M * 0 0 TGCACTTGTTGAACTGCTGAAACACAAGCCCAAGGCAACAGATGAACAACTGAAAACTGTTATGGGAGATTTTGGAGCCTTTGTAGAGAAGTGCTGCGCNG @>3CDDC@3@DBC>6;6CC@EEBA?EEGBIGHCGIGFF@BC:JJJIIHCGD4BDGIFIGIGHEDGCJJIJJJIIHJJJJJIJJJJJHGHHHFFDDA4#C AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:99A1 YT:Z:UUNH:i:1
HISEQ4_0112:3:1101:1044:61660#GCCAAT 99 gi|357579610|ref|chr21| 328438950 6M1452N95M = 32845462 1652 CNCAAGAGTTTACTCTCCAAAGCTCGAGGAATCGATTCCAGCTCTGTTAAACTCCGAGGTGGTTCTTTATTCATGGATACAGAAAAATCAGGAAAAAGGGA C#1ADFFFHHHHHJJJJJJJJJJJJJJJJJIJJJJJJJJIIJJJJJJJJJJJJJJJHIJEHHHHHFFFFFFFFEEEEEEDDDDDDDDDDDDDDDDDDDDDD AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:1C99 YT:Z:UU XS:A:+ NH:i:1


Should you require the whole sam file to be attached?
meher is offline   Reply With Quote
Old 11-15-2013, 03:46 AM   #17
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Your chomosome names don't match. HTSeq-count cannot guess that "gi|357579630|ref|chr1|" in the SAM file is the same as "chr1" in the GFF file.
Simon Anders is offline   Reply With Quote
Old 11-15-2013, 04:02 AM   #18
meher
Member
 
Location: helsinki

Join Date: Jun 2011
Posts: 54
Default

Quote:
Originally Posted by Simon Anders View Post
Your chomosome names don't match. HTSeq-count cannot guess that "gi|357579630|ref|chr1|" in the SAM file is the same as "chr1" in the GFF file.

Thank you. As i thought both have the same notation i.e. chr1,chr2.. they are in sync with each other but that seems to be costly.
meher is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 01:15 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO