Seqanswers Leaderboard Ad

**bruce01** · 11-06-2013, 08:45 AM

Well, I usually call HTSeq-counts like this:

htseq-counts <flags/parameters> <input.sam> <input.gtf>

Try:

htseq-counts --help

This will tell you about the flags you are interested in:
-s yes/no is for stranded data. Is your data stranded? Ask the person who did library prep.
-t is featureType, so usually 'exon' if you are doing RNAseq.
-i is attribute, usually 'gene_id', but you could use 'transcript_id' I suppose, or 'gene_name'

Hope that is some help to you?

**wmseq** · 11-06-2013, 08:54 AM

bruce01,
Thanks for your answer!
I chose "-s no", because they are not stranded reads.
As to -i and -ID, do you have any idea how to write the command according my gff file?

**bruce01** · 11-06-2013, 08:56 AM

Can you post the output from the command:

head amel_OGSv3.2.gene.gff

**wmseq** · 11-06-2013, 09:06 AM

bruce01,
In fact, amel_OGSv3.2.gene.gff is trimed gff file from the original one, amel_OGSv3.2.gff. I just grep the rows with gene. It seams that I need grepping the rows with exon, right?
Here is output of the original one for one gene:

Group1.1 amel_OGSv3.1 gene 507599 515039 1 - . ID=GB42155
Group1.1 amel_OGSv3.1 mRNA 507599 515039 1 - . ID=GB42155-RA;Parent=GB42155
Group1.1 amel_OGSv3.1 CDS 514009 514378 . - 0 Parent=GB42155-RA
Group1.1 amel_OGSv3.1 CDS 513392 513906 . - 2 Parent=GB42155-RA
Group1.1 amel_OGSv3.1 five_prime_UTR 514761 515039 1 - . Parent=GB42155-RA
Group1.1 amel_OGSv3.1 five_prime_UTR 514379 514408 1 - . Parent=GB42155-RA
Group1.1 amel_OGSv3.1 three_prime_UTR 512910 513391 1 - . Parent=GB42155-RA
Group1.1 amel_OGSv3.1 three_prime_UTR 507599 509541 1 - . Parent=GB42155-RA
Group1.1 amel_OGSv3.1 exon 507599 509541 1 - . Parent=GB42155-RA
Group1.1 amel_OGSv3.1 exon 512910 513906 1 - . Parent=GB42155-RA
Group1.1 amel_OGSv3.1 exon 514009 514408 1 - . Parent=GB42155-RA
Group1.1 amel_OGSv3.1 exon 514761 515039 1 - . Parent=GB42155-RA
Group1.1 amel_OGSv3.1 gene 940410 942604 1 + . ID=GB42182

Thanks!!

**bruce01** · 11-06-2013, 09:17 AM

I don't use GFFs so not too sure about -i, but I think you should be using:

-t exon -i ID

This will mean that the the reads must overlap exons to be counted, but will be counted on a 'per ID' basis, so counts you get out in a table will be for the ID name. If you did:

-t gene -i ID

you would get any reads overlapping with any of the gene, including introns, which is probably not what you want(?)

Give it a go, any errors or issues post them up.

**wmseq** · 11-06-2013, 09:22 AM

Do You mean the original gff file can be used dirrectedly for the counting?

**bruce01** · 11-06-2013, 09:27 AM

Yes, I think the -t flag is used to parse exon, or gene, or whatever you put in there, from the GFF so you don't really need to grep exon or gene.

**wmseq** · 11-06-2013, 09:36 AM

Hi bruce01,
Using -t exon -i ID, I got the following information:

Error occured when processing GFF file (line 9 of file /media/wenfu/LaCie/my_rnaseq_dat/amel_OGSv3.2.gff3):
Feature GB42155-RA does not contain a 'ID' attribute
[Exception type: SystemExit, raised in count.py:55]

As you can see, it contains 'Parent' atribute.

**bruce01** · 11-06-2013, 09:37 AM

Yes, was wondering if you should use -i ID or -i Parent. Try -i Parent now

**wmseq** · 11-06-2013, 09:41 AM

I used -i Parent, it is running now.

**kmcarr** · 11-06-2013, 02:28 PM

Originally posted by wmseq View Post

I used -i Parent, it is running now.

This is going to result in an improper analysis. Your purpose is to count the number of reads uniquely aligned to each gene, but the "Parent" name identifies the transcript, not the gene. A gene with multiple isoforms will have multiple transcripts associated with it. This means that a single read will counted multiple times (once for each transcript which shares the exon which the read is aligned to). Reads which align to multiple entities are not counted by HT-Seq. As it stands the GFF file you have can not be used to properly count reads per GENE. To do that you need a GFF (or GTF) with a gene name (e.g. gene_id in ENSEMBL GTF files) associated with each exon feature.

**bruce01** · 11-07-2013, 05:01 AM

@kmcarr How can you tell that 'Parent' is a transcript and not a gene? The "-RA"? It will only result in an improper analysis if multiple transcripts exist, so could test easily for that.

@wmseq are you working on Apis mellifera? Any reason not to use the GTF here?

**kmcarr** · 11-07-2013, 05:26 AM

Originally posted by bruce01 View Post

@kmcarr How can you tell that 'Parent' is a transcript and not a gene? The "-RA"?

Look at the second line of the GFF fragment wmseq posted:

Code:

Group1.1	amel_OGSv3.1	mRNA	507599	515039	1	-	.	ID=GB42155-RA;Parent=GB42155

and you can see the the ID GB42155-RA is assigned to an mRNA (transcript) not to a gene. The gene (Parent) of GB42155-RA has the ID GB42155.

It will only result in an improper analysis if multiple transcripts exist, so could test easily for that.

A majority of genes in complex eukaryotes have multiple isoforms so it is pretty much guaranteed that there will be a problem using the transcript ID with HTSeq-count.

@wmseq are you working on Apis mellifera? Any reason not to use the GTF here?

That annotation file is specific to the version 4.0 assembly of the A. mellifera genome. wmseq appears to be working with the v3.2 assembly. If (his/her) reads are really aligned to the 3.2 build then trying to do gene counts with a v4.0 annotation file will fail.

**bruce01** · 11-07-2013, 06:04 AM

@kmcarr, thanks for the explanation, saw that line after I posted, not as familiar with GFF. Infact I distinctly dislike and avoid them! Just not intuitive to me.

A majority of genes in complex eukaryotes have multiple isoforms so it is pretty much guaranteed...

Yes, so one might test for multiple transcripts, instead of assume they are present. I was unsure of level of annotation and couldn't find the assembly for that GFF, hence looking at that v4.0 on Ensembl. Was a bit brief asking about using that, I did mean use the whole assembly, so realign etc.

An option otherwise is to remove the overlapping transcripts (if any) from GFF, thereby using a single transcript ID as a proxy gene ID. I have a script to do that for GTF, if you want PM me location of GFF for download and I can rewrite for GFF.

Topics	Statistics	Last Post
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Today, 08:47 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 54 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM

Seqanswers Leaderboard Ad

Announcement

Is my parameter choice for HTSeq-count is right?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News