Seqanswers Leaderboard Ad

**sunkorner** · 07-03-2011, 06:09 AM

Please find the the gff file I have used at this url ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_004556.gff

otherwise also please check a sample of it, hope it will be readable in the post

##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC_004556.1
NC_004556.1 RefSeq source 1 2519802 . + . organism=Xylella fastidiosa Temecula1;mol_type=genomic DNA;strain=Temecula1;db_xref=taxon:183190;note=Pierce%27s disease strain
NC_004556.1 RefSeq gene 146 1465 . + . ID=NC_004556.1:dnaA;locus_tag=PD0001;db_xref=GeneID:1144259
NC_004556.1 RefSeq CDS 146 1462 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
NC_004556.1 RefSeq start_codon 146 148 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
NC_004556.1 RefSeq stop_codon 1463 1465 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
NC_004556.1 RefSeq gene 1747 2847 . + . ID=NC_004556.1:dnaN;locus_tag=PD0002;db_xref=GeneID:1144260
NC_004556.1 RefSeq CDS 1747 2844 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
NC_004556.1 RefSeq start_codon 1747 1749 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
NC_004556.1 RefSeq stop_codon 2845 2847 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
NC_004556.1 RefSeq gene 3153 4247 . + . ID=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;db_xref=GeneID:1144261
NC_004556.1 RefSeq CDS 3153 4244 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
NC_004556.1 RefSeq start_codon 3153 3155 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
NC_004556.1 RefSeq stop_codon 4245 4247 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1

I am looking for the count of all the reads mapped to each gene. Please let me know if it is correct way I am doing using locus_tag and CDs. It is also microbial genome.

Thank you

**sunkorner** · 07-03-2011, 06:16 AM

On the other hand: Is this your complete output? Not a single feature? Or did you cut them off.

I just gave the last few lines where no features is shown, I also get the count of reads mapped to each locus_tag. eg
PD0003 23
PD0001 5
etc

Thank you very much

**Simon Anders** · 07-03-2011, 06:16 AM

Again: Are the five lines you quote the whole output that you get from htseq-count? And have you looked at your data and the GTF file with a genome browser?

And what is a "locus tag"? (Sorry, I'm not very familiar with prokaryotic data.)

**Simon Anders** · 07-03-2011, 06:17 AM

Then, you should really check things with a genome browser. Check especially whether the reads are really on the same strand as the gene, to make sure that '-s yes' is correct.

**sunkorner** · 07-03-2011, 07:01 AM

Locus tags are systematic, unique identifiers that are assigned to each gene in GenBank. Therefore I used them as identifiers in this context too. However I am not sure if the use of CDS is causing this problem of lack of identification.

I will also look into the genome browser and get back to you with results. Thank you

**sunkorner** · 07-03-2011, 08:02 PM

Results from genome browser view,

I could see many reads actually mapping to each gene which we call locus tag. I get lesser than the actual count for each gene. Any way to increase it. When I remove the -i option then I get error, tRNAs have no gene id feature, error on 72 line and the program exit. Please let me know how to proceed.

Any input from guys working with bacterial genomes will be very helpful

Thank you

**sunkorner** · 07-03-2011, 08:03 PM

There are also many regions in between the genes where the reads are mapping....but the ones mapping onto the gene too are not shown in the count

**sunkorner** · 07-04-2011, 04:42 AM

I tried to give command,

htseq-count -m intersection-nonempty -o out sorted.sam chr.gff > locus_out

I get error line 72 gene feature doesnt exist for tRNA .......then the progam exits. When I checked the line I see that the feature in this line has 'exon'. Such lines exists for many tRNAs in this gff file. but which are other transcripts we have different information.

NC_004556.1 RefSeq gene 17033 17950 . + . ID=NC_004556.1:hemF;locus_tag=PD0015;db_xref=GeneID:1144215
NC_004556.1 RefSeq CDS 17033 17947 . + 0 ID=NC_004556.1:hemF:unknown_transcript_1;Parent=NC_004556.1:hemF;locus_tag=PD0015;EC_number=1.3.3.3;note=catalyzes the conversion of the propionic acid groups of rings I and III to vinyl groups during heme synthesis;transl_table=11;product=coproporphyrinogen III oxidase;protein_id=NP_778274.1;db_xref=GI:28197960;db_xref=GeneID:1144215;exon_number=1
NC_004556.1 RefSeq start_codon 17033 17035 . + 0 ID=NC_004556.1:hemF:unknown_transcript_1;Parent=NC_004556.1:hemF;locus_tag=PD0015;EC_number=1.3.3.3;note=catalyzes the conversion of the propionic acid groups of rings I and III to vinyl groups during heme synthesis;transl_table=11;product=coproporphyrinogen III oxidase;protein_id=NP_778274.1;db_xref=GI:28197960;db_xref=GeneID:1144215;exon_number=1
NC_004556.1 RefSeq stop_codon 17948 17950 . + 0 ID=NC_004556.1:hemF:unknown_transcript_1;Parent=NC_004556.1:hemF;locus_tag=PD0015;EC_number=1.3.3.3;note=catalyzes the conversion of the propionic acid groups of rings I and III to vinyl groups during heme synthesis;transl_table=11;product=coproporphyrinogen III oxidase;protein_id=NP_778274.1;db_xref=GI:28197960;db_xref=GeneID:1144215;exon_number=1
NC_004556.1 RefSeq gene 20728 20804 . - . locus_tag=PD0016;db_xref=GeneID:1144216
NC_004556.1 RefSeq exon 20728 20804 . - . gbkey=tRNA;locus_tag=PD0016;product=tRNA-Met;note=found by tRNAscan;db_xref=GeneID:1144216;exon_number=1

So I am exculsively giving the command htseq-count -m -t CDS -i locus_tag -o out sorted.sam chr.gff > PD_out

Then in PD_out we see,
.........
,..........
PD2111 4
PD2112 2
PD2113 7
PD2105 15
PD2106 11
PD2107 35
PD2108 10
PD2109 21
PD2110 17
.....
...
....
no_feature 14905057
ambiguous 13
too_low_aQual 0
not_aligned 1913205
alignment_not_unique 0

Something like this. Am I on the right path.

By the by looking into genome browser, the number of reads mapping to the gene too are counted less for my dataset. It is because some genes are overlapping each other that the parameters I gave doesnt let it to recognise. I need a count of uniquley mapped genes, partially mapped genes and not unique genes.

Any suggestions please.

Thank you

**Simon Anders** · 07-04-2011, 04:57 AM

Originally posted by sunkorner View Post

htseq-count -m intersection-nonempty -o out sorted.sam chr.gff > locus_out

This won't work with your GFF file, because it does not have 'gene_id' attributes, and you don't have a '-i'.

This is why you needed this here:

htseq-count -m -t CDS -i locus_tag -o out sorted.sam chr.gff > PD_out

By the by looking into genome browser, the number of reads mapping to the gene too are counted less for my dataset. It is because some genes are overlapping each other that the parameters I gave doesnt let it to recognise. I need a count of uniquley mapped genes, partially mapped genes and not unique genes.

If you had many read falling onto overlapping genes, they would be counted as ambiguous. "no_feature" means that they don't fall anywhere.

What you should do is visualize the 'out' SAM file along your GFF file and see if you find any read that is marked as "no_feature" and appears to be on a feature. Then, make a screenshot and post it here, and we can see what is wrong.

**sunkorner** · 07-05-2011, 09:19 AM

I am using DESeq and I get the following error.

> exampleFile = "/home/sun/Desktop/xylella_normalized_rpm.tab"
> countsTable <- read.delim(exampleFile, header=TRUE, stringsAsFactors=TRUE)
> head(countsTable)
Locus_tag WT.1 WT.2 WT.3 mutant.1 mutant.2 mutant.3
1 PD0001 32.59 35.18 200.56 37.90 173.21 224.03
2 PD0002 26.55 11.80 149.89 18.16 130.71 162.92
3 PD0003 9.73 7.46 171.14 14.50 129.91 203.61
4 PD0004 7.36 6.46 20.83 12.31 16.76 29.63
5 PD0005 47.71 69.57 481.91 82.01 372.57 567.03
6 PD0006 5.13 4.90 175.65 13.04 154.31 202.34
> rownames( countsTable) <- countsTable$Locus_tag
> countsTable <- countsTable[ ,-1 ]
> conds <- c( "W", "W", "W", "M", "M", "M" )
> cds <- newCountDataSet( countsTable, conds )
Error in newCountDataSet(countsTable, conds) :
The countData is not integer.
> cds <- newCountDataSet( countsTable, conds)
Error in newCountDataSet(countsTable, conds) :
The countData is not integer.
> head(countsTable)
WT.1 WT.2 WT.3 mutant.1 mutant.2 mutant.3
PD0001 32.59 35.18 200.56 37.90 173.21 224.03
PD0002 26.55 11.80 149.89 18.16 130.71 162.92
PD0003 9.73 7.46 171.14 14.50 129.91 203.61
PD0004 7.36 6.46 20.83 12.31 16.76 29.63
PD0005 47.71 69.57 481.91 82.01 372.57 567.03
PD0006 5.13 4.90 175.65 13.04 154.31 202.34
> str(countsTable)
'data.frame': 2123 obs. of 6 variables:
$ WT.1 : num 32.59 26.55 9.73 7.36 47.71 ...
$ WT.2 : num 35.18 11.8 7.46 6.46 69.57 ...
$ WT.3 : num 200.6 149.9 171.1 20.8 481.9 ...
$ mutant.1: num 37.9 18.2 14.5 12.3 82 ...
$ mutant.2: num 173.2 130.7 129.9 16.8 372.6 ...
$ mutant.3: num 224 162.9 203.6 29.6 567 ...

I have to still visualize my out SAM file, so I will get back on that soon.

Thank you

**Simon Anders** · 07-05-2011, 09:47 AM

Please, read the error message. It says clearly:

"Error in newCountDataSet(countsTable, conds) :The countData is not integer."

Sorry but you are about the thousandth person to ask this, and I'm a bit desperate because I thought I wrote it nice and clearly in the manual:

DESeq expects as input a table, which tells, for each sample and each gene, how many reads fall onto the gene in the sample.

It seems that many users feel an irresistible urge to do strange things to their counts and then supply something like the number of read per million reads, the number per kilobase transcript length, or whatever.

So, please, people, stop normalizing, dividing, rounding or doing whatever with your data, and just hand over the raw unnormalized integer counts.

**sunkorner** · 07-05-2011, 10:52 AM

Thank you very much and also sorry. I am new to RNAseq data analysis and so somehow must have overlooked the details.

**Camg** · 02-09-2012, 12:05 PM

Error running HTSeq

Hi,

I'm having the same problem as the original post in this thread. When I run HTSeq I'm getting the error:

Warning: Skipping read 'HWI-ST765:7:2307:9917:51869#0', because chromosome 'chr1', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST765:7:2307:9917:76673#0', because chromosome 'chr1', to which it has been aligned, did not appear in the GFF file.

This is the command I used:

python ../../programs/HTSeq-0.5.3p3/HTSeq/scripts/count.py --mode=intersection-nonempty --stranded=no --idattr=transcript_id -o 24.1.htseq 24.1.sorted.sam ../ec_txn.gtf

However, I can see that my SAM file and GTF file are using the same chromosome names so I'm not sure why HTSeq isn't recognizing them.

Here are examples of a few lines from my SAM and GTF files:

HWI-ST765:7:1101:10001:176633#0 147 chr1 97536 255 101M = 97470 -167 ACAAAGATGTCGCACTTGGCATTGATGTACATGCAGACCTCCTGGCAGACCTTGCGCTCGATGTGCTTCTTGAGCTTGTGGAGCTGTCCTGGGCATCGGAG DDDDDDDDDDDDCCDDDDDDDDDDEEFEDDDDDDDCADDFEFFEHHHHHJIJJJIGGIJJIJIHEIJIJJJJJJJJHEJJJJIJJJJIHBHHHFFDB;CC@ NM:i:0 NH:i:1
HWI-ST765:7:1101:10001:176633#0 99 chr1 97470 255 101M = 97536 167 CGGAGGTTTCTGTATCTTCTCCTGTCGTTAACCACGACTCTGTAGAAGTCGCAGTCCCCGACCAGCACAAAGATGTCGCACTTGGCATTGATGTACATGCA BCCFFFBDHHHHHJJJJJJJJJJJJJIFIIJJJJJJIJJJGGIJIJJJHJJIJJIIIJJHHFFDEDDDDDDCDDDCCBDDDDDDDDCDDDDEDEEEEEDDD NM:i:1 NH:i:1
HWI-ST765:7:1101:10002:95570#0 147 chr1 107247 255 101M = 107184 -164 TCCTTCCAAGCTCCAAGCAGCTTCCAATCGAAATGACAGGAATGTTCTTCTTTTTTGCCAGCGCCATGACAAGCTTGGAAATTCTTGGCTCTGCATTCTCG BDDDCDDDDDDDDDDDDDDDDDDDDDEDEEEEFEFFFFFFFHHHHHIIGIIIJIJJJJJJJJJJIJIJIIJJJIJIJIJJJJJJJJJJHHHHHFFFFFCCC NM:i:0 NH:i:1
HWI-ST765:7:1101:10002:95570#0 99 chr1 107184 255 101M = 107247 164 CGGCAACACAGCACCCCTTGCTCCTCACCTTCCCGCTGGAGCTGACATTCTCCACACCGACAATCCTTCCAAGCTCCAAGCAGCTTCCAATCGAAATGACA CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIIIIJFHIJIJIIIJGGJIIJIGHHBECDDDCDDDDDDDDDCDDDDDDDDDDDDDDDDDDDDDDDDD NM:i:0 NH:i:1

chr1 tophat CDS 4117 4593 . + 0 gene_id "ECU01_0010"; transcript_id "ECU01_0010";
chr1 tophat CDS 4391 5086 . - 0 gene_id "ECU01_0020"; transcript_id "ECU01_0020";
chr1 tophat CDS 5014 7486 . - 0 gene_id "ECU01_0023"; transcript_id "ECU01_0023";
chr1 tophat CDS 7536 8832 . - 0 gene_id "ECU01_0025"; transcript_id "ECU01_0025";
chr1 tophat CDS 11158 11529 . - 0 gene_id "ECU01_0030"; transcript_id "ECU01_0030";

Let me know if you have any ideas on how to solve this issue.
Thanks in advance

**Camg** · 02-09-2012, 12:23 PM

So I was messing around with parameters and found that when I changed --type=CDS to match my GTF file the program ran perfectly. I guess I was assuming CDS was the standard feature in column 3 of a GTF file since that's what is shown in the example GTF (link in manual). Also, I'm not sure why not specifying the --type attribute resulted in the error message I was getting.

Anyways, it is working now. Thanks for this great tool.

**Simon Anders** · 02-09-2012, 12:31 PM

The default value for 'type' is 'exon', not 'CDS'. This is because a GTF files usually have two lines for each exon, one labelled 'exon' and the other 'CDS', the difference being that the former includes untranslated regions and the latter only the coding part of the exon. (For interior exons, both lines contain the same coordinates.) I felt that most users would want htrseq-count to count a read also if it maps to a gene's UTR and this is why the default is 'exon', not 'CDS'.

In your GTF file, all the 'exon' lines seem to be missing, so you hit on the right solution to tell htseq-count to use the 'CDS' lines instead.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 58 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News