Seqanswers Leaderboard Ad

**Simon Anders** · 05-04-2011, 10:56 PM

So, does chromosome 'gi|49175990|ref|NC_000913.2|' appear in your GFF file or does it not? That would have been the first thing to check.

Obviously, chromosomes need to be identified with the same name in you Bowtie index and in your GFF file. Your Bowtie index seems to use these strange lengthy chromosome identifiers that the FASTA files from UCSC always use.

(A quick search on RefSeq tell me that NC_000913.2 is "Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome". Is this what you want?)

Now look into your GFF file, how is it called there? Not the same, I'd guess.

You need to the reference sequence names in one of two file to what is used in the other. However, be very careful to make sure there that both files base their coordinates on the same built (assembly).

**mbobro2** · 05-05-2011, 02:18 PM

Dear Simon,

So, I downloaded "Escherichia coli str. K-12 substr. MG1655" genome in fasta (NC_000913.fna) and gff (NC_000913.gff) formats, from RefSeq ftp, so both files should be made from the same genome built (NC_000913.2). Here are the first 10 lines of each.
FNA

gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT

GFF

##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC_000913.2
NC_000913.2 RefSeq source 1 4639675 . + . organism=Escherichia coli str. K-12 substr. MG1655;mol_type=genomic DNA;strain=K-12;sub_strain=MG1655;db_xref=taxon:511145
NC_000913.2 RefSeq gene 190 255 . + . ID=NC_000913.2:thrL;locus_tag=b0001;gene_synonym=ECK0001%3B JW4367;db_xref=ECOCYC:EG11277;db_xref=EcoGene:EG11277;db_xref=GeneID:944742
NC_000913.2 RefSeq CDS 190 252 . + 0 ID=NC_000913.2:thrL:unknown_transcript_1;Parent=NC_000913.2:thrL;locus_tag=b0001;gene_synonym=ECK0001%3B JW4367;function=leader%3B Amino acid biosynthesis: Threonine;function=1.5.1.8 metabolism%3B building block biosynthesis%3B amino acids%3B threonine;GO_process=GO:0009088 - threonine biosynthetic process;transl_table=11;product=thr operon leader peptide;protein_id=NP_414542.1;db_xref=GI:16127995;db_xref=ASAP:ABE-0000006;db_xref=UniProtKB%2FSwiss-Prot:P0AD86;db_xref=ECOCYC:EG11277;db_xref=EcoGene:EG11277;db_xref=GeneID:944742;exon_number=1
NC_000913.2 RefSeq start_codon 190 192 . + 0 ID=NC_000913.2:thrL:unknown_transcript_1;Parent=NC_000913.2:thrL;locus_tag=b0001;gene_synonym=ECK0001%3B JW4367;function=leader%3B Amino acid biosynthesis: Threonine;function=1.5.1.8 metabolism%3B building block biosynthesis%3B amino acids%3B threonine;GO_process=GO:0009088 - threonine biosynthetic process;transl_table=11;product=thr operon leader peptide;protein_id=NP_414542.1;db_xref=GI:16127995;db_xref=ASAP:ABE-0000006;db_xref=UniProtKB%2FSwiss-Prot:P0AD86;db_xref=ECOCYC:EG11277;db_xref=EcoGene:EG11277;db_xref=GeneID:944742;exon_number=1
NC_000913.2 RefSeq stop_codon 253 255 . + 0 ID=NC_000913.2:thrL:unknown_transcript_1;Parent=NC_000913.2:thrL;locus_tag=b0001;gene_synonym=ECK0001%3B JW4367;function=leader%3B Amino acid biosynthesis: Threonine;function=1.5.1.8 metabolism%3B building block biosynthesis%3B amino acids%3B threonine;GO_process=GO:0009088 - threonine biosynthetic process;transl_table=11;product=thr operon leader peptide;protein_id=NP_414542.1;db_xref=GI:16127995;db_xref=ASAP:ABE-0000006;db_xref=UniProtKB%2FSwiss-Prot:P0AD86;db_xref=ECOCYC:EG11277;db_xref=EcoGene:EG11277;db_xref=GeneID:944742;exon_number=1
NC_000913.2 RefSeq gene 337 2799 . + . ID=NC_000913.2:thrA;locus_tag=b0002;gene_synonym=ECK0002%3B Hs%3B JW0001%3B thrA1%3B thrA2%3B thrD;db_xref=ECOCYC:EG10998;db_xref=EcoGene:EG10998;db_xref=GeneID:945803

I made the index from FNA file using bowtie-build. Then I mapped my Trimmed and Groomed Illumina reads to this index. Next I tried to use htseq-count again inputting SAM file from bowtie with the GFF file described above. I get the following Warning again:

mbobro2:tophat bobrovskyy$ htseq-count -t -h ./output/pHDB3_5_bowtie.sam ./genomes/RefSeq_Genomes/NC_000913.gff
52211 GFF lines processed.
Warning: No features of type '-h' found.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'gi|49175990|ref|NC_000913.2|', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'gi|49175990|ref|NC_000913.2|', to which it has been aligned, did not appear in the GFF file.

...and keeps printing this warning over and over.

I'm having hard time figuring out what part of FNA or GFF does not match and how to fix it, as I'm new to the unix script as well as bioinformatics. I really appreciate your help and I would love to make it work. Several other people I know have never used Bowtie and HTseq with e. coli RNAseq, so I think figuring out the way would be extremely useful for our e.coli community. Thank you very much and appreciate your help!

Max Bobrovskyy
University of Illinois

**Simon Anders** · 05-05-2011, 10:46 PM

In your GTF file, the chromosome is called "NC_000913.2", in the FASTA file, it is called "gi|49175990|ref|NC_000913.2|". You cannot expect a script to recognize these as the same.

I usually download my data from Ensembl, which uses shorter identidiers and is more consistent between FASTA and GTF, so I didn't address this issue so far in HTSeq.

You could edit your FASTA file to remove the extra stuff from the sequence name and start over.

Alternatively you could write a little script to remove the extra characters from the SAM files. This Python script here might do the trick:

Code:

import HTSeq

for a in HTSeq.SAM_Reader( "myfile.sam" ):
   if a.aligned:
      a.iv.chrom = a.iv.chrom.split("|")[3]
   print a.get_sam_line()

It splits each chromosome name at the vertical bars and then retains only the part between the third and fourth bar. As I don't have your SAM file, I couldn't test it, but it should be simple to adjust.

Simon

**mbobro2** · 05-10-2011, 10:49 AM

Dear Simon,

I have tried to use Ensembl FASTA file for bowtie index and GFF file for HTseq-count, nevertheless chromosome names did not match. I slightly modified FASTA name to obtain matching chromosome name "Chromosome" in the SAM file. GTF calls it "Chromosome" too. Still, HTseq gives an error. Here are the headers of SAM and GTF files with the error I get when I run HTseq:

SAM
@HD VN:1.0 SO:unsorted
@SQ SN:Chromosome LN:4639675
@PG ID:Bowtie VN:0.12.7 CL:"./bowtie -S -q -a -t -v 1 ./indexes/Ensembl_Indexes_2/ecoli_k12_index ./rnaseq/1_CV104_pHDB3_5.fastqsanger.txt"
HWI-ST330_0103:1:1:1210:2132#NGATGT/1 4 * 0 0 * NCCCATTCGGAAATCGCCGGTTATAACGGTTCATATCACCTTACCGACGCTTATCGCAGATTAGCACGTCCTTCATCGCCTCTGACAGA #96828<598@@@@@DDDDDDDDDD<<<><>====DDDDD??????=?7??=?;?;??????==?5<787:::;:66599<<<<<8479 XM:i:0
HWI-ST330_0103:1:1:1202:2211#NGATGT/1 4 * 0 0 * CTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCNATAAGNGTCGGTAAGGTGATATGAACCGTTATAACCGGCGATTTCCTAATG DDDD@BEEEEDEDCE?DEEE<5?@@:549;@?A######################################################## XM:i:0
HWI-ST330_0103:1:1:1268:2150#NGATGT/1 4 * 0 0 * CTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTTACGAGTGGCGGGCG FBFDDEBEAEGEGEFAGFCEFFFFBDAAC.???0;A?D,?DC5DCCC<C;A:@7>?################################# XM:i:0
HWI-ST330_0103:1:1:1355:2205#CGATGT/1 0 Chromosome 224030 255 89M * 0 0 GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGG FF:CFEDEEEFGGGGEHFHDFHFFHGGBF?D5@C?GFEBECEDE;>ED?D=DEE8ED<ACCF>9CA(>AAA:=6?DD=AA;?D1@#### XA:i:0 MD:Z:89NM:i:0
HWI-ST330_0103:1:1:1355:2205#CGATGT/1 0 Chromosome 4164941 255 89M * 0 0 GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGG FF:CFEDEEEFGGGGEHFHDFHFFHGGBF?D5@C?GFEBECEDE;>ED?D=DEE8ED<ACCF>9CA(>AAA:=6?DD=AA;?D1@#### XA:i:0 MD:Z:89NM:i:0
HWI-ST330_0103:1:1:1355:2205#CGATGT/1 0 Chromosome 4033813 255 89M * 0 0 GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGG FF:CFEDEEEFGGGGEHFHDFHFFHGGBF?D5@C?GFEBECEDE;>ED?D=DEE8ED<ACCF>9CA(>AAA:=6?DD=AA;?D1@#### XA:i:0 MD:Z:89NM:i:0
HWI-ST330_0103:1:1:1355:2205#CGATGT/1 0 Chromosome 4206429 255 89M * 0 0 GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGG FF:CFEDEEEFGGGGEHFHDFHFFHGGBF?D5@C?GFEBECEDE;>ED?D=DEE8ED<ACCF>9CA(>AAA:=6?DD=AA;?D1@#### XA:i:0 MD:Z:89NM:i:0

GTF
Chromosome protein_coding exon 148 5020 . + . gene_id "EBESCG00000001716"; transcript_id "EBESCT00000002097"; exon_number "1"; gene_name "thrB"; transcript_name "thrB-1";
Chromosome protein_coding CDS 2801 3730 . + 0 gene_id "EBESCG00000001716"; transcript_id "EBESCT00000002097"; exon_number "1"; gene_name "thrB"; transcript_name "thrB-1"; protein_id "EBESCP00000002097";
Chromosome protein_coding start_codon 2801 2803 . + 0 gene_id "EBESCG00000001716"; transcript_id "EBESCT00000002097"; exon_number "1"; gene_name "thrB"; transcript_name "thrB-1";
Chromosome protein_coding stop_codon 3731 3733 . + 0 gene_id "EBESCG00000001716"; transcript_id "EBESCT00000002097"; exon_number "1"; gene_name "thrB"; transcript_name "thrB-1";
Chromosome protein_coding exon 148 5020 . + . gene_id "EBESCG00000000900"; transcript_id "EBESCT00000001087"; exon_number "1"; gene_name "thrL"; transcript_name "thrL-1";
Chromosome protein_coding CDS 190 252 . + 0 gene_id "EBESCG00000000900"; transcript_id "EBESCT00000001087"; exon_number "1"; gene_name "thrL"; transcript_name "thrL-1"; protein_id "EBESCP00000001087";
Chromosome protein_coding start_codon 190 192 . + 0 gene_id "EBESCG00000000900"; transcript_id "EBESCT00000001087"; exon_number "1"; gene_name "thrL"; transcript_name "thrL-1";
Chromosome protein_coding stop_codon 253 255 . + 0 gene_id "EBESCG00000000900"; transcript_id "EBESCT00000001087"; exon_number "1"; gene_name "thrL"; transcript_name "thrL-1";
Chromosome protein_coding exon 148 5020 . + . gene_id "EBESCG00000002850"; transcript_id "EBESCT00000003475"; exon_number "1"; gene_name "thrC"; transcript_name "thrC-1";
Chromosome protein_coding CDS 3734 5017 . + 0 gene_id "EBESCG00000002850"; transcript_id "EBESCT00000003475"; exon_number "1"; gene_name "thrC"; transcript_name "thrC-1"; protein_id "EBESCP00000003475";

HTseq error
Maksym-Bobrovskyys-MacBook-Pro:tophat bobrovskyy$ htseq-count -t -h ./output/pHDB3_5/pHDB3_5_bowtie.sam ./genomes/Ensembl_Genomes/e_coli_k12.EB1_e_coli_k12.60.gtf
20586 GFF lines processed.
Warning: No features of type '-h' found.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1303:2244#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.

I tried different combinations of FASTA and GTF files and it wouldn't work, I would still get the same message. I really appreciate your help on this!

I tried the script you provided me with in order to change SAM file from my previous example, but because my SAM file is 35Gb (~25 Million reads mapped) It was problematic for me to finish the job on my MacBook as it takes most of CPU (we are getting access to a local cluster/server, so this job will become easier and faster). But I figured it would be easier to provide appropriate FASTA from the beginning as I have other 5 datasets that require same sort of analysis.

Please let me know what you think! I'm sure there is a simple solution that I just don't see. Thank you in advance!

Max Bobrovskyy
University of Illinois

**Simon Anders** · 05-10-2011, 11:52 AM

Your problem is here:

Originally posted by mbobro2 View Post

Maksym-Bobrovskyys-MacBook-Pro:tophat bobrovskyy$ htseq-count -t -h ./output/pHDB3_5/pHDB3_5_bowtie.sam ./genomes/Ensembl_Genomes/e_coli_k12.EB1_e_coli_k12.60.gtf
20586 GFF lines processed.
Warning: No features of type '-h' found.

You instructed HTSeq-count to only look at lines in the GFF files with feature type (i.e., third column) "-h". As such lines don't exist, it skips the whole GFF file. Omit the "-t -h" in your command line.

**mbobro2** · 05-10-2011, 12:06 PM

Yay, thank you very much! It seems to be working! Highly appreciate your help! I might have some more questions on how to generate a tab-delimited table with multiple samples but I know there is a section on that in the manual. Hopefully I will be able to figure it out.

Once again, thank you for your help and wonderful software!

Max Bobrovskyy
University of Illinois

**mbobro2** · 05-11-2011, 10:37 AM

Dear Simon,

So I ran HTseq and it seems like it did count some of the reads and what I obtained is this:

EBESCG00000102254 40
EBESCG00000102255 1
EBESCG00000102256 1
EBESCG00000102257 343
EBESCG00000102258 7
EBESCG00000102260 0
EBESCG00000195128 0
EBESCG00000195129 0
EBESCG00000210071 0
EBESCG00000210072 188
EBESCG00000210073 0
EBESCG00000210074 0
EBESCG00000210075 67
no_feature 55630126
ambiguous 2951860
too_low_aQual 0
not_aligned 3760807
alignment_not_unique 0

What I don't understand is why I get Gene_ID instead of Gene_Name in the first column! Can you suggest a way to fix this!? (Unless I have to use TopHat with a gtf file for annotation first, which I thought was not necessary for bacteria due to lack of splice junctions).

Also, I'm not sure but it seams like get a lot of reads that have no_feature, ambiguous and not_aligned. Is there a way to improve or is this a common thing?

Max Bobrovskyy
University of Illinois

**mbobro2** · 05-11-2011, 11:14 AM

Simon,

I think all I need is to use -i gene_name option so that the final tab_delimited file will contain gene_name attribute instead of gene_id! I'm trying it right now. Sorry for unnecessary question! Let me know in case I'm wrong!

Nevertheless, number of no_feature, ambiguous and not_aligned reads worries me! What are your thoughts on it?

PS: Is it ok to make a tab_delimited table with multiple samples simply by pasting final result for two samples into excel and saving as tab delimited! If not what would be an easier way to do it? This table is intended to be used in DEseq. Thank you!

Max Bobrovskyy
University of Illinois

**sunkorner** · 07-01-2011, 05:48 AM

bacterial genome diff expression

Finally did u manage with the analysis. Even I am working with bacterial genome and your input will be very helpful

Thank you

**mbobro2** · 07-01-2011, 08:08 AM

I have managed to get differential expression values using DEseq. Nevertheless, I have a small problem, some of my genes show 0 reads, whereas I know for a fact they are expressed. One example is sRNA SgrS, which shows up to have 0 reads but I know it is expressed because a Northern blot was performed on that sample. Also SgrT (small prot. encoded in SgrS is not expressed). Another example is manXYZ genes, which should be expressed but also show up as 0 reads. FASTA and GTF files obtained from Ensembl Bacteria. Single end priming was used in RNAseq.

I used bowtie for mapping with the following options:

./bowtie -S -q -a -t -v 2 ./indexes/Ensembl_Indexes_2/ecoli_k12_index ./rnaseq/pHDB3_5.fastqsanger.txt > ./output/rnaseq_2/pHDB3_5_bowtie.sam

I used HTseq for annotation using the following options:

htseq-count -m intersection-nonempty -i gene_name ./output/pLCV1_20_bowtie.sam ./genomes/Ensembl_Genomes/e_coli_k12.EB1_e_coli_k12.60.gtf > ./output/rnaseq_3/pLCV1_20_htseq.txt

Can anyone make any suggestions as to why I obtain 0 reads for some obviously expressed transcripts and how to modify my workflow in order to fix this?! Thank you!

**Simon Anders** · 07-01-2011, 08:32 AM

Have you looked into you SAM file at the locus of the gene with a gene browser (e.g., IGV)?

**sunkorner** · 07-01-2011, 11:53 AM

I am using the following commands, I get lot of reads as "no_feature", I tried union as well as intersection strict, but no much change in the nubmer of no_feature. only the ambiguous reads nubmer has change. My total number of reads is around 18 million. So I am worried in this regard.

By the by what is the no_feature mean: are these number of reads not mapped?

Please suggest what might be wrong with my htseq command. Thank you

===
htseq-count -m union -s yes -t CDS -i locus_tag -o WT-1_geneid_htseq_out3 sorted.sam /home/sun/Rna_seq_data/Xylella/chrComplete.gff > PD_out_union &
no_feature 14867530
ambiguous 4453
too_low_aQual 0
not_aligned 1913205
alignment_not_unique 0

htseq-count -m intersection-strict -s yes -t CDS -i locus_tag -o WT-1_geneid_htseq_out2 sorted.sam /home/sun/Rna_seq_data/Xylella/chrComplete.gff > PD_out &
no_feature 14905057
ambiguous 13
too_low_aQual 0
not_aligned 1913205
alignment_not_unique 0

**sunkorner** · 07-01-2011, 11:58 AM

mbobro2, may be do u think we need to sort the sam file before HTSeq count, sometimes if we dont sort it can give odd results

**Simon Anders** · 07-03-2011, 03:40 AM

Originally posted by sunkorner View Post

By the by what is the no_feature mean: are these number of reads not mapped?

No, it means that the read does not overlap with any feature in your GTF file, i.e., that it falls onto intergenic or intronic space.

To investigate, first display your SAM file and you GTF file alongside in a genome browser (e.g., IGV) and inspect visually, whether there are indeed many reads that fall between features.

On the other hand: Is this your complete output? Not a single feature? Or did you cut them off.

If not, are you sure your GFF file fits your options? You have instructed htseq-count to only use those lines in your GFF file that have the string "CDS" in its third field, and have a suitable attribute called "locus_tag" in the last field. Maybe post an extract of the GFF file.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 17 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 22 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 46 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

HTseq:Adding GTF annotation to SAM alignment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News