SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
htseq-count for sam and gff3 sofia17 RNA Sequencing 45 11-04-2016 03:32 PM
Problems creating GTF for Cufflinks annotation DrD2009 Bioinformatics 10 02-23-2015 06:20 AM
Error with GTF file when using htseq-count MDonlin Bioinformatics 13 01-13-2015 08:29 AM
HTSeq sort sam DZhang Bioinformatics 41 11-02-2014 11:36 PM
where can I find annotation.gtf when trying Cuffcompare? joyce kang Bioinformatics 0 11-14-2011 06:59 AM

Reply
 
Thread Tools
Old 05-04-2011, 12:44 PM   #1
mbobro2
Junior Member
 
Location: Illinois

Join Date: May 2011
Posts: 7
Default HTseq:Adding GTF annotation to SAM alignment

Please help,

I mapped my RNA-seq (Illumina) reads to E. coli index (index built from UCSC obtained FASTA file) using BowTie. Now I'm trying to assign gene annotation to the mapped reads. I decided to use HTSeq for this purpose. I use SAM file i obtained after running Bowtie (also used TopHat to get the SAM file) with GTF files from either UCSC Microbial or Ensembl Bacterial (I also tried converting them to GFF3 format). In either case, when I run HTseq-count using the command:
$ HTseq-count <file.sam> <file.gff (or .gtf)>

I get an error message such as:
Warning: Skipping read 'HWI-ST330_0103:1:1:19499:9909#CGATGT/1', because chromosome 'gi|49175990|ref|NC_000913.2|', to which it has been aligned, did not appear in the GFF file.

I also tried to provide GTF or GFF files while running TopHat, but it failed in that case also.

Can someone one make a suggestion or has previously encountered such problem and knows how to solve it? Please help?!

Also if anyone has done RNA-seq for bacteria, what is your workflow? My plan for now is BowTie (or TopHat) -> HTseq -> DEseq

Appreciate any feedback, thank you!

Max Bobrovskyy
University of Illinois
mbobro2 is offline   Reply With Quote
Old 05-04-2011, 10:56 PM   #2
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

So, does chromosome 'gi|49175990|ref|NC_000913.2|' appear in your GFF file or does it not? That would have been the first thing to check.

Obviously, chromosomes need to be identified with the same name in you Bowtie index and in your GFF file. Your Bowtie index seems to use these strange lengthy chromosome identifiers that the FASTA files from UCSC always use.

(A quick search on RefSeq tell me that NC_000913.2 is "Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome". Is this what you want?)

Now look into your GFF file, how is it called there? Not the same, I'd guess.

You need to the reference sequence names in one of two file to what is used in the other. However, be very careful to make sure there that both files base their coordinates on the same built (assembly).
Simon Anders is offline   Reply With Quote
Old 05-05-2011, 02:18 PM   #3
mbobro2
Junior Member
 
Location: Illinois

Join Date: May 2011
Posts: 7
Default

Dear Simon,

So, I downloaded "Escherichia coli str. K-12 substr. MG1655" genome in fasta (NC_000913.fna) and gff (NC_000913.gff) formats, from RefSeq ftp, so both files should be made from the same genome built (NC_000913.2). Here are the first 10 lines of each.
FNA

gi|49175990|ref|NC_000913.2| Escherichia coli str. K-12 substr. MG1655 chromosome, complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG
CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA
GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC
AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG
AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT
GACGGGACTCGCCGCCGCCCAGCCGGGGTTCCCGCTGGCGCAATTGAAAACTTTCGTCGATCAGGAATTT


GFF

##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC_000913.2
NC_000913.2 RefSeq source 1 4639675 . + . organism=Escherichia coli str. K-12 substr. MG1655;mol_type=genomic DNA;strain=K-12;sub_strain=MG1655;db_xref=taxon:511145
NC_000913.2 RefSeq gene 190 255 . + . ID=NC_000913.2:thrL;locus_tag=b0001;gene_synonym=ECK0001%3B JW4367;db_xref=ECOCYC:EG11277;db_xref=EcoGene:EG11277;db_xref=GeneID:944742
NC_000913.2 RefSeq CDS 190 252 . + 0 ID=NC_000913.2:thrL:unknown_transcript_1;Parent=NC_000913.2:thrL;locus_tag=b0001;gene_synonym=ECK0001%3B JW4367;function=leader%3B Amino acid biosynthesis: Threonine;function=1.5.1.8 metabolism%3B building block biosynthesis%3B amino acids%3B threonine;GO_process=GO:0009088 - threonine biosynthetic process;transl_table=11;product=thr operon leader peptide;protein_id=NP_414542.1;db_xref=GI:16127995;db_xref=ASAP:ABE-0000006;db_xref=UniProtKB%2FSwiss-Prot:P0AD86;db_xref=ECOCYC:EG11277;db_xref=EcoGene:EG11277;db_xref=GeneID:944742;exon_number=1
NC_000913.2 RefSeq start_codon 190 192 . + 0 ID=NC_000913.2:thrL:unknown_transcript_1;Parent=NC_000913.2:thrL;locus_tag=b0001;gene_synonym=ECK0001%3B JW4367;function=leader%3B Amino acid biosynthesis: Threonine;function=1.5.1.8 metabolism%3B building block biosynthesis%3B amino acids%3B threonine;GO_process=GO:0009088 - threonine biosynthetic process;transl_table=11;product=thr operon leader peptide;protein_id=NP_414542.1;db_xref=GI:16127995;db_xref=ASAP:ABE-0000006;db_xref=UniProtKB%2FSwiss-Prot:P0AD86;db_xref=ECOCYC:EG11277;db_xref=EcoGene:EG11277;db_xref=GeneID:944742;exon_number=1
NC_000913.2 RefSeq stop_codon 253 255 . + 0 ID=NC_000913.2:thrL:unknown_transcript_1;Parent=NC_000913.2:thrL;locus_tag=b0001;gene_synonym=ECK0001%3B JW4367;function=leader%3B Amino acid biosynthesis: Threonine;function=1.5.1.8 metabolism%3B building block biosynthesis%3B amino acids%3B threonine;GO_process=GO:0009088 - threonine biosynthetic process;transl_table=11;product=thr operon leader peptide;protein_id=NP_414542.1;db_xref=GI:16127995;db_xref=ASAP:ABE-0000006;db_xref=UniProtKB%2FSwiss-Prot:P0AD86;db_xref=ECOCYC:EG11277;db_xref=EcoGene:EG11277;db_xref=GeneID:944742;exon_number=1
NC_000913.2 RefSeq gene 337 2799 . + . ID=NC_000913.2:thrA;locus_tag=b0002;gene_synonym=ECK0002%3B Hs%3B JW0001%3B thrA1%3B thrA2%3B thrD;db_xref=ECOCYC:EG10998;db_xref=EcoGene:EG10998;db_xref=GeneID:945803



I made the index from FNA file using bowtie-build. Then I mapped my Trimmed and Groomed Illumina reads to this index. Next I tried to use htseq-count again inputting SAM file from bowtie with the GFF file described above. I get the following Warning again:

mbobro2:tophat bobrovskyy$ htseq-count -t -h ./output/pHDB3_5_bowtie.sam ./genomes/RefSeq_Genomes/NC_000913.gff
52211 GFF lines processed.
Warning: No features of type '-h' found.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'gi|49175990|ref|NC_000913.2|', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'gi|49175990|ref|NC_000913.2|', to which it has been aligned, did not appear in the GFF file.


...and keeps printing this warning over and over.

I'm having hard time figuring out what part of FNA or GFF does not match and how to fix it, as I'm new to the unix script as well as bioinformatics. I really appreciate your help and I would love to make it work. Several other people I know have never used Bowtie and HTseq with e. coli RNAseq, so I think figuring out the way would be extremely useful for our e.coli community. Thank you very much and appreciate your help!


Max Bobrovskyy
University of Illinois

Last edited by mbobro2; 05-05-2011 at 02:21 PM.
mbobro2 is offline   Reply With Quote
Old 05-05-2011, 10:46 PM   #4
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

In your GTF file, the chromosome is called "NC_000913.2", in the FASTA file, it is called "gi|49175990|ref|NC_000913.2|". You cannot expect a script to recognize these as the same.

I usually download my data from Ensembl, which uses shorter identidiers and is more consistent between FASTA and GTF, so I didn't address this issue so far in HTSeq.

You could edit your FASTA file to remove the extra stuff from the sequence name and start over.

Alternatively you could write a little script to remove the extra characters from the SAM files. This Python script here might do the trick:

Code:
import HTSeq

for a in HTSeq.SAM_Reader( "myfile.sam" ):
   if a.aligned:
      a.iv.chrom = a.iv.chrom.split("|")[3]
   print a.get_sam_line()
It splits each chromosome name at the vertical bars and then retains only the part between the third and fourth bar. As I don't have your SAM file, I couldn't test it, but it should be simple to adjust.

Simon
Simon Anders is offline   Reply With Quote
Old 05-10-2011, 10:49 AM   #5
mbobro2
Junior Member
 
Location: Illinois

Join Date: May 2011
Posts: 7
Default

Dear Simon,

I have tried to use Ensembl FASTA file for bowtie index and GFF file for HTseq-count, nevertheless chromosome names did not match. I slightly modified FASTA name to obtain matching chromosome name "Chromosome" in the SAM file. GTF calls it "Chromosome" too. Still, HTseq gives an error. Here are the headers of SAM and GTF files with the error I get when I run HTseq:

SAM
@HD VN:1.0 SO:unsorted
@SQ SN:Chromosome LN:4639675
@PG ID:Bowtie VN:0.12.7 CL:"./bowtie -S -q -a -t -v 1 ./indexes/Ensembl_Indexes_2/ecoli_k12_index ./rnaseq/1_CV104_pHDB3_5.fastqsanger.txt"
HWI-ST330_0103:1:1:1210:2132#NGATGT/1 4 * 0 0 * NCCCATTCGGAAATCGCCGGTTATAACGGTTCATATCACCTTACCGACGCTTATCGCAGATTAGCACGTCCTTCATCGCCTCTGACAGA #96828<598@@@@@DDDDDDDDDD<<<><>====DDDDD??????=?7??=?;?;??????==?5<787:::;:66599<<<<<8479 XM:i:0
HWI-ST330_0103:1:1:1202:2211#NGATGT/1 4 * 0 0 * CTGGCAGTCAGAGGCGATGAAGGACGTGCTAATCTGCNATAAGNGTCGGTAAGGTGATATGAACCGTTATAACCGGCGATTTCCTAATG DDDD@BEEEEDEDCE?DEEE<5?@@:549;@?A######################################################## XM:i:0
HWI-ST330_0103:1:1:1268:2150#NGATGT/1 4 * 0 0 * CTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTTACGAGTGGCGGGCG FBFDDEBEAEGEGEFAGFCEFFFFBDAAC.???0;A?D,?DC5DCCC<C;A:@7>?################################# XM:i:0
HWI-ST330_0103:1:1:1355:2205#CGATGT/1 0 Chromosome 224030 255 89M * 0 0 GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGG FF:CFEDEEEFGGGGEHFHDFHFFHGGBF?D5@C?GFEBECEDE;>ED?D=DEE8ED<ACCF>9CA(>AAA:=6?DD=AA;?D1@#### XA:i:0 MD:Z:89NM:i:0
HWI-ST330_0103:1:1:1355:2205#CGATGT/1 0 Chromosome 4164941 255 89M * 0 0 GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGG FF:CFEDEEEFGGGGEHFHDFHFFHGGBF?D5@C?GFEBECEDE;>ED?D=DEE8ED<ACCF>9CA(>AAA:=6?DD=AA;?D1@#### XA:i:0 MD:Z:89NM:i:0
HWI-ST330_0103:1:1:1355:2205#CGATGT/1 0 Chromosome 4033813 255 89M * 0 0 GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGG FF:CFEDEEEFGGGGEHFHDFHFFHGGBF?D5@C?GFEBECEDE;>ED?D=DEE8ED<ACCF>9CA(>AAA:=6?DD=AA;?D1@#### XA:i:0 MD:Z:89NM:i:0
HWI-ST330_0103:1:1:1355:2205#CGATGT/1 0 Chromosome 4206429 255 89M * 0 0 GTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGG FF:CFEDEEEFGGGGEHFHDFHFFHGGBF?D5@C?GFEBECEDE;>ED?D=DEE8ED<ACCF>9CA(>AAA:=6?DD=AA;?D1@#### XA:i:0 MD:Z:89NM:i:0


GTF
Chromosome protein_coding exon 148 5020 . + . gene_id "EBESCG00000001716"; transcript_id "EBESCT00000002097"; exon_number "1"; gene_name "thrB"; transcript_name "thrB-1";
Chromosome protein_coding CDS 2801 3730 . + 0 gene_id "EBESCG00000001716"; transcript_id "EBESCT00000002097"; exon_number "1"; gene_name "thrB"; transcript_name "thrB-1"; protein_id "EBESCP00000002097";
Chromosome protein_coding start_codon 2801 2803 . + 0 gene_id "EBESCG00000001716"; transcript_id "EBESCT00000002097"; exon_number "1"; gene_name "thrB"; transcript_name "thrB-1";
Chromosome protein_coding stop_codon 3731 3733 . + 0 gene_id "EBESCG00000001716"; transcript_id "EBESCT00000002097"; exon_number "1"; gene_name "thrB"; transcript_name "thrB-1";
Chromosome protein_coding exon 148 5020 . + . gene_id "EBESCG00000000900"; transcript_id "EBESCT00000001087"; exon_number "1"; gene_name "thrL"; transcript_name "thrL-1";
Chromosome protein_coding CDS 190 252 . + 0 gene_id "EBESCG00000000900"; transcript_id "EBESCT00000001087"; exon_number "1"; gene_name "thrL"; transcript_name "thrL-1"; protein_id "EBESCP00000001087";
Chromosome protein_coding start_codon 190 192 . + 0 gene_id "EBESCG00000000900"; transcript_id "EBESCT00000001087"; exon_number "1"; gene_name "thrL"; transcript_name "thrL-1";
Chromosome protein_coding stop_codon 253 255 . + 0 gene_id "EBESCG00000000900"; transcript_id "EBESCT00000001087"; exon_number "1"; gene_name "thrL"; transcript_name "thrL-1";
Chromosome protein_coding exon 148 5020 . + . gene_id "EBESCG00000002850"; transcript_id "EBESCT00000003475"; exon_number "1"; gene_name "thrC"; transcript_name "thrC-1";
Chromosome protein_coding CDS 3734 5017 . + 0 gene_id "EBESCG00000002850"; transcript_id "EBESCT00000003475"; exon_number "1"; gene_name "thrC"; transcript_name "thrC-1"; protein_id "EBESCP00000003475";


HTseq error
Maksym-Bobrovskyys-MacBook-Pro:tophat bobrovskyy$ htseq-count -t -h ./output/pHDB3_5/pHDB3_5_bowtie.sam ./genomes/Ensembl_Genomes/e_coli_k12.EB1_e_coli_k12.60.gtf
20586 GFF lines processed.
Warning: No features of type '-h' found.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1355:2205#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.
Warning: Skipping read 'HWI-ST330_0103:1:1:1303:2244#CGATGT/1', because chromosome 'Chromosome', to which it has been aligned, did not appear in the GFF file.


I tried different combinations of FASTA and GTF files and it wouldn't work, I would still get the same message. I really appreciate your help on this!

I tried the script you provided me with in order to change SAM file from my previous example, but because my SAM file is 35Gb (~25 Million reads mapped) It was problematic for me to finish the job on my MacBook as it takes most of CPU (we are getting access to a local cluster/server, so this job will become easier and faster). But I figured it would be easier to provide appropriate FASTA from the beginning as I have other 5 datasets that require same sort of analysis.

Please let me know what you think! I'm sure there is a simple solution that I just don't see. Thank you in advance!


Max Bobrovskyy
University of Illinois

Last edited by mbobro2; 05-10-2011 at 11:02 AM.
mbobro2 is offline   Reply With Quote
Old 05-10-2011, 11:52 AM   #6
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Your problem is here:

Quote:
Originally Posted by mbobro2 View Post
Maksym-Bobrovskyys-MacBook-Pro:tophat bobrovskyy$ htseq-count -t -h ./output/pHDB3_5/pHDB3_5_bowtie.sam ./genomes/Ensembl_Genomes/e_coli_k12.EB1_e_coli_k12.60.gtf
20586 GFF lines processed.
Warning: No features of type '-h' found.
You instructed HTSeq-count to only look at lines in the GFF files with feature type (i.e., third column) "-h". As such lines don't exist, it skips the whole GFF file. Omit the "-t -h" in your command line.
Simon Anders is offline   Reply With Quote
Old 05-10-2011, 12:06 PM   #7
mbobro2
Junior Member
 
Location: Illinois

Join Date: May 2011
Posts: 7
Default

Yay, thank you very much! It seems to be working! Highly appreciate your help! I might have some more questions on how to generate a tab-delimited table with multiple samples but I know there is a section on that in the manual. Hopefully I will be able to figure it out.

Once again, thank you for your help and wonderful software!

Max Bobrovskyy
University of Illinois
mbobro2 is offline   Reply With Quote
Old 05-11-2011, 10:37 AM   #8
mbobro2
Junior Member
 
Location: Illinois

Join Date: May 2011
Posts: 7
Default

Dear Simon,

So I ran HTseq and it seems like it did count some of the reads and what I obtained is this:


EBESCG00000102254 40
EBESCG00000102255 1
EBESCG00000102256 1
EBESCG00000102257 343
EBESCG00000102258 7
EBESCG00000102260 0
EBESCG00000195128 0
EBESCG00000195129 0
EBESCG00000210071 0
EBESCG00000210072 188
EBESCG00000210073 0
EBESCG00000210074 0
EBESCG00000210075 67
no_feature 55630126
ambiguous 2951860
too_low_aQual 0
not_aligned 3760807
alignment_not_unique 0


What I don't understand is why I get Gene_ID instead of Gene_Name in the first column! Can you suggest a way to fix this!? (Unless I have to use TopHat with a gtf file for annotation first, which I thought was not necessary for bacteria due to lack of splice junctions).

Also, I'm not sure but it seams like get a lot of reads that have no_feature, ambiguous and not_aligned. Is there a way to improve or is this a common thing?

Max Bobrovskyy
University of Illinois
mbobro2 is offline   Reply With Quote
Old 05-11-2011, 11:14 AM   #9
mbobro2
Junior Member
 
Location: Illinois

Join Date: May 2011
Posts: 7
Default

Simon,

I think all I need is to use -i gene_name option so that the final tab_delimited file will contain gene_name attribute instead of gene_id! I'm trying it right now. Sorry for unnecessary question! Let me know in case I'm wrong!

Nevertheless, number of no_feature, ambiguous and not_aligned reads worries me! What are your thoughts on it?

PS: Is it ok to make a tab_delimited table with multiple samples simply by pasting final result for two samples into excel and saving as tab delimited! If not what would be an easier way to do it? This table is intended to be used in DEseq. Thank you!

Max Bobrovskyy
University of Illinois
mbobro2 is offline   Reply With Quote
Old 07-01-2011, 05:48 AM   #10
sunkorner
Member
 
Location: Florida

Join Date: Jan 2011
Posts: 15
Default bacterial genome diff expression

Finally did u manage with the analysis. Even I am working with bacterial genome and your input will be very helpful

Thank you
sunkorner is offline   Reply With Quote
Old 07-01-2011, 08:08 AM   #11
mbobro2
Junior Member
 
Location: Illinois

Join Date: May 2011
Posts: 7
Default

I have managed to get differential expression values using DEseq. Nevertheless, I have a small problem, some of my genes show 0 reads, whereas I know for a fact they are expressed. One example is sRNA SgrS, which shows up to have 0 reads but I know it is expressed because a Northern blot was performed on that sample. Also SgrT (small prot. encoded in SgrS is not expressed). Another example is manXYZ genes, which should be expressed but also show up as 0 reads. FASTA and GTF files obtained from Ensembl Bacteria. Single end priming was used in RNAseq.

I used bowtie for mapping with the following options:

./bowtie -S -q -a -t -v 2 ./indexes/Ensembl_Indexes_2/ecoli_k12_index ./rnaseq/pHDB3_5.fastqsanger.txt > ./output/rnaseq_2/pHDB3_5_bowtie.sam

I used HTseq for annotation using the following options:

htseq-count -m intersection-nonempty -i gene_name ./output/pLCV1_20_bowtie.sam ./genomes/Ensembl_Genomes/e_coli_k12.EB1_e_coli_k12.60.gtf > ./output/rnaseq_3/pLCV1_20_htseq.txt

Can anyone make any suggestions as to why I obtain 0 reads for some obviously expressed transcripts and how to modify my workflow in order to fix this?! Thank you!
mbobro2 is offline   Reply With Quote
Old 07-01-2011, 08:32 AM   #12
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Have you looked into you SAM file at the locus of the gene with a gene browser (e.g., IGV)?
Simon Anders is offline   Reply With Quote
Old 07-01-2011, 11:53 AM   #13
sunkorner
Member
 
Location: Florida

Join Date: Jan 2011
Posts: 15
Default

I am using the following commands, I get lot of reads as "no_feature", I tried union as well as intersection strict, but no much change in the nubmer of no_feature. only the ambiguous reads nubmer has change. My total number of reads is around 18 million. So I am worried in this regard.

By the by what is the no_feature mean: are these number of reads not mapped?

Please suggest what might be wrong with my htseq command. Thank you

===
htseq-count -m union -s yes -t CDS -i locus_tag -o WT-1_geneid_htseq_out3 sorted.sam /home/sun/Rna_seq_data/Xylella/chrComplete.gff > PD_out_union &
no_feature 14867530
ambiguous 4453
too_low_aQual 0
not_aligned 1913205
alignment_not_unique 0

htseq-count -m intersection-strict -s yes -t CDS -i locus_tag -o WT-1_geneid_htseq_out2 sorted.sam /home/sun/Rna_seq_data/Xylella/chrComplete.gff > PD_out &
no_feature 14905057
ambiguous 13
too_low_aQual 0
not_aligned 1913205
alignment_not_unique 0
sunkorner is offline   Reply With Quote
Old 07-01-2011, 11:58 AM   #14
sunkorner
Member
 
Location: Florida

Join Date: Jan 2011
Posts: 15
Default

mbobro2, may be do u think we need to sort the sam file before HTSeq count, sometimes if we dont sort it can give odd results
sunkorner is offline   Reply With Quote
Old 07-03-2011, 03:40 AM   #15
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Quote:
Originally Posted by sunkorner View Post
By the by what is the no_feature mean: are these number of reads not mapped?
No, it means that the read does not overlap with any feature in your GTF file, i.e., that it falls onto intergenic or intronic space.

To investigate, first display your SAM file and you GTF file alongside in a genome browser (e.g., IGV) and inspect visually, whether there are indeed many reads that fall between features.

On the other hand: Is this your complete output? Not a single feature? Or did you cut them off.

If not, are you sure your GFF file fits your options? You have instructed htseq-count to only use those lines in your GFF file that have the string "CDS" in its third field, and have a suitable attribute called "locus_tag" in the last field. Maybe post an extract of the GFF file.
Simon Anders is offline   Reply With Quote
Old 07-03-2011, 06:09 AM   #16
sunkorner
Member
 
Location: Florida

Join Date: Jan 2011
Posts: 15
Default

Please find the the gff file I have used at this url ftp://ftp.ncbi.nlm.nih.gov/genomes/B.../NC_004556.gff

otherwise also please check a sample of it, hope it will be readable in the post
Quote:
##gff-version 3
#!gff-spec-version 1.14
#!source-version NCBI C++ formatter 0.2
##Type DNA NC_004556.1
NC_004556.1 RefSeq source 1 2519802 . + . organism=Xylella fastidiosa Temecula1;mol_type=genomic DNA;strain=Temecula1;db_xref=taxon:183190;note=Pierce%27s disease strain
NC_004556.1 RefSeq gene 146 1465 . + . ID=NC_004556.1:dnaA;locus_tag=PD0001;db_xref=GeneID:1144259
NC_004556.1 RefSeq CDS 146 1462 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
NC_004556.1 RefSeq start_codon 146 148 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
NC_004556.1 RefSeq stop_codon 1463 1465 . + 0 ID=NC_004556.1:dnaA:unknown_transcript_1;Parent=NC_004556.1:dnaA;locus_tag=PD0001;note=binds to the dnaA-box as an ATP-bound complex at the origin of replication during the initiation of chromosomal replication%3B can also affect transcription of multiple genes including itself.;transl_table=11;product=chromosomal replication initiation protein;protein_id=NP_778260.1;db_xref=GI:28197946;db_xref=GeneID:1144259;exon_number=1
NC_004556.1 RefSeq gene 1747 2847 . + . ID=NC_004556.1:dnaN;locus_tag=PD0002;db_xref=GeneID:1144260
NC_004556.1 RefSeq CDS 1747 2844 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
NC_004556.1 RefSeq start_codon 1747 1749 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
NC_004556.1 RefSeq stop_codon 2845 2847 . + 0 ID=NC_004556.1:dnaN:unknown_transcript_1;Parent=NC_004556.1:dnaN;locus_tag=PD0002;EC_number=2.7.7.7;note=binds the polymerase to DNA and acts as a sliding clamp;transl_table=11;product=DNA polymerase III subunit beta;protein_id=NP_778261.1;db_xref=GI:28197947;db_xref=GeneID:1144260;exon_number=1
NC_004556.1 RefSeq gene 3153 4247 . + . ID=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;db_xref=GeneID:1144261
NC_004556.1 RefSeq CDS 3153 4244 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
NC_004556.1 RefSeq start_codon 3153 3155 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
NC_004556.1 RefSeq stop_codon 4245 4247 . + 0 ID=NC_004556.1:recF:unknown_transcript_1;Parent=NC_004556.1:recF;locus_tag=PD0003;gene_synonym=uvrF;note=Required for DNA replication%3B binds preferentially to single-stranded%2C linear DNA;transl_table=11;product=recombination protein F;protein_id=NP_778262.1;db_xref=GI:28197948;db_xref=GeneID:1144261;exon_number=1
I am looking for the count of all the reads mapped to each gene. Please let me know if it is correct way I am doing using locus_tag and CDs. It is also microbial genome.

Thank you
sunkorner is offline   Reply With Quote
Old 07-03-2011, 06:16 AM   #17
sunkorner
Member
 
Location: Florida

Join Date: Jan 2011
Posts: 15
Default

Quote:
On the other hand: Is this your complete output? Not a single feature? Or did you cut them off.
I just gave the last few lines where no features is shown, I also get the count of reads mapped to each locus_tag. eg
PD0003 23
PD0001 5
etc

Thank you very much
sunkorner is offline   Reply With Quote
Old 07-03-2011, 06:16 AM   #18
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Again: Are the five lines you quote the whole output that you get from htseq-count? And have you looked at your data and the GTF file with a genome browser?

And what is a "locus tag"? (Sorry, I'm not very familiar with prokaryotic data.)
Simon Anders is offline   Reply With Quote
Old 07-03-2011, 06:17 AM   #19
Simon Anders
Senior Member
 
Location: Heidelberg, Germany

Join Date: Feb 2010
Posts: 994
Default

Then, you should really check things with a genome browser. Check especially whether the reads are really on the same strand as the gene, to make sure that '-s yes' is correct.
Simon Anders is offline   Reply With Quote
Old 07-03-2011, 07:01 AM   #20
sunkorner
Member
 
Location: Florida

Join Date: Jan 2011
Posts: 15
Default

Locus tags are systematic, unique identifiers that are assigned to each gene in GenBank. Therefore I used them as identifiers in this context too. However I am not sure if the use of CDS is causing this problem of lack of identification.

I will also look into the genome browser and get back to you with results. Thank you
sunkorner is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 04:33 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO