I have been having problems using HTseq-count.
I’ll go through my protocol
1: Take tophat output (accepted_hits.bam), sort them by read and convert to bam file.
samtools sort -n mst12_t14_2_accepted_hits.bam mst12_t14_2_accepted_hits_read_sorted
samtools view mst12_t14_2_accepted_hits_read_sorted.bam > mst12_t14_2_accepted_hits_read_sorted.sam
2: Use HTseq-count to count reads for each gene using GFF file used for tophat alignment
htseq-count -m union -s no -t exon -i gene_id mst12_t14_2_accepted_hits_read_sorted.sam magnaporthe_oryzae_70-15_8_transcripts.gff > test.txt
There were no error messages but when I looked at the output, the count for every gene is zero and all the reads are in “alignment_not_unique” category. Reads in this category supposedly have NH > 1 in SAM file, but the reads I can see all have NH=1.
Below is a sample of the SAM file and the GFF.
D3P26HQ1:1290TA0ACXX:2:1101:1108:73648 73 supercont8.4 4401713 50 35M * 0 0
CCTTGATGTGGACGCGCCCNGCGACCTTGCTGGCG ?=@BDDDDCF>D>EGG6FF#1:?DHDB@DH9D>;F AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19G15 YT:Z:UU NH:i:1
D3P26HQ1:1290TA0ACXX:2:1101:1109:25458 89 supercont8.4 1740219 50 35M * 0 0
AGCGAATGACGAGCGNCGCTATCATGATAACGTCA ?091D?)1:?<<333#@2EDE<DC<>C;?2=41?: AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:15A19 YT:Z:UU NH:i:1
D3P26HQ1:1290TA0ACXX:2:1101:1112:73199 99 supercont8.1 3567120 50 36M = 3567285 210
TGCACAGAGAGATGGTTGCNTATATATATGCAAAAA ?<;ADDDD+<C;DE@2A@@#33A1CBEEC999CDDD AS:i:-1 XN:i:0 XM:i:1 XO:i:0
XG:i:0 NM:i:1 MD:Z:19A16 YT:Z:UU NH:i:1
D3P26HQ1:1290TA0ACXX:2:1101:1112:73199 147 supercont8.1 3567285 50 45M = 3567120 -210 CACAGACAAAGCGGCTCGTCTTGGCGCGGGCACGCATTGGCTCGG @CBHE?C?EGA@A;-(.9DDED0BD@@;<)B2DA<3HD80+BB?@ AS:i:-5 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:15G6T22 YT:Z:UU NH:i:1
D3P26HQ1:1290TA0ACXX:2:1101:1113:7390 89 supercont8.6 1611568 50 34M * 0 0 AGTACGTCTCGGCTNTCCAGAAGGCCAAGATCAC ?1CC?)1C:+33#3BC:EECA++A?BBB+??? AS:i:-3 XM:i:2 XO:i:0 XG:i:0 MD:Z:2G11T19 NM:i:2 NH:i:1
supercont8.5 MG8_CALLGENES_FINAL_1 start_codon 4470284 4470286 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 stop_codon 4469688 4469690 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 exon 4470173 4470286 . - . gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 CDS 4470173 4470286 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 exon 4470034 4470120 . - . gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 CDS 4470034 4470120 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 exon 4469273 4469969 . - . gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 CDS 4469691 4469969 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
Any ideas?
Cheers, Darren
I’ll go through my protocol
1: Take tophat output (accepted_hits.bam), sort them by read and convert to bam file.
samtools sort -n mst12_t14_2_accepted_hits.bam mst12_t14_2_accepted_hits_read_sorted
samtools view mst12_t14_2_accepted_hits_read_sorted.bam > mst12_t14_2_accepted_hits_read_sorted.sam
2: Use HTseq-count to count reads for each gene using GFF file used for tophat alignment
htseq-count -m union -s no -t exon -i gene_id mst12_t14_2_accepted_hits_read_sorted.sam magnaporthe_oryzae_70-15_8_transcripts.gff > test.txt
There were no error messages but when I looked at the output, the count for every gene is zero and all the reads are in “alignment_not_unique” category. Reads in this category supposedly have NH > 1 in SAM file, but the reads I can see all have NH=1.
Below is a sample of the SAM file and the GFF.
D3P26HQ1:1290TA0ACXX:2:1101:1108:73648 73 supercont8.4 4401713 50 35M * 0 0
CCTTGATGTGGACGCGCCCNGCGACCTTGCTGGCG ?=@BDDDDCF>D>EGG6FF#1:?DHDB@DH9D>;F AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19G15 YT:Z:UU NH:i:1
D3P26HQ1:1290TA0ACXX:2:1101:1109:25458 89 supercont8.4 1740219 50 35M * 0 0
AGCGAATGACGAGCGNCGCTATCATGATAACGTCA ?091D?)1:?<<333#@2EDE<DC<>C;?2=41?: AS:i:-1 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:15A19 YT:Z:UU NH:i:1
D3P26HQ1:1290TA0ACXX:2:1101:1112:73199 99 supercont8.1 3567120 50 36M = 3567285 210
TGCACAGAGAGATGGTTGCNTATATATATGCAAAAA ?<;ADDDD+<C;DE@2A@@#33A1CBEEC999CDDD AS:i:-1 XN:i:0 XM:i:1 XO:i:0
XG:i:0 NM:i:1 MD:Z:19A16 YT:Z:UU NH:i:1
D3P26HQ1:1290TA0ACXX:2:1101:1112:73199 147 supercont8.1 3567285 50 45M = 3567120 -210 CACAGACAAAGCGGCTCGTCTTGGCGCGGGCACGCATTGGCTCGG @CBHE?C?EGA@A;-(.9DDED0BD@@;<)B2DA<3HD80+BB?@ AS:i:-5 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:15G6T22 YT:Z:UU NH:i:1
D3P26HQ1:1290TA0ACXX:2:1101:1113:7390 89 supercont8.6 1611568 50 34M * 0 0 AGTACGTCTCGGCTNTCCAGAAGGCCAAGATCAC ?1CC?)1C:+33#3BC:EECA++A?BBB+??? AS:i:-3 XM:i:2 XO:i:0 XG:i:0 MD:Z:2G11T19 NM:i:2 NH:i:1
supercont8.5 MG8_CALLGENES_FINAL_1 start_codon 4470284 4470286 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 stop_codon 4469688 4469690 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 exon 4470173 4470286 . - . gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 CDS 4470173 4470286 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 exon 4470034 4470120 . - . gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 CDS 4470034 4470120 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 exon 4469273 4469969 . - . gene_id "MGG_00005"; transcript_id "MGG_00005T0";
supercont8.5 MG8_CALLGENES_FINAL_1 CDS 4469691 4469969 . - 0 gene_id "MGG_00005"; transcript_id "MGG_00005T0";
Any ideas?
Cheers, Darren
Comment