I appreciate any help to figure out this puzzle.
What I did:
1. I first used tophat2 to map the reads to the genome, get accepted.bam. This file is ~900 MB. Here is the alignment summary:
------------------------------------------------------------------------------------
Reads:
Input : 45593954
Mapped : 44546323 (97.7% of input)
of these: 1496005 ( 3.4%) have multiple alignments (7 have >20)
97.7% overall read mapping rate.
------------------------------------------------------------------------------------
2. I then used samtools to convert the bam file to sam file for htseq-count to handle:
samtools view -h -o WT3.sam WT3_th2out/accepted_hits.bam
The output sam file is quite large, is ~11 GB.
3. Finally I passed the sam file to htseq-count to generate a count table:
htseq-count --stranded=no WT3.sam genes.gtf > WT3_counts
The first line of the output is just one number followed by a tab:
" 1851236"
I assume this is the total counts for all genes (Is this right?).
At the end of the file it says:
------------------------------------------------------------------------------------
__no_feature 5542374
__ambiguous 5280500
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 3266554
------------------------------------------------------------------------------------
The questions:
I. The sum of all counts are merely 16 million, while the uniquely mapped reads are 43 million, why is the huge discrepancy?
II. For a few genes of interest, I got "0" in the count output, while I can see hundreds to thousands in IGV when looking at the bam file mapping.
What I have checked:
The genes.gtf was the same UCSC source as the genome I used for tophat2 mapping, so the chromosome names are consistent (chrI chrII etc).
The genes of interest I was looking at are not duplicated genes, so the mapped reads inside them should be mostly unique.
I am desparate to get some explanation for this.
Thanks!
Shaohe
What I did:
1. I first used tophat2 to map the reads to the genome, get accepted.bam. This file is ~900 MB. Here is the alignment summary:
------------------------------------------------------------------------------------
Reads:
Input : 45593954
Mapped : 44546323 (97.7% of input)
of these: 1496005 ( 3.4%) have multiple alignments (7 have >20)
97.7% overall read mapping rate.
------------------------------------------------------------------------------------
2. I then used samtools to convert the bam file to sam file for htseq-count to handle:
samtools view -h -o WT3.sam WT3_th2out/accepted_hits.bam
The output sam file is quite large, is ~11 GB.
3. Finally I passed the sam file to htseq-count to generate a count table:
htseq-count --stranded=no WT3.sam genes.gtf > WT3_counts
The first line of the output is just one number followed by a tab:
" 1851236"
I assume this is the total counts for all genes (Is this right?).
At the end of the file it says:
------------------------------------------------------------------------------------
__no_feature 5542374
__ambiguous 5280500
__too_low_aQual 0
__not_aligned 0
__alignment_not_unique 3266554
------------------------------------------------------------------------------------
The questions:
I. The sum of all counts are merely 16 million, while the uniquely mapped reads are 43 million, why is the huge discrepancy?
II. For a few genes of interest, I got "0" in the count output, while I can see hundreds to thousands in IGV when looking at the bam file mapping.
What I have checked:
The genes.gtf was the same UCSC source as the genome I used for tophat2 mapping, so the chromosome names are consistent (chrI chrII etc).
The genes of interest I was looking at are not duplicated genes, so the mapped reads inside them should be mostly unique.
I am desparate to get some explanation for this.
Thanks!
Shaohe
Comment