View Single Post
Old 11-25-2013, 12:56 PM   #1
Location: UK

Join Date: May 2013
Posts: 10
Default HTSeq-count large proportion of mapped reads with no_feature

I am fairly new to RNA-Seq analysis and hoped someone could help me diagnose the reason why I have such a low proportion of reads mapped to features with my current method.

The experiment:
- Were looking at the retina in developing mice
- I have two conditions (wild-type vs mutant), each with 4 biological replicates (8 in total).
- RNA prepared using Illumina TruSeq stranded kit. Ribo-Zero Deplete techneque.
- Using a pair-end 2 x 100bp HiSeq reads. Approximately 40 million pairs per replicate.
The methods:
- Trimmed low quality regions/reads using Trimmomatic.
- Using TopHat2 to map reads to UCSC mm10 reference genome available on the tophat website.
  - --no-novel-juncs and larger --mate-inner-dist arguments
  - Example numbers from one replicate:
    - 92.8% overall mapping rate
    - 30,244,035 aligned pairs; of these:
      - 9.4% have multiple alignments
      - 1.9% are discordant alignments
    - 87.6% concordant pair alignment rate
    - These numbers seem pretty good to me.
- Sorted the acccepted_hits.bam file by read name using 'samtools sort -n'
- Converted bam to sam using 'samtools view'
- Extracting feature read counts using HTSeq-count and the gtf annotation file also from UCSC mm10.
  - Using default arguments (method=union and stranded=yes)
  - Example numbers:
    - successfull feature = 505,622
    - no feature = 29,039,281
    - ambiguous feature = 2525
    - non unique feature = 10,519,556
    - Success rate = 1.26%
As you can see I have only a very small proportion of reads mapped to a feature. I don't really know what is normal but this seems very small. I am certain that it is the correct genes.gtf annotation file as it was provided gzipped with the genome itself. What kind of % of reads mapped on features should I be expecting? What can I do to diagnose this problem? Can anyone spot anything wrong that I am doing? Thanks.

I also have a second question, I am able to do limited down-stream analysis using DESeq using the currently read counts. I get very few (only two) genes with differential expression. These two genes are directly adjacent to each other on the genome but not overlapping, these are Uckl1 and Znf512b ( There are 130 bp between them including the UTRs. I can't imagine how this may be happening if I am using the --union option in HTSeq-count as any reads that overlap both should be labeled as ambiguous features.

Thanks for any help.
edm1 is offline   Reply With Quote