SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Bioinformatics (http://seqanswers.com/forums/forumdisplay.php?f=18)
-   -   HTSeq-count large proportion of mapped reads with no_feature (http://seqanswers.com/forums/showthread.php?t=36153)

edm1 11-25-2013 12:56 PM

HTSeq-count large proportion of mapped reads with no_feature
 
Hi,
I am fairly new to RNA-Seq analysis and hoped someone could help me diagnose the reason why I have such a low proportion of reads mapped to features with my current method.

The experiment:
Code:

- Were looking at the retina in developing mice
- I have two conditions (wild-type vs mutant), each with 4 biological replicates (8 in total).
- RNA prepared using Illumina TruSeq stranded kit. Ribo-Zero Deplete techneque.
- Using a pair-end 2 x 100bp HiSeq reads. Approximately 40 million pairs per replicate.

The methods:
Code:

- Trimmed low quality regions/reads using Trimmomatic.
- Using TopHat2 to map reads to UCSC mm10 reference genome available on the tophat website.
  - --no-novel-juncs and larger --mate-inner-dist arguments
  - Example numbers from one replicate:
    - 92.8% overall mapping rate
    - 30,244,035 aligned pairs; of these:
      - 9.4% have multiple alignments
      - 1.9% are discordant alignments
    - 87.6% concordant pair alignment rate
    - These numbers seem pretty good to me.
- Sorted the acccepted_hits.bam file by read name using 'samtools sort -n'
- Converted bam to sam using 'samtools view'
- Extracting feature read counts using HTSeq-count and the gtf annotation file also from UCSC mm10.
  - Using default arguments (method=union and stranded=yes)
  - Example numbers:
    - successfull feature = 505,622
    - no feature = 29,039,281
    - ambiguous feature = 2525
    - non unique feature = 10,519,556
    - Success rate = 1.26%

As you can see I have only a very small proportion of reads mapped to a feature. I don't really know what is normal but this seems very small. I am certain that it is the correct genes.gtf annotation file as it was provided gzipped with the genome itself. What kind of % of reads mapped on features should I be expecting? What can I do to diagnose this problem? Can anyone spot anything wrong that I am doing? Thanks.

I also have a second question, I am able to do limited down-stream analysis using DESeq using the currently read counts. I get very few (only two) genes with differential expression. These two genes are directly adjacent to each other on the genome but not overlapping, these are Uckl1 and Znf512b (http://genome-euro.ucsc.edu/cgi-bin/...gsid=194787325). There are 130 bp between them including the UTRs. I can't imagine how this may be happening if I am using the --union option in HTSeq-count as any reads that overlap both should be labeled as ambiguous features.

Thanks for any help.

kmcarr 11-25-2013 01:36 PM

Here's your problem.
Quote:

(method=union and stranded=yes)
For libraries prepared using TruSeq Stranded RNA Kits (dUTP second strand marking) you should set htseq-count -stranded=reverse

edm1 11-26-2013 02:40 AM

Thanks. That has greatly improved the feature counts:

successfull feature = 15,000,966 (45.3%)
no_feature = 8,715,612 (26.33%)
ambiguous = 172,178 (0.52%)
alignment_not_unique = 9,215,369 (27.8%)

Does 45.3% success rate seem close to what I should be expecting in mice? I have nothing to compare it to.

McG 03-05-2014 08:36 AM

Quote:

Originally Posted by edm1 (Post 123176)
Does 45.3% success rate seem close to what I should be expecting in mice? I have nothing to compare it to.

Hey edm1,

I'm not sure if you were ever able to get advice on whether 45.3% is close to what you should expect, but here are my two cents:
1) The "alignment_not_unique" feature is for reads with low alignment quality, which usually means reads that had multiple mappings. Count-based methods of differential expression (e.g. DESeq) usually ignore these, i.e. they focus on "unique matches", so you shouldn't include them in your final percentage tally. Excluding them gives you a success rate of 62.8%.
2) I'm not very familiar with the literature on what an expected success rate would be, but one example I am aware of (and was posted in another forum) is this nature paper (link: http://www.nature.com/nature/journal...ture08872.html), reporting in their supplementary information that they mapped an average of 86% (range of 64-91%) uniquely matching reads to exons. All of their numbers are reported in the first supplementary table. Thus, your success rate of 62.8% is close to the bottom of their range. There might still be some areas of improvement for you, but you're pretty darn close.

Hope that helps!

RemitoAmigo 05-26-2014 09:28 AM

Keep also in mind...this was Ribo Zero...All sorts of RNAs are in there...
I would expect PolyA selection to be higher (mapping to exons)...but that's just me thinking...

Remi


All times are GMT -8. The time now is 10:59 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.