SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   RNA Sequencing (http://seqanswers.com/forums/forumdisplay.php?f=26)
-   -   Typical alignment mapping percentage with genome? (http://seqanswers.com/forums/showthread.php?t=29769)

obscurite 04-29-2013 08:46 PM

Typical alignment mapping percentage with genome?
 
What are typical mapping percentages for alignment? My samples are giving me an average of approx. 25% mapping coverage (from pair end 100bp reads). STAR produces somewhat fewer mappings than Tophat, but that's not surprising.

What kind of mapping numbers are you seeing? What do you expect? To what do you attribute the numbers, and how do you interpret them?

Thanks in advance.

alexdobin 04-30-2013 07:14 AM

25% mapping rate seems low to me. On a standard quality human long RNA library we would typically get 85-90% of the reads mapped uniquely, and ~5% mapped to multiple loci. The first usual suspect is the sequencing quality of your library. If you post Log.final.out report from STAR we can look for clues.

obscurite 04-30-2013 07:35 AM

Hi Alex. Thanks for weighing in. Here's one particularly low mapping count.

Started job on | Mar 15 04:59:21
Started mapping on | Mar 15 05:02:08
Finished on | Mar 15 05:29:49
Mapping speed, Million of reads per hour | 99.16

Number of input reads | 45750724
Average input read length | 202
UNIQUE READS:
Uniquely mapped reads number | 3331104
Uniquely mapped reads % | 7.28%
Average mapped length | 199.28
Number of splices: Total | 736416
Number of splices: Annotated (sjdb) | 665151
Number of splices: GT/AG | 733030
Number of splices: GC/AG | 2159
Number of splices: AT/AC | 569
Number of splices: Non-canonical | 658
Mismatch rate per base, % | 1.11%
Deletion rate per base | 0.04%
Deletion average length | 2.30
Insertion rate per base | 0.03%
Insertion average length | 2.02
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 578140
% of reads mapped to multiple loci | 1.26%
Number of reads mapped to too many loci | 43288
% of reads mapped to too many loci | 0.09%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 1.19%
% of reads unmapped: other | 90.17%

alexdobin 04-30-2013 09:05 AM

This appears to be an interesting case.
Here is how I assess this mapping statistics.

First check the uniquely mapped reads:
Average mapped length | 199.28 : good, close you your pair length of 202
Mismatch rate per base, % | 1.11% : a bit on the high side, you would get 0.5-0.8% for good libraries,
The splices are dominated by annotated and canonical, which is good.
The indel rate is low.
So, the reads that actually mapped uniquely - as few as they are - look fine.

The ratio of unique to multimappers is 7.28%/1.26% ~ 6 is somewhat high, that is - for typical human cells, I am not sure what are you sequencing. Our typical value is 15-20.

% of reads mapped to too many loci | 0.09% : by default "too many loci" is >10, but this number is good so you are not missing much.

Finally - most importantly - unmapped reads.
% of reads unmapped: too short | 1.19% : this number would be large if you had poor sequencing quality, it is surprisingly small (we typically get ~5%).

% of reads unmapped: other | 90.17% :
this where all the unmapped reads went and it is very unusual.

It means that for 90% of the reads STAR could not find good anchor seeds. Two main possibilities are:
1. Contamination. Most reads have very little homology with human genome. You can check it by BLASTing a few unmapped reads against everything.
2. Repeat regions dominate expression. The number of loci a seed could map to is limited by --winAnchorMultimapNmax = 50 by default. You could increase it to ~1000 to see if more reads get mapped (also increase --outFilterMultimapNmax to output them as multi-mappers).

Cofactor Genomics 04-30-2013 10:11 AM

Hello Obscurite,

We typically see higher than 80% mapping rate for our RNA-Seq differential expression projects as well.

I agree with alexdobin. One of the next things to check is for contamination. It is not necessarily contamination of the sample in the classic sense but it is not corresponding to your reference genome.... but still may be important to the biology or phenotype observed in the sample you are sequencing. For example we recently sequenced a mouse RNA-seq project focussed on differential expression and found that 80% of the reads were mapping to a viral component in NCBI's NR database. Come to find out this viral component was very central to the phenotype observed in the mouse. The common saying around here is that Every sequencing project is a metagenomics project... the question is just to what level that is the case.

Jarret Glasscock
Cofactor Genomics
http://www.cofactorgenomics.com

HESmith 04-30-2013 12:45 PM

Following up alexdobin's post, ribosomal RNA contamination of mRNA-Seq libraries can produce this type of result due to the high copy number of the rRNA clusters. Adapter dimers are another possible culprit.

chadn737 04-30-2013 01:13 PM

My first suspect would be adaptor sequence. I have encountered that multiple times.

GenoMax 04-30-2013 06:09 PM

We assume that one must have done some basic QC that should have caught adapter contamination problem before the alignment was done :)

Another good tool to check for contamination also comes from Babraham Bioinformatics Group.

obscurite 05-02-2013 10:39 AM

Thanks for the tool and strategy suggestions. We have found some rRNA (despite depletion) and are running the QC tools. We are aware of ribopicker. Does anyone have a favorite technique for cleaning up pre-assembly sequences they are able to share? (e.g. rRNA, adapters, etc.) I've looked at normalization and clustering in the context of de novo assembly -- can those be useful for reference assembly?

caiosuz 07-05-2018 11:16 AM

Low percentage of mapped reads
 
I used STAR to align reads of 8 RNASeq libraries against the reference genome of the plant citrus sinensis and I got mapping results that I consider very low, once I've seen many published works with the same reference genome with an alignment rate between 80 and 95%. The best alignment rate of the 8 libraries I worked with was 39.22% and the worst was 9.90 %.
Should I try to run the mapping with less stringent parameters?
Is it possible to run differential expression analyses with such a low mapping rate?
I'm sending the summary mapping results below.
Best regards.

Mapping speed, Million of reads per hour | 51.55

Number of input reads | 14018962
Average input read length | 200
UNIQUE READS:
Uniquely mapped reads number | 5497854
Uniquely mapped reads % | 39.22%
Average mapped length | 197.61
Number of splices: Total | 2557324
Number of splices: Annotated (sjdb) | 2536892
Number of splices: GT/AG | 2480142
Number of splices: GC/AG | 31660
Number of splices: AT/AC | 1565
Number of splices: Non-canonical | 43957
Mismatch rate per base, % | 0.89%
Deletion rate per base | 0.06%
Deletion average length | 1.86
Insertion rate per base | 0.03%
Insertion average length | 2.14
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 91130
% of reads mapped to multiple loci | 0.65%
Number of reads mapped to too many loci | 284
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 60.12%
% of reads unmapped: other | 0.01%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

Mapping speed, Million of reads per hour | 31.96

Number of input reads | 12510660
Average input read length | 200
UNIQUE READS:
Uniquely mapped reads number | 1238257
Uniquely mapped reads % | 9.90%
Average mapped length | 196.44
Number of splices: Total | 500310
Number of splices: Annotated (sjdb) | 494951
Number of splices: GT/AG | 483272
Number of splices: GC/AG | 7143
Number of splices: AT/AC | 323
Number of splices: Non-canonical | 9572
Mismatch rate per base, % | 1.28%
Deletion rate per base | 0.06%
Deletion average length | 1.89
Insertion rate per base | 0.03%
Insertion average length | 2.13
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 26727
% of reads mapped to multiple loci | 0.21%
Number of reads mapped to too many loci | 98
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 89.88%
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

GenoMax 07-05-2018 11:14 PM

Anytime you see low mapping results you should take a sample of reads that don't align and then blast them at NCBI to see if you have

a. Problem with contamination of data with unrelated species
b. rRNA contamination

You appear to have a low % of multi-mapping reads so if your genome contains rDNA repeat then possibility of b is small.


All times are GMT -8. The time now is 09:27 PM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.