SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
Typical Bowtie2 Alignment Rates tdyo Bioinformatics 3 01-17-2014 02:12 AM
RNA-Seq: Low mapping percentage of pair-end reads (length 75bp) wilson90 Bioinformatics 6 03-21-2013 08:31 AM
How to compute percentage of my genome covering human reference genome? bioinf newbie Bioinformatics 2 07-10-2012 03:16 AM
Low mapping percentage with TopHat2 DerSeb RNA Sequencing 3 06-05-2012 05:35 AM
percentage coverage after alignment johnsequence Bioinformatics 9 03-15-2010 01:36 PM

Reply
 
Thread Tools
Old 04-29-2013, 07:46 PM   #1
obscurite
Junior Member
 
Location: NYC

Join Date: Feb 2013
Posts: 7
Default Typical alignment mapping percentage with genome?

What are typical mapping percentages for alignment? My samples are giving me an average of approx. 25% mapping coverage (from pair end 100bp reads). STAR produces somewhat fewer mappings than Tophat, but that's not surprising.

What kind of mapping numbers are you seeing? What do you expect? To what do you attribute the numbers, and how do you interpret them?

Thanks in advance.
obscurite is offline   Reply With Quote
Old 04-30-2013, 06:14 AM   #2
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

25% mapping rate seems low to me. On a standard quality human long RNA library we would typically get 85-90% of the reads mapped uniquely, and ~5% mapped to multiple loci. The first usual suspect is the sequencing quality of your library. If you post Log.final.out report from STAR we can look for clues.
alexdobin is offline   Reply With Quote
Old 04-30-2013, 06:35 AM   #3
obscurite
Junior Member
 
Location: NYC

Join Date: Feb 2013
Posts: 7
Default

Hi Alex. Thanks for weighing in. Here's one particularly low mapping count.

Started job on | Mar 15 04:59:21
Started mapping on | Mar 15 05:02:08
Finished on | Mar 15 05:29:49
Mapping speed, Million of reads per hour | 99.16

Number of input reads | 45750724
Average input read length | 202
UNIQUE READS:
Uniquely mapped reads number | 3331104
Uniquely mapped reads % | 7.28%
Average mapped length | 199.28
Number of splices: Total | 736416
Number of splices: Annotated (sjdb) | 665151
Number of splices: GT/AG | 733030
Number of splices: GC/AG | 2159
Number of splices: AT/AC | 569
Number of splices: Non-canonical | 658
Mismatch rate per base, % | 1.11%
Deletion rate per base | 0.04%
Deletion average length | 2.30
Insertion rate per base | 0.03%
Insertion average length | 2.02
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 578140
% of reads mapped to multiple loci | 1.26%
Number of reads mapped to too many loci | 43288
% of reads mapped to too many loci | 0.09%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 1.19%
% of reads unmapped: other | 90.17%
obscurite is offline   Reply With Quote
Old 04-30-2013, 08:05 AM   #4
alexdobin
Senior Member
 
Location: NY

Join Date: Feb 2009
Posts: 161
Default

This appears to be an interesting case.
Here is how I assess this mapping statistics.

First check the uniquely mapped reads:
Average mapped length | 199.28 : good, close you your pair length of 202
Mismatch rate per base, % | 1.11% : a bit on the high side, you would get 0.5-0.8% for good libraries,
The splices are dominated by annotated and canonical, which is good.
The indel rate is low.
So, the reads that actually mapped uniquely - as few as they are - look fine.

The ratio of unique to multimappers is 7.28%/1.26% ~ 6 is somewhat high, that is - for typical human cells, I am not sure what are you sequencing. Our typical value is 15-20.

% of reads mapped to too many loci | 0.09% : by default "too many loci" is >10, but this number is good so you are not missing much.

Finally - most importantly - unmapped reads.
% of reads unmapped: too short | 1.19% : this number would be large if you had poor sequencing quality, it is surprisingly small (we typically get ~5%).

% of reads unmapped: other | 90.17% :
this where all the unmapped reads went and it is very unusual.

It means that for 90% of the reads STAR could not find good anchor seeds. Two main possibilities are:
1. Contamination. Most reads have very little homology with human genome. You can check it by BLASTing a few unmapped reads against everything.
2. Repeat regions dominate expression. The number of loci a seed could map to is limited by --winAnchorMultimapNmax = 50 by default. You could increase it to ~1000 to see if more reads get mapped (also increase --outFilterMultimapNmax to output them as multi-mappers).
alexdobin is offline   Reply With Quote
Old 04-30-2013, 09:11 AM   #5
Cofactor Genomics
Registered Vendor
 
Location: St. Louis

Join Date: Jan 2010
Posts: 52
Default

Hello Obscurite,

We typically see higher than 80% mapping rate for our RNA-Seq differential expression projects as well.

I agree with alexdobin. One of the next things to check is for contamination. It is not necessarily contamination of the sample in the classic sense but it is not corresponding to your reference genome.... but still may be important to the biology or phenotype observed in the sample you are sequencing. For example we recently sequenced a mouse RNA-seq project focussed on differential expression and found that 80% of the reads were mapping to a viral component in NCBI's NR database. Come to find out this viral component was very central to the phenotype observed in the mouse. The common saying around here is that Every sequencing project is a metagenomics project... the question is just to what level that is the case.

Jarret Glasscock
Cofactor Genomics
http://www.cofactorgenomics.com
Cofactor Genomics is offline   Reply With Quote
Old 04-30-2013, 11:45 AM   #6
HESmith
Senior Member
 
Location: Bethesda MD

Join Date: Oct 2009
Posts: 498
Default

Following up alexdobin's post, ribosomal RNA contamination of mRNA-Seq libraries can produce this type of result due to the high copy number of the rRNA clusters. Adapter dimers are another possible culprit.
HESmith is offline   Reply With Quote
Old 04-30-2013, 12:13 PM   #7
chadn737
Senior Member
 
Location: US

Join Date: Jan 2009
Posts: 392
Default

My first suspect would be adaptor sequence. I have encountered that multiple times.
chadn737 is offline   Reply With Quote
Old 04-30-2013, 05:09 PM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,784
Default

We assume that one must have done some basic QC that should have caught adapter contamination problem before the alignment was done

Another good tool to check for contamination also comes from Babraham Bioinformatics Group.
GenoMax is offline   Reply With Quote
Old 05-02-2013, 09:39 AM   #9
obscurite
Junior Member
 
Location: NYC

Join Date: Feb 2013
Posts: 7
Default

Thanks for the tool and strategy suggestions. We have found some rRNA (despite depletion) and are running the QC tools. We are aware of ribopicker. Does anyone have a favorite technique for cleaning up pre-assembly sequences they are able to share? (e.g. rRNA, adapters, etc.) I've looked at normalization and clustering in the context of de novo assembly -- can those be useful for reference assembly?

Last edited by obscurite; 05-02-2013 at 10:21 AM.
obscurite is offline   Reply With Quote
Old 07-05-2018, 10:16 AM   #10
caiosuz
Member
 
Location: Brazil_Bahia_Ilheus

Join Date: Dec 2015
Posts: 12
Default Low percentage of mapped reads

I used STAR to align reads of 8 RNASeq libraries against the reference genome of the plant citrus sinensis and I got mapping results that I consider very low, once I've seen many published works with the same reference genome with an alignment rate between 80 and 95%. The best alignment rate of the 8 libraries I worked with was 39.22% and the worst was 9.90 %.
Should I try to run the mapping with less stringent parameters?
Is it possible to run differential expression analyses with such a low mapping rate?
I'm sending the summary mapping results below.
Best regards.

Mapping speed, Million of reads per hour | 51.55

Number of input reads | 14018962
Average input read length | 200
UNIQUE READS:
Uniquely mapped reads number | 5497854
Uniquely mapped reads % | 39.22%
Average mapped length | 197.61
Number of splices: Total | 2557324
Number of splices: Annotated (sjdb) | 2536892
Number of splices: GT/AG | 2480142
Number of splices: GC/AG | 31660
Number of splices: AT/AC | 1565
Number of splices: Non-canonical | 43957
Mismatch rate per base, % | 0.89%
Deletion rate per base | 0.06%
Deletion average length | 1.86
Insertion rate per base | 0.03%
Insertion average length | 2.14
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 91130
% of reads mapped to multiple loci | 0.65%
Number of reads mapped to too many loci | 284
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 60.12%
% of reads unmapped: other | 0.01%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

Mapping speed, Million of reads per hour | 31.96

Number of input reads | 12510660
Average input read length | 200
UNIQUE READS:
Uniquely mapped reads number | 1238257
Uniquely mapped reads % | 9.90%
Average mapped length | 196.44
Number of splices: Total | 500310
Number of splices: Annotated (sjdb) | 494951
Number of splices: GT/AG | 483272
Number of splices: GC/AG | 7143
Number of splices: AT/AC | 323
Number of splices: Non-canonical | 9572
Mismatch rate per base, % | 1.28%
Deletion rate per base | 0.06%
Deletion average length | 1.89
Insertion rate per base | 0.03%
Insertion average length | 2.13
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 26727
% of reads mapped to multiple loci | 0.21%
Number of reads mapped to too many loci | 98
% of reads mapped to too many loci | 0.00%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 89.88%
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
caiosuz is offline   Reply With Quote
Old 07-05-2018, 10:14 PM   #11
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,784
Default

Anytime you see low mapping results you should take a sample of reads that don't align and then blast them at NCBI to see if you have

a. Problem with contamination of data with unrelated species
b. rRNA contamination

You appear to have a low % of multi-mapping reads so if your genome contains rDNA repeat then possibility of b is small.
GenoMax is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 06:09 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO