Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Typical alignment mapping percentage with genome?

    What are typical mapping percentages for alignment? My samples are giving me an average of approx. 25% mapping coverage (from pair end 100bp reads). STAR produces somewhat fewer mappings than Tophat, but that's not surprising.

    What kind of mapping numbers are you seeing? What do you expect? To what do you attribute the numbers, and how do you interpret them?

    Thanks in advance.

  • #2
    25% mapping rate seems low to me. On a standard quality human long RNA library we would typically get 85-90% of the reads mapped uniquely, and ~5% mapped to multiple loci. The first usual suspect is the sequencing quality of your library. If you post Log.final.out report from STAR we can look for clues.

    Comment


    • #3
      Hi Alex. Thanks for weighing in. Here's one particularly low mapping count.

      Started job on | Mar 15 04:59:21
      Started mapping on | Mar 15 05:02:08
      Finished on | Mar 15 05:29:49
      Mapping speed, Million of reads per hour | 99.16

      Number of input reads | 45750724
      Average input read length | 202
      UNIQUE READS:
      Uniquely mapped reads number | 3331104
      Uniquely mapped reads % | 7.28%
      Average mapped length | 199.28
      Number of splices: Total | 736416
      Number of splices: Annotated (sjdb) | 665151
      Number of splices: GT/AG | 733030
      Number of splices: GC/AG | 2159
      Number of splices: AT/AC | 569
      Number of splices: Non-canonical | 658
      Mismatch rate per base, % | 1.11%
      Deletion rate per base | 0.04%
      Deletion average length | 2.30
      Insertion rate per base | 0.03%
      Insertion average length | 2.02
      MULTI-MAPPING READS:
      Number of reads mapped to multiple loci | 578140
      % of reads mapped to multiple loci | 1.26%
      Number of reads mapped to too many loci | 43288
      % of reads mapped to too many loci | 0.09%
      UNMAPPED READS:
      % of reads unmapped: too many mismatches | 0.00%
      % of reads unmapped: too short | 1.19%
      % of reads unmapped: other | 90.17%

      Comment


      • #4
        This appears to be an interesting case.
        Here is how I assess this mapping statistics.

        First check the uniquely mapped reads:
        Average mapped length | 199.28 : good, close you your pair length of 202
        Mismatch rate per base, % | 1.11% : a bit on the high side, you would get 0.5-0.8% for good libraries,
        The splices are dominated by annotated and canonical, which is good.
        The indel rate is low.
        So, the reads that actually mapped uniquely - as few as they are - look fine.

        The ratio of unique to multimappers is 7.28%/1.26% ~ 6 is somewhat high, that is - for typical human cells, I am not sure what are you sequencing. Our typical value is 15-20.

        % of reads mapped to too many loci | 0.09% : by default "too many loci" is >10, but this number is good so you are not missing much.

        Finally - most importantly - unmapped reads.
        % of reads unmapped: too short | 1.19% : this number would be large if you had poor sequencing quality, it is surprisingly small (we typically get ~5%).

        % of reads unmapped: other | 90.17% :
        this where all the unmapped reads went and it is very unusual.

        It means that for 90% of the reads STAR could not find good anchor seeds. Two main possibilities are:
        1. Contamination. Most reads have very little homology with human genome. You can check it by BLASTing a few unmapped reads against everything.
        2. Repeat regions dominate expression. The number of loci a seed could map to is limited by --winAnchorMultimapNmax = 50 by default. You could increase it to ~1000 to see if more reads get mapped (also increase --outFilterMultimapNmax to output them as multi-mappers).

        Comment


        • #5
          Hello Obscurite,

          We typically see higher than 80% mapping rate for our RNA-Seq differential expression projects as well.

          I agree with alexdobin. One of the next things to check is for contamination. It is not necessarily contamination of the sample in the classic sense but it is not corresponding to your reference genome.... but still may be important to the biology or phenotype observed in the sample you are sequencing. For example we recently sequenced a mouse RNA-seq project focussed on differential expression and found that 80% of the reads were mapping to a viral component in NCBI's NR database. Come to find out this viral component was very central to the phenotype observed in the mouse. The common saying around here is that Every sequencing project is a metagenomics project... the question is just to what level that is the case.

          Jarret Glasscock
          Cofactor Genomics

          Comment


          • #6
            Following up alexdobin's post, ribosomal RNA contamination of mRNA-Seq libraries can produce this type of result due to the high copy number of the rRNA clusters. Adapter dimers are another possible culprit.

            Comment


            • #7
              My first suspect would be adaptor sequence. I have encountered that multiple times.

              Comment


              • #8
                We assume that one must have done some basic QC that should have caught adapter contamination problem before the alignment was done

                Another good tool to check for contamination also comes from Babraham Bioinformatics Group.

                Comment


                • #9
                  Thanks for the tool and strategy suggestions. We have found some rRNA (despite depletion) and are running the QC tools. We are aware of ribopicker. Does anyone have a favorite technique for cleaning up pre-assembly sequences they are able to share? (e.g. rRNA, adapters, etc.) I've looked at normalization and clustering in the context of de novo assembly -- can those be useful for reference assembly?
                  Last edited by obscurite; 05-02-2013, 10:21 AM.

                  Comment


                  • #10
                    Low percentage of mapped reads

                    I used STAR to align reads of 8 RNASeq libraries against the reference genome of the plant citrus sinensis and I got mapping results that I consider very low, once I've seen many published works with the same reference genome with an alignment rate between 80 and 95%. The best alignment rate of the 8 libraries I worked with was 39.22% and the worst was 9.90 %.
                    Should I try to run the mapping with less stringent parameters?
                    Is it possible to run differential expression analyses with such a low mapping rate?
                    I'm sending the summary mapping results below.
                    Best regards.

                    Mapping speed, Million of reads per hour | 51.55

                    Number of input reads | 14018962
                    Average input read length | 200
                    UNIQUE READS:
                    Uniquely mapped reads number | 5497854
                    Uniquely mapped reads % | 39.22%
                    Average mapped length | 197.61
                    Number of splices: Total | 2557324
                    Number of splices: Annotated (sjdb) | 2536892
                    Number of splices: GT/AG | 2480142
                    Number of splices: GC/AG | 31660
                    Number of splices: AT/AC | 1565
                    Number of splices: Non-canonical | 43957
                    Mismatch rate per base, % | 0.89%
                    Deletion rate per base | 0.06%
                    Deletion average length | 1.86
                    Insertion rate per base | 0.03%
                    Insertion average length | 2.14
                    MULTI-MAPPING READS:
                    Number of reads mapped to multiple loci | 91130
                    % of reads mapped to multiple loci | 0.65%
                    Number of reads mapped to too many loci | 284
                    % of reads mapped to too many loci | 0.00%
                    UNMAPPED READS:
                    % of reads unmapped: too many mismatches | 0.00%
                    % of reads unmapped: too short | 60.12%
                    % of reads unmapped: other | 0.01%
                    CHIMERIC READS:
                    Number of chimeric reads | 0
                    % of chimeric reads | 0.00%

                    Mapping speed, Million of reads per hour | 31.96

                    Number of input reads | 12510660
                    Average input read length | 200
                    UNIQUE READS:
                    Uniquely mapped reads number | 1238257
                    Uniquely mapped reads % | 9.90%
                    Average mapped length | 196.44
                    Number of splices: Total | 500310
                    Number of splices: Annotated (sjdb) | 494951
                    Number of splices: GT/AG | 483272
                    Number of splices: GC/AG | 7143
                    Number of splices: AT/AC | 323
                    Number of splices: Non-canonical | 9572
                    Mismatch rate per base, % | 1.28%
                    Deletion rate per base | 0.06%
                    Deletion average length | 1.89
                    Insertion rate per base | 0.03%
                    Insertion average length | 2.13
                    MULTI-MAPPING READS:
                    Number of reads mapped to multiple loci | 26727
                    % of reads mapped to multiple loci | 0.21%
                    Number of reads mapped to too many loci | 98
                    % of reads mapped to too many loci | 0.00%
                    UNMAPPED READS:
                    % of reads unmapped: too many mismatches | 0.00%
                    % of reads unmapped: too short | 89.88%
                    % of reads unmapped: other | 0.00%
                    CHIMERIC READS:
                    Number of chimeric reads | 0
                    % of chimeric reads | 0.00%

                    Comment


                    • #11
                      Anytime you see low mapping results you should take a sample of reads that don't align and then blast them at NCBI to see if you have

                      a. Problem with contamination of data with unrelated species
                      b. rRNA contamination

                      You appear to have a low % of multi-mapping reads so if your genome contains rDNA repeat then possibility of b is small.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 06:37 PM
                      0 responses
                      10 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, Yesterday, 06:07 PM
                      0 responses
                      9 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      49 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      67 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X