Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • High percentage of STAR unmapped reads: too short

    Dear community,

    I am currently analysing a set of stranded paired-end RNA-seq data (human, 80 bp). Up till now, I only trimmed the adapter sequences and removed all reads <36 bp. Then, I used STAR to align them to the human reference genome but got an unexpectedly high percentage of unmapped reads.

    For my ~20 samples, I got 75-86% uniquely mapped reads, 4.3-5.6% multi-mapping reads and 10-19% unmapped reads: too short. The latter percentage seemed a bit high to me, so I looked at the output of unmapped reads (all, not only the ones classified as "too short") as provided by STAR. However, only 0.0007% of these unmapped reads are actually shorter than 80 bp. For a few of these sequences, I used megablast to check for highly similar sequences in NCBI's Nucleotide collection (nr/nt) to see whether there might be a contamination problem. Curiously, the reads that I checked, were all highly similar to some human genes or pseudogenes, leaving me wondering why they were not mapped in the first place.

    Do you have an idea what could have gone wrong in the mapping step? Or is up to 20% still an acceptable percentage for unmapped reads?

    Below are the STAR command I used (--sjdbGTFfile and --sjdbOverhang were already specified at the genome indexes generation step) and one exemplary final STAR log file. If needed, I can also attach the output fastq file for unmapped reads.

    ~/star/code/STAR-2.6.0a/bin/Linux_x86_64/STAR \
    --runThreadN 35 \
    --genomeDir ~/star/genome \
    --genomeLoad LoadAndRemove \
    --readFilesIn ~/seq_1.fastq.gz seq_2.fastq.gz \
    --readFilesCommand zcat \
    --outSAMtype BAM SortedByCoordinate \
    --outReadsUnmapped Fastx \
    --quantMode TranscriptomeSAM GeneCounts


    Started job on | Jul 10 12:51:05
    Started mapping on | Jul 10 12:55:25
    Finished on | Jul 10 13:28:34
    Mapping speed, Million of reads per hour | 132.32

    Number of input reads | 73107364
    Average input read length | 159
    UNIQUE READS:
    Uniquely mapped reads number | 55422599
    Uniquely mapped reads % | 75.81%
    Average mapped length | 159.19
    Number of splices: Total | 33925845
    Number of splices: Annotated (sjdb) | 33643272
    Number of splices: GT/AG | 33587698
    Number of splices: GC/AG | 276036
    Number of splices: AT/AC | 32871
    Number of splices: Non-canonical | 29240
    Mismatch rate per base, % | 0.47%
    Deletion rate per base | 0.01%
    Deletion average length | 1.53
    Insertion rate per base | 0.00%
    Insertion average length | 1.37
    MULTI-MAPPING READS:
    Number of reads mapped to multiple loci | 3689502
    % of reads mapped to multiple loci | 5.05%
    Number of reads mapped to too many loci | 15354
    % of reads mapped to too many loci | 0.02%
    UNMAPPED READS:
    % of reads unmapped: too many mismatches | 0.00%
    % of reads unmapped: too short | 19.10%
    % of reads unmapped: other | 0.02%
    CHIMERIC READS:
    Number of chimeric reads | 0
    % of chimeric reads | 0.00%


    Update (2018-07-12):
    I have now repeated the mapping using a slightly different genome annotation: Before I had Ensembl GRCh38 (release 92), now I used Gencode (v27). Interestingly, mapping with the Gencode annotation yields even worse results (~70% uniquely mapped vs. ~80% uniquely mapped from before). I also tested mapping the untrimmed reads instead, but this doesn't make a difference (which makes sense since only few reads containing adapter sequences were found in the sequences).
    I will now probably continue the analysis using the Ensembl mapping results I originally got (with 10-19% unmapped reads: too short). But I would still be grateful for anyone sharing their thoughts/experiences on the matter.
    Last edited by sevenless; 07-12-2018, 01:02 AM. Reason: Update

Latest Articles

Collapse

  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM
  • seqadmin
    Techniques and Challenges in Conservation Genomics
    by seqadmin



    The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

    Avian Conservation
    Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
    03-08-2024, 10:41 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, Yesterday, 06:37 PM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, Yesterday, 06:07 PM
0 responses
8 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-22-2024, 10:03 AM
0 responses
49 views
0 likes
Last Post seqadmin  
Started by seqadmin, 03-21-2024, 07:32 AM
0 responses
66 views
0 likes
Last Post seqadmin  
Working...
X