High percentage of STAR unmapped reads: too short

sevenless

Junior Member

Join Date: Jul 2018

Posts: 1
- Share
- Tweet
#1

High percentage of STAR unmapped reads: too short

07-10-2018, 06:50 AM

Dear community,

I am currently analysing a set of stranded paired-end RNA-seq data (human, 80 bp). Up till now, I only trimmed the adapter sequences and removed all reads <36 bp. Then, I used STAR to align them to the human reference genome but got an unexpectedly high percentage of unmapped reads.

For my ~20 samples, I got 75-86% uniquely mapped reads, 4.3-5.6% multi-mapping reads and 10-19% unmapped reads: too short. The latter percentage seemed a bit high to me, so I looked at the output of unmapped reads (all, not only the ones classified as "too short") as provided by STAR. However, only 0.0007% of these unmapped reads are actually shorter than 80 bp. For a few of these sequences, I used megablast to check for highly similar sequences in NCBI's Nucleotide collection (nr/nt) to see whether there might be a contamination problem. Curiously, the reads that I checked, were all highly similar to some human genes or pseudogenes, leaving me wondering why they were not mapped in the first place.

Do you have an idea what could have gone wrong in the mapping step? Or is up to 20% still an acceptable percentage for unmapped reads?

Below are the STAR command I used (--sjdbGTFfile and --sjdbOverhang were already specified at the genome indexes generation step) and one exemplary final STAR log file. If needed, I can also attach the output fastq file for unmapped reads.

~/star/code/STAR-2.6.0a/bin/Linux_x86_64/STAR \
--runThreadN 35 \
--genomeDir ~/star/genome \
--genomeLoad LoadAndRemove \
--readFilesIn ~/seq_1.fastq.gz seq_2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outReadsUnmapped Fastx \
--quantMode TranscriptomeSAM GeneCounts

Started job on | Jul 10 12:51:05
Started mapping on | Jul 10 12:55:25
Finished on | Jul 10 13:28:34
Mapping speed, Million of reads per hour | 132.32

Number of input reads | 73107364
Average input read length | 159
UNIQUE READS:
Uniquely mapped reads number | 55422599
Uniquely mapped reads % | 75.81%
Average mapped length | 159.19
Number of splices: Total | 33925845
Number of splices: Annotated (sjdb) | 33643272
Number of splices: GT/AG | 33587698
Number of splices: GC/AG | 276036
Number of splices: AT/AC | 32871
Number of splices: Non-canonical | 29240
Mismatch rate per base, % | 0.47%
Deletion rate per base | 0.01%
Deletion average length | 1.53
Insertion rate per base | 0.00%
Insertion average length | 1.37
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 3689502
% of reads mapped to multiple loci | 5.05%
Number of reads mapped to too many loci | 15354
% of reads mapped to too many loci | 0.02%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 19.10%
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%

Update (2018-07-12):
I have now repeated the mapping using a slightly different genome annotation: Before I had Ensembl GRCh38 (release 92), now I used Gencode (v27). Interestingly, mapping with the Gencode annotation yields even worse results (~70% uniquely mapped vs. ~80% uniquely mapped from before). I also tested mapping the untrimmed reads instead, but this doesn't make a difference (which makes sense since only few reads containing adapter sequences were found in the sequences).
I will now probably continue the analysis using the Ensembl mapping results I originally got (with 10-19% unmapped reads: too short). But I would still be grateful for anyone sharing their thoughts/experiences on the matter.

Last edited by sevenless; 07-12-2018, 01:02 AM. Reason: Update
Tags: mapping statistics, star aligner

Previous template Next

Recent Advances in Sequencing Analysis Tools

by seqadmin

The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
- Channel: Articles
05-06-2024, 07:48 AM
Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM

Topics	Statistics	Last Post
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, Today, 07:03 AM	0 responses 10 views 0 likes	Last Post by seqadmin Today, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 31 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM
Advanced Epigenome Editing Platform Explores Gene Regulation Mechanisms by seqadmin Started by seqadmin, 05-09-2024, 02:46 PM	0 responses 41 views 0 likes	Last Post by seqadmin 05-09-2024, 02:46 PM
Telomere Maintenance by PARP1: A New Perspective in Cancer Research by seqadmin Started by seqadmin, 05-07-2024, 06:57 AM	0 responses 33 views 0 likes	Last Post by seqadmin 05-07-2024, 06:57 AM

Seqanswers Leaderboard Ad

Announcement

High percentage of STAR unmapped reads: too short

Latest Articles

ad_right_rmr

News