Go Back   SEQanswers > Bioinformatics > Bioinformatics

Similar Threads
Thread Thread Starter Forum Replies Last Post
High percentage of Off-target reads finswimmer Illumina/Solexa 2 10-20-2016 09:12 PM
STAR unmapped: too short jk1124 Bioinformatics 3 05-29-2015 03:59 PM
Unmapped RNA-seq reads consist of repeated nucleotides (short homopolymeric regions) ritandr SOLiD 3 06-04-2014 04:03 AM
Titanium - high fraction of short reads andpet 454 Pyrosequencing 18 11-30-2011 07:29 AM
High Short Reads % elly 454 Pyrosequencing 2 11-10-2008 12:10 PM

Thread Tools
Old 07-10-2018, 06:50 AM   #1
Junior Member
Location: Germany

Join Date: Jul 2018
Posts: 1
Default High percentage of STAR unmapped reads: too short

Dear community,

I am currently analysing a set of stranded paired-end RNA-seq data (human, 80 bp). Up till now, I only trimmed the adapter sequences and removed all reads <36 bp. Then, I used STAR to align them to the human reference genome but got an unexpectedly high percentage of unmapped reads.

For my ~20 samples, I got 75-86% uniquely mapped reads, 4.3-5.6% multi-mapping reads and 10-19% unmapped reads: too short. The latter percentage seemed a bit high to me, so I looked at the output of unmapped reads (all, not only the ones classified as "too short") as provided by STAR. However, only 0.0007% of these unmapped reads are actually shorter than 80 bp. For a few of these sequences, I used megablast to check for highly similar sequences in NCBI's Nucleotide collection (nr/nt) to see whether there might be a contamination problem. Curiously, the reads that I checked, were all highly similar to some human genes or pseudogenes, leaving me wondering why they were not mapped in the first place.

Do you have an idea what could have gone wrong in the mapping step? Or is up to 20% still an acceptable percentage for unmapped reads?

Below are the STAR command I used (--sjdbGTFfile and --sjdbOverhang were already specified at the genome indexes generation step) and one exemplary final STAR log file. If needed, I can also attach the output fastq file for unmapped reads.

~/star/code/STAR-2.6.0a/bin/Linux_x86_64/STAR \
--runThreadN 35 \
--genomeDir ~/star/genome \
--genomeLoad LoadAndRemove \
--readFilesIn ~/seq_1.fastq.gz seq_2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outReadsUnmapped Fastx \
--quantMode TranscriptomeSAM GeneCounts

Started job on | Jul 10 12:51:05
Started mapping on | Jul 10 12:55:25
Finished on | Jul 10 13:28:34
Mapping speed, Million of reads per hour | 132.32

Number of input reads | 73107364
Average input read length | 159
Uniquely mapped reads number | 55422599
Uniquely mapped reads % | 75.81%
Average mapped length | 159.19
Number of splices: Total | 33925845
Number of splices: Annotated (sjdb) | 33643272
Number of splices: GT/AG | 33587698
Number of splices: GC/AG | 276036
Number of splices: AT/AC | 32871
Number of splices: Non-canonical | 29240
Mismatch rate per base, % | 0.47%
Deletion rate per base | 0.01%
Deletion average length | 1.53
Insertion rate per base | 0.00%
Insertion average length | 1.37
Number of reads mapped to multiple loci | 3689502
% of reads mapped to multiple loci | 5.05%
Number of reads mapped to too many loci | 15354
% of reads mapped to too many loci | 0.02%
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 19.10%
% of reads unmapped: other | 0.02%
Number of chimeric reads | 0
% of chimeric reads | 0.00%

Update (2018-07-12):
I have now repeated the mapping using a slightly different genome annotation: Before I had Ensembl GRCh38 (release 92), now I used Gencode (v27). Interestingly, mapping with the Gencode annotation yields even worse results (~70% uniquely mapped vs. ~80% uniquely mapped from before). I also tested mapping the untrimmed reads instead, but this doesn't make a difference (which makes sense since only few reads containing adapter sequences were found in the sequences).
I will now probably continue the analysis using the Ensembl mapping results I originally got (with 10-19% unmapped reads: too short). But I would still be grateful for anyone sharing their thoughts/experiences on the matter.

Last edited by sevenless; 07-12-2018 at 01:02 AM. Reason: Update
sevenless is offline   Reply With Quote

mapping statistics, star aligner

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off

All times are GMT -8. The time now is 08:51 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO