SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
High percentage of Off-target reads finswimmer Illumina/Solexa 2 10-20-2016 10:12 PM
STAR unmapped: too short jk1124 Bioinformatics 3 05-29-2015 04:59 PM
Unmapped RNA-seq reads consist of repeated nucleotides (short homopolymeric regions) ritandr SOLiD 3 06-04-2014 05:03 AM
Titanium - high fraction of short reads andpet 454 Pyrosequencing 18 11-30-2011 08:29 AM
High Short Reads % elly 454 Pyrosequencing 2 11-10-2008 01:10 PM

Reply
 
Thread Tools
Old 07-10-2018, 07:50 AM   #1
sevenless
Junior Member
 
Location: Germany

Join Date: Jul 2018
Posts: 1
Default High percentage of STAR unmapped reads: too short

Dear community,

I am currently analysing a set of stranded paired-end RNA-seq data (human, 80 bp). Up till now, I only trimmed the adapter sequences and removed all reads <36 bp. Then, I used STAR to align them to the human reference genome but got an unexpectedly high percentage of unmapped reads.

For my ~20 samples, I got 75-86% uniquely mapped reads, 4.3-5.6% multi-mapping reads and 10-19% unmapped reads: too short. The latter percentage seemed a bit high to me, so I looked at the output of unmapped reads (all, not only the ones classified as "too short") as provided by STAR. However, only 0.0007% of these unmapped reads are actually shorter than 80 bp. For a few of these sequences, I used megablast to check for highly similar sequences in NCBI's Nucleotide collection (nr/nt) to see whether there might be a contamination problem. Curiously, the reads that I checked, were all highly similar to some human genes or pseudogenes, leaving me wondering why they were not mapped in the first place.

Do you have an idea what could have gone wrong in the mapping step? Or is up to 20% still an acceptable percentage for unmapped reads?

Below are the STAR command I used (--sjdbGTFfile and --sjdbOverhang were already specified at the genome indexes generation step) and one exemplary final STAR log file. If needed, I can also attach the output fastq file for unmapped reads.

~/star/code/STAR-2.6.0a/bin/Linux_x86_64/STAR \
--runThreadN 35 \
--genomeDir ~/star/genome \
--genomeLoad LoadAndRemove \
--readFilesIn ~/seq_1.fastq.gz seq_2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outReadsUnmapped Fastx \
--quantMode TranscriptomeSAM GeneCounts


Started job on | Jul 10 12:51:05
Started mapping on | Jul 10 12:55:25
Finished on | Jul 10 13:28:34
Mapping speed, Million of reads per hour | 132.32

Number of input reads | 73107364
Average input read length | 159
UNIQUE READS:
Uniquely mapped reads number | 55422599
Uniquely mapped reads % | 75.81%
Average mapped length | 159.19
Number of splices: Total | 33925845
Number of splices: Annotated (sjdb) | 33643272
Number of splices: GT/AG | 33587698
Number of splices: GC/AG | 276036
Number of splices: AT/AC | 32871
Number of splices: Non-canonical | 29240
Mismatch rate per base, % | 0.47%
Deletion rate per base | 0.01%
Deletion average length | 1.53
Insertion rate per base | 0.00%
Insertion average length | 1.37
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 3689502
% of reads mapped to multiple loci | 5.05%
Number of reads mapped to too many loci | 15354
% of reads mapped to too many loci | 0.02%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 19.10%
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%


Update (2018-07-12):
I have now repeated the mapping using a slightly different genome annotation: Before I had Ensembl GRCh38 (release 92), now I used Gencode (v27). Interestingly, mapping with the Gencode annotation yields even worse results (~70% uniquely mapped vs. ~80% uniquely mapped from before). I also tested mapping the untrimmed reads instead, but this doesn't make a difference (which makes sense since only few reads containing adapter sequences were found in the sequences).
I will now probably continue the analysis using the Ensembl mapping results I originally got (with 10-19% unmapped reads: too short). But I would still be grateful for anyone sharing their thoughts/experiences on the matter.

Last edited by sevenless; 07-12-2018 at 02:02 AM. Reason: Update
sevenless is offline   Reply With Quote
Reply

Tags
mapping statistics, star aligner

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 12:34 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO