Dear community,
I am currently analysing a set of stranded paired-end RNA-seq data (human, 80 bp). Up till now, I only trimmed the adapter sequences and removed all reads <36 bp. Then, I used STAR to align them to the human reference genome but got an unexpectedly high percentage of unmapped reads.
For my ~20 samples, I got 75-86% uniquely mapped reads, 4.3-5.6% multi-mapping reads and 10-19% unmapped reads: too short. The latter percentage seemed a bit high to me, so I looked at the output of unmapped reads (all, not only the ones classified as "too short") as provided by STAR. However, only 0.0007% of these unmapped reads are actually shorter than 80 bp. For a few of these sequences, I used megablast to check for highly similar sequences in NCBI's Nucleotide collection (nr/nt) to see whether there might be a contamination problem. Curiously, the reads that I checked, were all highly similar to some human genes or pseudogenes, leaving me wondering why they were not mapped in the first place.
Do you have an idea what could have gone wrong in the mapping step? Or is up to 20% still an acceptable percentage for unmapped reads?
Below are the STAR command I used (--sjdbGTFfile and --sjdbOverhang were already specified at the genome indexes generation step) and one exemplary final STAR log file. If needed, I can also attach the output fastq file for unmapped reads.
~/star/code/STAR-2.6.0a/bin/Linux_x86_64/STAR \
--runThreadN 35 \
--genomeDir ~/star/genome \
--genomeLoad LoadAndRemove \
--readFilesIn ~/seq_1.fastq.gz seq_2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outReadsUnmapped Fastx \
--quantMode TranscriptomeSAM GeneCounts
Started job on | Jul 10 12:51:05
Started mapping on | Jul 10 12:55:25
Finished on | Jul 10 13:28:34
Mapping speed, Million of reads per hour | 132.32
Number of input reads | 73107364
Average input read length | 159
UNIQUE READS:
Uniquely mapped reads number | 55422599
Uniquely mapped reads % | 75.81%
Average mapped length | 159.19
Number of splices: Total | 33925845
Number of splices: Annotated (sjdb) | 33643272
Number of splices: GT/AG | 33587698
Number of splices: GC/AG | 276036
Number of splices: AT/AC | 32871
Number of splices: Non-canonical | 29240
Mismatch rate per base, % | 0.47%
Deletion rate per base | 0.01%
Deletion average length | 1.53
Insertion rate per base | 0.00%
Insertion average length | 1.37
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 3689502
% of reads mapped to multiple loci | 5.05%
Number of reads mapped to too many loci | 15354
% of reads mapped to too many loci | 0.02%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 19.10%
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Update (2018-07-12):
I have now repeated the mapping using a slightly different genome annotation: Before I had Ensembl GRCh38 (release 92), now I used Gencode (v27). Interestingly, mapping with the Gencode annotation yields even worse results (~70% uniquely mapped vs. ~80% uniquely mapped from before). I also tested mapping the untrimmed reads instead, but this doesn't make a difference (which makes sense since only few reads containing adapter sequences were found in the sequences).
I will now probably continue the analysis using the Ensembl mapping results I originally got (with 10-19% unmapped reads: too short). But I would still be grateful for anyone sharing their thoughts/experiences on the matter.
I am currently analysing a set of stranded paired-end RNA-seq data (human, 80 bp). Up till now, I only trimmed the adapter sequences and removed all reads <36 bp. Then, I used STAR to align them to the human reference genome but got an unexpectedly high percentage of unmapped reads.
For my ~20 samples, I got 75-86% uniquely mapped reads, 4.3-5.6% multi-mapping reads and 10-19% unmapped reads: too short. The latter percentage seemed a bit high to me, so I looked at the output of unmapped reads (all, not only the ones classified as "too short") as provided by STAR. However, only 0.0007% of these unmapped reads are actually shorter than 80 bp. For a few of these sequences, I used megablast to check for highly similar sequences in NCBI's Nucleotide collection (nr/nt) to see whether there might be a contamination problem. Curiously, the reads that I checked, were all highly similar to some human genes or pseudogenes, leaving me wondering why they were not mapped in the first place.
Do you have an idea what could have gone wrong in the mapping step? Or is up to 20% still an acceptable percentage for unmapped reads?
Below are the STAR command I used (--sjdbGTFfile and --sjdbOverhang were already specified at the genome indexes generation step) and one exemplary final STAR log file. If needed, I can also attach the output fastq file for unmapped reads.
~/star/code/STAR-2.6.0a/bin/Linux_x86_64/STAR \
--runThreadN 35 \
--genomeDir ~/star/genome \
--genomeLoad LoadAndRemove \
--readFilesIn ~/seq_1.fastq.gz seq_2.fastq.gz \
--readFilesCommand zcat \
--outSAMtype BAM SortedByCoordinate \
--outReadsUnmapped Fastx \
--quantMode TranscriptomeSAM GeneCounts
Started job on | Jul 10 12:51:05
Started mapping on | Jul 10 12:55:25
Finished on | Jul 10 13:28:34
Mapping speed, Million of reads per hour | 132.32
Number of input reads | 73107364
Average input read length | 159
UNIQUE READS:
Uniquely mapped reads number | 55422599
Uniquely mapped reads % | 75.81%
Average mapped length | 159.19
Number of splices: Total | 33925845
Number of splices: Annotated (sjdb) | 33643272
Number of splices: GT/AG | 33587698
Number of splices: GC/AG | 276036
Number of splices: AT/AC | 32871
Number of splices: Non-canonical | 29240
Mismatch rate per base, % | 0.47%
Deletion rate per base | 0.01%
Deletion average length | 1.53
Insertion rate per base | 0.00%
Insertion average length | 1.37
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 3689502
% of reads mapped to multiple loci | 5.05%
Number of reads mapped to too many loci | 15354
% of reads mapped to too many loci | 0.02%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 19.10%
% of reads unmapped: other | 0.02%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Update (2018-07-12):
I have now repeated the mapping using a slightly different genome annotation: Before I had Ensembl GRCh38 (release 92), now I used Gencode (v27). Interestingly, mapping with the Gencode annotation yields even worse results (~70% uniquely mapped vs. ~80% uniquely mapped from before). I also tested mapping the untrimmed reads instead, but this doesn't make a difference (which makes sense since only few reads containing adapter sequences were found in the sequences).
I will now probably continue the analysis using the Ensembl mapping results I originally got (with 10-19% unmapped reads: too short). But I would still be grateful for anyone sharing their thoughts/experiences on the matter.