![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
A first look at Illumina’s new NextSeq 500 | AllSeq | Vendor Forum | 111 | 03-12-2020 03:25 AM |
Secondary size selection for single cell RNAseq on NextSeq 500 help | sallen | Illumina/Solexa | 0 | 03-08-2016 12:21 PM |
Lotsa new toys from Illumina: HiSeq X Five, 3000, 4000, NextSeq 550 | GW_OK | Illumina/Solexa | 53 | 05-21-2015 12:30 AM |
NextSeq 500 and HiSeq X Ten Services Coming Soon to Genohub.com | Genohub | Vendor Forum | 11 | 04-22-2014 09:46 AM |
General mapping rate of human resequencing data against reference in GAiix/Hiseq | cybog337 | Illumina/Solexa | 2 | 01-12-2011 09:43 AM |
![]() |
|
Thread Tools |
![]() |
#1 |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]()
Dear all,
I have two RNASeq experiments from which I try to understand mapping rate difference. The experiments are pretty different: Experiment 1: Samples: 19 primary tumors RNA selection: TruSeq Stranded mRNA (PolyA) Sequencing: Paired-ends, 75bp Sequencer: NextSeq 500 Experiment 2: Samples: 16 tumor cell lines RNA selection: TruSeq total RNA Stranded (Ribo zero) Sequencing: Paired-ends, 100bp Sequencer: HiSeq 4000 Below are the number of reads in each strand after preprocessing and mapping info. Mapping was performed using TopHat v2.1.0 and Bowtie v2.2.5.0 on hg19 reference genome. The average mapping rate is classic for recent experiments. Nevertheless, there is a difference of 10% between both experiments: Experiment 1: Mean number of reads after Trimmomatic (*_paired files): 67944508.54, SD=8215167.56, range=53282567-88949193 Overall read mapping rate: 97.71, SD=0.65, range=96-99 Concordant pair alignment rate: 94.41, SD=1.27, range=91.6-96.8 Experiment 2: Mean number of reads after Trimmomatic (*_paired files): 90733572.69, SD=12822798.96, range=72880790-122816426 Overall read mapping rate: 87.12, SD=5.16, range=77.9-93.4 Concordant pair alignment rate: 82.74, SD=5.70, range=72.3-89.8 I was asked from what was due this difference. I do not have the answer for sure, only hypotheses. I would greatly appreciate your opinion on these hypotheses: 1. Ribo-zero experiments have commonly lower mapping rate due to the presence of non coding elements that correspond to less conserved regions (=>less well annotated) and/or in repeated regions. I spend a couple of hours looking for mapping rate info in similar recent (publications between 2014 and now) experiments without any success. The mapping rate seems to be not reported anymore in the publications - for those that I checked at least. 2. We can observe a slight lower base quality, especially around the 75th base in the Ribo-zero samples (attached: one representative Ribo zero and one representative PolyA samples after preprocessing). This could reduce the mapping rate. Until now, I used 20 as threshold for both experiments from HiSeq and NextSeq. I read yesterday that a threshold of 20 was too high for reads from NextSeq. Can you confirm this? I am trying a more stringent base quality threshold (20 -> 24) on Ribo zero samples during preprocessing. 3. Different sequencers. Has a higher mapping rate been reported for NextSeq experiments? Finally, I used the 75 first bases from 3 samples of the Ribo zero experiment instead of the 100 bases. It increased from 1 to 2.5% the mapping rate, which is a bit surprising for me since mapping is supposed to increase as read length increases. On the other hand, there is this slight default around 75bp in these samples. I would really appreciate any feedback on mapping rates with both RNA selections and sequencers, any link to similar analyses or any test to try. Thank you in advance, Jane Last edited by Jane M; 04-26-2016 at 08:24 AM. |
![]() |
![]() |
![]() |
#2 |
Senior Member
Location: California Join Date: Jul 2014
Posts: 198
|
![]()
I'd guess it's the first of your hypotheses. You could try BLAST/BLAT on some of the unmapped reads to see what they hit...
|
![]() |
![]() |
![]() |
#3 |
Senior Member
Location: Sweden Join Date: Mar 2008
Posts: 324
|
![]()
My guess would be that you have more rDNA in the second experiment. The 45S rDNA sequence is not included in hg19.
|
![]() |
![]() |
![]() |
#4 | |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
1. HiSeq 4000 generates optical duplicates (due to pad-hopping). Depending on how you are handling your multi-mappers this could explain the difference.
2. Even though the samples were prepped there is bound to be some rRNA present. You could align to human rDNA repeat independently and see what you get for the two samples as @Chipper suggested. 3. Quote:
Last edited by GenoMax; 04-27-2016 at 04:54 AM. |
|
![]() |
![]() |
![]() |
#5 | |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]() Quote:
I used the Web BLAST interface on the first "polyA" and "ribo" unmapped reads using "Genome (all assemblies top level)" as database and default settings. 1/10 reads from "polyA" unmapped.sam gets "No significant similarity found", the other 9 reads had pretty good similaries on the full read or on part of the read. 1/10 reads from "ribo" unmapped.sam gets similaries and the 9 other gets "No significant similarity found". Is "Genome (all assemblies top level)" not the whole genome, with missing non coding regions where my "ribo" reads match? Or do they contain too many sequencing errors? |
|
![]() |
![]() |
![]() |
#6 | |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]() Quote:
Is it possible to test this hypothesis without mapping the whole transcriptome to hg38? |
|
![]() |
![]() |
![]() |
#7 | |||
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]()
I really appreciate your suggestions GenoMax, thank you
Quote:
Quote:
One polyA unmapped.sam file: Code:
head unmapped.sam NS500197:119:HG2Y7BGXX:1:11101:24683:1054 69 * 0 255 * * 0 0 CTCCANTTTAATTTCCTTTAACAAATTTTCTGTACTTTTACTTTCCATCCAAACAGTAACTTATAAATTATTATT AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NS500197:119:HG2Y7BGXX:1:13311:17866:19393 69 * 0 255 * * 0 0 TAGTACTTCAGTGATTTGCCAATGCCCTCTAAGTCCCGTAACCTGTTGTCAATCTGCACATGAACATTTTGCAGA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NS500197:119:HG2Y7BGXX:2:13311:16650:6888 69 * 0 255 * * 0 0 CTGGGCTGGTGTTGAGTGTCTGCAGCTTTTCCAGGCACAGAGTGCAGGCTGTCAGTGGATCTACCATCCTGGAGT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NS500197:119:HG2Y7BGXX:4:13610:15232:11614 69 * 0 255 * * 0 0 CCCACTTGATTTTGGAGGGATCTCGCTCCTGGAAGATGGTGATGGGATTTCCATTGATGACAAGCTTCCCGAGAT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NS500197:119:HG2Y7BGXX:4:13610:26394:11583 133 * 0 255 * * 0 0 TTTCTGAGCCTTTTACCACTCCAGCCTAGCCCCTACCCCCCAATTTGGCGGGCGCGGACCCC AAAAAEEEEEEEEEEEEAEEEEEEEEEEEEEEE//E6AAAA/A///EE/EEE//A/E///EE NS500197:119:HG2Y7BGXX:1:11101:17376:1060 69 * 0 255 * * 0 0 AGCCTNTTGGGGGTCAGTGCCCTCCTAATTGGGGGGTAGGGGCTAGGCTGGAGTGGTAAAAGGCTCAGAAAAATC AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEE NS500197:119:HG2Y7BGXX:1:13311:20274:19398 69 * 0 255 * * 0 0 GCCAGCTCTCTCCTGGTGTGCTCGTCTTGCCTCAGGTCCTTCTTATTCCCCACCAGGATGATGGGCACGAGATCG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NS500197:119:HG2Y7BGXX:2:13311:9800:6890 69 * 0 255 * * 0 0 CAGGTAGAGCAGGAGCCAGTGAAGAGCAGGCCCTGGATGGGTGGGAGATCGGAAGAGCACACGTCTGAACTCCAG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NS500197:119:HG2Y7BGXX:4:13610:3313:11628 69 * 0 255 * * 0 0 CCCCCGTTGTGGTCCCAGGTCCTAAACAGCAGGGCTAGTCCCAGAAAGACGACCCCCCCACGGAAGGGCTTTAGC AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE NS500197:119:HG2Y7BGXX:4:13610:5379:11596 133 * 0 255 * * 0 0 CATGTAGTTAATCTGACATGTCAAAATTGGGAAGACTGTAACCTTTTTTTTTTTTTTTTTTT AAA/AA/EEAEA/EEEEAEEEAEEE/E//E/EE66/EEAEE/AEEEE<A/EE6EEEA<EEEE wc -l unmapped.sam 2754730 unmapped.sam cat unmapped.sam | cut -f1 | sort | uniq | wc -l 2334225 cat unmapped.sam | cut -f10 | sort | uniq | wc -l 2098310 I don't understand why there are several reads with a same identifier since there is no multi-mapped read in unmapped sam. It seems that there are reads with the same sequence but different identifiers. What are they? Same info for one ribo unmapped.sam file: Code:
head unmapped.sam K00103:86:H5JKJBBXX:6:1101:1489:1086 69 * 0 255 * * 0 0 GTGGAGATGGGCGCCGCGAGGCGTCCAGTGCGGTAACGCGACCGATCCCGGAGAAGCCGGCGGGAGCCCCGGGGAGAGTTCTCTTTTCTTTGTGAAGGG AAFFJJJJJJJJJJFJJJJFFJJJJJAJJJJJJFFJFFJJJJJJJJJJJFFFAFFAJFAAA7-7AF<FAAJAAFFFJF-AAJJ<<FJJAJFJJFJJJAF K00103:86:H5JKJBBXX:6:1114:8298:40526 69 * 0 255 * * 0 0 TGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGTCCCCCAACTTCTTAGAGGGACAAGTGGCGTTCAGCCACCCGAGATTGAGCAATAACAGGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJFJJJJJJ<FF-FJJFJJJFJFJFJF7<FJ7<77FF<7FFFF<JAJFJFFFJJJJJJFJJF K00103:86:H5JKJBBXX:6:1214:15696:30661 69 * 0 255 * * 0 0 CCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAAAGTCAGCCCTCGAAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJ K00103:86:H5JKJBBXX:6:2214:26453:40192 69 * 0 255 * * 0 0 CTTGGTGGTAGTAGCAAATATTCAAACGAGAACTTTGAAGGCCGAAGTGGAGAAGGCTTCCATGTGAACAGCAGTTGAACATGGGTCAGTCGGTCCTGAGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFF<JJJJJJJ<-<<FJFJFF7F<FFJJJFFJJJJJJJ K00103:86:H5JKJBBXX:6:2214:25621:40192 133 * 0 255 * * 0 0 CGAGGGCTGACTTTCAATAGATCGCAGCGAGGGAGCTGCTCTGCTACGTACGAAACCCCGACCCAGAAGCAGGTCGTCTACGAATGGTTTAGCGCCAGGTAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJ K00103:86:H5JKJBBXX:6:1101:1509:1086 69 * 0 255 * * 0 0 GAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTGGC AAAFFJFFJJJJFJFFAJJJJJJJJAFAJAAFFFF<<FJJ-FJJFJFJ<<AAJA<J7-F-<< K00103:86:H5JKJBBXX:6:1114:8501:40526 69 * 0 255 * * 0 0 CACCACACAGCGCGGCCAGGGCAGAGTGGGCCGCCGCCGCCGCCCAGAGACCGAGGCTGCCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCAAFFFJJJFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJJFFFJJFFFFJJJJAJFFJFA<JJJJJJJJJJJJJJFJJJ K00103:86:H5JKJBBXX:6:1214:16122:30661 69 * 0 255 * * 0 0 CGGCGTCGCTCGTGGGGGGCCCAAGTCCTTCTGATCGAGGCCCAGCCCGTGGACGGTGTGAGGCCGGTAGCGGCCCCCGGCGCGCCGGGCCCGGGTCTTCAAFFFJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFJJFFFJFJJJF<AF7AJJJJJJJAJJFJJJJAF<AFAFJFFJFFFAFJFAJJJJFFJ7FJJJ K00103:86:H5JKJBBXX:6:2214:27082:40192 69 * 0 255 * * 0 0 GTCTTGGGGCCGAAACGATCTCAACCTATTCTCAAACTTTAAATGGGTAAGAAGCCCGGCTCGCTGGCGTGGAGCCGGGCGTGGAATGCGAGTGCCTAGTAAFFFJJJFJJJJJJJFFJJJJJFJJFJJJJJJJJFJF<JJJ<<JFF<AJJFFFAJJJJF7AAJ<<<<7F77A-A77AFJJ7<-FFJ7J<JJAFFFFA-F K00103:86:H5JKJBBXX:6:2214:26453:40192 133 * 0 255 * * 0 0 GGGATCTGAACCCGACTCCCTTTCGATCGGCCGAGGGCAACGGAGGCCATCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCA--AAJ7-FJJJJJJJJJJFJJJFFJJFJFFJ<JFJJJJJJFJJJJJJJJJJJJJJJJFFF<JFFJJJJAJJJJJJFJJJJJJJJJJJJJJJJJJJJJJF wc -l unmapped.sam 31234334 unmapped.sam cat unmapped.sam | cut -f 1 | sort | uniq | wc -l 18571998 Quote:
When trimming the HiSeq 4000 reads down to 75 bp -for 3 samples only- I got an increase in overall mapping rate of 1.1, 1.6 and 2.5%. Better but not 10% better. |
|||
![]() |
![]() |
![]() |
#8 | ||||
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]() Quote:
Quote:
Quote:
Quote:
![]() |
||||
![]() |
![]() |
![]() |
#9 | ||
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]() Quote:
Quote:
Maybe I will try this after! |
||
![]() |
![]() |
![]() |
#10 | |||||||
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]() Quote:
Quote:
Quote:
ribo sample: Quote:
Quote:
![]() Quote:
Then, I converted unmapped.bam with bamToFatq. I downloaded the gtf of rRNA and tRNA from UCSC tables. After creating a new transcriptome index, I reran the mapping with the preprocessed fastq: Quote:
|
|||||||
![]() |
![]() |
![]() |
#11 | |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]()
I wrote 2 posts last week, they do not appear (waiting for validation, I don't know why). I hope this one will be posted.
Quote:
1) map the unmapped reads as previously, except I change the reference to rDNA fasta sequence. Then, I got alignment rate of 83.2, 82, 81.6 and 74.8% for ribodepleted samples and 7, 4.4, 2.8 and 4% for polyA samples. 2) map the paired trimmed reads as previously, except I use as reference the rDNA fasta sequence. Then, I got alignment rate of 24.3, 14.1, 18.1 and 13.3% for ribodepleted samples and 0.5, 0.2, 0.2 and 0.1% for polyA samples. I read in some posts that 5% was low remaining rDNA sequence after ribodepletion. I have quite higher percentage. What is the classical range? From these tests, I am convince that the different mapping rates I got between the two experiments come from the rDNA sequence still present in ribodepleted samples that are not reported in hg19. Thank you for the help you provided me. |
|
![]() |
![]() |
![]() |
#12 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
@Jane_M: Apologize for the posts from last week. I must have missed them. They were indeed waiting for moderation.
Thank you for providing detailed stats. It does appear that the ribo-depletion did not work well. Someone with experimental expertise can comment but % of rRNA should be less than 5% (perhaps 1% or less is ideal). Interestingly (if I am reading this right) most of your duplicates (>98%) on the 4000 are optical, with a lesser % in regular run. |
![]() |
![]() |
![]() |
#13 | |||
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]() Quote:
Quote:
Quote:
Can someone please comment these two ratio: 35% (on ribodepleted sample - HiSeq 4000) and 51% (on polyA sample - NextSeq 500) for PERCENT_DUPLICATION? Last edited by Jane M; 05-02-2016 at 05:10 AM. |
|||
![]() |
![]() |
![]() |
#14 |
Senior Member
Location: East Coast USA Join Date: Feb 2008
Posts: 7,083
|
![]()
If MarkDuplicates is working as expected then the duplicates on HiSeq 4000 are real optical dups likely formed by pad-hopping (contamination of nearby nanowells). For NextSeq 500 they probably come from one too many PCR cycles.
One possible cause of the duplicates in general is experimental origin (likely # of PCR cycles). |
![]() |
![]() |
![]() |
#15 |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]()
Ok, thank you GenoMax.
|
![]() |
![]() |
![]() |
#16 |
Senior Member
Location: Paris Join Date: Aug 2011
Posts: 239
|
![]()
Still interested in your percentages of remaining rDNA sequence after ribodepletion
![]() I have one more question concerning what are the unmapped reads in the ribodepleted experiments. When looking at the overall mapping rate to hg19 and the overall mapping rate to rDNA for my 4 samples, one can see that Mapping Rate to hg19 + Mapping Rate to rDNA is beyond 100%: Code:
Sample MappingRateTohg19 MappingRateTorDNA Sample 1 79.8 24.3 Sample 2 88.2 14.1 Sample 3 84.7 18.1 Sample 4 87.8 13.3 Thank you Last edited by Jane M; 05-09-2016 at 08:35 AM. |
![]() |
![]() |
![]() |
Thread Tools | |
|
|