SEQanswers

Go Back   SEQanswers > Applications Forums > RNA Sequencing



Similar Threads
Thread Thread Starter Forum Replies Last Post
A first look at Illumina’s new NextSeq 500 AllSeq Vendor Forum 105 03-13-2017 01:39 PM
Secondary size selection for single cell RNAseq on NextSeq 500 help sallen Illumina/Solexa 0 03-08-2016 11:21 AM
Lotsa new toys from Illumina: HiSeq X Five, 3000, 4000, NextSeq 550 GW_OK Illumina/Solexa 53 05-20-2015 11:30 PM
NextSeq 500 and HiSeq X Ten Services Coming Soon to Genohub.com Genohub Vendor Forum 11 04-22-2014 08:46 AM
General mapping rate of human resequencing data against reference in GAiix/Hiseq cybog337 Illumina/Solexa 2 01-12-2011 08:43 AM

Reply
 
Thread Tools
Old 04-26-2016, 07:16 AM   #1
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Question Mapping rate with PolyA and Ribo-zero selection on NextSeq 500 and HiSeq 4000

Dear all,

I have two RNASeq experiments from which I try to understand mapping rate difference. The experiments are pretty different:

Experiment 1:
Samples: 19 primary tumors
RNA selection: TruSeq Stranded mRNA (PolyA)
Sequencing: Paired-ends, 75bp
Sequencer: NextSeq 500

Experiment 2:
Samples: 16 tumor cell lines
RNA selection: TruSeq total RNA Stranded (Ribo zero)
Sequencing: Paired-ends, 100bp
Sequencer: HiSeq 4000

Below are the number of reads in each strand after preprocessing and mapping info. Mapping was performed using TopHat v2.1.0 and Bowtie v2.2.5.0 on hg19 reference genome. The average mapping rate is classic for recent experiments. Nevertheless, there is a difference of 10% between both experiments:

Experiment 1:
Mean number of reads after Trimmomatic (*_paired files): 67944508.54, SD=8215167.56, range=53282567-88949193
Overall read mapping rate: 97.71, SD=0.65, range=96-99
Concordant pair alignment rate: 94.41, SD=1.27, range=91.6-96.8

Experiment 2:
Mean number of reads after Trimmomatic (*_paired files): 90733572.69, SD=12822798.96, range=72880790-122816426
Overall read mapping rate: 87.12, SD=5.16, range=77.9-93.4
Concordant pair alignment rate: 82.74, SD=5.70, range=72.3-89.8

I was asked from what was due this difference. I do not have the answer for sure, only hypotheses. I would greatly appreciate your opinion on these hypotheses:

1. Ribo-zero experiments have commonly lower mapping rate due to the presence of non coding elements that correspond to less conserved regions (=>less well annotated) and/or in repeated regions.
I spend a couple of hours looking for mapping rate info in similar recent (publications between 2014 and now) experiments without any success. The mapping rate seems to be not reported anymore in the publications - for those that I checked at least.
2. We can observe a slight lower base quality, especially around the 75th base in the Ribo-zero samples (attached: one representative Ribo zero and one representative PolyA samples after preprocessing). This could reduce the mapping rate.
Until now, I used 20 as threshold for both experiments from HiSeq and NextSeq. I read yesterday that a threshold of 20 was too high for reads from NextSeq. Can you confirm this? I am trying a more stringent base quality threshold (20 -> 24) on Ribo zero samples during preprocessing.
3. Different sequencers. Has a higher mapping rate been reported for NextSeq experiments?

Finally, I used the 75 first bases from 3 samples of the Ribo zero experiment instead of the 100 bases. It increased from 1 to 2.5% the mapping rate, which is a bit surprising for me since mapping is supposed to increase as read length increases. On the other hand, there is this slight default around 75bp in these samples.

I would really appreciate any feedback on mapping rates with both RNA selections and sequencers, any link to similar analyses or any test to try.
Thank you in advance,
Jane
Attached Images
File Type: png ribo_R1_f.png (34.7 KB, 3 views)
File Type: png ribo_R2_f.png (38.6 KB, 1 views)
File Type: png polyA_R1_f.png (8.1 KB, 2 views)
File Type: png polyA_R2_f.png (8.1 KB, 1 views)

Last edited by Jane M; 04-26-2016 at 07:24 AM.
Jane M is offline   Reply With Quote
Old 04-26-2016, 07:57 AM   #2
fanli
Senior Member
 
Location: California

Join Date: Jul 2014
Posts: 196
Default

I'd guess it's the first of your hypotheses. You could try BLAST/BLAT on some of the unmapped reads to see what they hit...
fanli is offline   Reply With Quote
Old 04-27-2016, 02:55 AM   #3
Chipper
Senior Member
 
Location: Sweden

Join Date: Mar 2008
Posts: 324
Default

My guess would be that you have more rDNA in the second experiment. The 45S rDNA sequence is not included in hg19.
Chipper is offline   Reply With Quote
Old 04-27-2016, 03:52 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

1. HiSeq 4000 generates optical duplicates (due to pad-hopping). Depending on how you are handling your multi-mappers this could explain the difference.
2. Even though the samples were prepped there is bound to be some rRNA present. You could align to human rDNA repeat independently and see what you get for the two samples as @Chipper suggested.
3.
Quote:
since mapping is supposed to increase as read length increases
Mapping should become more precise (harder to multi-map) as length increases so that could be another reason for the reduced alignment. You could trim the HiSeq 4000 reads down to 75 bp and see if you get the same ballpark alignment as NextSeq 500 reads.

Last edited by GenoMax; 04-27-2016 at 03:54 AM.
GenoMax is offline   Reply With Quote
Old 04-27-2016, 05:14 AM   #5
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

Quote:
Originally Posted by fanli View Post
I'd guess it's the first of your hypotheses. You could try BLAST/BLAT on some of the unmapped reads to see what they hit...
Thank you fanli for your suggestion.
I used the Web BLAST interface on the first "polyA" and "ribo" unmapped reads using "Genome (all assemblies top level)" as database and default settings.
1/10 reads from "polyA" unmapped.sam gets "No significant similarity found", the other 9 reads had pretty good similaries on the full read or on part of the read.
1/10 reads from "ribo" unmapped.sam gets similaries and the 9 other gets "No significant similarity found".

Is "Genome (all assemblies top level)" not the whole genome, with missing non coding regions where my "ribo" reads match? Or do they contain too many sequencing errors?
Jane M is offline   Reply With Quote
Old 04-27-2016, 05:20 AM   #6
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

Quote:
Originally Posted by Chipper View Post
My guess would be that you have more rDNA in the second experiment. The 45S rDNA sequence is not included in hg19.
Thank you Chipper for your help. Are these sequences present in hg38?
Is it possible to test this hypothesis without mapping the whole transcriptome to hg38?
Jane M is offline   Reply With Quote
Old 04-27-2016, 06:11 AM   #7
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

I really appreciate your suggestions GenoMax, thank you

Quote:
Originally Posted by GenoMax View Post
1. HiSeq 4000 generates optical duplicates (due to pad-hopping). Depending on how you are handling your multi-mappers this could explain the difference.
I used TopHat2 with default parameters:
Quote:
tophat/v2.1.0/bin/tophat2 -p 8 --transcriptome-index=/data/KnownTranscriptome/Known -o /data/2_T1-ARN/tophat_out --read-realign-edit-dist 0 --mate-inner-dist 97 --mate-std-dev 20 --min-anchor-length 4 --splice-mismatches 0 --min-intron-length 50 --max-intron-length 500000 --no-coverage-search --library-type fr-firststrand --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 /data/bioinfo/hg19/Bowtie2/2.2.5_20150629/Base_chr/hg19 /data/RNASeq/2_T1-ARN/2_T1-ARN_paired.R1.fastq /data/RNASeq/2_T1-ARN/2_T1-ARN_paired.R2.fastq
I was wondering if something was happening with duplicates based on some tests:
One polyA unmapped.sam file:
Code:
head unmapped.sam
NS500197:119:HG2Y7BGXX:1:11101:24683:1054	69	*	0	255	*	*	0	0	CTCCANTTTAATTTCCTTTAACAAATTTTCTGTACTTTTACTTTCCATCCAAACAGTAACTTATAAATTATTATT	AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:1:13311:17866:19393	69	*	0	255	*	*	0	0	TAGTACTTCAGTGATTTGCCAATGCCCTCTAAGTCCCGTAACCTGTTGTCAATCTGCACATGAACATTTTGCAGA	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:2:13311:16650:6888	69	*	0	255	*	*	0	0	CTGGGCTGGTGTTGAGTGTCTGCAGCTTTTCCAGGCACAGAGTGCAGGCTGTCAGTGGATCTACCATCCTGGAGT	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:15232:11614	69	*	0	255	*	*	0	0	CCCACTTGATTTTGGAGGGATCTCGCTCCTGGAAGATGGTGATGGGATTTCCATTGATGACAAGCTTCCCGAGAT	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:26394:11583	133	*	0	255	*	*	0	0	TTTCTGAGCCTTTTACCACTCCAGCCTAGCCCCTACCCCCCAATTTGGCGGGCGCGGACCCC	AAAAAEEEEEEEEEEEEAEEEEEEEEEEEEEEE//E6AAAA/A///EE/EEE//A/E///EE
NS500197:119:HG2Y7BGXX:1:11101:17376:1060	69	*	0	255	*	*	0	0	AGCCTNTTGGGGGTCAGTGCCCTCCTAATTGGGGGGTAGGGGCTAGGCTGGAGTGGTAAAAGGCTCAGAAAAATC	AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEE
NS500197:119:HG2Y7BGXX:1:13311:20274:19398	69	*	0	255	*	*	0	0	GCCAGCTCTCTCCTGGTGTGCTCGTCTTGCCTCAGGTCCTTCTTATTCCCCACCAGGATGATGGGCACGAGATCG	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:2:13311:9800:6890	69	*	0	255	*	*	0	0	CAGGTAGAGCAGGAGCCAGTGAAGAGCAGGCCCTGGATGGGTGGGAGATCGGAAGAGCACACGTCTGAACTCCAG	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:3313:11628	69	*	0	255	*	*	0	0	CCCCCGTTGTGGTCCCAGGTCCTAAACAGCAGGGCTAGTCCCAGAAAGACGACCCCCCCACGGAAGGGCTTTAGC	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:5379:11596	133	*	0	255	*	*	0	0	CATGTAGTTAATCTGACATGTCAAAATTGGGAAGACTGTAACCTTTTTTTTTTTTTTTTTTT	AAA/AA/EEAEA/EEEEAEEEAEEE/E//E/EE66/EEAEE/AEEEE<A/EE6EEEA<EEEE

wc -l unmapped.sam
2754730 unmapped.sam

cat unmapped.sam | cut -f1 | sort | uniq | wc -l
2334225

cat unmapped.sam | cut -f10 | sort | uniq | wc -l
2098310
Several reads have the same sequence. Are they duplicates?
I don't understand why there are several reads with a same identifier since there is no multi-mapped read in unmapped sam.
It seems that there are reads with the same sequence but different identifiers. What are they?

Same info for one ribo unmapped.sam file:

Code:
head unmapped.sam 
K00103:86:H5JKJBBXX:6:1101:1489:1086	69	*	0	255	*	*	0	0	GTGGAGATGGGCGCCGCGAGGCGTCCAGTGCGGTAACGCGACCGATCCCGGAGAAGCCGGCGGGAGCCCCGGGGAGAGTTCTCTTTTCTTTGTGAAGGG	AAFFJJJJJJJJJJFJJJJFFJJJJJAJJJJJJFFJFFJJJJJJJJJJJFFFAFFAJFAAA7-7AF<FAAJAAFFFJF-AAJJ<<FJJAJFJJFJJJAF
K00103:86:H5JKJBBXX:6:1114:8298:40526	69	*	0	255	*	*	0	0	TGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGTCCCCCAACTTCTTAGAGGGACAAGTGGCGTTCAGCCACCCGAGATTGAGCAATAACAGGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJFJJJJJJ<FF-FJJFJJJFJFJFJF7<FJ7<77FF<7FFFF<JAJFJFFFJJJJJJFJJF
K00103:86:H5JKJBBXX:6:1214:15696:30661	69	*	0	255	*	*	0	0	CCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAAAGTCAGCCCTCGAAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJ
K00103:86:H5JKJBBXX:6:2214:26453:40192	69	*	0	255	*	*	0	0	CTTGGTGGTAGTAGCAAATATTCAAACGAGAACTTTGAAGGCCGAAGTGGAGAAGGCTTCCATGTGAACAGCAGTTGAACATGGGTCAGTCGGTCCTGAGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFF<JJJJJJJ<-<<FJFJFF7F<FFJJJFFJJJJJJJ
K00103:86:H5JKJBBXX:6:2214:25621:40192	133	*	0	255	*	*	0	0	CGAGGGCTGACTTTCAATAGATCGCAGCGAGGGAGCTGCTCTGCTACGTACGAAACCCCGACCCAGAAGCAGGTCGTCTACGAATGGTTTAGCGCCAGGTAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJ
K00103:86:H5JKJBBXX:6:1101:1509:1086	69	*	0	255	*	*	0	0	GAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTGGC	AAAFFJFFJJJJFJFFAJJJJJJJJAFAJAAFFFF<<FJJ-FJJFJFJ<<AAJA<J7-F-<<
K00103:86:H5JKJBBXX:6:1114:8501:40526	69	*	0	255	*	*	0	0	CACCACACAGCGCGGCCAGGGCAGAGTGGGCCGCCGCCGCCGCCCAGAGACCGAGGCTGCCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCAAFFFJJJFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJJFFFJJFFFFJJJJAJFFJFA<JJJJJJJJJJJJJJFJJJ
K00103:86:H5JKJBBXX:6:1214:16122:30661	69	*	0	255	*	*	0	0	CGGCGTCGCTCGTGGGGGGCCCAAGTCCTTCTGATCGAGGCCCAGCCCGTGGACGGTGTGAGGCCGGTAGCGGCCCCCGGCGCGCCGGGCCCGGGTCTTCAAFFFJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFJJFFFJFJJJF<AF7AJJJJJJJAJJFJJJJAF<AFAFJFFJFFFAFJFAJJJJFFJ7FJJJ
K00103:86:H5JKJBBXX:6:2214:27082:40192	69	*	0	255	*	*	0	0	GTCTTGGGGCCGAAACGATCTCAACCTATTCTCAAACTTTAAATGGGTAAGAAGCCCGGCTCGCTGGCGTGGAGCCGGGCGTGGAATGCGAGTGCCTAGTAAFFFJJJFJJJJJJJFFJJJJJFJJFJJJJJJJJFJF<JJJ<<JFF<AJJFFFAJJJJF7AAJ<<<<7F77A-A77AFJJ7<-FFJ7J<JJAFFFFA-F
K00103:86:H5JKJBBXX:6:2214:26453:40192	133	*	0	255	*	*	0	0	GGGATCTGAACCCGACTCCCTTTCGATCGGCCGAGGGCAACGGAGGCCATCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCA--AAJ7-FJJJJJJJJJJFJJJFFJJFJFFJ<JFJJJJJJFJJJJJJJJJJJJJJJJFFF<JFFJJJJAJJJJJJFJJJJJJJJJJJJJJJJJJJJJJF

wc -l unmapped.sam
31234334 unmapped.sam

cat unmapped.sam | cut -f 1 | sort | uniq | wc -l
18571998
Can we say there are 25% of duplicates in the polyA unmapped file and 40% in the ribo unmapped file?

Quote:
Originally Posted by GenoMax View Post
2. Even though the samples were prepped there is bound to be some rRNA present. You could align to human rDNA repeat independently and see what you get for the two samples as @Chipper suggested.
Is there a way to try to map the unmapped reads to rRNA?

Quote:
Originally Posted by GenoMax View Post
3. Mapping should become more precise (harder to multi-map) as length increases so that could be another reason for the reduced alignment. You could trim the HiSeq 4000 reads down to 75 bp and see if you get the same ballpark alignment as NextSeq 500 reads.
When trimming the HiSeq 4000 reads down to 75 bp -for 3 samples only- I got an increase in overall mapping rate of 1.1, 1.6 and 2.5%. Better but not 10% better.
Jane M is offline   Reply With Quote
Old 04-27-2016, 07:10 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

Quote:
Originally Posted by Jane M View Post
I really appreciate your suggestions GenoMax, thank you

I used TopHat2 with default parameters:
Default is 20 locations for multi-mapped reads for TopHat as I recall.
Quote:
It seems that there are reads with the same sequence but different identifiers. What are they?
Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.

Quote:
Is there a way to try to map the unmapped reads to rRNA?
You can use the sequence of the human rDNA repeat found here to map against.

Quote:
When trimming the HiSeq 4000 reads down to 75 bp -for 3 samples only- I got an increase in overall mapping rate of 1.1, 1.6 and 2.5%. Better but not 10% better.
Perhaps you should allow multi-mappers to map at all locations. See if that ups the percentage. An academic exercise
GenoMax is offline   Reply With Quote
Old 04-27-2016, 08:08 AM   #9
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

Quote:
Originally Posted by GenoMax View Post
Default is 20 locations for multi-mapped reads for TopHat as I recall.
Exact

Quote:
Originally Posted by GenoMax View Post
Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.

You can use the sequence of the human rDNA repeat found here to map against.
Thank you for these suggestions. I will try both, counting the number of reads flagged as optical duplicates and map the unmapped reads to ribosomal DNA, each time in the 2 experiments.

Quote:
Originally Posted by GenoMax View Post
Perhaps you should allow multi-mappers to map at all locations. See if that ups the percentage. An academic exercise
Maybe I will try this after!
Jane M is offline   Reply With Quote
Old 04-28-2016, 06:36 AM   #10
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

Quote:
Originally Posted by GenoMax View Post
Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.
I tried your suggestion on both unmapped and mapped reads for one polyA and one ribo sample:

Quote:
java -Xmx6g -jar picard/1.134/picard.jar MarkDuplicates INPUT=$path/unmapped.bam OUTPUT=$path/unmapped_MarkDuplicates.bam METRICS_FILE=unmapped_metrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
Quote:
java -Xmx6g -jar picard/1.134/picard.jar MarkDuplicates INPUT=$path/ARN_POP.bam OUTPUT=$path/ARN_POP_MarkDuplicates.bam METRICS_FILE=mapped_metrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
For both runs, I got:
ribo sample:
Quote:
unmapped_metrics file:
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 0 0 31234334 0 0 0 ?

mapped_metrics
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 5909662 58551716 0 5344654 19348992 19322086 0.358032 28585819416
polyA sample:
Quote:
ribo sample:
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 0 0 2754730 0 0 0 ?

mapped_metrics
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 1913720 64301596 0 1502675 32452071 28720194 0.508799 157544313
I got results for mapped reads only, maybe because they are mainly paired. The ribo sample, sequenced on HiSeq4000 has less PERCENT_DUPLICATION (36%) than the polyA sample (51%), sequenced on NextSeq 500. I was expected the contrary And these percentages seem very high. Is it normal?

Quote:
Originally Posted by GenoMax View Post
You can use the sequence of the human rDNA repeat found here to map against.
To test if more reads correspond to rRNA sequence in ribodepleted sample than in polyA sample, I first ran blast on the first 10 unmapped reads, but on "Nucleotide collection (nr/nt)" database this time. Now, for the ribodepleted sample, 9/10 had significant similarities, each had several similarities, often to 45S/60S, as Chipper suggested. For the polyA sample, again 9/10 had significant similarities, each had several similarities too, but not especially to ribosome.

Then, I converted unmapped.bam with bamToFatq. I downloaded the gtf of rRNA and tRNA from UCSC tables.
After creating a new transcriptome index, I reran the mapping with the preprocessed fastq:
Quote:
tophat/v2.1.0/bin/tophat2 -p 8 --transcriptome-index=rDNATranscriptome/rDNA -o /myfolder/rDNA_tophat_out --read-realign-edit-dist 0 --mate-inner-dist 98 --mate-std-dev 20 --min-anchor-length 4 --splice-mismatches 0 --min-intron-length 50 --max-intron-length 500000 --no-coverage-search --library-type fr-firststrand --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 Bowtie2/myNewBowtie2Index myfolder/unmapped.R1.fastq myfolder/unmapped.R2.fastq
If most of the unmapped reads correspond to rRNA, I should now get a higher mapping rate for ribodepleted sample than for polyA sample. Waiting for these results.
Jane M is offline   Reply With Quote
Old 05-02-2016, 02:35 AM   #11
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

I wrote 2 posts last week, they do not appear (waiting for validation, I don't know why). I hope this one will be posted.

Quote:
Originally Posted by GenoMax View Post
You can use the sequence of the human rDNA repeat found here to map against.
I ran two tests:
1) map the unmapped reads as previously, except I change the reference to rDNA fasta sequence.
Then, I got alignment rate of 83.2, 82, 81.6 and 74.8% for ribodepleted samples and 7, 4.4, 2.8 and 4% for polyA samples.

2) map the paired trimmed reads as previously, except I use as reference the rDNA fasta sequence.
Then, I got alignment rate of 24.3, 14.1, 18.1 and 13.3% for ribodepleted samples and 0.5, 0.2, 0.2 and 0.1% for polyA samples.

I read in some posts that 5% was low remaining rDNA sequence after ribodepletion. I have quite higher percentage. What is the classical range?

From these tests, I am convince that the different mapping rates I got between the two experiments come from the rDNA sequence still present in ribodepleted samples that are not reported in hg19.

Thank you for the help you provided me.
Jane M is offline   Reply With Quote
Old 05-02-2016, 03:23 AM   #12
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

@Jane_M: Apologize for the posts from last week. I must have missed them. They were indeed waiting for moderation.

Thank you for providing detailed stats. It does appear that the ribo-depletion did not work well. Someone with experimental expertise can comment but % of rRNA should be less than 5% (perhaps 1% or less is ideal).

Interestingly (if I am reading this right) most of your duplicates (>98%) on the 4000 are optical, with a lesser % in regular run.
GenoMax is offline   Reply With Quote
Old 05-02-2016, 04:06 AM   #13
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

Quote:
Originally Posted by GenoMax View Post
@Jane_M: Apologize for the posts from last week. I must have missed them. They were indeed waiting for moderation.
Thank you for posting them.

Quote:
Originally Posted by GenoMax View Post
Thank you for providing detailed stats. It does appear that the ribo-depletion did not work well. Someone with experimental expertise can comment but % of rRNA should be less than 5% (perhaps 1% or less is ideal).
I had the impression that 5% was already a good ribodepletion ratio. Let's see if others can give their feedback.


Quote:
Originally Posted by GenoMax View Post
Interestingly (if I am reading this right) most of your duplicates (>98%) on the 4000 are optical, with a lesser % in regular run.
Well, on HiSeq 4000, more than 99% of the READ_PAIR_DUPLICATES are READ_PAIR_OPTICAL_DUPLICATES. This ratio reaches 88% for NextSeq 500. Are there optical duplicates too? If not, what are they?

Can someone please comment these two ratio: 35% (on ribodepleted sample - HiSeq 4000) and 51% (on polyA sample - NextSeq 500) for PERCENT_DUPLICATION?

Last edited by Jane M; 05-02-2016 at 04:10 AM.
Jane M is offline   Reply With Quote
Old 05-02-2016, 04:32 AM   #14
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,435
Default

If MarkDuplicates is working as expected then the duplicates on HiSeq 4000 are real optical dups likely formed by pad-hopping (contamination of nearby nanowells). For NextSeq 500 they probably come from one too many PCR cycles.

One possible cause of the duplicates in general is experimental origin (likely # of PCR cycles).
GenoMax is offline   Reply With Quote
Old 05-02-2016, 05:46 AM   #15
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

Ok, thank you GenoMax.
Jane M is offline   Reply With Quote
Old 05-09-2016, 05:15 AM   #16
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

Still interested in your percentages of remaining rDNA sequence after ribodepletion

I have one more question concerning what are the unmapped reads in the ribodepleted experiments. When looking at the overall mapping rate to hg19 and the overall mapping rate to rDNA for my 4 samples, one can see that Mapping Rate to hg19 + Mapping Rate to rDNA is beyond 100%:

Code:
Sample  MappingRateTohg19  MappingRateTorDNA
Sample 1  79.8  24.3
Sample 2  88.2  14.1
Sample 3  84.7  18.1
Sample 4  87.8  13.3
What are those reads common to hg19 and rDNA: multimapped reads? rDNA sequences included in hg19 (Human ribosomal DNA complete repeating unit - http://www.ncbi.nlm.nih.gov/nuccore/U13369.1 - contains ~43kb. Isn't it too large)? Any feedback?

Thank you

Last edited by Jane M; 05-09-2016 at 07:35 AM.
Jane M is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:28 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2017, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO