SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   RNA Sequencing (http://seqanswers.com/forums/forumdisplay.php?f=26)
-   -   Mapping rate with PolyA and Ribo-zero selection on NextSeq 500 and HiSeq 4000 (http://seqanswers.com/forums/showthread.php?t=67876)

Jane M 04-26-2016 08:16 AM

Mapping rate with PolyA and Ribo-zero selection on NextSeq 500 and HiSeq 4000
 
4 Attachment(s)
Dear all,

I have two RNASeq experiments from which I try to understand mapping rate difference. The experiments are pretty different:

Experiment 1:
Samples: 19 primary tumors
RNA selection: TruSeq Stranded mRNA (PolyA)
Sequencing: Paired-ends, 75bp
Sequencer: NextSeq 500

Experiment 2:
Samples: 16 tumor cell lines
RNA selection: TruSeq total RNA Stranded (Ribo zero)
Sequencing: Paired-ends, 100bp
Sequencer: HiSeq 4000

Below are the number of reads in each strand after preprocessing and mapping info. Mapping was performed using TopHat v2.1.0 and Bowtie v2.2.5.0 on hg19 reference genome. The average mapping rate is classic for recent experiments. Nevertheless, there is a difference of 10% between both experiments:

Experiment 1:
Mean number of reads after Trimmomatic (*_paired files): 67944508.54, SD=8215167.56, range=53282567-88949193
Overall read mapping rate: 97.71, SD=0.65, range=96-99
Concordant pair alignment rate: 94.41, SD=1.27, range=91.6-96.8

Experiment 2:
Mean number of reads after Trimmomatic (*_paired files): 90733572.69, SD=12822798.96, range=72880790-122816426
Overall read mapping rate: 87.12, SD=5.16, range=77.9-93.4
Concordant pair alignment rate: 82.74, SD=5.70, range=72.3-89.8

I was asked from what was due this difference. I do not have the answer for sure, only hypotheses. I would greatly appreciate your opinion on these hypotheses:

1. Ribo-zero experiments have commonly lower mapping rate due to the presence of non coding elements that correspond to less conserved regions (=>less well annotated) and/or in repeated regions.
I spend a couple of hours looking for mapping rate info in similar recent (publications between 2014 and now) experiments without any success. The mapping rate seems to be not reported anymore in the publications - for those that I checked at least.
2. We can observe a slight lower base quality, especially around the 75th base in the Ribo-zero samples (attached: one representative Ribo zero and one representative PolyA samples after preprocessing). This could reduce the mapping rate.
Until now, I used 20 as threshold for both experiments from HiSeq and NextSeq. I read yesterday that a threshold of 20 was too high for reads from NextSeq. Can you confirm this? I am trying a more stringent base quality threshold (20 -> 24) on Ribo zero samples during preprocessing.
3. Different sequencers. Has a higher mapping rate been reported for NextSeq experiments?

Finally, I used the 75 first bases from 3 samples of the Ribo zero experiment instead of the 100 bases. It increased from 1 to 2.5% the mapping rate, which is a bit surprising for me since mapping is supposed to increase as read length increases. On the other hand, there is this slight default around 75bp in these samples.

I would really appreciate any feedback on mapping rates with both RNA selections and sequencers, any link to similar analyses or any test to try.
Thank you in advance,
Jane

fanli 04-26-2016 08:57 AM

I'd guess it's the first of your hypotheses. You could try BLAST/BLAT on some of the unmapped reads to see what they hit...

Chipper 04-27-2016 03:55 AM

My guess would be that you have more rDNA in the second experiment. The 45S rDNA sequence is not included in hg19.

GenoMax 04-27-2016 04:52 AM

1. HiSeq 4000 generates optical duplicates (due to pad-hopping). Depending on how you are handling your multi-mappers this could explain the difference.
2. Even though the samples were prepped there is bound to be some rRNA present. You could align to human rDNA repeat independently and see what you get for the two samples as @Chipper suggested.
3.
Quote:

since mapping is supposed to increase as read length increases
Mapping should become more precise (harder to multi-map) as length increases so that could be another reason for the reduced alignment. You could trim the HiSeq 4000 reads down to 75 bp and see if you get the same ballpark alignment as NextSeq 500 reads.

Jane M 04-27-2016 06:14 AM

Quote:

Originally Posted by fanli (Post 192918)
I'd guess it's the first of your hypotheses. You could try BLAST/BLAT on some of the unmapped reads to see what they hit...

Thank you fanli for your suggestion.
I used the Web BLAST interface on the first "polyA" and "ribo" unmapped reads using "Genome (all assemblies top level)" as database and default settings.
1/10 reads from "polyA" unmapped.sam gets "No significant similarity found", the other 9 reads had pretty good similaries on the full read or on part of the read.
1/10 reads from "ribo" unmapped.sam gets similaries and the 9 other gets "No significant similarity found".

Is "Genome (all assemblies top level)" not the whole genome, with missing non coding regions where my "ribo" reads match? Or do they contain too many sequencing errors?

Jane M 04-27-2016 06:20 AM

Quote:

Originally Posted by Chipper (Post 192956)
My guess would be that you have more rDNA in the second experiment. The 45S rDNA sequence is not included in hg19.

Thank you Chipper for your help. Are these sequences present in hg38?
Is it possible to test this hypothesis without mapping the whole transcriptome to hg38?

Jane M 04-27-2016 07:11 AM

I really appreciate your suggestions GenoMax, thank you

Quote:

Originally Posted by GenoMax (Post 192957)
1. HiSeq 4000 generates optical duplicates (due to pad-hopping). Depending on how you are handling your multi-mappers this could explain the difference.

I used TopHat2 with default parameters:
Quote:

tophat/v2.1.0/bin/tophat2 -p 8 --transcriptome-index=/data/KnownTranscriptome/Known -o /data/2_T1-ARN/tophat_out --read-realign-edit-dist 0 --mate-inner-dist 97 --mate-std-dev 20 --min-anchor-length 4 --splice-mismatches 0 --min-intron-length 50 --max-intron-length 500000 --no-coverage-search --library-type fr-firststrand --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 /data/bioinfo/hg19/Bowtie2/2.2.5_20150629/Base_chr/hg19 /data/RNASeq/2_T1-ARN/2_T1-ARN_paired.R1.fastq /data/RNASeq/2_T1-ARN/2_T1-ARN_paired.R2.fastq
I was wondering if something was happening with duplicates based on some tests:
One polyA unmapped.sam file:
Code:


head unmapped.sam
NS500197:119:HG2Y7BGXX:1:11101:24683:1054        69        *        0        255        *        *        0        0        CTCCANTTTAATTTCCTTTAACAAATTTTCTGTACTTTTACTTTCCATCCAAACAGTAACTTATAAATTATTATT        AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:1:13311:17866:19393        69        *        0        255        *        *        0        0        TAGTACTTCAGTGATTTGCCAATGCCCTCTAAGTCCCGTAACCTGTTGTCAATCTGCACATGAACATTTTGCAGA        AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:2:13311:16650:6888        69        *        0        255        *        *        0        0        CTGGGCTGGTGTTGAGTGTCTGCAGCTTTTCCAGGCACAGAGTGCAGGCTGTCAGTGGATCTACCATCCTGGAGT        AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:15232:11614        69        *        0        255        *        *        0        0        CCCACTTGATTTTGGAGGGATCTCGCTCCTGGAAGATGGTGATGGGATTTCCATTGATGACAAGCTTCCCGAGAT        AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:26394:11583        133        *        0        255        *        *        0        0        TTTCTGAGCCTTTTACCACTCCAGCCTAGCCCCTACCCCCCAATTTGGCGGGCGCGGACCCC        AAAAAEEEEEEEEEEEEAEEEEEEEEEEEEEEE//E6AAAA/A///EE/EEE//A/E///EE
NS500197:119:HG2Y7BGXX:1:11101:17376:1060        69        *        0        255        *        *        0        0        AGCCTNTTGGGGGTCAGTGCCCTCCTAATTGGGGGGTAGGGGCTAGGCTGGAGTGGTAAAAGGCTCAGAAAAATC        AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEE
NS500197:119:HG2Y7BGXX:1:13311:20274:19398        69        *        0        255        *        *        0        0        GCCAGCTCTCTCCTGGTGTGCTCGTCTTGCCTCAGGTCCTTCTTATTCCCCACCAGGATGATGGGCACGAGATCG        AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:2:13311:9800:6890        69        *        0        255        *        *        0        0        CAGGTAGAGCAGGAGCCAGTGAAGAGCAGGCCCTGGATGGGTGGGAGATCGGAAGAGCACACGTCTGAACTCCAG        AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:3313:11628        69        *        0        255        *        *        0        0        CCCCCGTTGTGGTCCCAGGTCCTAAACAGCAGGGCTAGTCCCAGAAAGACGACCCCCCCACGGAAGGGCTTTAGC        AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:5379:11596        133        *        0        255        *        *        0        0        CATGTAGTTAATCTGACATGTCAAAATTGGGAAGACTGTAACCTTTTTTTTTTTTTTTTTTT        AAA/AA/EEAEA/EEEEAEEEAEEE/E//E/EE66/EEAEE/AEEEE<A/EE6EEEA<EEEE

wc -l unmapped.sam
2754730 unmapped.sam

cat unmapped.sam | cut -f1 | sort | uniq | wc -l
2334225

cat unmapped.sam | cut -f10 | sort | uniq | wc -l
2098310

Several reads have the same sequence. Are they duplicates?
I don't understand why there are several reads with a same identifier since there is no multi-mapped read in unmapped sam.
It seems that there are reads with the same sequence but different identifiers. What are they?

Same info for one ribo unmapped.sam file:

Code:

head unmapped.sam
K00103:86:H5JKJBBXX:6:1101:1489:1086        69        *        0        255        *        *        0        0        GTGGAGATGGGCGCCGCGAGGCGTCCAGTGCGGTAACGCGACCGATCCCGGAGAAGCCGGCGGGAGCCCCGGGGAGAGTTCTCTTTTCTTTGTGAAGGG        AAFFJJJJJJJJJJFJJJJFFJJJJJAJJJJJJFFJFFJJJJJJJJJJJFFFAFFAJFAAA7-7AF<FAAJAAFFFJF-AAJJ<<FJJAJFJJFJJJAF
K00103:86:H5JKJBBXX:6:1114:8298:40526        69        *        0        255        *        *        0        0        TGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGTCCCCCAACTTCTTAGAGGGACAAGTGGCGTTCAGCCACCCGAGATTGAGCAATAACAGGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJFJJJJJJ<FF-FJJFJJJFJFJFJF7<FJ7<77FF<7FFFF<JAJFJFFFJJJJJJFJJF
K00103:86:H5JKJBBXX:6:1214:15696:30661        69        *        0        255        *        *        0        0        CCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAAAGTCAGCCCTCGAAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJ
K00103:86:H5JKJBBXX:6:2214:26453:40192        69        *        0        255        *        *        0        0        CTTGGTGGTAGTAGCAAATATTCAAACGAGAACTTTGAAGGCCGAAGTGGAGAAGGCTTCCATGTGAACAGCAGTTGAACATGGGTCAGTCGGTCCTGAGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFF<JJJJJJJ<-<<FJFJFF7F<FFJJJFFJJJJJJJ
K00103:86:H5JKJBBXX:6:2214:25621:40192        133        *        0        255        *        *        0        0        CGAGGGCTGACTTTCAATAGATCGCAGCGAGGGAGCTGCTCTGCTACGTACGAAACCCCGACCCAGAAGCAGGTCGTCTACGAATGGTTTAGCGCCAGGTAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJ
K00103:86:H5JKJBBXX:6:1101:1509:1086        69        *        0        255        *        *        0        0        GAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTGGC        AAAFFJFFJJJJFJFFAJJJJJJJJAFAJAAFFFF<<FJJ-FJJFJFJ<<AAJA<J7-F-<<
K00103:86:H5JKJBBXX:6:1114:8501:40526        69        *        0        255        *        *        0        0        CACCACACAGCGCGGCCAGGGCAGAGTGGGCCGCCGCCGCCGCCCAGAGACCGAGGCTGCCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCAAFFFJJJFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJJFFFJJFFFFJJJJAJFFJFA<JJJJJJJJJJJJJJFJJJ
K00103:86:H5JKJBBXX:6:1214:16122:30661        69        *        0        255        *        *        0        0        CGGCGTCGCTCGTGGGGGGCCCAAGTCCTTCTGATCGAGGCCCAGCCCGTGGACGGTGTGAGGCCGGTAGCGGCCCCCGGCGCGCCGGGCCCGGGTCTTCAAFFFJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFJJFFFJFJJJF<AF7AJJJJJJJAJJFJJJJAF<AFAFJFFJFFFAFJFAJJJJFFJ7FJJJ
K00103:86:H5JKJBBXX:6:2214:27082:40192        69        *        0        255        *        *        0        0        GTCTTGGGGCCGAAACGATCTCAACCTATTCTCAAACTTTAAATGGGTAAGAAGCCCGGCTCGCTGGCGTGGAGCCGGGCGTGGAATGCGAGTGCCTAGTAAFFFJJJFJJJJJJJFFJJJJJFJJFJJJJJJJJFJF<JJJ<<JFF<AJJFFFAJJJJF7AAJ<<<<7F77A-A77AFJJ7<-FFJ7J<JJAFFFFA-F
K00103:86:H5JKJBBXX:6:2214:26453:40192        133        *        0        255        *        *        0        0        GGGATCTGAACCCGACTCCCTTTCGATCGGCCGAGGGCAACGGAGGCCATCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCA--AAJ7-FJJJJJJJJJJFJJJFFJJFJFFJ<JFJJJJJJFJJJJJJJJJJJJJJJJFFF<JFFJJJJAJJJJJJFJJJJJJJJJJJJJJJJJJJJJJF

wc -l unmapped.sam
31234334 unmapped.sam

cat unmapped.sam | cut -f 1 | sort | uniq | wc -l
18571998

Can we say there are 25% of duplicates in the polyA unmapped file and 40% in the ribo unmapped file?

Quote:

Originally Posted by GenoMax (Post 192957)
2. Even though the samples were prepped there is bound to be some rRNA present. You could align to human rDNA repeat independently and see what you get for the two samples as @Chipper suggested.

Is there a way to try to map the unmapped reads to rRNA?

Quote:

Originally Posted by GenoMax (Post 192957)
3. Mapping should become more precise (harder to multi-map) as length increases so that could be another reason for the reduced alignment. You could trim the HiSeq 4000 reads down to 75 bp and see if you get the same ballpark alignment as NextSeq 500 reads.

When trimming the HiSeq 4000 reads down to 75 bp -for 3 samples only- I got an increase in overall mapping rate of 1.1, 1.6 and 2.5%. Better but not 10% better.

GenoMax 04-27-2016 08:10 AM

Quote:

Originally Posted by Jane M (Post 192966)
I really appreciate your suggestions GenoMax, thank you

I used TopHat2 with default parameters:

Default is 20 locations for multi-mapped reads for TopHat as I recall.
Quote:

It seems that there are reads with the same sequence but different identifiers. What are they?
Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.

Quote:

Is there a way to try to map the unmapped reads to rRNA?
You can use the sequence of the human rDNA repeat found here to map against.

Quote:

When trimming the HiSeq 4000 reads down to 75 bp -for 3 samples only- I got an increase in overall mapping rate of 1.1, 1.6 and 2.5%. Better but not 10% better.
Perhaps you should allow multi-mappers to map at all locations. See if that ups the percentage. An academic exercise :)

Jane M 04-27-2016 09:08 AM

Quote:

Originally Posted by GenoMax (Post 192970)
Default is 20 locations for multi-mapped reads for TopHat as I recall.

Exact

Quote:

Originally Posted by GenoMax (Post 192970)
Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.

You can use the sequence of the human rDNA repeat found here to map against.

Thank you for these suggestions. I will try both, counting the number of reads flagged as optical duplicates and map the unmapped reads to ribosomal DNA, each time in the 2 experiments.

Quote:

Originally Posted by GenoMax (Post 192970)
Perhaps you should allow multi-mappers to map at all locations. See if that ups the percentage. An academic exercise :)

Maybe I will try this after!

Jane M 04-28-2016 07:36 AM

Quote:

Originally Posted by GenoMax (Post 192970)
Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.

I tried your suggestion on both unmapped and mapped reads for one polyA and one ribo sample:

Quote:

java -Xmx6g -jar picard/1.134/picard.jar MarkDuplicates INPUT=$path/unmapped.bam OUTPUT=$path/unmapped_MarkDuplicates.bam METRICS_FILE=unmapped_metrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
Quote:

java -Xmx6g -jar picard/1.134/picard.jar MarkDuplicates INPUT=$path/ARN_POP.bam OUTPUT=$path/ARN_POP_MarkDuplicates.bam METRICS_FILE=mapped_metrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
For both runs, I got:
ribo sample:
Quote:

unmapped_metrics file:
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 0 0 31234334 0 0 0 ?

mapped_metrics
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 5909662 58551716 0 5344654 19348992 19322086 0.358032 28585819416
polyA sample:
Quote:

ribo sample:
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 0 0 2754730 0 0 0 ?

mapped_metrics
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 1913720 64301596 0 1502675 32452071 28720194 0.508799 157544313
I got results for mapped reads only, maybe because they are mainly paired. The ribo sample, sequenced on HiSeq4000 has less PERCENT_DUPLICATION (36%) than the polyA sample (51%), sequenced on NextSeq 500. I was expected the contrary:confused: And these percentages seem very high. Is it normal?

Quote:

Originally Posted by GenoMax (Post 192970)
You can use the sequence of the human rDNA repeat found here to map against.

To test if more reads correspond to rRNA sequence in ribodepleted sample than in polyA sample, I first ran blast on the first 10 unmapped reads, but on "Nucleotide collection (nr/nt)" database this time. Now, for the ribodepleted sample, 9/10 had significant similarities, each had several similarities, often to 45S/60S, as Chipper suggested. For the polyA sample, again 9/10 had significant similarities, each had several similarities too, but not especially to ribosome.

Then, I converted unmapped.bam with bamToFatq. I downloaded the gtf of rRNA and tRNA from UCSC tables.
After creating a new transcriptome index, I reran the mapping with the preprocessed fastq:
Quote:

tophat/v2.1.0/bin/tophat2 -p 8 --transcriptome-index=rDNATranscriptome/rDNA -o /myfolder/rDNA_tophat_out --read-realign-edit-dist 0 --mate-inner-dist 98 --mate-std-dev 20 --min-anchor-length 4 --splice-mismatches 0 --min-intron-length 50 --max-intron-length 500000 --no-coverage-search --library-type fr-firststrand --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 Bowtie2/myNewBowtie2Index myfolder/unmapped.R1.fastq myfolder/unmapped.R2.fastq
If most of the unmapped reads correspond to rRNA, I should now get a higher mapping rate for ribodepleted sample than for polyA sample. Waiting for these results.

Jane M 05-02-2016 03:35 AM

I wrote 2 posts last week, they do not appear (waiting for validation, I don't know why). I hope this one will be posted.

Quote:

Originally Posted by GenoMax (Post 192970)
You can use the sequence of the human rDNA repeat found here to map against.

I ran two tests:
1) map the unmapped reads as previously, except I change the reference to rDNA fasta sequence.
Then, I got alignment rate of 83.2, 82, 81.6 and 74.8% for ribodepleted samples and 7, 4.4, 2.8 and 4% for polyA samples.

2) map the paired trimmed reads as previously, except I use as reference the rDNA fasta sequence.
Then, I got alignment rate of 24.3, 14.1, 18.1 and 13.3% for ribodepleted samples and 0.5, 0.2, 0.2 and 0.1% for polyA samples.

I read in some posts that 5% was low remaining rDNA sequence after ribodepletion. I have quite higher percentage. What is the classical range?

From these tests, I am convince that the different mapping rates I got between the two experiments come from the rDNA sequence still present in ribodepleted samples that are not reported in hg19.

Thank you for the help you provided me.

GenoMax 05-02-2016 04:23 AM

@Jane_M: Apologize for the posts from last week. I must have missed them. They were indeed waiting for moderation.

Thank you for providing detailed stats. It does appear that the ribo-depletion did not work well. Someone with experimental expertise can comment but % of rRNA should be less than 5% (perhaps 1% or less is ideal).

Interestingly (if I am reading this right) most of your duplicates (>98%) on the 4000 are optical, with a lesser % in regular run.

Jane M 05-02-2016 05:06 AM

Quote:

Originally Posted by GenoMax (Post 193152)
@Jane_M: Apologize for the posts from last week. I must have missed them. They were indeed waiting for moderation.

Thank you for posting them.

Quote:

Originally Posted by GenoMax (Post 193152)
Thank you for providing detailed stats. It does appear that the ribo-depletion did not work well. Someone with experimental expertise can comment but % of rRNA should be less than 5% (perhaps 1% or less is ideal).

I had the impression that 5% was already a good ribodepletion ratio. Let's see if others can give their feedback.


Quote:

Originally Posted by GenoMax (Post 193152)
Interestingly (if I am reading this right) most of your duplicates (>98%) on the 4000 are optical, with a lesser % in regular run.

Well, on HiSeq 4000, more than 99% of the READ_PAIR_DUPLICATES are READ_PAIR_OPTICAL_DUPLICATES. This ratio reaches 88% for NextSeq 500. Are there optical duplicates too? If not, what are they?

Can someone please comment these two ratio: 35% (on ribodepleted sample - HiSeq 4000) and 51% (on polyA sample - NextSeq 500) for PERCENT_DUPLICATION?

GenoMax 05-02-2016 05:32 AM

If MarkDuplicates is working as expected then the duplicates on HiSeq 4000 are real optical dups likely formed by pad-hopping (contamination of nearby nanowells). For NextSeq 500 they probably come from one too many PCR cycles.

One possible cause of the duplicates in general is experimental origin (likely # of PCR cycles).

Jane M 05-02-2016 06:46 AM

Ok, thank you GenoMax.

Jane M 05-09-2016 06:15 AM

Still interested in your percentages of remaining rDNA sequence after ribodepletion :D

I have one more question concerning what are the unmapped reads in the ribodepleted experiments. When looking at the overall mapping rate to hg19 and the overall mapping rate to rDNA for my 4 samples, one can see that Mapping Rate to hg19 + Mapping Rate to rDNA is beyond 100%:

Code:

Sample  MappingRateTohg19  MappingRateTorDNA
Sample 1  79.8  24.3
Sample 2  88.2  14.1
Sample 3  84.7  18.1
Sample 4  87.8  13.3

What are those reads common to hg19 and rDNA: multimapped reads? rDNA sequences included in hg19 (Human ribosomal DNA complete repeating unit - http://www.ncbi.nlm.nih.gov/nuccore/U13369.1 - contains ~43kb. Isn't it too large)? Any feedback?

Thank you


All times are GMT -8. The time now is 07:03 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.