Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mapping rate with PolyA and Ribo-zero selection on NextSeq 500 and HiSeq 4000

    Dear all,

    I have two RNASeq experiments from which I try to understand mapping rate difference. The experiments are pretty different:

    Experiment 1:
    Samples: 19 primary tumors
    RNA selection: TruSeq Stranded mRNA (PolyA)
    Sequencing: Paired-ends, 75bp
    Sequencer: NextSeq 500

    Experiment 2:
    Samples: 16 tumor cell lines
    RNA selection: TruSeq total RNA Stranded (Ribo zero)
    Sequencing: Paired-ends, 100bp
    Sequencer: HiSeq 4000

    Below are the number of reads in each strand after preprocessing and mapping info. Mapping was performed using TopHat v2.1.0 and Bowtie v2.2.5.0 on hg19 reference genome. The average mapping rate is classic for recent experiments. Nevertheless, there is a difference of 10% between both experiments:

    Experiment 1:
    Mean number of reads after Trimmomatic (*_paired files): 67944508.54, SD=8215167.56, range=53282567-88949193
    Overall read mapping rate: 97.71, SD=0.65, range=96-99
    Concordant pair alignment rate: 94.41, SD=1.27, range=91.6-96.8

    Experiment 2:
    Mean number of reads after Trimmomatic (*_paired files): 90733572.69, SD=12822798.96, range=72880790-122816426
    Overall read mapping rate: 87.12, SD=5.16, range=77.9-93.4
    Concordant pair alignment rate: 82.74, SD=5.70, range=72.3-89.8

    I was asked from what was due this difference. I do not have the answer for sure, only hypotheses. I would greatly appreciate your opinion on these hypotheses:

    1. Ribo-zero experiments have commonly lower mapping rate due to the presence of non coding elements that correspond to less conserved regions (=>less well annotated) and/or in repeated regions.
    I spend a couple of hours looking for mapping rate info in similar recent (publications between 2014 and now) experiments without any success. The mapping rate seems to be not reported anymore in the publications - for those that I checked at least.
    2. We can observe a slight lower base quality, especially around the 75th base in the Ribo-zero samples (attached: one representative Ribo zero and one representative PolyA samples after preprocessing). This could reduce the mapping rate.
    Until now, I used 20 as threshold for both experiments from HiSeq and NextSeq. I read yesterday that a threshold of 20 was too high for reads from NextSeq. Can you confirm this? I am trying a more stringent base quality threshold (20 -> 24) on Ribo zero samples during preprocessing.
    3. Different sequencers. Has a higher mapping rate been reported for NextSeq experiments?

    Finally, I used the 75 first bases from 3 samples of the Ribo zero experiment instead of the 100 bases. It increased from 1 to 2.5% the mapping rate, which is a bit surprising for me since mapping is supposed to increase as read length increases. On the other hand, there is this slight default around 75bp in these samples.

    I would really appreciate any feedback on mapping rates with both RNA selections and sequencers, any link to similar analyses or any test to try.
    Thank you in advance,
    Jane
    Attached Files
    Last edited by Jane M; 04-26-2016, 07:24 AM.

  • #2
    I'd guess it's the first of your hypotheses. You could try BLAST/BLAT on some of the unmapped reads to see what they hit...

    Comment


    • #3
      My guess would be that you have more rDNA in the second experiment. The 45S rDNA sequence is not included in hg19.

      Comment


      • #4
        1. HiSeq 4000 generates optical duplicates (due to pad-hopping). Depending on how you are handling your multi-mappers this could explain the difference.
        2. Even though the samples were prepped there is bound to be some rRNA present. You could align to human rDNA repeat independently and see what you get for the two samples as @Chipper suggested.
        3.
        since mapping is supposed to increase as read length increases
        Mapping should become more precise (harder to multi-map) as length increases so that could be another reason for the reduced alignment. You could trim the HiSeq 4000 reads down to 75 bp and see if you get the same ballpark alignment as NextSeq 500 reads.
        Last edited by GenoMax; 04-27-2016, 03:54 AM.

        Comment


        • #5
          Originally posted by fanli View Post
          I'd guess it's the first of your hypotheses. You could try BLAST/BLAT on some of the unmapped reads to see what they hit...
          Thank you fanli for your suggestion.
          I used the Web BLAST interface on the first "polyA" and "ribo" unmapped reads using "Genome (all assemblies top level)" as database and default settings.
          1/10 reads from "polyA" unmapped.sam gets "No significant similarity found", the other 9 reads had pretty good similaries on the full read or on part of the read.
          1/10 reads from "ribo" unmapped.sam gets similaries and the 9 other gets "No significant similarity found".

          Is "Genome (all assemblies top level)" not the whole genome, with missing non coding regions where my "ribo" reads match? Or do they contain too many sequencing errors?

          Comment


          • #6
            Originally posted by Chipper View Post
            My guess would be that you have more rDNA in the second experiment. The 45S rDNA sequence is not included in hg19.
            Thank you Chipper for your help. Are these sequences present in hg38?
            Is it possible to test this hypothesis without mapping the whole transcriptome to hg38?

            Comment


            • #7
              I really appreciate your suggestions GenoMax, thank you

              Originally posted by GenoMax View Post
              1. HiSeq 4000 generates optical duplicates (due to pad-hopping). Depending on how you are handling your multi-mappers this could explain the difference.
              I used TopHat2 with default parameters:
              tophat/v2.1.0/bin/tophat2 -p 8 --transcriptome-index=/data/KnownTranscriptome/Known -o /data/2_T1-ARN/tophat_out --read-realign-edit-dist 0 --mate-inner-dist 97 --mate-std-dev 20 --min-anchor-length 4 --splice-mismatches 0 --min-intron-length 50 --max-intron-length 500000 --no-coverage-search --library-type fr-firststrand --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 /data/bioinfo/hg19/Bowtie2/2.2.5_20150629/Base_chr/hg19 /data/RNASeq/2_T1-ARN/2_T1-ARN_paired.R1.fastq /data/RNASeq/2_T1-ARN/2_T1-ARN_paired.R2.fastq
              I was wondering if something was happening with duplicates based on some tests:
              One polyA unmapped.sam file:
              Code:
              head unmapped.sam
              NS500197:119:HG2Y7BGXX:1:11101:24683:1054	69	*	0	255	*	*	0	0	CTCCANTTTAATTTCCTTTAACAAATTTTCTGTACTTTTACTTTCCATCCAAACAGTAACTTATAAATTATTATT	AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
              NS500197:119:HG2Y7BGXX:1:13311:17866:19393	69	*	0	255	*	*	0	0	TAGTACTTCAGTGATTTGCCAATGCCCTCTAAGTCCCGTAACCTGTTGTCAATCTGCACATGAACATTTTGCAGA	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
              NS500197:119:HG2Y7BGXX:2:13311:16650:6888	69	*	0	255	*	*	0	0	CTGGGCTGGTGTTGAGTGTCTGCAGCTTTTCCAGGCACAGAGTGCAGGCTGTCAGTGGATCTACCATCCTGGAGT	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
              NS500197:119:HG2Y7BGXX:4:13610:15232:11614	69	*	0	255	*	*	0	0	CCCACTTGATTTTGGAGGGATCTCGCTCCTGGAAGATGGTGATGGGATTTCCATTGATGACAAGCTTCCCGAGAT	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
              NS500197:119:HG2Y7BGXX:4:13610:26394:11583	133	*	0	255	*	*	0	0	TTTCTGAGCCTTTTACCACTCCAGCCTAGCCCCTACCCCCCAATTTGGCGGGCGCGGACCCC	AAAAAEEEEEEEEEEEEAEEEEEEEEEEEEEEE//E6AAAA/A///EE/EEE//A/E///EE
              NS500197:119:HG2Y7BGXX:1:11101:17376:1060	69	*	0	255	*	*	0	0	AGCCTNTTGGGGGTCAGTGCCCTCCTAATTGGGGGGTAGGGGCTAGGCTGGAGTGGTAAAAGGCTCAGAAAAATC	AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEE
              NS500197:119:HG2Y7BGXX:1:13311:20274:19398	69	*	0	255	*	*	0	0	GCCAGCTCTCTCCTGGTGTGCTCGTCTTGCCTCAGGTCCTTCTTATTCCCCACCAGGATGATGGGCACGAGATCG	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
              NS500197:119:HG2Y7BGXX:2:13311:9800:6890	69	*	0	255	*	*	0	0	CAGGTAGAGCAGGAGCCAGTGAAGAGCAGGCCCTGGATGGGTGGGAGATCGGAAGAGCACACGTCTGAACTCCAG	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
              NS500197:119:HG2Y7BGXX:4:13610:3313:11628	69	*	0	255	*	*	0	0	CCCCCGTTGTGGTCCCAGGTCCTAAACAGCAGGGCTAGTCCCAGAAAGACGACCCCCCCACGGAAGGGCTTTAGC	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
              NS500197:119:HG2Y7BGXX:4:13610:5379:11596	133	*	0	255	*	*	0	0	CATGTAGTTAATCTGACATGTCAAAATTGGGAAGACTGTAACCTTTTTTTTTTTTTTTTTTT	AAA/AA/EEAEA/EEEEAEEEAEEE/E//E/EE66/EEAEE/AEEEE<A/EE6EEEA<EEEE
              
              wc -l unmapped.sam
              2754730 unmapped.sam
              
              cat unmapped.sam | cut -f1 | sort | uniq | wc -l
              2334225
              
              cat unmapped.sam | cut -f10 | sort | uniq | wc -l
              2098310
              Several reads have the same sequence. Are they duplicates?
              I don't understand why there are several reads with a same identifier since there is no multi-mapped read in unmapped sam.
              It seems that there are reads with the same sequence but different identifiers. What are they?

              Same info for one ribo unmapped.sam file:

              Code:
              head unmapped.sam 
              K00103:86:H5JKJBBXX:6:1101:1489:1086	69	*	0	255	*	*	0	0	GTGGAGATGGGCGCCGCGAGGCGTCCAGTGCGGTAACGCGACCGATCCCGGAGAAGCCGGCGGGAGCCCCGGGGAGAGTTCTCTTTTCTTTGTGAAGGG	AAFFJJJJJJJJJJFJJJJFFJJJJJAJJJJJJFFJFFJJJJJJJJJJJFFFAFFAJFAAA7-7AF<FAAJAAFFFJF-AAJJ<<FJJAJFJJFJJJAF
              K00103:86:H5JKJBBXX:6:1114:8298:40526	69	*	0	255	*	*	0	0	TGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGTCCCCCAACTTCTTAGAGGGACAAGTGGCGTTCAGCCACCCGAGATTGAGCAATAACAGGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJFJJJJJJ<FF-FJJFJJJFJFJFJF7<FJ7<77FF<7FFFF<JAJFJFFFJJJJJJFJJF
              K00103:86:H5JKJBBXX:6:1214:15696:30661	69	*	0	255	*	*	0	0	CCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAAAGTCAGCCCTCGAAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJ
              K00103:86:H5JKJBBXX:6:2214:26453:40192	69	*	0	255	*	*	0	0	CTTGGTGGTAGTAGCAAATATTCAAACGAGAACTTTGAAGGCCGAAGTGGAGAAGGCTTCCATGTGAACAGCAGTTGAACATGGGTCAGTCGGTCCTGAGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFF<JJJJJJJ<-<<FJFJFF7F<FFJJJFFJJJJJJJ
              K00103:86:H5JKJBBXX:6:2214:25621:40192	133	*	0	255	*	*	0	0	CGAGGGCTGACTTTCAATAGATCGCAGCGAGGGAGCTGCTCTGCTACGTACGAAACCCCGACCCAGAAGCAGGTCGTCTACGAATGGTTTAGCGCCAGGTAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJ
              K00103:86:H5JKJBBXX:6:1101:1509:1086	69	*	0	255	*	*	0	0	GAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTGGC	AAAFFJFFJJJJFJFFAJJJJJJJJAFAJAAFFFF<<FJJ-FJJFJFJ<<AAJA<J7-F-<<
              K00103:86:H5JKJBBXX:6:1114:8501:40526	69	*	0	255	*	*	0	0	CACCACACAGCGCGGCCAGGGCAGAGTGGGCCGCCGCCGCCGCCCAGAGACCGAGGCTGCCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCAAFFFJJJFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJJFFFJJFFFFJJJJAJFFJFA<JJJJJJJJJJJJJJFJJJ
              K00103:86:H5JKJBBXX:6:1214:16122:30661	69	*	0	255	*	*	0	0	CGGCGTCGCTCGTGGGGGGCCCAAGTCCTTCTGATCGAGGCCCAGCCCGTGGACGGTGTGAGGCCGGTAGCGGCCCCCGGCGCGCCGGGCCCGGGTCTTCAAFFFJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFJJFFFJFJJJF<AF7AJJJJJJJAJJFJJJJAF<AFAFJFFJFFFAFJFAJJJJFFJ7FJJJ
              K00103:86:H5JKJBBXX:6:2214:27082:40192	69	*	0	255	*	*	0	0	GTCTTGGGGCCGAAACGATCTCAACCTATTCTCAAACTTTAAATGGGTAAGAAGCCCGGCTCGCTGGCGTGGAGCCGGGCGTGGAATGCGAGTGCCTAGTAAFFFJJJFJJJJJJJFFJJJJJFJJFJJJJJJJJFJF<JJJ<<JFF<AJJFFFAJJJJF7AAJ<<<<7F77A-A77AFJJ7<-FFJ7J<JJAFFFFA-F
              K00103:86:H5JKJBBXX:6:2214:26453:40192	133	*	0	255	*	*	0	0	GGGATCTGAACCCGACTCCCTTTCGATCGGCCGAGGGCAACGGAGGCCATCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCA--AAJ7-FJJJJJJJJJJFJJJFFJJFJFFJ<JFJJJJJJFJJJJJJJJJJJJJJJJFFF<JFFJJJJAJJJJJJFJJJJJJJJJJJJJJJJJJJJJJF
              
              wc -l unmapped.sam
              31234334 unmapped.sam
              
              cat unmapped.sam | cut -f 1 | sort | uniq | wc -l
              18571998
              Can we say there are 25% of duplicates in the polyA unmapped file and 40% in the ribo unmapped file?

              Originally posted by GenoMax View Post
              2. Even though the samples were prepped there is bound to be some rRNA present. You could align to human rDNA repeat independently and see what you get for the two samples as @Chipper suggested.
              Is there a way to try to map the unmapped reads to rRNA?

              Originally posted by GenoMax View Post
              3. Mapping should become more precise (harder to multi-map) as length increases so that could be another reason for the reduced alignment. You could trim the HiSeq 4000 reads down to 75 bp and see if you get the same ballpark alignment as NextSeq 500 reads.
              When trimming the HiSeq 4000 reads down to 75 bp -for 3 samples only- I got an increase in overall mapping rate of 1.1, 1.6 and 2.5%. Better but not 10% better.

              Comment


              • #8
                Originally posted by Jane M View Post
                I really appreciate your suggestions GenoMax, thank you

                I used TopHat2 with default parameters:
                Default is 20 locations for multi-mapped reads for TopHat as I recall.
                It seems that there are reads with the same sequence but different identifiers. What are they?
                Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.

                Is there a way to try to map the unmapped reads to rRNA?
                You can use the sequence of the human rDNA repeat found here to map against.

                When trimming the HiSeq 4000 reads down to 75 bp -for 3 samples only- I got an increase in overall mapping rate of 1.1, 1.6 and 2.5%. Better but not 10% better.
                Perhaps you should allow multi-mappers to map at all locations. See if that ups the percentage. An academic exercise

                Comment


                • #9
                  Originally posted by GenoMax View Post
                  Default is 20 locations for multi-mapped reads for TopHat as I recall.
                  Exact

                  Originally posted by GenoMax View Post
                  Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.

                  You can use the sequence of the human rDNA repeat found here to map against.
                  Thank you for these suggestions. I will try both, counting the number of reads flagged as optical duplicates and map the unmapped reads to ribosomal DNA, each time in the 2 experiments.

                  Originally posted by GenoMax View Post
                  Perhaps you should allow multi-mappers to map at all locations. See if that ups the percentage. An academic exercise
                  Maybe I will try this after!

                  Comment


                  • #10
                    Originally posted by GenoMax View Post
                    Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.
                    I tried your suggestion on both unmapped and mapped reads for one polyA and one ribo sample:

                    java -Xmx6g -jar picard/1.134/picard.jar MarkDuplicates INPUT=$path/unmapped.bam OUTPUT=$path/unmapped_MarkDuplicates.bam METRICS_FILE=unmapped_metrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
                    java -Xmx6g -jar picard/1.134/picard.jar MarkDuplicates INPUT=$path/ARN_POP.bam OUTPUT=$path/ARN_POP_MarkDuplicates.bam METRICS_FILE=mapped_metrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
                    For both runs, I got:
                    ribo sample:
                    unmapped_metrics file:
                    ## METRICS CLASS picard.sam.DuplicationMetrics
                    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
                    Unknown Library 0 0 31234334 0 0 0 ?

                    mapped_metrics
                    ## METRICS CLASS picard.sam.DuplicationMetrics
                    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
                    Unknown Library 5909662 58551716 0 5344654 19348992 19322086 0.358032 28585819416
                    polyA sample:
                    ribo sample:
                    ## METRICS CLASS picard.sam.DuplicationMetrics
                    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
                    Unknown Library 0 0 2754730 0 0 0 ?

                    mapped_metrics
                    ## METRICS CLASS picard.sam.DuplicationMetrics
                    LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
                    Unknown Library 1913720 64301596 0 1502675 32452071 28720194 0.508799 157544313
                    I got results for mapped reads only, maybe because they are mainly paired. The ribo sample, sequenced on HiSeq4000 has less PERCENT_DUPLICATION (36%) than the polyA sample (51%), sequenced on NextSeq 500. I was expected the contrary And these percentages seem very high. Is it normal?

                    Originally posted by GenoMax View Post
                    You can use the sequence of the human rDNA repeat found here to map against.
                    To test if more reads correspond to rRNA sequence in ribodepleted sample than in polyA sample, I first ran blast on the first 10 unmapped reads, but on "Nucleotide collection (nr/nt)" database this time. Now, for the ribodepleted sample, 9/10 had significant similarities, each had several similarities, often to 45S/60S, as Chipper suggested. For the polyA sample, again 9/10 had significant similarities, each had several similarities too, but not especially to ribosome.

                    Then, I converted unmapped.bam with bamToFatq. I downloaded the gtf of rRNA and tRNA from UCSC tables.
                    After creating a new transcriptome index, I reran the mapping with the preprocessed fastq:
                    tophat/v2.1.0/bin/tophat2 -p 8 --transcriptome-index=rDNATranscriptome/rDNA -o /myfolder/rDNA_tophat_out --read-realign-edit-dist 0 --mate-inner-dist 98 --mate-std-dev 20 --min-anchor-length 4 --splice-mismatches 0 --min-intron-length 50 --max-intron-length 500000 --no-coverage-search --library-type fr-firststrand --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 Bowtie2/myNewBowtie2Index myfolder/unmapped.R1.fastq myfolder/unmapped.R2.fastq
                    If most of the unmapped reads correspond to rRNA, I should now get a higher mapping rate for ribodepleted sample than for polyA sample. Waiting for these results.

                    Comment


                    • #11
                      I wrote 2 posts last week, they do not appear (waiting for validation, I don't know why). I hope this one will be posted.

                      Originally posted by GenoMax View Post
                      You can use the sequence of the human rDNA repeat found here to map against.
                      I ran two tests:
                      1) map the unmapped reads as previously, except I change the reference to rDNA fasta sequence.
                      Then, I got alignment rate of 83.2, 82, 81.6 and 74.8% for ribodepleted samples and 7, 4.4, 2.8 and 4% for polyA samples.

                      2) map the paired trimmed reads as previously, except I use as reference the rDNA fasta sequence.
                      Then, I got alignment rate of 24.3, 14.1, 18.1 and 13.3% for ribodepleted samples and 0.5, 0.2, 0.2 and 0.1% for polyA samples.

                      I read in some posts that 5% was low remaining rDNA sequence after ribodepletion. I have quite higher percentage. What is the classical range?

                      From these tests, I am convince that the different mapping rates I got between the two experiments come from the rDNA sequence still present in ribodepleted samples that are not reported in hg19.

                      Thank you for the help you provided me.

                      Comment


                      • #12
                        @Jane_M: Apologize for the posts from last week. I must have missed them. They were indeed waiting for moderation.

                        Thank you for providing detailed stats. It does appear that the ribo-depletion did not work well. Someone with experimental expertise can comment but % of rRNA should be less than 5% (perhaps 1% or less is ideal).

                        Interestingly (if I am reading this right) most of your duplicates (>98%) on the 4000 are optical, with a lesser % in regular run.

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          @Jane_M: Apologize for the posts from last week. I must have missed them. They were indeed waiting for moderation.
                          Thank you for posting them.

                          Originally posted by GenoMax View Post
                          Thank you for providing detailed stats. It does appear that the ribo-depletion did not work well. Someone with experimental expertise can comment but % of rRNA should be less than 5% (perhaps 1% or less is ideal).
                          I had the impression that 5% was already a good ribodepletion ratio. Let's see if others can give their feedback.


                          Originally posted by GenoMax View Post
                          Interestingly (if I am reading this right) most of your duplicates (>98%) on the 4000 are optical, with a lesser % in regular run.
                          Well, on HiSeq 4000, more than 99% of the READ_PAIR_DUPLICATES are READ_PAIR_OPTICAL_DUPLICATES. This ratio reaches 88% for NextSeq 500. Are there optical duplicates too? If not, what are they?

                          Can someone please comment these two ratio: 35% (on ribodepleted sample - HiSeq 4000) and 51% (on polyA sample - NextSeq 500) for PERCENT_DUPLICATION?
                          Last edited by Jane M; 05-02-2016, 04:10 AM.

                          Comment


                          • #14
                            If MarkDuplicates is working as expected then the duplicates on HiSeq 4000 are real optical dups likely formed by pad-hopping (contamination of nearby nanowells). For NextSeq 500 they probably come from one too many PCR cycles.

                            One possible cause of the duplicates in general is experimental origin (likely # of PCR cycles).

                            Comment


                            • #15
                              Ok, thank you GenoMax.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              9 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              67 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X