Quote:
Originally Posted by GenoMax
Those may be the optical duplicates that were generated by pad-hopping (or PCR during prep). You can try to run the Picard MarkDuplicates protocol (including the optical dup marking) and see if they get flagged. I have tried doing this with a limited number of samples from HiSeq 4000 but have not managed to get useful results for the optical part.
|
I tried your suggestion on both unmapped and mapped reads for one polyA and one ribo sample:
Quote:
java -Xmx6g -jar picard/1.134/picard.jar MarkDuplicates INPUT=$path/unmapped.bam OUTPUT=$path/unmapped_MarkDuplicates.bam METRICS_FILE=unmapped_metrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
|
Quote:
java -Xmx6g -jar picard/1.134/picard.jar MarkDuplicates INPUT=$path/ARN_POP.bam OUTPUT=$path/ARN_POP_MarkDuplicates.bam METRICS_FILE=mapped_metrics OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500 REMOVE_DUPLICATES=true VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true
|
For both runs, I got:
ribo sample:
Quote:
unmapped_metrics file:
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 0 0 31234334 0 0 0 ?
mapped_metrics
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 5909662 58551716 0 5344654 19348992 19322086 0.358032 28585819416
|
polyA sample:
Quote:
ribo sample:
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 0 0 2754730 0 0 0 ?
mapped_metrics
## METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE
Unknown Library 1913720 64301596 0 1502675 32452071 28720194 0.508799 157544313
|
I got results for mapped reads only, maybe because they are mainly paired. The ribo sample, sequenced on HiSeq4000 has less PERCENT_DUPLICATION (36%) than the polyA sample (51%), sequenced on NextSeq 500. I was expected the contrary

And these percentages seem very high. Is it normal?
Quote:
Originally Posted by GenoMax
You can use the sequence of the human rDNA repeat found here to map against.
|
To test if more reads correspond to rRNA sequence in ribodepleted sample than in polyA sample, I first ran blast on the first 10 unmapped reads, but on "Nucleotide collection (nr/nt)" database this time. Now, for the ribodepleted sample, 9/10 had significant similarities, each had several similarities, often to 45S/60S, as Chipper suggested. For the polyA sample, again 9/10 had significant similarities, each had several similarities too, but not especially to ribosome.
Then, I converted unmapped.bam with bamToFatq. I downloaded the gtf of rRNA and tRNA from UCSC tables.
After creating a new transcriptome index, I reran the mapping with the preprocessed fastq:
Quote:
tophat/v2.1.0/bin/tophat2 -p 8 --transcriptome-index=rDNATranscriptome/rDNA -o /myfolder/rDNA_tophat_out --read-realign-edit-dist 0 --mate-inner-dist 98 --mate-std-dev 20 --min-anchor-length 4 --splice-mismatches 0 --min-intron-length 50 --max-intron-length 500000 --no-coverage-search --library-type fr-firststrand --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 Bowtie2/myNewBowtie2Index myfolder/unmapped.R1.fastq myfolder/unmapped.R2.fastq
|
If most of the unmapped reads correspond to rRNA, I should now get a higher mapping rate for ribodepleted sample than for polyA sample. Waiting for these results.