View Single Post
Old 04-27-2016, 06:11 AM   #7
Jane M
Senior Member
 
Location: Paris

Join Date: Aug 2011
Posts: 237
Default

I really appreciate your suggestions GenoMax, thank you

Quote:
Originally Posted by GenoMax View Post
1. HiSeq 4000 generates optical duplicates (due to pad-hopping). Depending on how you are handling your multi-mappers this could explain the difference.
I used TopHat2 with default parameters:
Quote:
tophat/v2.1.0/bin/tophat2 -p 8 --transcriptome-index=/data/KnownTranscriptome/Known -o /data/2_T1-ARN/tophat_out --read-realign-edit-dist 0 --mate-inner-dist 97 --mate-std-dev 20 --min-anchor-length 4 --splice-mismatches 0 --min-intron-length 50 --max-intron-length 500000 --no-coverage-search --library-type fr-firststrand --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 /data/bioinfo/hg19/Bowtie2/2.2.5_20150629/Base_chr/hg19 /data/RNASeq/2_T1-ARN/2_T1-ARN_paired.R1.fastq /data/RNASeq/2_T1-ARN/2_T1-ARN_paired.R2.fastq
I was wondering if something was happening with duplicates based on some tests:
One polyA unmapped.sam file:
Code:
head unmapped.sam
NS500197:119:HG2Y7BGXX:1:11101:24683:1054	69	*	0	255	*	*	0	0	CTCCANTTTAATTTCCTTTAACAAATTTTCTGTACTTTTACTTTCCATCCAAACAGTAACTTATAAATTATTATT	AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:1:13311:17866:19393	69	*	0	255	*	*	0	0	TAGTACTTCAGTGATTTGCCAATGCCCTCTAAGTCCCGTAACCTGTTGTCAATCTGCACATGAACATTTTGCAGA	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:2:13311:16650:6888	69	*	0	255	*	*	0	0	CTGGGCTGGTGTTGAGTGTCTGCAGCTTTTCCAGGCACAGAGTGCAGGCTGTCAGTGGATCTACCATCCTGGAGT	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:15232:11614	69	*	0	255	*	*	0	0	CCCACTTGATTTTGGAGGGATCTCGCTCCTGGAAGATGGTGATGGGATTTCCATTGATGACAAGCTTCCCGAGAT	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:26394:11583	133	*	0	255	*	*	0	0	TTTCTGAGCCTTTTACCACTCCAGCCTAGCCCCTACCCCCCAATTTGGCGGGCGCGGACCCC	AAAAAEEEEEEEEEEEEAEEEEEEEEEEEEEEE//E6AAAA/A///EE/EEE//A/E///EE
NS500197:119:HG2Y7BGXX:1:11101:17376:1060	69	*	0	255	*	*	0	0	AGCCTNTTGGGGGTCAGTGCCCTCCTAATTGGGGGGTAGGGGCTAGGCTGGAGTGGTAAAAGGCTCAGAAAAATC	AAAAA#EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAAEE
NS500197:119:HG2Y7BGXX:1:13311:20274:19398	69	*	0	255	*	*	0	0	GCCAGCTCTCTCCTGGTGTGCTCGTCTTGCCTCAGGTCCTTCTTATTCCCCACCAGGATGATGGGCACGAGATCG	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:2:13311:9800:6890	69	*	0	255	*	*	0	0	CAGGTAGAGCAGGAGCCAGTGAAGAGCAGGCCCTGGATGGGTGGGAGATCGGAAGAGCACACGTCTGAACTCCAG	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:3313:11628	69	*	0	255	*	*	0	0	CCCCCGTTGTGGTCCCAGGTCCTAAACAGCAGGGCTAGTCCCAGAAAGACGACCCCCCCACGGAAGGGCTTTAGC	AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
NS500197:119:HG2Y7BGXX:4:13610:5379:11596	133	*	0	255	*	*	0	0	CATGTAGTTAATCTGACATGTCAAAATTGGGAAGACTGTAACCTTTTTTTTTTTTTTTTTTT	AAA/AA/EEAEA/EEEEAEEEAEEE/E//E/EE66/EEAEE/AEEEE<A/EE6EEEA<EEEE

wc -l unmapped.sam
2754730 unmapped.sam

cat unmapped.sam | cut -f1 | sort | uniq | wc -l
2334225

cat unmapped.sam | cut -f10 | sort | uniq | wc -l
2098310
Several reads have the same sequence. Are they duplicates?
I don't understand why there are several reads with a same identifier since there is no multi-mapped read in unmapped sam.
It seems that there are reads with the same sequence but different identifiers. What are they?

Same info for one ribo unmapped.sam file:

Code:
head unmapped.sam 
K00103:86:H5JKJBBXX:6:1101:1489:1086	69	*	0	255	*	*	0	0	GTGGAGATGGGCGCCGCGAGGCGTCCAGTGCGGTAACGCGACCGATCCCGGAGAAGCCGGCGGGAGCCCCGGGGAGAGTTCTCTTTTCTTTGTGAAGGG	AAFFJJJJJJJJJJFJJJJFFJJJJJAJJJJJJFFJFFJJJJJJJJJJJFFFAFFAJFAAA7-7AF<FAAJAAFFFJF-AAJJ<<FJJAJFJJFJJJAF
K00103:86:H5JKJBBXX:6:1114:8298:40526	69	*	0	255	*	*	0	0	TGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGTCCCCCAACTTCTTAGAGGGACAAGTGGCGTTCAGCCACCCGAGATTGAGCAATAACAGGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJFJJJJJJ<FF-FJJFJJJFJFJFJF7<FJ7<77FF<7FFFF<JAJFJFFFJJJJJJFJJF
K00103:86:H5JKJBBXX:6:1214:15696:30661	69	*	0	255	*	*	0	0	CCTGGCGCTAAACCATTCGTAGACGACCTGCTTCTGGGTCGGGGTTTCGTACGTAGCAGAGCAGCTCCCTCGCTGCGATCTATTGAAAGTCAGCCCTCGAAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJ
K00103:86:H5JKJBBXX:6:2214:26453:40192	69	*	0	255	*	*	0	0	CTTGGTGGTAGTAGCAAATATTCAAACGAGAACTTTGAAGGCCGAAGTGGAGAAGGCTTCCATGTGAACAGCAGTTGAACATGGGTCAGTCGGTCCTGAGAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFF<JJJJJJJ<-<<FJFJFF7F<FFJJJFFJJJJJJJ
K00103:86:H5JKJBBXX:6:2214:25621:40192	133	*	0	255	*	*	0	0	CGAGGGCTGACTTTCAATAGATCGCAGCGAGGGAGCTGCTCTGCTACGTACGAAACCCCGACCCAGAAGCAGGTCGTCTACGAATGGTTTAGCGCCAGGTAAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJAJ
K00103:86:H5JKJBBXX:6:1101:1509:1086	69	*	0	255	*	*	0	0	GAAAGACTAATCGAACCATCTAGTAGCTGGTTCCCTCCGAAGTTTCCCTCAGGATAGCTGGC	AAAFFJFFJJJJFJFFAJJJJJJJJAFAJAAFFFF<<FJJ-FJJFJFJ<<AAJA<J7-F-<<
K00103:86:H5JKJBBXX:6:1114:8501:40526	69	*	0	255	*	*	0	0	CACCACACAGCGCGGCCAGGGCAGAGTGGGCCGCCGCCGCCGCCCAGAGACCGAGGCTGCCTTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACATCAAFFFJJJFJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJFJJJJJFJJJJJJJFFFJJFFFFJJJJAJFFJFA<JJJJJJJJJJJJJJFJJJ
K00103:86:H5JKJBBXX:6:1214:16122:30661	69	*	0	255	*	*	0	0	CGGCGTCGCTCGTGGGGGGCCCAAGTCCTTCTGATCGAGGCCCAGCCCGTGGACGGTGTGAGGCCGGTAGCGGCCCCCGGCGCGCCGGGCCCGGGTCTTCAAFFFJJJJJJJJJJJJJJJJJJJJJJJJFJJFJJJJFJJFFFJFJJJF<AF7AJJJJJJJAJJFJJJJAF<AFAFJFFJFFFAFJFAJJJJFFJ7FJJJ
K00103:86:H5JKJBBXX:6:2214:27082:40192	69	*	0	255	*	*	0	0	GTCTTGGGGCCGAAACGATCTCAACCTATTCTCAAACTTTAAATGGGTAAGAAGCCCGGCTCGCTGGCGTGGAGCCGGGCGTGGAATGCGAGTGCCTAGTAAFFFJJJFJJJJJJJFFJJJJJFJJFJJJJJJJJFJF<JJJ<<JFF<AJJFFFAJJJJF7AAJ<<<<7F77A-A77AFJJ7<-FFJ7J<JJAFFFFA-F
K00103:86:H5JKJBBXX:6:2214:26453:40192	133	*	0	255	*	*	0	0	GGGATCTGAACCCGACTCCCTTTCGATCGGCCGAGGGCAACGGAGGCCATCGCCCGTCCCTTCGGAACGGCGCTCGCCCATCTCTCAGGACCGACTGACCA--AAJ7-FJJJJJJJJJJFJJJFFJJFJFFJ<JFJJJJJJFJJJJJJJJJJJJJJJJFFF<JFFJJJJAJJJJJJFJJJJJJJJJJJJJJJJJJJJJJF

wc -l unmapped.sam
31234334 unmapped.sam

cat unmapped.sam | cut -f 1 | sort | uniq | wc -l
18571998
Can we say there are 25% of duplicates in the polyA unmapped file and 40% in the ribo unmapped file?

Quote:
Originally Posted by GenoMax View Post
2. Even though the samples were prepped there is bound to be some rRNA present. You could align to human rDNA repeat independently and see what you get for the two samples as @Chipper suggested.
Is there a way to try to map the unmapped reads to rRNA?

Quote:
Originally Posted by GenoMax View Post
3. Mapping should become more precise (harder to multi-map) as length increases so that could be another reason for the reduced alignment. You could trim the HiSeq 4000 reads down to 75 bp and see if you get the same ballpark alignment as NextSeq 500 reads.
When trimming the HiSeq 4000 reads down to 75 bp -for 3 samples only- I got an increase in overall mapping rate of 1.1, 1.6 and 2.5%. Better but not 10% better.
Jane M is offline   Reply With Quote