SEQanswers

Go Back   SEQanswers > Bioinformatics > Bioinformatics



Similar Threads
Thread Thread Starter Forum Replies Last Post
Samtools flagstat 0% properly paired jorhodes Bioinformatics 5 01-07-2014 09:41 AM
what does “properly paired” mean? raving Illumina/Solexa 2 07-17-2013 11:54 PM
Finding unmapped paired-end mates with samtools newguy Bioinformatics 3 09-11-2012 12:04 PM
can paired-end mates overlap? cmd Bioinformatics 5 03-05-2012 01:21 PM
Improperly paired mates spikesd17 Bioinformatics 2 02-07-2011 12:17 PM

Reply
 
Thread Tools
Old 03-14-2014, 12:56 PM   #1
emolinari
Member
 
Location: New haven

Join Date: May 2013
Posts: 47
Default Properly paired mates

Hi guys!

I have a weird problem to solve.
I have recently run an RNAseq pipeline on human cell line (polyA enriched, 75bp, PE, ~30 mil reads each).

12 samples out of 16 showed great alignment rates and library QC.
The remaining 4 instead have a ridiculously low properly mate pairing, like this one:

64544673 in total
0 QC failure
0 duplicates
64544673 mapped (100.00%)
64544673 paired in sequencing
33238851 read1
31305822 read2
14676 properly paired (0.02%)
57821112 with itself and mate mapped
6723561 singletons (10.42%)
54777510 with mate mapped to a different chr
44604000 with mate mapped to a different chr (mapQ>=5)
Sample_TK1_002_flagstat.txt (END)

Initially I thought it was a trivial issue, and I have proceeded with differential expression analysis. Unfortunately I have found out that these 4 samples were clustering together although belonging to 3 different cohorts.
Then I re-checked my work flow and realized that the Genome Center that performed the library prep and sequencing had to fund 2/3 sequencing for these samples, as they were providing a low yield (of reads).
Other thing to add, when trying to run RNASeQC on Gene Pattern, I cannot Mark Duplicates for these samples too (error mess:

net.sf.picard.PicardException: Value was put into PairInfoMap more than once. 1: Flowcell 1 Older_P3:BFC08P1:257:C38GUACXX:8:1108:18368:35984
at net.sf.picard.sam.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:124)
at net.sf.picard.sam.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:78)
at net.sf.picard.sam.DiskReadEndsMap.remove(DiskReadEndsMap.java:61)
at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:294)
at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:117)
at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:169)
at org.genepattern.sam.MarkDuplicatesWrapper.main(MarkDuplicatesWrapper.java:83)


I am sorry if all this sounds confusing... I am quite desperate in understanding what's going on.
I know that there is some samtools available for fixing mate pairing...would it be the solution?

Thanks to all of you!

Manu
emolinari is offline   Reply With Quote
Old 03-14-2014, 12:59 PM   #2
emolinari
Member
 
Location: New haven

Join Date: May 2013
Posts: 47
Default

Then I re-checked my work flow and realized that the Genome Center that performed the library prep and sequencing had to fund 2/3 sequencing for these samples, as they were providing a low yield (of reads).

Sorry there's a typo: the Genome center re-performed the sequencing a couple of times on my samples due to low reads yield!
emolinari is offline   Reply With Quote
Old 03-14-2014, 01:22 PM   #3
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Looks like you are mapping to a highly broken assembly, since most of the reads map to different scaffolds. Or else the read1 and read2 files don't go together. What did you map to, normal HG19 or something else? What mapper did you use? And are you sure the read1 and read2 files go together? Look at the read names and make sure they match.
Brian Bushnell is offline   Reply With Quote
Old 03-14-2014, 01:32 PM   #4
emolinari
Member
 
Location: New haven

Join Date: May 2013
Posts: 47
Default

Quote:
Originally Posted by Brian Bushnell View Post
Looks like you are mapping to a highly broken assembly, since most of the reads map to different scaffolds. Or else the read1 and read2 files don't go together. What did you map to, normal HG19 or something else? What mapper did you use? And are you sure the read1 and read2 files go together? Look at the read names and make sure they match.

Hi Brian!
thanks for the reply. I have mapped my samples with TopHat2, using normal hg19. I guess that it is not a broken hg19(got it from iGenomes), as it worked just fine for the others.
I have checked the read identifiers and they seem to agree...although I have just briefly looked at them, I haven't written any string to check if they were matching. Any suggestions how to?

Thank you!!!
Manu
emolinari is offline   Reply With Quote
Old 03-14-2014, 01:55 PM   #5
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

So, the weird thing is that HG19 only has 25 normal scaffolds and a handful of small extra scaffolds. You can't have 99.98% of pairs landing on different scaffolds by chance, even if you had mismatched files for read1 and read2. The only way I can see it happening is if, for example, every read1 was good, but every read2 mapped to the same location on a tiny scaffold (like mito).

I suggest you pileup and see if some tiny scaffold has immense coverage (or you can just visually look at the unsorted sam file and see where things mapped, to see if there's a pattern). Also, look at the read quality profiles and per-base ACGTN counts, and see if they look different than other libraries.

If you want to examine read names, just use "head" and "tail" on each of the fastq files, too see that the names are identical (except for e.g. " /1" and " /2"). Also use wc to ensure both files have the same number of lines.

Last edited by Brian Bushnell; 03-14-2014 at 02:16 PM.
Brian Bushnell is offline   Reply With Quote
Old 03-24-2014, 07:41 AM   #6
emolinari
Member
 
Location: New haven

Join Date: May 2013
Posts: 47
Default

Quote:
Originally Posted by Brian Bushnell View Post
So, the weird thing is that HG19 only has 25 normal scaffolds and a handful of small extra scaffolds. You can't have 99.98% of pairs landing on different scaffolds by chance, even if you had mismatched files for read1 and read2. The only way I can see it happening is if, for example, every read1 was good, but every read2 mapped to the same location on a tiny scaffold (like mito).

I suggest you pileup and see if some tiny scaffold has immense coverage (or you can just visually look at the unsorted sam file and see where things mapped, to see if there's a pattern). Also, look at the read quality profiles and per-base ACGTN counts, and see if they look different than other libraries.

If you want to examine read names, just use "head" and "tail" on each of the fastq files, too see that the names are identical (except for e.g. " /1" and " /2"). Also use wc to ensure both files have the same number of lines.
Hi Brian!

I've tried to do what you suggested. I am quite a newbie for this kind of things, so I have tried to visualize the sorted.bam file on IGV. The reads look weird to me...looking into the IGV manual I see that different colors suggest different chromosome mapping...is this correct?
I have tried to do the same for one of the other samples that I have aligned the same day and that gave a optimal rate of properly paired mates...and in fact they look normal compared to this funny one.
I have a doubt, but I am not able to see if it makes sense: this weird sample (as well as the other three that gave the same problem) was sequenced twice due to poor yield...could it be that either the reads header was not paired properly? or do you have other suggestions?

Thanks!!!

Manu
Attached Images
File Type: png Screen Shot 2014-03-24 at 9.25.08 AM.png (133.1 KB, 17 views)
emolinari is offline   Reply With Quote
Old 03-24-2014, 09:28 AM   #7
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by emolinari View Post
Hi Brian!

I've tried to do what you suggested. I am quite a newbie for this kind of things, so I have tried to visualize the sorted.bam file on IGV. The reads look weird to me...looking into the IGV manual I see that different colors suggest different chromosome mapping...is this correct?
Colors mean different things in different modes with IGV. But those colors on your picture indicate SNPs - the reference does not match the reads, like, anywhere. Could you be using a slightly different reference for mapping and for IGV?

Quote:
I have a doubt, but I am not able to see if it makes sense: this weird sample (as well as the other three that gave the same problem) was sequenced twice due to poor yield...could it be that either the reads header was not paired properly? or do you have other suggestions?

Thanks!!!

Manu
Possible. I suggest that you DO write something to verify that the read names all match. But I think your problem likely has to do with the reference. I would try re-indexing HG19, re-mapping, and re-loading everything (including the reference) into IGV.
Brian Bushnell is offline   Reply With Quote
Old 03-24-2014, 02:32 PM   #8
emolinari
Member
 
Location: New haven

Join Date: May 2013
Posts: 47
Default

Quote:
Originally Posted by Brian Bushnell View Post

Possible. I suggest that you DO write something to verify that the read names all match. But I think your problem likely has to do with the reference. I would try re-indexing HG19, re-mapping, and re-loading everything (including the reference) into IGV.
Thanks Brian,
I am uploading the reference genome once again, so I will run TopHat+Bowtie and see if they make sense now...

BTW, checking prep_read.logs I have found out this:

/panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads: /lib64/libz.so.1: no version information available (required by /panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads)
prep_reads v2.0.6 (3670M)
---------------------------
WARNING: read pairing issues detected (check prep_reads.log) !
Pair #1 name mismatch: BFC08P1:257:C38GUACXX:8:1111:14326:35343 vs BFC08P1:257:C38GUACXX:8:1101:1229:2109
101900 out of 31388352 reads have been filtered out
67141 out of 31388352 read mates have been filtered out

...can you explain?

Thanks!!!
Manu
emolinari is offline   Reply With Quote
Old 03-24-2014, 03:18 PM   #9
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by emolinari View Post
Thanks Brian,
I am uploading the reference genome once again, so I will run TopHat+Bowtie and see if they make sense now...

BTW, checking prep_read.logs I have found out this:

/panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads: /lib64/libz.so.1: no version information available (required by /panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads)
prep_reads v2.0.6 (3670M)
---------------------------
WARNING: read pairing issues detected (check prep_reads.log) !
Pair #1 name mismatch: BFC08P1:257:C38GUACXX:8:1111:14326:35343 vs BFC08P1:257:C38GUACXX:8:1101:1229:2109
101900 out of 31388352 reads have been filtered out
67141 out of 31388352 read mates have been filtered out

...can you explain?

Thanks!!!
Manu
Looks like the reads are not properly paired. Probably at some point QC was done on them in a way that was not sensitive to pairing and wrecked the files. There are programs that will re-pair them, but I would suggest acquiring the original raw data before it got messed up.
Brian Bushnell is offline   Reply With Quote
Old 03-24-2014, 06:35 PM   #10
martin2
Member
 
Location: Prague, Czech Republic

Join Date: Nov 2010
Posts: 40
Default

Quote:
Originally Posted by emolinari View Post
Hi guys!

I have a weird problem to solve.
I have recently run an RNAseq pipeline on human cell line (polyA enriched, 75bp, PE, ~30 mil reads each).

12 samples out of 16 showed great alignment rates and library QC.
The remaining 4 instead have a ridiculously low properly mate pairing, like this one:
What is the length of the paired-end linker in Illumina? 30nt or more? So in theory 22.5nt on each side? How could you map accurately 22nt piece to anything? I just don't get the idea. You just need longer reads. If the linker happens to be located not in the exact middle but shifted to a side, of course, it should be treated as a shotgun read because the sequence across the linker is way too short. That's the good thing in you situation.

Mapping 30nt sequence with sequencing quality 90% quality, well, can be done, but not uniquely. In human? Not even 100% identity will give you a single target (actually at least two in diploid).

Myself when I do de novo assemblies I request at least 80% overlap and 96% identity between reads. [Stupid, newbler cannot constrain by 80nt overlap. Who cares about 80% length overlap of 20nt short reads? Nobody.] Either way, 80%o with 96%i is still too loose, for mammals and other organisms which went through several genome duplication events you have to anticipate each gene is present in the genome in 4-8, maybe even 16 places for some gene families or promiscuous protein domains. All of them are relatively recent copies which did not accumulate yet enough mutations and all are practically >=98% similar ^H^H^H^H^H^H^Hidentical. Mapping even a whole 75nt read to a genome like human is always asking for at least two hits (mother+father) and within each of them for at least 4 places. The read is so short that the sequencing errors will give you a chance to distinguish authentic differences from sequencing errors. And if the data are badly trimmed of sample tags then don't be surprised one of the two halves does not "map".

I know you do reference-based mapping but still, those 75nt chunks could map to many places. If you happen to trim badly your reads offending 10nt could dampen your mapping efficiency. Still, I wonder what is the rationale.

Why don't you think this is all sign of badly trimmed data, or too bad overall sequencing quality? Why do you want to squeeze something out from crappy data? If it doesn't map then thrash it. Even worse if the sample had to be sequenced several times, maybe low-input DNA for initial PCR steps. All of them introduce new errors and amplify errors at the best. In my opinion, already the info that the "sample" had to be re-sequenced means there is something wrong with it. I am a biologist so I understand that acquiring some samples is a lot of work and low-gain, in terms of ug/ng of DNA but still, you should have your own limits.

[Honestly, I had always personal blocks to merge transcripts spanning 10 exons in 2.0kb mRNA while having 5'- and 3'-end reads from Sanger, overlapping 200-400nt (1 or 2 exons). Those were not from same physical clone so there was no evidence these were for sure from same mRNA. This overlap is not a proof that the overall combination of exons exists in real, at all, no matter the overlap is 400nt long and 2 exons out of 10 are "confirmed". That is just guesswork. We have plenty of predictions but we need real evidence we can trust. Working with NGS data? No, never ever under lengths of 454 Titanium. Those 300nt after trimming are already under my comfortable limits but definitely nothing shorter. Please. ;-) Maybe for amplicon sequencing and SNP calling in some crafted target regions the Illumina can be good, but not in general.]

Finally, what ploidy has your cell line?
martin2 is offline   Reply With Quote
Old 03-25-2014, 06:37 AM   #11
emolinari
Member
 
Location: New haven

Join Date: May 2013
Posts: 47
Default

Quote:
Originally Posted by martin2 View Post
What is the length of the paired-end linker in Illumina? 30nt or more? So in theory 22.5nt on each side? How could you map accurately 22nt piece to anything? I just don't get the idea. You just need longer reads. If the linker happens to be located not in the exact middle but shifted to a side, of course, it should be treated as a shotgun read because the sequence across the linker is way too short. That's the good thing in you situation.

Mapping 30nt sequence with sequencing quality 90% quality, well, can be done, but not uniquely. In human? Not even 100% identity will give you a single target (actually at least two in diploid).

Myself when I do de novo assemblies I request at least 80% overlap and 96% identity between reads. [Stupid, newbler cannot constrain by 80nt overlap. Who cares about 80% length overlap of 20nt short reads? Nobody.] Either way, 80%o with 96%i is still too loose, for mammals and other organisms which went through several genome duplication events you have to anticipate each gene is present in the genome in 4-8, maybe even 16 places for some gene families or promiscuous protein domains. All of them are relatively recent copies which did not accumulate yet enough mutations and all are practically >=98% similar ^H^H^H^H^H^H^Hidentical. Mapping even a whole 75nt read to a genome like human is always asking for at least two hits (mother+father) and within each of them for at least 4 places. The read is so short that the sequencing errors will give you a chance to distinguish authentic differences from sequencing errors. And if the data are badly trimmed of sample tags then don't be surprised one of the two halves does not "map".

I know you do reference-based mapping but still, those 75nt chunks could map to many places. If you happen to trim badly your reads offending 10nt could dampen your mapping efficiency. Still, I wonder what is the rationale.

Why don't you think this is all sign of badly trimmed data, or too bad overall sequencing quality? Why do you want to squeeze something out from crappy data? If it doesn't map then thrash it. Even worse if the sample had to be sequenced several times, maybe low-input DNA for initial PCR steps. All of them introduce new errors and amplify errors at the best. In my opinion, already the info that the "sample" had to be re-sequenced means there is something wrong with it. I am a biologist so I understand that acquiring some samples is a lot of work and low-gain, in terms of ug/ng of DNA but still, you should have your own limits.

[Honestly, I had always personal blocks to merge transcripts spanning 10 exons in 2.0kb mRNA while having 5'- and 3'-end reads from Sanger, overlapping 200-400nt (1 or 2 exons). Those were not from same physical clone so there was no evidence these were for sure from same mRNA. This overlap is not a proof that the overall combination of exons exists in real, at all, no matter the overlap is 400nt long and 2 exons out of 10 are "confirmed". That is just guesswork. We have plenty of predictions but we need real evidence we can trust. Working with NGS data? No, never ever under lengths of 454 Titanium. Those 300nt after trimming are already under my comfortable limits but definitely nothing shorter. Please. ;-) Maybe for amplicon sequencing and SNP calling in some crafted target regions the Illumina can be good, but not in general.]

Finally, what ploidy has your cell line?

My reads are 75bp long after adapter trimming.
Considering that are paired end, that it's homo sapiens reference transcriptome, that the literature advices this strategy and most importantly that ALL the other samples sequenced are just fine - only 4 samples failed- I think I can safely assume this is a good sequencing strategy for me.
I agree anyhow that the problem was created at the source (either PCR step or during barcoding)... trash data is always "easy" but it means re doing everything from scratch that of course takes lots of resources, such as money and most importantly TIME. So I was trying to save both, unless these data cannot be saved at all, as it starts to be evident

BTW my cells are diploid!
thanks

Manu
emolinari is offline   Reply With Quote
Old 04-14-2014, 06:38 PM   #12
emolinari
Member
 
Location: New haven

Join Date: May 2013
Posts: 47
Default

Quote:
Originally Posted by Brian Bushnell View Post
Looks like the reads are not properly paired. Probably at some point QC was done on them in a way that was not sensitive to pairing and wrecked the files. There are programs that will re-pair them, but I would suggest acquiring the original raw data before it got messed up.
Hi Brian,
I've run Trimmomatic on my samples, with ILLUMINACLIP TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 and my map pairing now is way better. I am waiting for Cuffdiff results...

206974197 in total
0 QC failure
0 duplicates
206974197 mapped (100.00%)
206974197 paired in sequencing
103366809 read1
103607388 read2
152531082 properly paired (73.70%)
204699398 with itself and mate mapped
2274799 singletons (1.10%)
26308258 with mate mapped to a different chr
18725784 with mate mapped to a different chr (mapQ>=5)

thanks for the help!!!
Manu
emolinari is offline   Reply With Quote
Old 04-14-2014, 07:00 PM   #13
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Quote:
Originally Posted by emolinari View Post
Hi Brian,
...
thanks for the help!!!
Manu
Glad I could assist.

By the way... I generally recommend aligning against a genome rather than transcriptome, for better objectivity and repeatability, and less bias toward known genes/isoforms.

Last edited by Brian Bushnell; 04-14-2014 at 07:09 PM.
Brian Bushnell is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 07:16 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO