Seqanswers Leaderboard Ad

**emolinari** · 03-14-2014, 11:59 AM

Then I re-checked my work flow and realized that the Genome Center that performed the library prep and sequencing had to fund 2/3 sequencing for these samples, as they were providing a low yield (of reads).

Sorry there's a typo: the Genome center re-performed the sequencing a couple of times on my samples due to low reads yield!

**Brian Bushnell** · 03-14-2014, 12:22 PM

Looks like you are mapping to a highly broken assembly, since most of the reads map to different scaffolds. Or else the read1 and read2 files don't go together. What did you map to, normal HG19 or something else? What mapper did you use? And are you sure the read1 and read2 files go together? Look at the read names and make sure they match.

**emolinari** · 03-14-2014, 12:32 PM

Originally posted by Brian Bushnell View Post

Looks like you are mapping to a highly broken assembly, since most of the reads map to different scaffolds. Or else the read1 and read2 files don't go together. What did you map to, normal HG19 or something else? What mapper did you use? And are you sure the read1 and read2 files go together? Look at the read names and make sure they match.

Hi Brian!
thanks for the reply. I have mapped my samples with TopHat2, using normal hg19. I guess that it is not a broken hg19(got it from iGenomes), as it worked just fine for the others.
I have checked the read identifiers and they seem to agree...although I have just briefly looked at them, I haven't written any string to check if they were matching. Any suggestions how to?

Thank you!!!
Manu

**Brian Bushnell** · 03-14-2014, 12:55 PM

So, the weird thing is that HG19 only has 25 normal scaffolds and a handful of small extra scaffolds. You can't have 99.98% of pairs landing on different scaffolds by chance, even if you had mismatched files for read1 and read2. The only way I can see it happening is if, for example, every read1 was good, but every read2 mapped to the same location on a tiny scaffold (like mito).

I suggest you pileup and see if some tiny scaffold has immense coverage (or you can just visually look at the unsorted sam file and see where things mapped, to see if there's a pattern). Also, look at the read quality profiles and per-base ACGTN counts, and see if they look different than other libraries.

If you want to examine read names, just use "head" and "tail" on each of the fastq files, too see that the names are identical (except for e.g. " /1" and " /2"). Also use wc to ensure both files have the same number of lines.

**emolinari** · 03-24-2014, 06:41 AM

Originally posted by Brian Bushnell View Post

So, the weird thing is that HG19 only has 25 normal scaffolds and a handful of small extra scaffolds. You can't have 99.98% of pairs landing on different scaffolds by chance, even if you had mismatched files for read1 and read2. The only way I can see it happening is if, for example, every read1 was good, but every read2 mapped to the same location on a tiny scaffold (like mito).

I suggest you pileup and see if some tiny scaffold has immense coverage (or you can just visually look at the unsorted sam file and see where things mapped, to see if there's a pattern). Also, look at the read quality profiles and per-base ACGTN counts, and see if they look different than other libraries.

If you want to examine read names, just use "head" and "tail" on each of the fastq files, too see that the names are identical (except for e.g. " /1" and " /2"). Also use wc to ensure both files have the same number of lines.

Hi Brian!

I've tried to do what you suggested. I am quite a newbie for this kind of things, so I have tried to visualize the sorted.bam file on IGV. The reads look weird to me...looking into the IGV manual I see that different colors suggest different chromosome mapping...is this correct?
I have tried to do the same for one of the other samples that I have aligned the same day and that gave a optimal rate of properly paired mates...and in fact they look normal compared to this funny one.
I have a doubt, but I am not able to see if it makes sense: this weird sample (as well as the other three that gave the same problem) was sequenced twice due to poor yield...could it be that either the reads header was not paired properly? or do you have other suggestions?

Thanks!!!

Manu

Attached Files

Screen Shot 2014-03-24 at 9.25.08 AM.png (133.1 KB, 17 views)

**Brian Bushnell** · 03-24-2014, 08:28 AM

Originally posted by emolinari View Post

Hi Brian!

I've tried to do what you suggested. I am quite a newbie for this kind of things, so I have tried to visualize the sorted.bam file on IGV. The reads look weird to me...looking into the IGV manual I see that different colors suggest different chromosome mapping...is this correct?

Colors mean different things in different modes with IGV. But those colors on your picture indicate SNPs - the reference does not match the reads, like, anywhere. Could you be using a slightly different reference for mapping and for IGV?

I have a doubt, but I am not able to see if it makes sense: this weird sample (as well as the other three that gave the same problem) was sequenced twice due to poor yield...could it be that either the reads header was not paired properly? or do you have other suggestions?

Thanks!!!

Manu

Possible. I suggest that you DO write something to verify that the read names all match. But I think your problem likely has to do with the reference. I would try re-indexing HG19, re-mapping, and re-loading everything (including the reference) into IGV.

**emolinari** · 03-24-2014, 01:32 PM

Originally posted by Brian Bushnell View Post

Possible. I suggest that you DO write something to verify that the read names all match. But I think your problem likely has to do with the reference. I would try re-indexing HG19, re-mapping, and re-loading everything (including the reference) into IGV.

Thanks Brian,
I am uploading the reference genome once again, so I will run TopHat+Bowtie and see if they make sense now...

BTW, checking prep_read.logs I have found out this:

/panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads: /lib64/libz.so.1: no version information available (required by /panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads)
prep_reads v2.0.6 (3670M)
---------------------------
WARNING: read pairing issues detected (check prep_reads.log) !
Pair #1 name mismatch: BFC08P1:257:C38GUACXX:8:1111:14326:35343 vs BFC08P1:257:C38GUACXX:8:1101:1229:2109
101900 out of 31388352 reads have been filtered out
67141 out of 31388352 read mates have been filtered out

...can you explain?

Thanks!!!
Manu

**Brian Bushnell** · 03-24-2014, 02:18 PM

Originally posted by emolinari View Post

Thanks Brian,
I am uploading the reference genome once again, so I will run TopHat+Bowtie and see if they make sense now...

BTW, checking prep_read.logs I have found out this:

/panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads: /lib64/libz.so.1: no version information available (required by /panfs/home/bioinfo/software/tophat/tophat-2.0.6.Linux_x86_64/prep_reads)
prep_reads v2.0.6 (3670M)
---------------------------
WARNING: read pairing issues detected (check prep_reads.log) !
Pair #1 name mismatch: BFC08P1:257:C38GUACXX:8:1111:14326:35343 vs BFC08P1:257:C38GUACXX:8:1101:1229:2109
101900 out of 31388352 reads have been filtered out
67141 out of 31388352 read mates have been filtered out

...can you explain?

Thanks!!!
Manu

Looks like the reads are not properly paired. Probably at some point QC was done on them in a way that was not sensitive to pairing and wrecked the files. There are programs that will re-pair them, but I would suggest acquiring the original raw data before it got messed up.

**martin2** · 03-24-2014, 05:35 PM

Originally posted by emolinari View Post

Hi guys!

I have a weird problem to solve.
I have recently run an RNAseq pipeline on human cell line (polyA enriched, 75bp, PE, ~30 mil reads each).

12 samples out of 16 showed great alignment rates and library QC.
The remaining 4 instead have a ridiculously low properly mate pairing, like this one:

What is the length of the paired-end linker in Illumina? 30nt or more? So in theory 22.5nt on each side? How could you map accurately 22nt piece to anything? I just don't get the idea. You just need longer reads. If the linker happens to be located not in the exact middle but shifted to a side, of course, it should be treated as a shotgun read because the sequence across the linker is way too short. That's the good thing in you situation.

Mapping 30nt sequence with sequencing quality 90% quality, well, can be done, but not uniquely. In human? Not even 100% identity will give you a single target (actually at least two in diploid).

Myself when I do de novo assemblies I request at least 80% overlap and 96% identity between reads. [Stupid, newbler cannot constrain by 80nt overlap. Who cares about 80% length overlap of 20nt short reads? Nobody.] Either way, 80%o with 96%i is still too loose, for mammals and other organisms which went through several genome duplication events you have to anticipate each gene is present in the genome in 4-8, maybe even 16 places for some gene families or promiscuous protein domains. All of them are relatively recent copies which did not accumulate yet enough mutations and all are practically >=98% similar ^H^H^H^H^H^H^Hidentical. Mapping even a whole 75nt read to a genome like human is always asking for at least two hits (mother+father) and within each of them for at least 4 places. The read is so short that the sequencing errors will give you a chance to distinguish authentic differences from sequencing errors. And if the data are badly trimmed of sample tags then don't be surprised one of the two halves does not "map".

I know you do reference-based mapping but still, those 75nt chunks could map to many places. If you happen to trim badly your reads offending 10nt could dampen your mapping efficiency. Still, I wonder what is the rationale.

Why don't you think this is all sign of badly trimmed data, or too bad overall sequencing quality? Why do you want to squeeze something out from crappy data? If it doesn't map then thrash it. Even worse if the sample had to be sequenced several times, maybe low-input DNA for initial PCR steps. All of them introduce new errors and amplify errors at the best. In my opinion, already the info that the "sample" had to be re-sequenced means there is something wrong with it. I am a biologist so I understand that acquiring some samples is a lot of work and low-gain, in terms of ug/ng of DNA but still, you should have your own limits.

[Honestly, I had always personal blocks to merge transcripts spanning 10 exons in 2.0kb mRNA while having 5'- and 3'-end reads from Sanger, overlapping 200-400nt (1 or 2 exons). Those were not from same physical clone so there was no evidence these were for sure from same mRNA. This overlap is not a proof that the overall combination of exons exists in real, at all, no matter the overlap is 400nt long and 2 exons out of 10 are "confirmed". That is just guesswork. We have plenty of predictions but we need real evidence we can trust. Working with NGS data? No, never ever under lengths of 454 Titanium. Those 300nt after trimming are already under my comfortable limits but definitely nothing shorter. Please. ;-) Maybe for amplicon sequencing and SNP calling in some crafted target regions the Illumina can be good, but not in general.]

Finally, what ploidy has your cell line?

**emolinari** · 03-25-2014, 05:37 AM

Originally posted by martin2 View Post

What is the length of the paired-end linker in Illumina? 30nt or more? So in theory 22.5nt on each side? How could you map accurately 22nt piece to anything? I just don't get the idea. You just need longer reads. If the linker happens to be located not in the exact middle but shifted to a side, of course, it should be treated as a shotgun read because the sequence across the linker is way too short. That's the good thing in you situation.

Mapping 30nt sequence with sequencing quality 90% quality, well, can be done, but not uniquely. In human? Not even 100% identity will give you a single target (actually at least two in diploid).

Myself when I do de novo assemblies I request at least 80% overlap and 96% identity between reads. [Stupid, newbler cannot constrain by 80nt overlap. Who cares about 80% length overlap of 20nt short reads? Nobody.] Either way, 80%o with 96%i is still too loose, for mammals and other organisms which went through several genome duplication events you have to anticipate each gene is present in the genome in 4-8, maybe even 16 places for some gene families or promiscuous protein domains. All of them are relatively recent copies which did not accumulate yet enough mutations and all are practically >=98% similar ^H^H^H^H^H^H^Hidentical. Mapping even a whole 75nt read to a genome like human is always asking for at least two hits (mother+father) and within each of them for at least 4 places. The read is so short that the sequencing errors will give you a chance to distinguish authentic differences from sequencing errors. And if the data are badly trimmed of sample tags then don't be surprised one of the two halves does not "map".

I know you do reference-based mapping but still, those 75nt chunks could map to many places. If you happen to trim badly your reads offending 10nt could dampen your mapping efficiency. Still, I wonder what is the rationale.

Why don't you think this is all sign of badly trimmed data, or too bad overall sequencing quality? Why do you want to squeeze something out from crappy data? If it doesn't map then thrash it. Even worse if the sample had to be sequenced several times, maybe low-input DNA for initial PCR steps. All of them introduce new errors and amplify errors at the best. In my opinion, already the info that the "sample" had to be re-sequenced means there is something wrong with it. I am a biologist so I understand that acquiring some samples is a lot of work and low-gain, in terms of ug/ng of DNA but still, you should have your own limits.

[Honestly, I had always personal blocks to merge transcripts spanning 10 exons in 2.0kb mRNA while having 5'- and 3'-end reads from Sanger, overlapping 200-400nt (1 or 2 exons). Those were not from same physical clone so there was no evidence these were for sure from same mRNA. This overlap is not a proof that the overall combination of exons exists in real, at all, no matter the overlap is 400nt long and 2 exons out of 10 are "confirmed". That is just guesswork. We have plenty of predictions but we need real evidence we can trust. Working with NGS data? No, never ever under lengths of 454 Titanium. Those 300nt after trimming are already under my comfortable limits but definitely nothing shorter. Please. ;-) Maybe for amplicon sequencing and SNP calling in some crafted target regions the Illumina can be good, but not in general.]

Finally, what ploidy has your cell line?

My reads are 75bp long after adapter trimming.
Considering that are paired end, that it's homo sapiens reference transcriptome, that the literature advices this strategy and most importantly that ALL the other samples sequenced are just fine - only 4 samples failed- I think I can safely assume this is a good sequencing strategy for me.
I agree anyhow that the problem was created at the source (either PCR step or during barcoding)... trash data is always "easy" but it means re doing everything from scratch that of course takes lots of resources, such as money and most importantly TIME. So I was trying to save both, unless these data cannot be saved at all, as it starts to be evident

BTW my cells are diploid!
thanks

Manu

**emolinari** · 04-14-2014, 05:38 PM

Originally posted by Brian Bushnell View Post

Looks like the reads are not properly paired. Probably at some point QC was done on them in a way that was not sensitive to pairing and wrecked the files. There are programs that will re-pair them, but I would suggest acquiring the original raw data before it got messed up.

Hi Brian,
I've run Trimmomatic on my samples, with ILLUMINACLIP TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 and my map pairing now is way better. I am waiting for Cuffdiff results...

206974197 in total
0 QC failure
0 duplicates
206974197 mapped (100.00%)
206974197 paired in sequencing
103366809 read1
103607388 read2
152531082 properly paired (73.70%)
204699398 with itself and mate mapped
2274799 singletons (1.10%)
26308258 with mate mapped to a different chr
18725784 with mate mapped to a different chr (mapQ>=5)

thanks for the help!!!
Manu

**Brian Bushnell** · 04-14-2014, 06:00 PM

Originally posted by emolinari View Post

Hi Brian,
...
thanks for the help!!!
Manu

Glad I could assist.

By the way... I generally recommend aligning against a genome rather than transcriptome, for better objectivity and repeatability, and less bias toward known genes/isoforms.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Properly paired mates

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News