SEQanswers

SEQanswers (http://seqanswers.com/forums/index.php)
-   Illumina/Solexa (http://seqanswers.com/forums/forumdisplay.php?f=6)
-   -   Poor seq quality due to low diversity sample (http://seqanswers.com/forums/showthread.php?t=93993)

invu 04-22-2020 11:04 AM

Poor seq quality due to low diversity sample
 
3 Attachment(s)
Hi,

I have some sets of HiSeq data that I am analyzing and the sequencing quality turned out quite bad. I attach the "per base seq quality" diagram and the "per tile seq quality" diagram for one of those sets, generated using FastQC.

I contacted the service provider, and they say it's due to my sample having low diversity especially at the beginning. (I also attached the seq content diagram.)
Based on some searches and reading of Illumina tech notes, I see that the diversity at the first several bases is quite important for the system to "calibrate" correctly for quality base calls for later bases.
My first question is, is this roughly a correct interpretation? And is there any way to "post-process" maybe the raw(er) data to correct/improve the seq reads?

Second, what I still don't understand is why does it affect the per tile seq quality? How does the low diversity at initial bases have anything to do with the spatial variation on seq quality?

What do you guys think?
What should I argue when replying to my service provider? Should I ask for a re-run?

Any note will be greatly appreciated!
Thanks.

GenoMax 04-22-2020 12:38 PM

Yikes this is a really low diversity sample. Do you know how much phiX (if any) was added to this sample. Did you not tell the sequence provider that these were low diversity? If you did not then it would be hard to make a case for them to re-sequence this sample again for free. You may have to pay for a re-run with a significant % of phiX (10-20% or more), if you want to get improved Q-scores.

It is possible that in spite of the bad Q-scores etc your sequence may still be usable. Have you looked at that?

invu 04-22-2020 02:29 PM

Quote:

Originally Posted by GenoMax (Post 232417)
Yikes this is a really low diversity sample. Do you know how much phiX (if any) was added to this sample. Did you not tell the sequence provider that these were low diversity? If you did not then it would be hard to make a case for them to re-sequence this sample again for free. You may have to pay for a re-run with a significant % of phiX (10-20% or more), if you want to get improved Q-scores.

It is possible that in spite of the bad Q-scores etc your sequence may still be usable. Have you looked at that?

Thanks for your reply, GenoMax!
The sample is a custom set of sequences with well-defined regions (hence those low-diversity regions). I had declined PhiX spike-in to obtain as many valid read lines as possible w/o sacrificing any to PhiX. I hadn't told them about the diversity because I had no idea about this kind of issue before; that being said, my old results for samples similar to this (even though they did have a few degenerate bases at the beginning) didn't have this problem (at least weren't as bad as this). Will I really need PhiX if I get to repeat something like this? Which way will I lose more data -- 10-20% loss by PhiX or less well-defined loss by poor quality reads like this?

I am looking at the data, and a big portion of the lines do seem valid and usable, but again, I'd need more lines to be ideal, and more importantly, even among those lines that apparently look okay, if more base call errors were caused by this issue, then that's a separate problem, which is quite hard to tell just from looking at those other lines.

Do you happen to know if someone looks at the rawer data (e.g., imaging data? if they're preserved? sorry I'm not really familiar with the details of the seq machines..) whether they could correct or improve the base calls throughout the seq data even now? Or is everything done real time by the machine and there's nothing that can be done to improve this?
Also, do you know if this issue caused by low diversity would also cause the tile-dependent quality loss as shown in my diagram? (This is something I am having hard time in understanding, and something I'm trying to argue about..)

GenoMax 04-22-2020 06:09 PM

You really should have asked for phiX to be added. You should consider the fact that this run could have completely failed, if it was a bit overloaded, leaving you with no data. Raw image data is generally not stored now-a-days so there is not much you can do with it afterwards. If you need more data consider sequencing an additional lane rather than taking a chance like this.

invu 04-22-2020 06:20 PM

Quote:

Originally Posted by GenoMax (Post 232420)
You really should have asked for phiX to be added. You should consider the fact that this run could have completely failed, if it was a bit overloaded, leaving you with no data. Raw image data is generally not stored now-a-days so there is not much you can do with it afterwards. If you need more data consider sequencing an additional lane rather than taking a chance like this.

Ha, I see. Lesson learned. Thanks for your help, GenoMax!

ATϟGC 04-23-2020 04:53 AM

If these are amplicon libraries and you want to minimize the amount of PhiX you can add "stagger" or "offset" nucleotides between the illumina sequencing primer region (like the nextera or truseq tail) and your locus-specific primer in order to create diversity of bases. These stagger nucleotides can also be added to restriction-digests adapters to increase base diversity.

I always add staggers to my amplicon primers and sequence multiple amplicons per run to increase diversity but I still always add 5-12% Phix just to be sure.

invu 04-23-2020 05:33 AM

Quote:

Originally Posted by ATϟGC (Post 232446)
If these are amplicon libraries and you want to minimize the amount of PhiX you can add "stagger" or "offset" nucleotides between the illumina sequencing primer region (like the nextera or truseq tail) and your locus-specific primer in order to create diversity of bases. These stagger nucleotides can also be added to restriction-digests adapters to increase base diversity.

I always add staggers to my amplicon primers and sequence multiple amplicons per run to increase diversity but I still always add 5-12% Phix just to be sure.

Thanks, ATϟGC, that's a good suggestion.
Looking back, the adapter-primers that I had used for my older runs when I didn't have this issue, did have some degenerate bases in between for different purposes and I think that was key in preventing this issue.

Still adding a minimal portion of PhiX is a good suggestion, too.
Thanks!!

cement_head 04-23-2020 08:46 AM

Quote:

Originally Posted by invu (Post 232419)
Thanks for your reply, GenoMax!
The sample is a custom set of sequences with well-defined regions (hence those low-diversity regions). I had declined PhiX spike-in to obtain as many valid read lines as possible w/o sacrificing any to PhiX. I hadn't told them about the diversity because I had no idea about this kind of issue before; that being said, my old results for samples similar to this (even though they did have a few degenerate bases at the beginning) didn't have this problem (at least weren't as bad as this). Will I really need PhiX if I get to repeat something like this? Which way will I lose more data -- 10-20% loss by PhiX or less well-defined loss by poor quality reads like this?

I am looking at the data, and a big portion of the lines do seem valid and usable, but again, I'd need more lines to be ideal, and more importantly, even among those lines that apparently look okay, if more base call errors were caused by this issue, then that's a separate problem, which is quite hard to tell just from looking at those other lines.

Do you happen to know if someone looks at the rawer data (e.g., imaging data? if they're preserved? sorry I'm not really familiar with the details of the seq machines..) whether they could correct or improve the base calls throughout the seq data even now? Or is everything done real time by the machine and there's nothing that can be done to improve this?
Also, do you know if this issue caused by low diversity would also cause the tile-dependent quality loss as shown in my diagram? (This is something I am having hard time in understanding, and something I'm trying to argue about..)

I'd have to agree with GenoMax; super-important to have a consultation with the sequencing center about the library composition and ask them what they recommend. You probably should have had 10% PhiX spike-in added. HiSeq are terrible at dynamic calibration - MiSeqs are better (to a point).

invu 04-23-2020 09:30 AM

Quote:

Originally Posted by cement_head (Post 232452)
I'd have to agree with GenoMax; super-important to have a consultation with the sequencing center about the library composition and ask them what they recommend. You probably should have had 10% PhiX spike-in added. HiSeq are terrible at dynamic calibration - MiSeqs are better (to a point).

I see. Next time I will consider PhiX spike-in. Thanks, cement_head!

ATϟGC 04-24-2020 05:05 AM

I agree that would be best to discuss these issues with your sequencing provider.

If you do choose to use staggered bases I recommend making an alignment to check for base diversity in the first 12-20 base pairs of read1. This alignment should be made with respect to the Illumina sequencing primer. For my amplicon libraries, this means I anchor it on the left by the Nextera Read1 sequences. You then only need to consider the base diversity of your staggered and/or unstaggered (I use a mix of both in my round 1 PCR reactions) primers or adapters. I do this in microsoft excel so that I can calculate and optimize base diversity of all the amplicons that will be pooled in my run.

Adding stagger bases has the potential to introduce biases in your libraries due to secondary structures or other priming phenomena. If you use the same mix of staggers for all samples the bias should be the same in theory.

I have only sequenced amplicons on Miseq and Novaseq and 5-12% PhiX has been enough for me with those platforms so I cannot comment on Hiseq.


All times are GMT -8. The time now is 02:03 AM.

Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.