SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
discrepancy in miseq runs due to sample prep protocol? samd Sample Prep / Library Generation 2 11-07-2019 07:33 AM
Amplicon poor alignment on one sample nahoku Illumina/Solexa 0 10-27-2015 11:48 AM

Reply
 
Thread Tools
Old 04-22-2020, 12:04 PM   #1
invu
Junior Member
 
Location: Boston, MA, USA

Join Date: Apr 2020
Posts: 5
Default Poor seq quality due to low diversity sample

Hi,

I have some sets of HiSeq data that I am analyzing and the sequencing quality turned out quite bad. I attach the "per base seq quality" diagram and the "per tile seq quality" diagram for one of those sets, generated using FastQC.

I contacted the service provider, and they say it's due to my sample having low diversity especially at the beginning. (I also attached the seq content diagram.)
Based on some searches and reading of Illumina tech notes, I see that the diversity at the first several bases is quite important for the system to "calibrate" correctly for quality base calls for later bases.
My first question is, is this roughly a correct interpretation? And is there any way to "post-process" maybe the raw(er) data to correct/improve the seq reads?

Second, what I still don't understand is why does it affect the per tile seq quality? How does the low diversity at initial bases have anything to do with the spatial variation on seq quality?

What do you guys think?
What should I argue when replying to my service provider? Should I ask for a re-run?

Any note will be greatly appreciated!
Thanks.
Attached Images
File Type: png Per_base_seq_quality-SW3-R1.png (12.3 KB, 10 views)
File Type: png Per_tile_seq_quality-SW3-R1.png (12.1 KB, 7 views)
File Type: png Per_base_seq_content-SW3-R1.png (110.5 KB, 12 views)

Last edited by invu; 04-22-2020 at 12:06 PM.
invu is offline   Reply With Quote
Old 04-22-2020, 01:38 PM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

Yikes this is a really low diversity sample. Do you know how much phiX (if any) was added to this sample. Did you not tell the sequence provider that these were low diversity? If you did not then it would be hard to make a case for them to re-sequence this sample again for free. You may have to pay for a re-run with a significant % of phiX (10-20% or more), if you want to get improved Q-scores.

It is possible that in spite of the bad Q-scores etc your sequence may still be usable. Have you looked at that?
GenoMax is offline   Reply With Quote
Old 04-22-2020, 03:29 PM   #3
invu
Junior Member
 
Location: Boston, MA, USA

Join Date: Apr 2020
Posts: 5
Default

Quote:
Originally Posted by GenoMax View Post
Yikes this is a really low diversity sample. Do you know how much phiX (if any) was added to this sample. Did you not tell the sequence provider that these were low diversity? If you did not then it would be hard to make a case for them to re-sequence this sample again for free. You may have to pay for a re-run with a significant % of phiX (10-20% or more), if you want to get improved Q-scores.

It is possible that in spite of the bad Q-scores etc your sequence may still be usable. Have you looked at that?
Thanks for your reply, GenoMax!
The sample is a custom set of sequences with well-defined regions (hence those low-diversity regions). I had declined PhiX spike-in to obtain as many valid read lines as possible w/o sacrificing any to PhiX. I hadn't told them about the diversity because I had no idea about this kind of issue before; that being said, my old results for samples similar to this (even though they did have a few degenerate bases at the beginning) didn't have this problem (at least weren't as bad as this). Will I really need PhiX if I get to repeat something like this? Which way will I lose more data -- 10-20% loss by PhiX or less well-defined loss by poor quality reads like this?

I am looking at the data, and a big portion of the lines do seem valid and usable, but again, I'd need more lines to be ideal, and more importantly, even among those lines that apparently look okay, if more base call errors were caused by this issue, then that's a separate problem, which is quite hard to tell just from looking at those other lines.

Do you happen to know if someone looks at the rawer data (e.g., imaging data? if they're preserved? sorry I'm not really familiar with the details of the seq machines..) whether they could correct or improve the base calls throughout the seq data even now? Or is everything done real time by the machine and there's nothing that can be done to improve this?
Also, do you know if this issue caused by low diversity would also cause the tile-dependent quality loss as shown in my diagram? (This is something I am having hard time in understanding, and something I'm trying to argue about..)

Last edited by invu; 04-22-2020 at 03:33 PM.
invu is offline   Reply With Quote
Old 04-22-2020, 07:09 PM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 7,079
Default

You really should have asked for phiX to be added. You should consider the fact that this run could have completely failed, if it was a bit overloaded, leaving you with no data. Raw image data is generally not stored now-a-days so there is not much you can do with it afterwards. If you need more data consider sequencing an additional lane rather than taking a chance like this.
GenoMax is offline   Reply With Quote
Old 04-22-2020, 07:20 PM   #5
invu
Junior Member
 
Location: Boston, MA, USA

Join Date: Apr 2020
Posts: 5
Default

Quote:
Originally Posted by GenoMax View Post
You really should have asked for phiX to be added. You should consider the fact that this run could have completely failed, if it was a bit overloaded, leaving you with no data. Raw image data is generally not stored now-a-days so there is not much you can do with it afterwards. If you need more data consider sequencing an additional lane rather than taking a chance like this.
Ha, I see. Lesson learned. Thanks for your help, GenoMax!
invu is offline   Reply With Quote
Old 04-23-2020, 05:53 AM   #6
ATϟGC
Member
 
Location: Canada

Join Date: Jun 2013
Posts: 56
Default

If these are amplicon libraries and you want to minimize the amount of PhiX you can add "stagger" or "offset" nucleotides between the illumina sequencing primer region (like the nextera or truseq tail) and your locus-specific primer in order to create diversity of bases. These stagger nucleotides can also be added to restriction-digests adapters to increase base diversity.

I always add staggers to my amplicon primers and sequence multiple amplicons per run to increase diversity but I still always add 5-12% Phix just to be sure.
ATϟGC is offline   Reply With Quote
Old 04-23-2020, 06:33 AM   #7
invu
Junior Member
 
Location: Boston, MA, USA

Join Date: Apr 2020
Posts: 5
Default

Quote:
Originally Posted by ATϟGC View Post
If these are amplicon libraries and you want to minimize the amount of PhiX you can add "stagger" or "offset" nucleotides between the illumina sequencing primer region (like the nextera or truseq tail) and your locus-specific primer in order to create diversity of bases. These stagger nucleotides can also be added to restriction-digests adapters to increase base diversity.

I always add staggers to my amplicon primers and sequence multiple amplicons per run to increase diversity but I still always add 5-12% Phix just to be sure.
Thanks, ATϟGC, that's a good suggestion.
Looking back, the adapter-primers that I had used for my older runs when I didn't have this issue, did have some degenerate bases in between for different purposes and I think that was key in preventing this issue.

Still adding a minimal portion of PhiX is a good suggestion, too.
Thanks!!
invu is offline   Reply With Quote
Old 04-23-2020, 09:46 AM   #8
cement_head
Senior Member
 
Location: Oxford, Ohio

Join Date: Mar 2012
Posts: 253
Default

Quote:
Originally Posted by invu View Post
Thanks for your reply, GenoMax!
The sample is a custom set of sequences with well-defined regions (hence those low-diversity regions). I had declined PhiX spike-in to obtain as many valid read lines as possible w/o sacrificing any to PhiX. I hadn't told them about the diversity because I had no idea about this kind of issue before; that being said, my old results for samples similar to this (even though they did have a few degenerate bases at the beginning) didn't have this problem (at least weren't as bad as this). Will I really need PhiX if I get to repeat something like this? Which way will I lose more data -- 10-20% loss by PhiX or less well-defined loss by poor quality reads like this?

I am looking at the data, and a big portion of the lines do seem valid and usable, but again, I'd need more lines to be ideal, and more importantly, even among those lines that apparently look okay, if more base call errors were caused by this issue, then that's a separate problem, which is quite hard to tell just from looking at those other lines.

Do you happen to know if someone looks at the rawer data (e.g., imaging data? if they're preserved? sorry I'm not really familiar with the details of the seq machines..) whether they could correct or improve the base calls throughout the seq data even now? Or is everything done real time by the machine and there's nothing that can be done to improve this?
Also, do you know if this issue caused by low diversity would also cause the tile-dependent quality loss as shown in my diagram? (This is something I am having hard time in understanding, and something I'm trying to argue about..)
I'd have to agree with GenoMax; super-important to have a consultation with the sequencing center about the library composition and ask them what they recommend. You probably should have had 10% PhiX spike-in added. HiSeq are terrible at dynamic calibration - MiSeqs are better (to a point).
cement_head is offline   Reply With Quote
Old 04-23-2020, 10:30 AM   #9
invu
Junior Member
 
Location: Boston, MA, USA

Join Date: Apr 2020
Posts: 5
Default

Quote:
Originally Posted by cement_head View Post
I'd have to agree with GenoMax; super-important to have a consultation with the sequencing center about the library composition and ask them what they recommend. You probably should have had 10% PhiX spike-in added. HiSeq are terrible at dynamic calibration - MiSeqs are better (to a point).
I see. Next time I will consider PhiX spike-in. Thanks, cement_head!
invu is offline   Reply With Quote
Old 04-24-2020, 06:05 AM   #10
ATϟGC
Member
 
Location: Canada

Join Date: Jun 2013
Posts: 56
Default

I agree that would be best to discuss these issues with your sequencing provider.

If you do choose to use staggered bases I recommend making an alignment to check for base diversity in the first 12-20 base pairs of read1. This alignment should be made with respect to the Illumina sequencing primer. For my amplicon libraries, this means I anchor it on the left by the Nextera Read1 sequences. You then only need to consider the base diversity of your staggered and/or unstaggered (I use a mix of both in my round 1 PCR reactions) primers or adapters. I do this in microsoft excel so that I can calculate and optimize base diversity of all the amplicons that will be pooled in my run.

Adding stagger bases has the potential to introduce biases in your libraries due to secondary structures or other priming phenomena. If you use the same mix of staggers for all samples the bias should be the same in theory.

I have only sequenced amplicons on Miseq and Novaseq and 5-12% PhiX has been enough for me with those platforms so I cannot comment on Hiseq.
ATϟGC is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 08:20 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO