SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Threshold quality score to determine the quality read of ILLUMINA reads problem edge Illumina/Solexa 35 11-02-2015 11:31 AM
about illumina reads quality score gridbird Illumina/Solexa 4 08-08-2011 06:10 AM
quality filtered illumina PE reads Wallysb01 Bioinformatics 1 07-21-2011 11:04 AM
Threshold quality score to determine the quality read of ILLUMINA reads problem edge General 1 09-13-2010 03:22 PM
Reason for low quality of illumina reads nvteja Illumina/Solexa 2 07-07-2010 10:41 AM

Reply
 
Thread Tools
Old 11-05-2012, 03:39 PM   #1
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default Help- Illumina Sequencing Reads Quality Scale Problem!

Hi,
Our lab asked a seq company to do the illumina seq for a snail. However, the illumina reads they produced seem have some problems, since I used jellyfish to do k-mer analysis, and didn't find any coverage peaks even after quality filtration and trimming for 50x coverage - which shows very high quality score in fastqc check. Since the reads have expected hits in our transcriptome, we can rule out reads contamination. Then the only possible reason I could think is the reads quality scale is completely wrong in base calling procedure. For example, it's put Q30 on , but actually it's Q10 or lower.

Could anyone give us some ideas how the base calling procedure would fail in seq process, and could anyone give us some suggestions ? We have already spent huge amount of money on it...

Thank you very much!!
Looking forward to your reply!
Best,
Qing
Qingl is offline   Reply With Quote
Old 11-06-2012, 12:40 AM   #2
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

It's pretty unlikely that the base calling has mis-assigned high quality scores to your data when it's actually really poor, if anything the illumina pipeline tends to go the other way and mark good data as bad.

It wasn't really clear to me how you've decided that your data is bad - you said you have hits to your transcriptome, so presumably you can see the degree of similarity and get some idea of how good your data is from that.

When you did your fastqc analysis what did the per-sequence GC plot look like? Genomic reads from a clean source should generally produce a nice looking normal distribution in this plot. If you have contamination with a different organism (with a different GC content) you should get some idea from that.
simonandrews is offline   Reply With Quote
Old 11-08-2012, 08:58 PM   #3
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default

Quote:
Originally Posted by simonandrews View Post
It's pretty unlikely that the base calling has mis-assigned high quality scores to your data when it's actually really poor, if anything the illumina pipeline tends to go the other way and mark good data as bad.

It wasn't really clear to me how you've decided that your data is bad - you said you have hits to your transcriptome, so presumably you can see the degree of similarity and get some idea of how good your data is from that.

When you did your fastqc analysis what did the per-sequence GC plot look like? Genomic reads from a clean source should generally produce a nice looking normal distribution in this plot. If you have contamination with a different organism (with a different GC content) you should get some idea from that.
Hi Simon,
Thank you very much for your reply! I suspect the data are bad because: 1 the assembly failed at high coverage depth(50x~100x) and high quality(>Q20~Q30) ;2 jellyfish would not produce any peak, however I change the k-mer size or coverage depth of data.

However, when I blast the reads against the transcriptome, it gave me the expected coverage hits, and the per-sequence GC plot looks normal. So I think we can rull out reads contamination. Then the only possible explanation is reads quality scale off......
Any thoughts or ideas? Thanks! It has been a nightmare for me...
Qingl is offline   Reply With Quote
Old 11-09-2012, 12:16 AM   #4
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

I'd go back to the point that if you have reads which align against your known transcriptome then by looking at those you should be able to tell approximately what error rate you're actually seeing in your genomic data. If your data really is Q10 or below it should be pretty obvious in the number of mismatches you see to your existing RNA-seq data.

Also, you could make some back of the envelope calculations to work out if the number of sequences mapping to your RNA-Seq data falls into line with the size of genome you expect. As long as you can (roughly) estimate the proportion of your genome expected to be covered by exons then you can see if your mapped reads occur at roughly this rate (give or take an order of magnitude) in your genomic data.

As I said before I've never seen an Illumina dataset mis-represent bad data as high quality. If you were really serious about excluding this as a possibility you could even go back to the original run and look at some of the thumbnail images and see if the data looked OK (or get your sequencing centre to do this). It's possible that this is the cause, but it doesn't seem the most likely way that this would go wrong.
simonandrews is offline   Reply With Quote
Old 11-11-2012, 09:54 AM   #5
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default

Thanks Simon! The mismatch rate for transcriptome is 2% than the control, but the control itself may have high error rate.

I asked for thumbnail images, but seq center tells me that thumbnail images take up a lot of space so they normally do not save those for the runs. They use control samples to diagnose issues with the run instead. And they tells me the control samples look normal.

After discussion, the seq center thinks the quality scale is possibly wrong, but it may be caused by the low complexity nature of the genome......which would cause the mis-recording of the illumina machine......
Would you have any idea for this? Since it's normal for fastqc seq content check,,,but that's the averaged out result. Maybe it really have some low complexity region?
Qingl is offline   Reply With Quote
Old 11-12-2012, 01:31 AM   #6
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

Low complexity sequence is only a problem when it affects the whole library (ie there is a bias for all bases at a particular position within a library). Having low complexity sequence on an individual read isn't normally a problem - though it will make assembling the genome difficult. Also the illumina pipeline has no problem flagging low complexity sequence as poor quality - sometimes even when the bases have actually been read OK.

Nothing you've seen suggests that there is a problem with the calling of the sequence library. You have good quality scores on a run where other samples worked OK, and you have a reasonably low level of mismatch to control sequences within your own library.

I'd suggest focusing your attention on the assembly, or ruling out other possible sources of contamination rather than assuming that the base calling is wrong as this would seem to be the more likely place for the problem to be.
simonandrews is offline   Reply With Quote
Old 11-12-2012, 09:24 PM   #7
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default

Thanks Simon!
Could you have a look at the attached fig and tell whether there is a bias for all bases at a particular position within the library? It looks odd, though not indicate a low complexity...
Attached Images
File Type: png Screen Shot 2012-11-12 at 10.13.56 PM.png (59.0 KB, 29 views)
Qingl is offline   Reply With Quote
Old 11-13-2012, 12:35 AM   #8
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

The plot you attached certainly suggests that there is some loss of complexity, but it's not bad - we've seen much worse and have had no problems with the run. It's not until you get up around 70% being one base that Illumina will have any real problem (as long as there's some signal in the other channels). Did you get anything reported in the overrepresented sequence module? The most common reason for plots like that is the presence of a single contaminating sequence (normally an adapter).
simonandrews is offline   Reply With Quote
Old 11-13-2012, 08:30 AM   #9
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default

Exactly! Yes, there are 4 overrepresented seq, but only occupy 0.12% ~ 0.82% (see attachment), so we don't pay much attention to them. Would you think that would be a problem in sequencing?

Also, I have attached k-mer content, it seems like these low complexity kmers are related to the adapter problem? Would you think this would shift the base call quality scale?
Thank you!
Attached Images
File Type: png Overrepresented_seq.png (58.3 KB, 15 views)
File Type: png Kmer_content.png (127.3 KB, 18 views)
Qingl is offline   Reply With Quote
Old 11-14-2012, 09:01 AM   #10
simonandrews
Simon Andrews
 
Location: Babraham Inst, Cambridge, UK

Join Date: May 2009
Posts: 871
Default

They wouldn't be a problem with sequencing, but they might cause your per-base sequence plot to show a bias.

The Kmer plot is difficult to interpret without seeing the accompanying table. It would really be easier if you could put the whole report up somewhere we could see it rather than sending snippets.
simonandrews is offline   Reply With Quote
Old 11-14-2012, 09:08 AM   #11
Qingl
Member
 
Location: salt lake city

Join Date: Sep 2012
Posts: 17
Default

Hmm, I see, thx! Would you have a dropbox account? I put it in the shared folder of dropbox. Would you mind to let me invite you in shared folder?
Qingl is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 10:35 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2020, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO