Seqanswers Leaderboard Ad

**simonandrews** · 11-06-2012, 12:40 AM

It's pretty unlikely that the base calling has mis-assigned high quality scores to your data when it's actually really poor, if anything the illumina pipeline tends to go the other way and mark good data as bad.

It wasn't really clear to me how you've decided that your data is bad - you said you have hits to your transcriptome, so presumably you can see the degree of similarity and get some idea of how good your data is from that.

When you did your fastqc analysis what did the per-sequence GC plot look like? Genomic reads from a clean source should generally produce a nice looking normal distribution in this plot. If you have contamination with a different organism (with a different GC content) you should get some idea from that.

**Qingl** · 11-08-2012, 08:58 PM

Originally posted by simonandrews View Post

It's pretty unlikely that the base calling has mis-assigned high quality scores to your data when it's actually really poor, if anything the illumina pipeline tends to go the other way and mark good data as bad.

It wasn't really clear to me how you've decided that your data is bad - you said you have hits to your transcriptome, so presumably you can see the degree of similarity and get some idea of how good your data is from that.

When you did your fastqc analysis what did the per-sequence GC plot look like? Genomic reads from a clean source should generally produce a nice looking normal distribution in this plot. If you have contamination with a different organism (with a different GC content) you should get some idea from that.

Hi Simon,
Thank you very much for your reply! I suspect the data are bad because: 1 the assembly failed at high coverage depth(50x~100x) and high quality(>Q20~Q30) ;2 jellyfish would not produce any peak, however I change the k-mer size or coverage depth of data.

However, when I blast the reads against the transcriptome, it gave me the expected coverage hits, and the per-sequence GC plot looks normal. So I think we can rull out reads contamination. Then the only possible explanation is reads quality scale off......
Any thoughts or ideas? Thanks! It has been a nightmare for me...

**simonandrews** · 11-09-2012, 12:16 AM

I'd go back to the point that if you have reads which align against your known transcriptome then by looking at those you should be able to tell approximately what error rate you're actually seeing in your genomic data. If your data really is Q10 or below it should be pretty obvious in the number of mismatches you see to your existing RNA-seq data.

Also, you could make some back of the envelope calculations to work out if the number of sequences mapping to your RNA-Seq data falls into line with the size of genome you expect. As long as you can (roughly) estimate the proportion of your genome expected to be covered by exons then you can see if your mapped reads occur at roughly this rate (give or take an order of magnitude) in your genomic data.

As I said before I've never seen an Illumina dataset mis-represent bad data as high quality. If you were really serious about excluding this as a possibility you could even go back to the original run and look at some of the thumbnail images and see if the data looked OK (or get your sequencing centre to do this). It's possible that this is the cause, but it doesn't seem the most likely way that this would go wrong.

**Qingl** · 11-11-2012, 09:54 AM

Thanks Simon! The mismatch rate for transcriptome is 2% than the control, but the control itself may have high error rate.

I asked for thumbnail images, but seq center tells me that thumbnail images take up a lot of space so they normally do not save those for the runs. They use control samples to diagnose issues with the run instead. And they tells me the control samples look normal.

After discussion, the seq center thinks the quality scale is possibly wrong, but it may be caused by the low complexity nature of the genome......which would cause the mis-recording of the illumina machine......
Would you have any idea for this? Since it's normal for fastqc seq content check,,,but that's the averaged out result. Maybe it really have some low complexity region?

**simonandrews** · 11-12-2012, 01:31 AM

Low complexity sequence is only a problem when it affects the whole library (ie there is a bias for all bases at a particular position within a library). Having low complexity sequence on an individual read isn't normally a problem - though it will make assembling the genome difficult. Also the illumina pipeline has no problem flagging low complexity sequence as poor quality - sometimes even when the bases have actually been read OK.

Nothing you've seen suggests that there is a problem with the calling of the sequence library. You have good quality scores on a run where other samples worked OK, and you have a reasonably low level of mismatch to control sequences within your own library.

I'd suggest focusing your attention on the assembly, or ruling out other possible sources of contamination rather than assuming that the base calling is wrong as this would seem to be the more likely place for the problem to be.

**Qingl** · 11-12-2012, 09:24 PM

Thanks Simon!
Could you have a look at the attached fig and tell whether there is a bias for all bases at a particular position within the library? It looks odd, though not indicate a low complexity...

Attached Files

Screen Shot 2012-11-12 at 10.13.56 PM.png (59.0 KB, 29 views)

**simonandrews** · 11-13-2012, 12:35 AM

The plot you attached certainly suggests that there is some loss of complexity, but it's not bad - we've seen much worse and have had no problems with the run. It's not until you get up around 70% being one base that Illumina will have any real problem (as long as there's some signal in the other channels). Did you get anything reported in the overrepresented sequence module? The most common reason for plots like that is the presence of a single contaminating sequence (normally an adapter).

**Qingl** · 11-13-2012, 08:30 AM

Exactly! Yes, there are 4 overrepresented seq, but only occupy 0.12% ~ 0.82% (see attachment), so we don't pay much attention to them. Would you think that would be a problem in sequencing?

Also, I have attached k-mer content, it seems like these low complexity kmers are related to the adapter problem? Would you think this would shift the base call quality scale?
Thank you!

Attached Files

**simonandrews** · 11-14-2012, 09:01 AM

They wouldn't be a problem with sequencing, but they might cause your per-base sequence plot to show a bias.

The Kmer plot is difficult to interpret without seeing the accompanying table. It would really be easier if you could put the whole report up somewhere we could see it rather than sending snippets.

**Qingl** · 11-14-2012, 09:08 AM

Hmm, I see, thx! Would you have a dropbox account? I put it in the shared folder of dropbox. Would you mind to let me invite you in shared folder?

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 27 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 31 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 27 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Help- Illumina Sequencing Reads Quality Scale Problem!

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News