Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help- Illumina Sequencing Reads Quality Scale Problem!

    Hi,
    Our lab asked a seq company to do the illumina seq for a snail. However, the illumina reads they produced seem have some problems, since I used jellyfish to do k-mer analysis, and didn't find any coverage peaks even after quality filtration and trimming for 50x coverage - which shows very high quality score in fastqc check. Since the reads have expected hits in our transcriptome, we can rule out reads contamination. Then the only possible reason I could think is the reads quality scale is completely wrong in base calling procedure. For example, it's put Q30 on , but actually it's Q10 or lower.

    Could anyone give us some ideas how the base calling procedure would fail in seq process, and could anyone give us some suggestions ? We have already spent huge amount of money on it...

    Thank you very much!!
    Looking forward to your reply!
    Best,
    Qing

  • #2
    It's pretty unlikely that the base calling has mis-assigned high quality scores to your data when it's actually really poor, if anything the illumina pipeline tends to go the other way and mark good data as bad.

    It wasn't really clear to me how you've decided that your data is bad - you said you have hits to your transcriptome, so presumably you can see the degree of similarity and get some idea of how good your data is from that.

    When you did your fastqc analysis what did the per-sequence GC plot look like? Genomic reads from a clean source should generally produce a nice looking normal distribution in this plot. If you have contamination with a different organism (with a different GC content) you should get some idea from that.

    Comment


    • #3
      Originally posted by simonandrews View Post
      It's pretty unlikely that the base calling has mis-assigned high quality scores to your data when it's actually really poor, if anything the illumina pipeline tends to go the other way and mark good data as bad.

      It wasn't really clear to me how you've decided that your data is bad - you said you have hits to your transcriptome, so presumably you can see the degree of similarity and get some idea of how good your data is from that.

      When you did your fastqc analysis what did the per-sequence GC plot look like? Genomic reads from a clean source should generally produce a nice looking normal distribution in this plot. If you have contamination with a different organism (with a different GC content) you should get some idea from that.
      Hi Simon,
      Thank you very much for your reply! I suspect the data are bad because: 1 the assembly failed at high coverage depth(50x~100x) and high quality(>Q20~Q30) ;2 jellyfish would not produce any peak, however I change the k-mer size or coverage depth of data.

      However, when I blast the reads against the transcriptome, it gave me the expected coverage hits, and the per-sequence GC plot looks normal. So I think we can rull out reads contamination. Then the only possible explanation is reads quality scale off......
      Any thoughts or ideas? Thanks! It has been a nightmare for me...

      Comment


      • #4
        I'd go back to the point that if you have reads which align against your known transcriptome then by looking at those you should be able to tell approximately what error rate you're actually seeing in your genomic data. If your data really is Q10 or below it should be pretty obvious in the number of mismatches you see to your existing RNA-seq data.

        Also, you could make some back of the envelope calculations to work out if the number of sequences mapping to your RNA-Seq data falls into line with the size of genome you expect. As long as you can (roughly) estimate the proportion of your genome expected to be covered by exons then you can see if your mapped reads occur at roughly this rate (give or take an order of magnitude) in your genomic data.

        As I said before I've never seen an Illumina dataset mis-represent bad data as high quality. If you were really serious about excluding this as a possibility you could even go back to the original run and look at some of the thumbnail images and see if the data looked OK (or get your sequencing centre to do this). It's possible that this is the cause, but it doesn't seem the most likely way that this would go wrong.

        Comment


        • #5
          Thanks Simon! The mismatch rate for transcriptome is 2% than the control, but the control itself may have high error rate.

          I asked for thumbnail images, but seq center tells me that thumbnail images take up a lot of space so they normally do not save those for the runs. They use control samples to diagnose issues with the run instead. And they tells me the control samples look normal.

          After discussion, the seq center thinks the quality scale is possibly wrong, but it may be caused by the low complexity nature of the genome......which would cause the mis-recording of the illumina machine......
          Would you have any idea for this? Since it's normal for fastqc seq content check,,,but that's the averaged out result. Maybe it really have some low complexity region?

          Comment


          • #6
            Low complexity sequence is only a problem when it affects the whole library (ie there is a bias for all bases at a particular position within a library). Having low complexity sequence on an individual read isn't normally a problem - though it will make assembling the genome difficult. Also the illumina pipeline has no problem flagging low complexity sequence as poor quality - sometimes even when the bases have actually been read OK.

            Nothing you've seen suggests that there is a problem with the calling of the sequence library. You have good quality scores on a run where other samples worked OK, and you have a reasonably low level of mismatch to control sequences within your own library.

            I'd suggest focusing your attention on the assembly, or ruling out other possible sources of contamination rather than assuming that the base calling is wrong as this would seem to be the more likely place for the problem to be.

            Comment


            • #7
              Thanks Simon!
              Could you have a look at the attached fig and tell whether there is a bias for all bases at a particular position within the library? It looks odd, though not indicate a low complexity...
              Attached Files

              Comment


              • #8
                The plot you attached certainly suggests that there is some loss of complexity, but it's not bad - we've seen much worse and have had no problems with the run. It's not until you get up around 70% being one base that Illumina will have any real problem (as long as there's some signal in the other channels). Did you get anything reported in the overrepresented sequence module? The most common reason for plots like that is the presence of a single contaminating sequence (normally an adapter).

                Comment


                • #9
                  Exactly! Yes, there are 4 overrepresented seq, but only occupy 0.12% ~ 0.82% (see attachment), so we don't pay much attention to them. Would you think that would be a problem in sequencing?

                  Also, I have attached k-mer content, it seems like these low complexity kmers are related to the adapter problem? Would you think this would shift the base call quality scale?
                  Thank you!
                  Attached Files

                  Comment


                  • #10
                    They wouldn't be a problem with sequencing, but they might cause your per-base sequence plot to show a bias.

                    The Kmer plot is difficult to interpret without seeing the accompanying table. It would really be easier if you could put the whole report up somewhere we could see it rather than sending snippets.

                    Comment


                    • #11
                      Hmm, I see, thx! Would you have a dropbox account? I put it in the shared folder of dropbox. Would you mind to let me invite you in shared folder?

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM
                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      31 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      27 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-04-2024, 09:00 AM
                      0 responses
                      52 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X