Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina process controls present in input data

    Hello,

    I am trying to assemble the genome of an insect using data from Illumina HiSeq2500 (250 PE). The first check of my data with FastQC showed the presence of:
    [1] Illumina adapters
    [2] Illumina Process Controls
    [3] this sequence: GGGCCATACTAGTACTGGATGCATCTGCAGGATATCGCGGCCGC

    I understand the reasons of adapters presence and how to deal with that, but why there are process controls? And where the DNA sequence of the point 3 comes from? Can I just remove it?

    Thank you in advance!

  • #2
    #2 must be process controls from TrueSeq kit (you can find the sequences in the Illumina sequence letter for scanning/trimming purposes). For #3 it could very well be sequence from the genome you are interested in so you shouldn't just throw it out. Use BBDuk from BBMap suite to scan and trim your data and then try running FastQC again to see how the data looks.
    Last edited by GenoMax; 09-22-2017, 07:02 AM.

    Comment


    • #3
      Thank you for your response!

      The strange thing with this sequence (sorry I forgot to mention it), is that it is always situated at the beginning of the reads. And I still have it even after trimming the data with BBDuk.

      Comment


      • #4
        Do you mean to say that sequence in #3 is present at the beginning of all reads? That would certainly be very odd.

        Comment


        • #5
          No, only a certain percentage of reads contain this sequence (I think less than 1%, but I don't have the estimation yet), but for all those reads this sequence is situated at the beginning.

          Actually, I have the same problem as described here. I found the explanation for all other sequences detected by FastQC (which correspond to Illumina Process Controls and which are documented on Illumina website), but I have no idea of the origin of this remaining sequence.

          Comment


          • #6
            If you take out that sequence does the rest of the read blast to the genome of the expected species (or a close relative)? You could either drop those reads all together (since they are only 1%) or choose to trim that sequence out (with bbduk's literal= option).

            Comment


            • #7
              Thank you GenoMax for your suggestion, I just tried to blast these reads and approximately one third of them blast to... the common carp genome! But I am working on an insect (and as far as I know there is no assembly available of species close to mine). How could it be explained?

              I also tried to blast the remaining (normal) reads and none of them matched to that genome.

              And why is that sequence always situated at the beginning of these reads? (well, I just found 16 reads having it in the middle, but all the others 52150 have it at the beginning)

              In any case, I suppose that I should remove all these reads from my assembly.

              Comment


              • #8
                It may be best to remove them altogether. Hopefully you don't have a bigger contamination problem. Take a few of other "normal" reads and confirm them by blast before you dive in to the assembly.

                Comment


                • #9
                  Looks like the reverse complement of #3 (GCGGCCGCGATATCCTGCAGATGCATCCAGTACTAGTATGGCCC) matches the last 55 base of TruSeq process controls CTA-150bp, CTA-450bp, CTA-550bp, and CTA-850bp
                  Josh Kinman

                  Comment


                  • #10
                    Originally posted by jdk787 View Post
                    Looks like the reverse complement of #3 (GCGGCCGCGATATCCTGCAGATGCATCCAGTACTAGTATGGCCC) matches the last 55 base of TruSeq process controls CTA-150bp, CTA-450bp, CTA-550bp, and CTA-850bp
                    You are right jdk787, thank you very much!

                    Comment


                    • #11
                      Incidentally, that sequence also occurs in:

                      CTA___650bp, CTA___350bp, CTA___250bp, CTA___750bp

                      These are all distributed with BBMap in /bbmap/resources/sequencing_artifacts.fa.gz. Their names were anonymized, though, as required by Illumina before I could distribute them publicly. Typically before you do things like assembly I suggest you perform adapter-trimming and synthetic artifact removal, e.g.

                      Code:
                      bbduk.sh in=in.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5
                      bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality
                      The current versions of BBMap allow you to specify "ref=artifacts", for example, and it will automatically use /bbmap/resources/sequencing_artifacts.fa.gz. The full suggested pipeline is in /bbmap/pipelines/assemblyPipeline.sh but some of the specific steps may be more relevant for bacteria than insect assembly.

                      Comment


                      • #12
                        Thank you Brian for your suggestion!

                        I only performed the first step (adapter trimming), I wasn't aware that bbduk was able to filter synthetic artifacts as well. I'll take a look at the suggested pipeline. Thank you again!

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM
                        • seqadmin
                          Techniques and Challenges in Conservation Genomics
                          by seqadmin



                          The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                          Avian Conservation
                          Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                          03-08-2024, 10:41 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 06:37 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, Yesterday, 06:07 PM
                        0 responses
                        9 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2024, 10:03 AM
                        0 responses
                        50 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-21-2024, 07:32 AM
                        0 responses
                        67 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X