Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Illumina process controls present in input data

    Hello,

    I am trying to assemble the genome of an insect using data from Illumina HiSeq2500 (250 PE). The first check of my data with FastQC showed the presence of:
    [1] Illumina adapters
    [2] Illumina Process Controls
    [3] this sequence: GGGCCATACTAGTACTGGATGCATCTGCAGGATATCGCGGCCGC

    I understand the reasons of adapters presence and how to deal with that, but why there are process controls? And where the DNA sequence of the point 3 comes from? Can I just remove it?

    Thank you in advance!

  • #2
    #2 must be process controls from TrueSeq kit (you can find the sequences in the Illumina sequence letter for scanning/trimming purposes). For #3 it could very well be sequence from the genome you are interested in so you shouldn't just throw it out. Use BBDuk from BBMap suite to scan and trim your data and then try running FastQC again to see how the data looks.
    Last edited by GenoMax; 09-22-2017, 07:02 AM.

    Comment


    • #3
      Thank you for your response!

      The strange thing with this sequence (sorry I forgot to mention it), is that it is always situated at the beginning of the reads. And I still have it even after trimming the data with BBDuk.

      Comment


      • #4
        Do you mean to say that sequence in #3 is present at the beginning of all reads? That would certainly be very odd.

        Comment


        • #5
          No, only a certain percentage of reads contain this sequence (I think less than 1%, but I don't have the estimation yet), but for all those reads this sequence is situated at the beginning.

          Actually, I have the same problem as described here. I found the explanation for all other sequences detected by FastQC (which correspond to Illumina Process Controls and which are documented on Illumina website), but I have no idea of the origin of this remaining sequence.

          Comment


          • #6
            If you take out that sequence does the rest of the read blast to the genome of the expected species (or a close relative)? You could either drop those reads all together (since they are only 1%) or choose to trim that sequence out (with bbduk's literal= option).

            Comment


            • #7
              Thank you GenoMax for your suggestion, I just tried to blast these reads and approximately one third of them blast to... the common carp genome! But I am working on an insect (and as far as I know there is no assembly available of species close to mine). How could it be explained?

              I also tried to blast the remaining (normal) reads and none of them matched to that genome.

              And why is that sequence always situated at the beginning of these reads? (well, I just found 16 reads having it in the middle, but all the others 52150 have it at the beginning)

              In any case, I suppose that I should remove all these reads from my assembly.

              Comment


              • #8
                It may be best to remove them altogether. Hopefully you don't have a bigger contamination problem. Take a few of other "normal" reads and confirm them by blast before you dive in to the assembly.

                Comment


                • #9
                  Looks like the reverse complement of #3 (GCGGCCGCGATATCCTGCAGATGCATCCAGTACTAGTATGGCCC) matches the last 55 base of TruSeq process controls CTA-150bp, CTA-450bp, CTA-550bp, and CTA-850bp
                  Josh Kinman

                  Comment


                  • #10
                    Originally posted by jdk787 View Post
                    Looks like the reverse complement of #3 (GCGGCCGCGATATCCTGCAGATGCATCCAGTACTAGTATGGCCC) matches the last 55 base of TruSeq process controls CTA-150bp, CTA-450bp, CTA-550bp, and CTA-850bp
                    You are right jdk787, thank you very much!

                    Comment


                    • #11
                      Incidentally, that sequence also occurs in:

                      CTA___650bp, CTA___350bp, CTA___250bp, CTA___750bp

                      These are all distributed with BBMap in /bbmap/resources/sequencing_artifacts.fa.gz. Their names were anonymized, though, as required by Illumina before I could distribute them publicly. Typically before you do things like assembly I suggest you perform adapter-trimming and synthetic artifact removal, e.g.

                      Code:
                      bbduk.sh in=in.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5
                      bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality
                      The current versions of BBMap allow you to specify "ref=artifacts", for example, and it will automatically use /bbmap/resources/sequencing_artifacts.fa.gz. The full suggested pipeline is in /bbmap/pipelines/assemblyPipeline.sh but some of the specific steps may be more relevant for bacteria than insect assembly.

                      Comment


                      • #12
                        Thank you Brian for your suggestion!

                        I only performed the first step (adapter trimming), I wasn't aware that bbduk was able to filter synthetic artifacts as well. I'll take a look at the suggested pipeline. Thank you again!

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Recent Innovations in Spatial Biology
                          by seqadmin


                          Spatial biology is an exciting field that encompasses a wide range of techniques and technologies aimed at mapping the organization and interactions of various biomolecules in their native environments. As this area of research progresses, new tools and methodologies are being introduced, accompanied by efforts to establish benchmarking standards and drive technological innovation.

                          3D Genomics
                          While spatial biology often involves studying proteins and RNAs in their...
                          01-01-2025, 07:30 PM
                        • seqadmin
                          Advancing Precision Medicine for Rare Diseases in Children
                          by seqadmin




                          Many organizations study rare diseases, but few have a mission as impactful as Rady Children’s Institute for Genomic Medicine (RCIGM). “We are all about changing outcomes for children,” explained Dr. Stephen Kingsmore, President and CEO of the group. The institute’s initial goal was to provide rapid diagnoses for critically ill children and shorten their diagnostic odyssey, a term used to describe the long and arduous process it takes patients to obtain an accurate...
                          12-16-2024, 07:57 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 01-09-2025, 04:04 PM
                        0 responses
                        434 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 01-09-2025, 09:42 AM
                        0 responses
                        441 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 01-08-2025, 03:17 PM
                        0 responses
                        458 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 01-03-2025, 11:18 AM
                        1 response
                        50 views
                        1 like
                        Last Post Tonia
                        by Tonia
                         
                        Working...
                        X