SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Anyone using input controls for ChIP-exo? NKS Sample Prep / Library Generation 0 03-01-2017 01:20 PM
ERCC spike in controls for illumina? onconaut Illumina/Solexa 5 11-26-2012 06:47 AM
Process to remove primers, adapters, etc. from Illumina data LizBent Bioinformatics 6 05-14-2012 04:08 AM
when do you pre-process Illumina reads before analysis? PFS Bioinformatics 15 04-28-2011 03:06 PM

Reply
 
Thread Tools
Old 09-22-2017, 05:24 AM   #1
svitlana
Member
 
Location: Brussels

Join Date: Jun 2017
Posts: 15
Question Illumina process controls present in input data

Hello,

I am trying to assemble the genome of an insect using data from Illumina HiSeq2500 (250 PE). The first check of my data with FastQC showed the presence of:
[1] Illumina adapters
[2] Illumina Process Controls
[3] this sequence: GGGCCATACTAGTACTGGATGCATCTGCAGGATATCGCGGCCGC

I understand the reasons of adapters presence and how to deal with that, but why there are process controls? And where the DNA sequence of the point 3 comes from? Can I just remove it?

Thank you in advance!
svitlana is offline   Reply With Quote
Old 09-22-2017, 07:00 AM   #2
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,951
Default

#2 must be process controls from TrueSeq kit (you can find the sequences in the Illumina sequence letter for scanning/trimming purposes). For #3 it could very well be sequence from the genome you are interested in so you shouldn't just throw it out. Use BBDuk from BBMap suite to scan and trim your data and then try running FastQC again to see how the data looks.

Last edited by GenoMax; 09-22-2017 at 07:02 AM.
GenoMax is offline   Reply With Quote
Old 09-22-2017, 07:04 AM   #3
svitlana
Member
 
Location: Brussels

Join Date: Jun 2017
Posts: 15
Default

Thank you for your response!

The strange thing with this sequence (sorry I forgot to mention it), is that it is always situated at the beginning of the reads. And I still have it even after trimming the data with BBDuk.
svitlana is offline   Reply With Quote
Old 09-22-2017, 07:13 AM   #4
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,951
Default

Do you mean to say that sequence in #3 is present at the beginning of all reads? That would certainly be very odd.
GenoMax is offline   Reply With Quote
Old 09-22-2017, 07:24 AM   #5
svitlana
Member
 
Location: Brussels

Join Date: Jun 2017
Posts: 15
Default

No, only a certain percentage of reads contain this sequence (I think less than 1%, but I don't have the estimation yet), but for all those reads this sequence is situated at the beginning.

Actually, I have the same problem as described here. I found the explanation for all other sequences detected by FastQC (which correspond to Illumina Process Controls and which are documented on Illumina website), but I have no idea of the origin of this remaining sequence.
svitlana is offline   Reply With Quote
Old 09-22-2017, 07:29 AM   #6
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,951
Default

If you take out that sequence does the rest of the read blast to the genome of the expected species (or a close relative)? You could either drop those reads all together (since they are only 1%) or choose to trim that sequence out (with bbduk's literal= option).
GenoMax is offline   Reply With Quote
Old 09-22-2017, 10:09 AM   #7
svitlana
Member
 
Location: Brussels

Join Date: Jun 2017
Posts: 15
Default

Thank you GenoMax for your suggestion, I just tried to blast these reads and approximately one third of them blast to... the common carp genome! But I am working on an insect (and as far as I know there is no assembly available of species close to mine). How could it be explained?

I also tried to blast the remaining (normal) reads and none of them matched to that genome.

And why is that sequence always situated at the beginning of these reads? (well, I just found 16 reads having it in the middle, but all the others 52150 have it at the beginning)

In any case, I suppose that I should remove all these reads from my assembly.
svitlana is offline   Reply With Quote
Old 09-22-2017, 10:20 AM   #8
GenoMax
Senior Member
 
Location: East Coast USA

Join Date: Feb 2008
Posts: 6,951
Default

It may be best to remove them altogether. Hopefully you don't have a bigger contamination problem. Take a few of other "normal" reads and confirm them by blast before you dive in to the assembly.
GenoMax is offline   Reply With Quote
Old 09-22-2017, 05:22 PM   #9
jdk787
josh kinman
 
Location: Austin

Join Date: Apr 2014
Posts: 64
Default

Looks like the reverse complement of #3 (GCGGCCGCGATATCCTGCAGATGCATCCAGTACTAGTATGGCCC) matches the last 55 base of TruSeq process controls CTA-150bp, CTA-450bp, CTA-550bp, and CTA-850bp
__________________
Josh Kinman
jdk787 is offline   Reply With Quote
Old 09-26-2017, 02:12 AM   #10
svitlana
Member
 
Location: Brussels

Join Date: Jun 2017
Posts: 15
Default

Quote:
Originally Posted by jdk787 View Post
Looks like the reverse complement of #3 (GCGGCCGCGATATCCTGCAGATGCATCCAGTACTAGTATGGCCC) matches the last 55 base of TruSeq process controls CTA-150bp, CTA-450bp, CTA-550bp, and CTA-850bp
You are right jdk787, thank you very much!
svitlana is offline   Reply With Quote
Old 09-27-2017, 08:58 AM   #11
Brian Bushnell
Super Moderator
 
Location: Walnut Creek, CA

Join Date: Jan 2014
Posts: 2,707
Default

Incidentally, that sequence also occurs in:

CTA___650bp, CTA___350bp, CTA___250bp, CTA___750bp

These are all distributed with BBMap in /bbmap/resources/sequencing_artifacts.fa.gz. Their names were anonymized, though, as required by Illumina before I could distribute them publicly. Typically before you do things like assembly I suggest you perform adapter-trimming and synthetic artifact removal, e.g.

Code:
bbduk.sh in=in.fq.gz out=trimmed.fq.gz ktrim=r k=23 mink=11 hdist=1 tbo tpe minlen=70 ref=adapters ftm=5
bbduk.sh in=trimmed.fq.gz out=filtered.fq.gz k=31 ref=artifacts,phix ordered cardinality
The current versions of BBMap allow you to specify "ref=artifacts", for example, and it will automatically use /bbmap/resources/sequencing_artifacts.fa.gz. The full suggested pipeline is in /bbmap/pipelines/assemblyPipeline.sh but some of the specific steps may be more relevant for bacteria than insect assembly.
Brian Bushnell is offline   Reply With Quote
Old 09-28-2017, 02:03 AM   #12
svitlana
Member
 
Location: Brussels

Join Date: Jun 2017
Posts: 15
Default

Thank you Brian for your suggestion!

I only performed the first step (adapter trimming), I wasn't aware that bbduk was able to filter synthetic artifacts as well. I'll take a look at the suggested pipeline. Thank you again!
svitlana is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 05:53 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2019, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO