SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Overrepresented kmers at the start of reads kentk Bioinformatics 20 07-23-2014 02:23 AM
#bases & # reads CS Student General 5 01-21-2012 04:10 AM
losing %reads aligned with Bowtie paired end analysis Batool Bioinformatics 0 04-21-2010 10:14 AM
Duplicate reads ("same start" reads) in 454 FLX/Titanium shotgun runs [c]oma 454 Pyrosequencing 20 08-28-2009 07:12 AM
start position of reads and its distribution baohua100 Bioinformatics 0 11-18-2008 06:20 AM

Reply
 
Thread Tools
Old 09-28-2012, 05:27 AM   #1
tjs7
Junior Member
 
Location: Cleveland

Join Date: Sep 2012
Posts: 3
Default Why am I losing up to 5 bases at start of reads?

Hello all, this is my first post. I have been trying for several weeks now to figure out an issue with a dataset. I have discussed this with a number of local experts and am in contact with Illumina support, but no one has come up with an answer yet. My advisor suggested SEQanswers as a good, knowledgeable forum.

Our reads should start with a 4 base degenerate sequence (which rarely aligns to the genome; to be used to identify PCR duplicates), an invariant C at the 5th base, then genomic sequence.

For visualization, start of read should be: NNNNC followed by 30 - 80 nt of genomic sequence.

Before even sending the library to be sequenced, I cloned a bit of library into pBluescript and sequenced 10 clones. All 10 had this correct structure, so we went ahead with sequencing.

However, after we sent the library to be sequenced on an Illumina HiScan SQ, the data that came back showed that only 33% of all reads had a C in the 5th position. Worse, when I randomly selected 30 reads and performed manual alignment, it appears as though anywhere from 0-5 of the first 5 bases align to the genome in a pretty random distribution. To put this another way, we have likely lost 1-5 nt from the beginning of reads (67% of all reads).

I can still work with the data by just aligning it without the first 5 bases and accepting that there will be PCR biases. However, I would prefer to use the degenerate bases to limit PCR biases and thus make the analysis a bit more quantitative.

Thanks for any help anyone can provide
tjs7 is offline   Reply With Quote
Old 09-28-2012, 04:06 PM   #2
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 343
Default

Could you give us some details about your protocols and the structure of the adapters that you are using? How do you get randomized sequences a the beginning of your reads? In any case, I would suggest to run FastQC or a similar program on your data to check for any quality problems.

Last edited by luc; 09-28-2012 at 04:09 PM.
luc is offline   Reply With Quote
Old 10-01-2012, 11:38 AM   #3
tjs7
Junior Member
 
Location: Cleveland

Join Date: Sep 2012
Posts: 3
Default

The core facility I collaborate with ran FastQC for me after I posted this, and it showed that quality scores were above 30 for bases 1~55, with the exception of base 5, which had a very low score. The explanation from the core facility computer analyst was that having a C in every read at position 5 is probably confusing the machine. Further analysis showed that 35% of the time C was correctly called, but the other ~65% of the time the machine called the 5th base as N.

During filtering, we were requiring that our reads have a C in the 5th position, thus we were throwing out a large portion of the data. By simply eliminating that requirement, we were able to include most reads in our data set, and most reads appear to have the correct structure.

I have no explanation why this occurred, since libraries of essentially the same structure were sequenced a year ago and bases were called correctly. It could be a particular software update or machine update. If anyone needs specifics (like software version, etc.) I am sure I could get them.

Thanks
tjs7 is offline   Reply With Quote
Old 10-01-2012, 07:10 PM   #4
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 343
Default

Hi,

good that you figured that out.
Having an identical base at one position in all clusters is obviously not a good premise as you have noted. Such problems are to be expected and you might have been merely lucky when doing your first sequencing run. Further I guess the HiSeq system has gotten considerably better over the last year - meaning we are getting a lot more reads on average - perhaps denser clusters lead to more problems in parts of the sequence lacking complexity?

I would have some more questions. Why would you need your 4 degenerate bases to determine PCR duplicates? Are you analyzing a small genome? I would assume that for eukaryotic genomes the first 30 bases (or perhaps better something like bases 12-40) are diverse enough for a good removal of PCR duplicates, especially for paired end data. At least that is our working assumption.

How did you generate the 4 degenerate bases at the beginning of the read? That sounds interesting.
What is the resulting base composition of your sequenced first 4 bases?

Last edited by luc; 10-01-2012 at 07:19 PM.
luc is offline   Reply With Quote
Old 10-02-2012, 05:10 AM   #5
tjs7
Junior Member
 
Location: Cleveland

Join Date: Sep 2012
Posts: 3
Default

Our library prep strategy has two variables that help identify PCR duplicates. First, our reads are designed to be of various lengths. Second, the RT primer we use has the 4 degenerate bases, which end up at the start of our reads (essentially 256 possible RT primers in the mix).

Doing a probability calculation, this comes out to thousands of possible combinations of read lengths and 4 degenerate base "codes" for a given genomic location. Thus, if we have multiple reads mapping to the exact same genomic coordinates and having the same 4 base "code," we treat those as PCR duplicates and collapse those reads into 1 read.

In practice, this works well for all but the most highly expressed genes. Those relatively few genes are so highly expressed in the tissue we study that the number of reads are so many that each combination of length, sequence, and 4 base code is repeated multiple times. We are willing to accept this to limit PCR duplicates throughout the majority of the dataset.

The base composition of the first 4 bases ended up as 25% A, 25% G, 15% C, and 35% T. Not a perfect 25% each, but OK for our purposes, which are qualitative comparative analyses.
tjs7 is offline   Reply With Quote
Old 10-02-2012, 01:47 PM   #6
luc
Senior Member
 
Location: US

Join Date: Dec 2010
Posts: 343
Default

Thanks a lot for the details on your protocol! Very interesting.
luc is offline   Reply With Quote
Old 10-03-2012, 06:31 AM   #7
jparsons
Member
 
Location: SF Bay Area

Join Date: Feb 2012
Posts: 62
Default

I'd be interested to know how often you get reads that look like PCR dupes without the random RT primer but have different degenerate bases. In other words, are 90% of the "duplicate" reads really duplicates, or is it more like 9%?
jparsons is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 09:14 PM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2018, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO