SEQanswers

Go Back   SEQanswers > Sequencing Technologies/Companies > Illumina/Solexa



Similar Threads
Thread Thread Starter Forum Replies Last Post
Bisulphite sequencing on Illumina Paired End 100bp reads yog77 Illumina/Solexa 11 07-15-2014 07:03 AM
FastQC,kmer content, per base sequence content: is this good enough mgg Bioinformatics 10 11-06-2013 11:45 PM
Finding differences in gene content with Illumina SRA? green tree De novo discovery 0 02-09-2012 09:24 AM
kmer content in the first bases of Illumina sequence brachysclereid Bioinformatics 2 01-09-2012 03:54 PM
Bisulphite sequencing on Illumina Paired End 100bp reads yog77 Epigenetics 0 06-30-2011 10:11 AM

Reply
 
Thread Tools
Old 06-30-2011, 08:28 AM   #1
yog77
Member
 
Location: London

Join Date: Jun 2011
Posts: 18
Default Illumina PE 100bp and allele content

Hi All Im new on here. i was after advice concerning the 100bp PE reads:

Q1) I have heard it is a problem bioinformaticaly that if you do 100bp PE reads, you ideally don't want the reads from either end to overlap (ie more than 100bp+ from either end) coz if they do you can't align these easily.

Q2) Following on from Q1 the reason I ask this is that I would like to sequence directly from 200bp PCR fragments as I am hoping to index 96 samples in a single lane. So is it possible to sequence many (~150) amplicons that are all 200bp long, I guess adapters and barcodes would be added to these after PCR.

so ~150 x 200bp amplicons for 96 patients using 100bp PE reads, therefore achieving 200bp reads that would hold allele specific information for that one paired end read. Also I really have no other option but to use PCR products - but to maintain the read along the full length.

Can anyone help
yog77 is offline   Reply With Quote
Old 06-30-2011, 08:40 AM   #2
Jeremy37
Member
 
Location: Montreal, Canada

Join Date: Feb 2011
Posts: 17
Default

Will all of your amplicons be exactly 200 bp long?
And are you aligning to a reference genome after sequencing?
I'm just thinking that if you're not aligning, and your reads don't overlap, then how will you know the actual insert size -- i.e. how many nt are between your PE reads.

In terms of size, though, having some be less than 200 nt and having the PE reads overlap is no problem. I've seen a number of our samples have low insert sizes close to a mean of 180 nt, with a distribution around that (i.e. some < 150 nt), and aligning to the reference human genome has been no problem.
You just don't want your inserts to be so short that you start reading through them into the adaptor sequences. But it sounds like that wouldn't be the case here.
Jeremy37 is offline   Reply With Quote
Old 06-30-2011, 08:56 AM   #3
yog77
Member
 
Location: London

Join Date: Jun 2011
Posts: 18
Default

Quote:
Originally Posted by Jeremy37 View Post
Will all of your amplicons be exactly 200 bp long?
And are you aligning to a reference genome after sequencing?
I'm just thinking that if you're not aligning, and your reads don't overlap, then how will you know the actual insert size -- i.e. how many nt are between your PE reads.

In terms of size, though, having some be less than 200 nt and having the PE reads overlap is no problem. I've seen a number of our samples have low insert sizes close to a mean of 180 nt, with a distribution around that (i.e. some < 150 nt), and aligning to the reference human genome has been no problem.
You just don't want your inserts to be so short that you start reading through them into the adaptor sequences. But it sounds like that wouldn't be the case here.
====
Thanks for the quick response

I was hoping to generate similar sized amplicons (~200bp) so there would be no need for size selection.

I plan to do bisulphite sequencing and aligning to a small region where all my amplicons will come from (300kb region of bisulphite converted sequnce) some of these amplicons will overlap with one another.

In essence I want as much read length (100bp x2) from the 200bp PCR amplicons as possible and so I was thinking there was going to be no insert - is this possible?
yog77 is offline   Reply With Quote
Old 06-30-2011, 09:02 AM   #4
TonyBrooks
Senior Member
 
Location: London

Join Date: Jun 2009
Posts: 298
Default

Have you thought about combining 150bp paired end reads into one psuedo read of 200bp using the 50bp overlap?
These guys used that approach for their metagenomic work, but it should also work for other applications.

http://www.plosone.org/article/info%...l.pone.0011840
TonyBrooks is offline   Reply With Quote
Old 06-30-2011, 09:44 AM   #5
yog77
Member
 
Location: London

Join Date: Jun 2011
Posts: 18
Default

Sorry Im new on here and not sure if you directly got my response Jeremy37

I was hoping to generate similar sized bisulphite PCR amplicons (~200bp) so there would be no need for size selection and this is a reliably obtainable size for bisulphite PCR.

I plan to do bisulphite sequencing and aligning to a small genomic region (a large gene) where all my amplicons will come from (300kb region of bisulphite converted sequnce and which will be used specifically for the alignment) some of these ~200bp amplicons will overlap with one another (say where I was interested in a streach of 3kb or so).

In essence I want as much read length (100bp x2) from the 200bp PCR amplicons as possible to look at methylated CpGs and SNPs in the same amplicon and so I was thinking there was going to be no insert as I want data on the whole 200bp PCR amplicon - is this possible to achieve?
yog77 is offline   Reply With Quote
Old 06-30-2011, 09:49 AM   #6
yog77
Member
 
Location: London

Join Date: Jun 2011
Posts: 18
Default

Quote:
Originally Posted by TonyBrooks View Post
Have you thought about combining 150bp paired end reads into one psuedo read of 200bp using the 50bp overlap?
These guys used that approach for their metagenomic work, but it should also work for other applications.

http://www.plosone.org/article/info%...l.pone.0011840
Thanks will look into that
yog77 is offline   Reply With Quote
Old 06-30-2011, 10:28 AM   #7
Jeremy37
Member
 
Location: Montreal, Canada

Join Date: Feb 2011
Posts: 17
Default

I don't see any problem with what you're trying to do.
I'm not sure why you are concerned about the read length though. It seems to me that you could do this even with 50 bp reads if you wanted. You would be demultiplexing the samples yourself using your adapter sequences, I guess.

I'm not sure how the SNP calling would work, since with the bisulphite treatment (which I just had to look up) you're going to have a lot of differences from the reference. I think you need someone who knows about methylation analysis to comment...
Jeremy37 is offline   Reply With Quote
Old 06-30-2011, 02:38 PM   #8
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

Overlapping end is not a problem for all the major read mappers. It could propose minor issues for SNP calling, but just minor.
lh3 is offline   Reply With Quote
Old 06-30-2011, 11:56 PM   #9
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

Technically, overlapping reads should not be a problem (unless they are completely contained within each other). However if reads overlap you will potentially call the methylation state of the overlapping part twice, and you need to think about a strategy how to deal with this (i.e. use methylation calls from only a single read, from both reads...).

I am also quite concerned about a read length of 100bp. From our experience the basecall qualities drop steadily towards the end of reads, and this usually starts from bp 50-70bp. BS-Seq is very dependent on good quality reads, especially if you also want to look at SNPs later on. We have seen numerous examples where long reads (75-108bp) had to be trimmed uniformly to ~50bp (or using adaptive quality trimmers) in order to obtain a good mapping efficiency. This essentially means wasting half of the data und thus money. If I understood it correctly you should have many different products of your amplified gene, and I think more but shorter reads will be more useful than one 2x100bp run with low qualities.

If you have good coverage SNP calling is possible, but it is a bit trickier than normal because SNPs concerning Cs or Ts can only be called by looking at reads from the opposing strand (before BS conversion).
fkrueger is offline   Reply With Quote
Old 07-01-2011, 04:40 AM   #10
yog77
Member
 
Location: London

Join Date: Jun 2011
Posts: 18
Default

Thank you all for your comments they have been really helpful as I don't have any hands on experience with Illumina sequencing just yet - it's been "Illuminating"

Jeremy37 - My reason for hoping for longer reads is to associate methylation on specific reads (originating from a single cluser - a bit like a single molecule) and it's associated SNP's, and so the longer the read the more potenital SNPs to try and associate the methylation status with.

fkrueger - Ok I get that overlapping won't be an issue, but now appreciate that the quality is going to drop off from 50-75bp onwards so wasting half the money! Also Iam aware that the C or T SNPs will need to be confirmed by genomic re-seq.

As you have experience with BS-seq, I just wondered wouldn't it be less complex mapping to a defined region (such as my 300kb gene region) than a whole genome and so we might be more sucessful in mapping the poorer quality end of reads? OR would you still reconmend shorter, say 75bp PE reads in the hope that the last 25 bases are OKish quality, or in your experience this would still be poor quality for BS-seq??
yog77 is offline   Reply With Quote
Old 07-01-2011, 05:01 AM   #11
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

I would assume that you wouldn't lose many reads due to ambiguous mapping if you aligned 2x50 or even 2x75bp reads the whole genome instead of just your region of interest. It might be a bit quicker but shouldn't make such a big difference. If in doubt you could just compare the number of mapped reads against the whole genome with your region of interest, and if they don't differ very much i would possibly use the whole genome approach as this can be informative whether your experiment worked the way you intended and it is probably easier to justify for a publication at some point...

If I had a choice I would opt for 2x50 or 2x75bp reads, the latter might need to be run through a quality and/or adapter trimmer just to be sure. Low quality sequence can lead to wrong methylation calls, in rare cases even to mis-mappings (which generally produce random methylation calls). And of course many mismatches can bring down your mapping efficiency quite quickly if you use reasonably strict mapping parameters. So I suggest short to medium reads and possibly quality trimming, then you should be fine. Let me know if I can be of any further help with your project.
fkrueger is offline   Reply With Quote
Old 07-01-2011, 05:47 AM   #12
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

How long reads you can sequence depends on many factors, such as machine, chemistry and optimization. All the HiSeq users I know can confidently get 2*100bp reads without much quality drop at the end. I have seen optimized GAIIx can also reach this level of accuracy. With 100bp reads, we have much fewer alignment artifacts than using 2*50bp reads. If your machine (e.g. HiSeq) can do that and you are not very constrained by the funding, you should try to get 2*100 reads. Roche used to advertise "longer is better". That is true.

Also, in the previous post, I just want to say overlapping ends does not cause mapping problems. How to deal with them is largely the task of downstream tools.

Last edited by lh3; 07-01-2011 at 05:51 AM.
lh3 is offline   Reply With Quote
Old 07-01-2011, 05:53 AM   #13
fkrueger
Senior Member
 
Location: Cambridge, UK

Join Date: Sep 2009
Posts: 625
Default

I agree that longer = better IF quality stays up until the end. The latest iPS BS-Seq datasets from Lister et al. have excellent qualities for reads >100bp for instance. However we have received loads of emails from people where the quality of their data deteriorated quite early on (as mentioned above).
fkrueger is offline   Reply With Quote
Old 07-01-2011, 06:50 AM   #14
ECO
--Site Admin--
 
Location: SF Bay Area, CA, USA

Join Date: Oct 2007
Posts: 1,358
Default

Moving to ILMN forum.
ECO is offline   Reply With Quote
Old 07-01-2011, 06:53 AM   #15
lh3
Senior Member
 
Location: Boston

Join Date: Feb 2008
Posts: 693
Default

All the HiSeq data I have seen so far have good quality at the end. Another potential concern is that not all BS mappers are optimized for 100bp reads. They may have better performance for 50bp reads.
lh3 is offline   Reply With Quote
Old 07-01-2011, 09:32 AM   #16
yog77
Member
 
Location: London

Join Date: Jun 2011
Posts: 18
Default

Quote:
Originally Posted by fkrueger View Post
I would assume that you wouldn't lose many reads due to ambiguous mapping if you aligned 2x50 or even 2x75bp reads the whole genome instead of just your region of interest. It might be a bit quicker but shouldn't make such a big difference. If in doubt you could just compare the number of mapped reads against the whole genome with your region of interest, and if they don't differ very much i would possibly use the whole genome approach as this can be informative whether your experiment worked the way you intended and it is probably easier to justify for a publication at some point...

If I had a choice I would opt for 2x50 or 2x75bp reads, the latter might need to be run through a quality and/or adapter trimmer just to be sure. Low quality sequence can lead to wrong methylation calls, in rare cases even to mis-mappings (which generally produce random methylation calls). And of course many mismatches can bring down your mapping efficiency quite quickly if you use reasonably strict mapping parameters. So I suggest short to medium reads and possibly quality trimming, then you should be fine. Let me know if I can be of any further help with your project.
Thanks for the informative reply will give it some more thought and may some more questions thanks
yog77 is offline   Reply With Quote
Old 07-01-2011, 10:11 AM   #17
Jeremy37
Member
 
Location: Montreal, Canada

Join Date: Feb 2011
Posts: 17
Default

You should ask whoever is doing the sequencing for you what to expect in terms of read quality towards the end of the reads. When we were sequencing with the GA, we had extremely high quality all the way to base 100 -- often a median > Q35.

With the hiseqs they have been optimizing some things. We have had some runs "fail"... but they were eventually redone, and typically we have quality >Q30 at read 100. Especially since you expect to have many mismatches, I think that the value of longer reads would be very high for your application. You can always quality trim your reads using a tool like fastx (fastq_quality_trimmer). Even when the quality trails off at the end of a run, you will get a large fraction of the reads that don't need to be trimmed.
Jeremy37 is offline   Reply With Quote
Old 07-13-2011, 04:02 AM   #18
yog77
Member
 
Location: London

Join Date: Jun 2011
Posts: 18
Default

Thanks All for the informative answers.

Hi all I have another Question regarding this approach:

Q2) Secondly a concern some of my colleagues have highlighted is that using a PCR approach we would get over-representation of the start and end of the reads (i.e. the start and end of the desired 200bp amplicon) and a much lower if not absent coverage of the middle portion. Would anyone have any comments as to whether this would be the case and if so are there ways around this?
yog77 is offline   Reply With Quote
Reply

Thread Tools

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off




All times are GMT -8. The time now is 03:03 AM.


Powered by vBulletin® Version 3.8.9
Copyright ©2000 - 2021, vBulletin Solutions, Inc.
Single Sign On provided by vBSSO