Hi, I just started a project where we need to resequence many E. coli mutants using an Illumina GAII system. Anyone have any thoughts on paired-end versus single end, potentially with multiplexing to save $$? I have estimated that we can run ~5 multiplexed samples to get >10X coverage for each (4.6 Mbp genome), but I have no experience on what the actual data will look like (errors, other things I haven't thought of, etc.). For anyone who has done resequencing is 10X coverage realistic to identify SNPs or other mutations?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
If I were faced with this question, I'd probably start with thinking about the following:
Paired-end vs. single end. What is the real cost per basepair for each approach? The read length would figure in here. Given the read lengths you are using & E.coli (are these lab strains? wild strains?), what fraction of reads will not be unambiguously mappable with only single end coverage? There are probably some E.coli datasets in the Short Read Archive to play with, you could simulate some (there are various tools out there; getting the error profile right could be a bugaboo) or do a pilot run at very long read length & paired end, from which you can then simulate any lesser version of data (single end, shorter reads).
For the depth question, you really need to think about what is the risk of false positive (due to a miscall being classified as an error because you do not get reads to contradict it) or false negative (insufficient depth to find the SNP) errors -- how worried are you of each?
One more set of considerations: how will the results be used? What will the next experiment look like? For example, if next you retest mutations by another sequencing strategy (PCR-based Sanger, for example), how many positives do you want to screen? Or is the next step really expensive, such as generating the same mutation in another background or expressing mutant protein or such?
Comment
-
I am still waiting to find out pricing for single end versus paired end runs with multiplexing, but I am trying to figure out if I should be using one versus the other for any specific reason other than price. Our ultimate goal is to sequence mutants of E. coli K12 to better understand pathway evolution. We will be confirming mutations by Sanger sequencing regions of interest so my biggest concern would probably be false negative errors.
As I am very new to this area, where would I find these E. coli datasets? Also, are there any programs I can run on a laptop with 1 GB RAM to look at this (running Windows XP)? We are going to be purchasing a new computer for the actual data analysis, but we haven't decided what to get quite yet.
Thanks for the help!
Comment
-
Paired-end reads don't offer any advantage for identifying SNPs, although they can be useful for identifying indels. Unless you have a large number of strains to analyze, 30X coverage is reasonable. A single lane of Illumina short-read sequence (30 million reads x 36bp) is sufficient for 30X coverage for seven barcoded strains. Reagent costs for the clustering and sequencing are ~$600, and costs per sample library can be >$100 w/ home-made protocol. So you're looking at ~$200 per strain; your labor costs for the time spent on data analysis will be MUCH more than that !
-Harold
Comment
-
Thanks for all of the information, it is hard to find everything in one place!
Our plan is to run paired-end with 36 cycles to catch genomic rearrangements as well as SNPs. I am thinking of starting with multiplexing 6 strains per lane and see what the coverage ends up looking like. Anyone have any thoughts on the size of barcodes - we don't want to loose too much sequence but maintain fidelity. Also, with paired-end runs do you have to put the barcode on both ends? I was thinking you wouldn't but reading some of the threads I am now confused on this issue again.
Thanks again!
Comment
-
Originally posted by ECO View PostSNPs or transposons moving around. That is the question.
I had much better luck with SOLiD mate pairs in resolving unknown bacterial mutations, while only after the fact were we able to confirm with frag reads.
Thanks!
Comment
-
I've just finished analyzing our dataset of twelve barcoded E. coli samples (one WT and 11 mutants). Three lanes of 36bp single-end reads at less than maximum density produced enough sequence for 30X average depth of coverage. Ours were spontaneous mutations, so we anticipated that most would be transposon hops. I resorted to an inelegant but successful strategy to ID the insertion sites. I generated a second 'chromosome' file that contained the ends of all known insertion elements, and removed those sequences from the E. coli reference genome. I then aligned the 5' and 3' ends of my reads independently, then IDed the reads that mapped to different chromosomes (ECO was right; it would have been easier with paired-end data). Nine mutant strains contained unique transposon insertions that were absent in the WT. I used BFAST (shout-out to Nils) to screen for SNPs; one of the two remaining mutants contained a single unique SNP encoding a missense mutation. (The last mutant might contain a transposon in a repeated sequence; I'm still parsing the data).
Paired-end 36bp sequencing would be the way to go. In addition to making the transposon strategy work more smoothly, you could identify large rearrangements in the same manner (i.e., the ends would map to different loci in the reference genome).
-Harold
Comment
-
Originally posted by krobison View PostCool!
If you want to have fun & have a sufficient machine, you could also feed each read set into velvet & then take the contigs & BLAST them against reference E.coli & your transposons.
Comment
-
Ours was a one-off experiment, so I didn't invest the time in mastering a de novo assembler (I'm a molecular biologist, so it takes me longer ). But if we were working with a larger collection of mutations, I could see the advantage of building a reference genome from the starting strain. And coverage wouldn't be a problem; even one lane of 36bp single end would yield >90X (and surely that's enough for an accurate assembly).
-Harold
Comment
-
Originally posted by HESmith View PostI've just finished analyzing our dataset of twelve barcoded E. coli samples (one WT and 11 mutants). Three lanes of 36bp single-end reads at less than maximum density produced enough sequence for 30X average depth of coverage. Ours were spontaneous mutations, so we anticipated that most would be transposon hops. I resorted to an inelegant but successful strategy to ID the insertion sites. I generated a second 'chromosome' file that contained the ends of all known insertion elements, and removed those sequences from the E. coli reference genome. I then aligned the 5' and 3' ends of my reads independently, then IDed the reads that mapped to different chromosomes (ECO was right; it would have been easier with paired-end data). Nine mutant strains contained unique transposon insertions that were absent in the WT. I used BFAST (shout-out to Nils) to screen for SNPs; one of the two remaining mutants contained a single unique SNP encoding a missense mutation. (The last mutant might contain a transposon in a repeated sequence; I'm still parsing the data).
Paired-end 36bp sequencing would be the way to go. In addition to making the transposon strategy work more smoothly, you could identify large rearrangements in the same manner (i.e., the ends would map to different loci in the reference genome).
-Harold
We recently got data from several multiplexed 36bp single-read E. coli samples and are starting to analyze them (we couldn't find anyone to run paired-end lanes with us at the time). I have been using Bowtie to align the data to the E. coli K-12 MG1655 genome, but am having problems with transposons and other large insertions and was curious about the method you described to find them. I am still pretty new at this and couldn't quite understand what you did with the second chromosome file? I think we may try some de novo sequencing down the line, but I don't know when the software will be available on our server and would really like to get started identifying regions of interest for Sanger sequencing.
Thanks!
Jamie
Comment
-
Hi Jamie,
By default, most aligners return the reads that have a single best match to the reference genome. Transposons are present in multiple copies, so reads derived from transposons will produce multiple best matches. To use the approach I took, you want remove the transposon sequences in the K-12 genome and build a second reference file that contains a single copy of each transposon.
Specifically:
1) Build a new reference file that contains 20 bases from the 5' and 3' end of each transposon/insertion element.
2) Replace those sequences with N's in the K-12 genome.
3) Use your favorite aligner to map the first 16 bases of your reads to the two reference files.
4) Repeat the alignment with the last 16 bases of your reads.
5) Identify the set of reads where the first 16 bases aligns to the K-12 reference and the last 16 bases aligns to the transposon reference, and vice versa. These are the junction reads (genome/transposon).
6) Use the genome alignment position to map the location of your reads to known or novel insertion sites.
Note that there are more sophisticated tools for identifying indels, but this approach worked fine for my purposes.
Email (smithhe2ATniddk.nih.gov) if you have further questions.
-Harold
Comment
Latest Articles
Collapse
-
by seqadmin
The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...-
Channel: Articles
05-06-2024, 07:48 AM -
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
04-22-2024, 07:01 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Today, 02:46 PM
|
0 responses
8 views
0 likes
|
Last Post
by seqadmin
Today, 02:46 PM
|
||
Started by seqadmin, 05-07-2024, 06:57 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
05-07-2024, 06:57 AM
|
||
Started by seqadmin, 05-06-2024, 07:17 AM
|
0 responses
16 views
0 likes
|
Last Post
by seqadmin
05-06-2024, 07:17 AM
|
||
Started by seqadmin, 05-02-2024, 08:06 AM
|
0 responses
23 views
0 likes
|
Last Post
by seqadmin
05-02-2024, 08:06 AM
|
Comment