Seqanswers Leaderboard Ad

**NextGenSeq** · 05-05-2010, 04:57 AM

For SNP and mutation discovery people usually try to aim for 20 to 30X.

**krobison** · 05-05-2010, 06:16 AM

If I were faced with this question, I'd probably start with thinking about the following:

Paired-end vs. single end. What is the real cost per basepair for each approach? The read length would figure in here. Given the read lengths you are using & E.coli (are these lab strains? wild strains?), what fraction of reads will not be unambiguously mappable with only single end coverage? There are probably some E.coli datasets in the Short Read Archive to play with, you could simulate some (there are various tools out there; getting the error profile right could be a bugaboo) or do a pilot run at very long read length & paired end, from which you can then simulate any lesser version of data (single end, shorter reads).

For the depth question, you really need to think about what is the risk of false positive (due to a miscall being classified as an error because you do not get reads to contradict it) or false negative (insufficient depth to find the SNP) errors -- how worried are you of each?

One more set of considerations: how will the results be used? What will the next experiment look like? For example, if next you retest mutations by another sequencing strategy (PCR-based Sanger, for example), how many positives do you want to screen? Or is the next step really expensive, such as generating the same mutation in another background or expressing mutant protein or such?

**jkersh** · 05-05-2010, 11:28 AM

I am still waiting to find out pricing for single end versus paired end runs with multiplexing, but I am trying to figure out if I should be using one versus the other for any specific reason other than price. Our ultimate goal is to sequence mutants of E. coli K12 to better understand pathway evolution. We will be confirming mutations by Sanger sequencing regions of interest so my biggest concern would probably be false negative errors.

As I am very new to this area, where would I find these E. coli datasets? Also, are there any programs I can run on a laptop with 1 GB RAM to look at this (running Windows XP)? We are going to be purchasing a new computer for the actual data analysis, but we haven't decided what to get quite yet.

Thanks for the help!

**HESmith** · 05-05-2010, 05:15 PM

Paired-end reads don't offer any advantage for identifying SNPs, although they can be useful for identifying indels. Unless you have a large number of strains to analyze, 30X coverage is reasonable. A single lane of Illumina short-read sequence (30 million reads x 36bp) is sufficient for 30X coverage for seven barcoded strains. Reagent costs for the clustering and sequencing are ~$600, and costs per sample library can be >$100 w/ home-made protocol. So you're looking at ~$200 per strain; your labor costs for the time spent on data analysis will be MUCH more than that

!

-Harold

**ECO** · 05-05-2010, 08:08 PM

SNPs or transposons moving around. That is the question.

I had much better luck with SOLiD mate pairs in resolving unknown bacterial mutations, while only after the fact were we able to confirm with frag reads.

**jkersh** · 05-18-2010, 02:41 PM

Thanks for all of the information, it is hard to find everything in one place!

Our plan is to run paired-end with 36 cycles to catch genomic rearrangements as well as SNPs. I am thinking of starting with multiplexing 6 strains per lane and see what the coverage ends up looking like. Anyone have any thoughts on the size of barcodes - we don't want to loose too much sequence but maintain fidelity. Also, with paired-end runs do you have to put the barcode on both ends? I was thinking you wouldn't but reading some of the threads I am now confused on this issue again.

Thanks again!

**jkersh** · 05-20-2010, 06:47 AM

Please ignore the barcoding question relating to paired-end reads, just realized I was thinking about incorrectly!

**jkersh** · 05-20-2010, 06:59 AM

Originally posted by ECO View Post

SNPs or transposons moving around. That is the question.

I had much better luck with SOLiD mate pairs in resolving unknown bacterial mutations, while only after the fact were we able to confirm with frag reads.

I agree that SOLiD would be better for this, but we only have an Illumina system available to us. We think the most likely mutations with be SNPs, but it is likely we will also find a variety of insertions/deletions. I am not sure if we will really be able to see transposons but it would be nice to see those as well. I am hoping a paired-end Illumina run with 36 cycles will pick up most of those, but since I haven't had any experience with this I am not sure which type of Illumina run will be the best.

Thanks!

**HESmith** · 05-26-2010, 06:45 PM

I've just finished analyzing our dataset of twelve barcoded E. coli samples (one WT and 11 mutants). Three lanes of 36bp single-end reads at less than maximum density produced enough sequence for 30X average depth of coverage. Ours were spontaneous mutations, so we anticipated that most would be transposon hops. I resorted to an inelegant but successful strategy to ID the insertion sites. I generated a second 'chromosome' file that contained the ends of all known insertion elements, and removed those sequences from the E. coli reference genome. I then aligned the 5' and 3' ends of my reads independently, then IDed the reads that mapped to different chromosomes (ECO was right; it would have been easier with paired-end data). Nine mutant strains contained unique transposon insertions that were absent in the WT. I used BFAST (shout-out to Nils) to screen for SNPs; one of the two remaining mutants contained a single unique SNP encoding a missense mutation. (The last mutant might contain a transposon in a repeated sequence; I'm still parsing the data).

Paired-end 36bp sequencing would be the way to go. In addition to making the transposon strategy work more smoothly, you could identify large rearrangements in the same manner (i.e., the ends would map to different loci in the reference genome).

-Harold

**krobison** · 05-27-2010, 09:07 AM

Cool!

If you want to have fun & have a sufficient machine, you could also feed each read set into velvet & then take the contigs & BLAST them against reference E.coli & your transposons.

**lh3** · 05-27-2010, 02:27 PM

Originally posted by krobison View Post

Cool!

If you want to have fun & have a sufficient machine, you could also feed each read set into velvet & then take the contigs & BLAST them against reference E.coli & your transposons.

I would also recommend to do de novo assembly, even at the cost of deeper coverage/cost.

**HESmith** · 05-31-2010, 02:46 PM

Ours was a one-off experiment, so I didn't invest the time in mastering a de novo assembler (I'm a molecular biologist, so it takes me longer

). But if we were working with a larger collection of mutations, I could see the advantage of building a reference genome from the starting strain. And coverage wouldn't be a problem; even one lane of 36bp single end would yield >90X (and surely that's enough for an accurate assembly).

-Harold

**jkersh** · 10-01-2010, 09:07 AM

Originally posted by HESmith View Post

I've just finished analyzing our dataset of twelve barcoded E. coli samples (one WT and 11 mutants). Three lanes of 36bp single-end reads at less than maximum density produced enough sequence for 30X average depth of coverage. Ours were spontaneous mutations, so we anticipated that most would be transposon hops. I resorted to an inelegant but successful strategy to ID the insertion sites. I generated a second 'chromosome' file that contained the ends of all known insertion elements, and removed those sequences from the E. coli reference genome. I then aligned the 5' and 3' ends of my reads independently, then IDed the reads that mapped to different chromosomes (ECO was right; it would have been easier with paired-end data). Nine mutant strains contained unique transposon insertions that were absent in the WT. I used BFAST (shout-out to Nils) to screen for SNPs; one of the two remaining mutants contained a single unique SNP encoding a missense mutation. (The last mutant might contain a transposon in a repeated sequence; I'm still parsing the data).

Paired-end 36bp sequencing would be the way to go. In addition to making the transposon strategy work more smoothly, you could identify large rearrangements in the same manner (i.e., the ends would map to different loci in the reference genome).

-Harold

Hi Harold,
We recently got data from several multiplexed 36bp single-read E. coli samples and are starting to analyze them (we couldn't find anyone to run paired-end lanes with us at the time). I have been using Bowtie to align the data to the E. coli K-12 MG1655 genome, but am having problems with transposons and other large insertions and was curious about the method you described to find them. I am still pretty new at this and couldn't quite understand what you did with the second chromosome file? I think we may try some de novo sequencing down the line, but I don't know when the software will be available on our server and would really like to get started identifying regions of interest for Sanger sequencing.

Thanks!
Jamie

**HESmith** · 10-05-2010, 12:46 PM

Hi Jamie,

By default, most aligners return the reads that have a single best match to the reference genome. Transposons are present in multiple copies, so reads derived from transposons will produce multiple best matches. To use the approach I took, you want remove the transposon sequences in the K-12 genome and build a second reference file that contains a single copy of each transposon.

Specifically:

1) Build a new reference file that contains 20 bases from the 5' and 3' end of each transposon/insertion element.
2) Replace those sequences with N's in the K-12 genome.
3) Use your favorite aligner to map the first 16 bases of your reads to the two reference files.
4) Repeat the alignment with the last 16 bases of your reads.
5) Identify the set of reads where the first 16 bases aligns to the K-12 reference and the last 16 bases aligns to the transposon reference, and vice versa. These are the junction reads (genome/transposon).
6) Use the genome alignment position to map the location of your reads to known or novel insertion sites.

Note that there are more sophisticated tools for identifying indels, but this approach worked fine for my purposes.

Email (smithhe2ATniddk.nih.gov) if you have further questions.

-Harold

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 12 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Illumina for resequencing E. coli

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News